Socioeconomic and genomic roots of verbal ability from current evidence

This research examines how the human genome and SES jointly and interactively shape verbal ability among youth in the U.S. The youth are aged 12–18 when the study starts. The research draws on findings from the latest GWAS as well as a rich set of longitudinal SES measures at individual, family and neighborhood levels from Add Health (N = 7194). Both SES and genome measures predict verbal ability well separately and jointly. More interestingly, the inclusion of both sets of predictors in the same model corrects for about 20% upward bias in the effect of the education PGS, and implies that about 20–30% of the effects of parental SES are not environmental, but parentally genomic. The three incremental R2s that measure the relative contributions of the two PGSs, the genomic component in parental SES, and the environmental component in parental SES are estimated to be about 1.5%, 1.5%, and 7.8%, respectively. The total environmental R2 and the total genomic R2 are, thus, 7.8% and 3%, respectively. These findings confirm the importance of SES environment and also pose challenges to traditional social-science research. Not only does an individual’s genome have an important direct influence on verbal ability, parental genomes also influence verbal ability through parental SES. The decades-long blueprint of including SES in a model and interpreting their effects as those of SES needs to be amended accordingly. A straightforward solution is to routinely collect DNA data for large social-science studies granted that the primary purpose is to understand social and environmental influences.


Supplementary Notes
In this section, we define and discuss some of the genetics terms and concepts that appear in the article.
For decades before DNA data were widely available, analysis of siblings and twins were the primary approach for attaining genetic evidence. Such analysis requires clusters of individuals with known genetic relatedness such as DZ twins, who share 50% of genetic variants on average, and MZ twins, who share 100%. Such analysis yields an estimate of heritability (h 2 ) of a sample or the proportion of variance in a trait due to genetic influence. Under a number of assumptions, heritability can be obtained by twice the difference between the MZ correlation and the DZ correlation or ℎ =2( − ). This result refers to a sample or population and is obtained without DNA data. Twin and related studies made invaluable contributions to the understanding of the genetic nature of human traits. However, such analysis yields only parameters at the population level. Heritability can be calculated for a subpopulation defined by environmental variables, but these are still population parameters rather than mutually adjusted genetic and environmental effects at individual level. For estimating such effects, mutually-correlated genetic and environmental measures for each individual are indispensable.
For social scientists who just begin to understand statistical analysis of genetic data, the human genome in an individual may be viewed as consisting of 3 billion DNA units or chemical base pairs. Only about 0.1% of the 3 billion units differ across individuals. This small proportion still amounts to millions of differences. The objective of statistical genetic analysis is to find creditable associations between these differences across individuals and a trait.
A trait can be a disease such as type 2 diabetes or a characteristic such as hair color. A tiny portion of human traits are Mendelian. Mendelian human traits follow the Mendelian patterns of inheritance, first described by Gregor Mendel about 150 years ago, and are determined by one genetic locus. Examples of Mendelian traits include cystic fibrosis and color blindness. Complex human traits are traits that do not follow the Mendelian patterns of inheritance, and generally influenced by multiple genetic variables as well as multiple environmental variables. Examples of complex traits include body mass index, height, educational attainment, and cognitive ability. Most of human traits are complex. All social outcomes are complex.
Association studies using candidate genes aims at finding statistical association between genetic variation among a limited number of pre-specified genes and a trait. This approach was a main method for the study of complex traits before the era of genome-wide association studies (GWAS). The candidate-gene studies are typically based on a sample size ranging from a few dozens to a few thousands of individuals. Many results fail to replicate because of the lack of statistical power. Even when such a study is successful, its findings are limited and do not cover all variations in the entire genome for the trait.
As a reaction to the inadequacies of the candidate-gene approach, GWAS aims at finding statistical association between all possible genetic variations in a genome and a trait. In a current GWAS, the variation across the genome in the 0.1% of the 3 billion DNA units is measured by about 0.1%x3billions=3 million single-nucleotide polymorphisms (SNP). SNPs are a particular type of genetic variables that can be coded as a count of number of risk alleles ranging 0, 1 or 2 in value. To establish which of the 3-million SNPs are associated with a trait, GWAS conducts 3-million separate regression analyses one for each SNP. To overcome false positives, GWAS sets the critical value at 5x10 -8 and requires a replication of the findings in an independent dataset. Note that the critical value is far more stringent than the conventional 0.05. The SNPs that are found associated with a trait under the two criteria are referred to "top hits." GWAS top hits tend be substantiated by subsequent studies in independent data sets.
Polygenetic scores (PGSs) or genetic risk scores are one summary measure for the genome-wide influence on a trait. A PGS can be based on the top hits in a GWAS, which is equivalent to setting P>5x10 -8 , or on all of the SNPs used in a GWAS, which is equivalent to setting P=1. As described in the section titled Genomic measures in our article, a PGS is a sum of a number of risk allele times the coefficient of the corresponding GWAS SNPs. For example, the total number of top-hit loci based on the latest GWAS of BMI is 97 (Locke et al. 2015) and the corresponding PGS would sum over 97 loci. When a PGS is based on P>5x10 -8 , not all SNPs can be used in the construction of a PGS because of linkage disequilibrium (LD) referring to the phenomenon that SNPs in a section of genomic sequence tends to be correlated. Including all the highly correlated SNPs would exaggerate the value of a PGS since that is equal to overcounting SNP effects. The reduction of the correlation among the SNPs included in a PGS is achieved by pruning (which removes some of the highly correlated SNPs) and clumping (which keeps only one SNP in a section in which the SNPs are highly correlated).

Supplementary tables
Supplementary Results that use the first principal component as a summary measure for specific SES predictors. In addition to the presentation of the results that use specific SES measures in the main body of the article, we present results based on a principal component analysis. In these models, the first principal component is used as a summary of SES measures including mother's education, father's education, mother's occupation, father's occupation, household income, family structure, sibship size, neighborhood disadvantage, and whether the respondent was in school in the past school year. The principal component analysis was conducted only on cases with complete information on each of the SES measures. The advantage of using a single principal component is that the analysis is much simpler. In this analysis, the summary measure is almost always highly significant. The results convey a very similar story as that told by using specific SES measures. Its drawback is that SES is now a black box whose effects cannot be easily interpreted.  Testing the additional effects of general health, mental health and problem behavior. Selfreported health measures general health of the respondents at Waves 1 and 3, and is based on an answer to the question of "(I)n general, how is your health?" The answer has five categories of 1=excellent, 2=very good, 3=Good, 4=fair, and 5=Poor. To facilitate interpretation, the variable is reversely coded so that a higher value corresponds to a better outcome: 1=Poor, 2=Fair, 3=Good, 4=Very good, and 5=Excellent; and it is treated as a continuous variable in analysis. Depression is calculated by summing the modified version of the Center for Epidemiologic Studies Depression (CES-D) scale (19 items) asked at Wave 1. The response categories for each item are: never or rarely (=0), sometimes (=1), a lot of the time (=2), and most of the time or all of the time (=3). Items 4, 8, 11, and 15 were reversely coded. See Add Health website for the list of the 19 items.
Serious delinquency is constructed from a delinquency scale using 12 questions at Waves 1 and 3. The scale is similar to those widely used in delinquency and criminal behavior research (Thornberry and Krohn 2000). The 12 questions are about physical fighting that results in injuries needing medical treatment, use of weapons to get something from someone, involvement in physical fighting between groups, shooting or stabbing someone, deliberately damaging property, pulling a knife or gun on someone, stealing amounts larger or smaller than $50, breaking and entering, and selling drugs. Respondents were asked to report how many times they had been engaged in these delinquent behaviors in the past 12 months. The answers are none, 1 or 2 times, 3 or 4 times, and 5 or more times. We recode them into 0, 1, 2, and 3, respectively, for each item. The scores are then summed up and divided it by twelve. The results are rounded up and coded into categories of none, 1 or 2 times, 3 or more times, and missing.
The findings demonstrate that when our model of parental SES and offspring genome is added with general health, mental health and delinquency, only mental health significantly predicts verbal ability with a negative impact. *** p<0.001, ** p<0.01, * p<0.05, + p<0.1; "omitted" indicates that the parameters are very similar to those in previous models and omitted to avoid redundancy.