Introduction

Most neuroimaging studies rely on relatively small samples that are not representative of a well-defined target population. This has resulted in multiple calls to incorporate population science approaches into neuroimaging research1, 2. To date, however, the impact of convenience sampling on neuroimaging findings has not been examined empirically. In the current study, we address this need by examining whether sample composition influences age-related variation in brain structure among children in the United States.

All participants in research studies are drawn from target populations, even if study investigators do not explicitly define or enumerate that population. Even when a target population is defined (e.g., adults between the ages of 25 and 40 in the United States), study participants are unlikely to represent that target population unless they are randomly selected. Decades of methodological work in epidemiology and population science has shed light on the conditions that limit generalizability of findings generated from such non-representative samples3,4,5. This work suggests that sample composition may influence a study’s conclusions when the association between the independent and dependent variable (e.g., age and brain structure) differs between those selected into the study and those who are eligible from the target population but not included6, 7. Such a scenario is likely to occur when study participants do not represent the target population in characteristics known to influence neural structure or function, for example, socioeconomic status (SES)8. Participants recruited into neuroimaging studies are not typically selected to be representative of a known target population, under the assumption—implicit or explicit—that basic neural functions (e.g., visual processing) in healthy individuals are not influenced by sample characteristics. Study findings are often assumed to reflect universal aspects of brain structure and function regardless of the sampling strategy. However, this assumption is largely untested and likely false.

There are exceptional examples of neuroimaging studies that have attempted to select representative samples9, 10; however, logistical challenges and study design decisions reduce the generalizability of findings from these studies to the broader U.S. population. In the foundational NIH MRI Study of Normal Brain Development, investigators selected a sample representative of the population in the study areas9; however, this study included numerous exclusion criteria (e.g., the presence of clinically significant mental health symptoms) that reduced the true representativeness of the sample2. The more recent NKI Rockland study was also designed to minimize sampling bias and maximize generalizability and included a representative sample of children and adults from Rockland County, NY10. Although this study represents a considerable advance toward representative sampling in cognitive neuroscience, participants were from a single geographic location and had higher levels of SES than in the U.S. population overall, indicating that this sample does not fully represent the U.S. While sample composition has become a growing area of focus in neuroimaging research1, 2, to date there are no neuroimaging studies based on a representative sample of the U.S. population.

Here, we test the hypothesis that the use of non-representative samples in neuroimaging studies may influence interpretation of the association between age and brain structure. Age-related variation in brain structure in childhood and adolescence has been examined frequently in cognitive neuroscience. Prior studies have demonstrated substantial heterogeneity in the pattern of developmental change across brain structures and in the age at which peak thickness and surface area are reached for different cortical regions11,12,13,14,15. In the current study, we use a large neuroimaging data set of typically developing children, the Pediatric Imaging, Neurocognition and Genetics (PING) study16, to examine whether sample composition influences age–brain structure associations. We use 2010 U.S. Census data to estimate the national distributions of basic socio-demographic characteristics (i.e., race/ethnicity, age, sex, parental educational attainment, and income) for children in the age range of the study sample (3–18 years). We apply sample weights based on these distributions to the PING sample using a common epidemiological and survey method procedure called raking to create a weighted PING sample that approximates a representative sample of the U.S. To determine the impact of sample composition on age-related variation in brain structure, we compare associations of age with global and regional measures of gray matter structure in the original, unweighted PING sample (i.e., non-representative) to those from the weighted PING sample (i.e., more representative). We focus our analysis on global morphometric cortical gray matter measurements as well as measurements of each lobe of the brain. Specifically, we examine cortical volume, cortical surface area, and cortical thickness for the entire cortex and for the right and left hemispheres; we additionally examine cortical surface area and thickness of frontal, parietal, temporal, and occipital lobes. These are robust metrics of brain structure that are measured with high reliability relative to specific cortical regions17. We also examine the volume of three widely studied subcortical structures—amygdala, hippocampus, and basal ganglia—as well as total subcortical volume to determine whether sample composition has a greater influence on cortical vs. subcortical regions and on global vs. specific measures.

Our results suggest that sample composition alters the interpretation of how cortical and subcortical areas vary with age. In the weighted sample, we frequently observe cubic (S-shaped) developmental patterns for cortical surface area and subcortical volume and younger ages of peak surface area and volume compared to the unweighted sample. In contrast, we primarily observe quadratic (U-shaped) developmental trajectories and older ages at peak cortical surface area and subcortical volume in the unweighted sample. Our findings empirically demonstrate observable impacts of sample composition on cognitive neuroscience findings, even for questions about fundamental processes such as age-related change in neural structure.

Results

Image acquisition and processing

The MRI protocol and standardized image processing techniques used in the PING study were designed to extract high-quality multimodal imaging data in a multisite study of children40. For each participant a single whole brain, T1 weighted structural magnetic resonance image was acquired in the sagittal plane using interleaved slice acquisition. All images were acquired on a 3 T scanner at one of 10 different study sites using Siemens, GE, or Philips scanners. Acquisition parameters were standardized across sites and are detailed as follows: for Siemens: TE = 4.33 ms, TR = 2170 ms, flip angle = 7 degrees, 160 slices with 1 × 1 × 1.2 mm voxels, FoV = 256; for Philips: TE = 3.1 ms, TR = 1665.9 ms, flip angle = 8 degrees, 170 slices with 1 × 1 × 1.2 mm voxels, FoV = 256; GE: TE 1 = 4.0 ms, TR = 1500 ms, flip angle = 8 degrees, 170 slices with 1 × 1 × 1.2 mm voxels, FoV = 256. To reduce motion, prospective motion correction (PROMO) was applied during acquisition43. Because different scanners are likely to have different field inhomogeneities resulting in differential sources of image distortion, a gradient field nonlinearity correction was applied prior to analysis40.

Cortical thickness and surface area estimates were calculated with the FreeSurfer image analysis suite, which is documented and freely available for download online (http://surfer.nmr.mgh.harvard.edu). FreeSurfer morphometric procedures are well established60,61,62, have demonstrated good test-retest reliability across scanner manufacturers and field strengths63, have been validated against manual measurement64, 65 and histological analysis66, and have been successfully used in studies of children as young as age 467.

FreeSurfer methods applied to the processing of PING structural data included removal of non-brain tissue using a hybrid watershed/surface deformation procedure68, automated Talairach transformation, previously validated in pediatric populations69, and segmentation of the subcortical white matter and deep gray matter volumetric structures, separately validated for use with pediatric populations67, 70. FreeSurfer provided thickness and surface area estimates for 68 cortical regions (34 for each hemisphere), according to the Desikan-Killiany atlas60, 71. Labels for cortical gray matter were assigned using surface-based nonlinear registration to a gyral and sulcal-based atlas62 and Bayesian classification rules61, 71. For subcortical structures, an automated, atlas-based, volumetric segmentation procedure was used to calculate volumes in mm3 for each structure, also executed in FreeSurfer40.

Prior to inclusion in the final data set, neuroimaging data were required to pass rigorous quality-control procedures. All images were reviewed by trained technicians for significant motion artifacts, operator error and scanner dysfunction within 24 h of the scan to allow for the re-scanning of participants when possible40. T1-weighted images were examined slice by slice for any evidence of excessive motion and rated as either acceptable or for attempted rescan40. The subcortical segmentations, cortical parcellations, and white and pial surface reconstructions from the processed images were also reviewed by trained staff40.

The publically available PING data set provides preprocessed, labeled, and quality controlled structural data for cortical surface area and thickness, and subcortical volumes based on the high-resolution T1-weighted scan. We chose to examine global and lobe-specific measures of cortical structure as they show high test-retest reliability and are more precisely estimated than smaller, individual structures17. Cortical gray matter measurements included total cortical volume, left/right hemispheric cortical volume, total subcortical volume, overall mean cortical thickness, left/right hemispheric mean cortical thickness, total cortical surface area, and left/right total cortical surface area. We also generated measurements for surface area and thickness for each lobe of the brain (frontal, occipital, temporal, and parietal) by combining regions identified in the Desikan-Killiany atlas (see Supplementary Table 1 for a complete list of regions)60, 71. We examined three subcortical structures—amygdala, hippocampus, and basal ganglia.

Creating sample weights

When a recruited sample does not adequately and proportionally cover segments of a target population, sample weights can be used so that the marginal totals of the adjusted weighted sample align with the target population on predefined characteristics (e.g., age, sex, race/ethnicity, SES, etc.). A classic way in which to create this alignment is through raking18, 19. In raking, the inverse of the marginal distribution of each variable to be included in the weight is iteratively multiplied across individuals in the sample. Each sample participant is then assigned a weight that is estimated as the difference between the unweighted value and the population distribution for the set of raked estimates. For illustrative purposes, consider two of the four variables we used for raking: sex and race. The raking procedure is essentially accomplished by first multiplying each individual by the inverse probability of being the sex that they are based on the overall population distribution of sex; the resulting estimates thus match the population distribution of sex, but not race. Then, each individual is multiplied by the inverse probability of being the race that they are given the overall population distribution of race. The resulting estimates thus match the population distribution of race, but now the sex estimates may not match population distributions. We then multiply again the individual by the inverse of the probability of their sex based on the population, and iteratively move through this sequence until there is convergence by which all of the weighted estimates match the population distributions within a caliper of error18, 47. The generalized raking procedure we followed was similar but with four variables: sex, race/ethnicity, parental education, and income, such that at the end of the procedure, the distributions of these demographic characteristics in the weighted sample were comparable to the population distribution of the U.S. Census in 2010. To improve the stability of estimates and ensure that results are not sensitive to a few individuals with extreme weights, it is traditional in raking procedures to “trim” the weights so that no extreme observation has undue influence72. We applied such trimming to our sample, using an initial weight to estimate interquartile ranges (IQR) of the input sample and adjusted the weights so that no observation fell outside of 3 IQR of the initial weight.

We estimated population totals from the American Community Survey (ACS) Public Use Microdata Sample from 2009–2011. We then applied a raking procedure to the data using the “WTADJUST” procedure in SUDAAN, which employs a model-based approach and can be interpreted as a generalized raking procedure. The equations used to estimate the post-stratification weights are provided in the SUDAAN language manual73, and we will summarize the main equation used for weight estimation here. Readers interested in full details of the generalized raking procedure are encouraged to refer to the manual for more details and full examples. We used the following equation for our post-stratification weight73:

$${\theta _k} = {\gamma _k}{\alpha _k} = {\gamma _k}\left( {\frac{{{l_k}\left( {{u_k} - {c_k}} \right) + {u_k}\left( {{c_k} - {l_k}} \right){\rm{exp}}\left( {{A_k}x_k^\prime \beta } \right)}}{{\left( {{u_k} - {c_k}} \right) + \left( {{c_k} - {l_k}} \right){\rm{exp}}\left( {{A_k}x_k^\prime \beta } \right)}}} \right)$$

In this equation and as applied to our analysis, k refers to each respondent in the PING data for which a final weight (θ k ) was estimated. This final weight is a function of γ k , the weight trimming factor used to stabilize the variance of the weighted estimates, and α k , the post-stratification adjustment. The post-stratification adjustment (α k ) is described by a vector of the socio-demographic variables we included ($$x_k^{\prime}$$, which in our model is sex, race/ethnicity, parental education, and income) and the model parameters (β) based on a logistic function. The remaining factors that determine the final weight are A k , l k , u k , and c k . These are all adjustments to improve weight stability, and include a lower bound (l k ), and upper bound (u k ), and a centering constant (c k ) for the weight of any individual in the data, which is required to be between the lower and upper board. A k is an additional constant that adjusts the final weight for stability. In summary, generalized raking procedures produce stable weight estimates based on a set of user-defined parameters that control the performance of the weight, as well as user-inputted variables that allow for the adjustment of each individual respondent so that the weighted sample as a whole is representative of the selected characteristics in the user-defined target population. We provide all of our statistical code as an online supplement that includes our user-defined parameters and assumptions that we made in the statistical model regarding weight trimming factors (see Supplementary Data 1).

Regression models

We next estimated separate models of the association of age with global and regional measures of gray matter structure to determine whether a linear, quadratic, or cubic term for age provided the best fit to the data. The best-fitting model for each measure was determined by comparing the AIC21. The more complex model (i.e., with quadratic or cubic terms) was selected when the AIC was at least 2.5 points lower than the AIC in for the less complex model25. AIC is commonly used for model selection (i.e., selecting covariates that provide the best fit to the data and selecting the best functional form of a model)22,23,24. Model fit statistics determine how well a particular model aligns with the underlying data, while taking into account the number of parameters in that model (rather than examining the statistical significance of each parameter individually). Model fit has long been accepted as the gold standard approach for model selection across a wide range of scientific disciplines, including the behavioral sciences and epidemiology24, 25; this approach is particularly well suited for deciding among models with polynomial terms24.

All models included covariates for sex, race/ethnicity, parent educational attainment, family income, and scanner. Models for subcortical volume measurements also included intracranial volume (ICV). For both the unweighted and weighted samples, we used this same model building strategy to arrive at the best-fitting model to describe age-related variation in brain structure, so differences between the models can be attributed to the application of the sample weighting technique and underlying differences in the distribution of demographic characteristics in the unweighted and weighted samples.

To determine the extent to which differences in model parameterization led to meaningful differences in the interpretation of age-related variation between analytic approaches, we generated predicted values for each brain measure (area, thickness, and volume) at each age using the best-fitting unweighted and weighted data and graphed these results. We also calculated the difference in age at peak surface area and volume in both unweighted and weighted data where applicable (i.e., in quadratic and cubic models) by calculating the first-order derivative of the fitted curves. For quadratic models, we used the following formula to estimate peak age:

$${\rm{MeanAge}} + \frac{{ - {\it{a}}1\beta }}{{2{\rm{*}}{\it{a}}2\beta }}$$

where MeanAge is the estimated sample mean, a1β is the beta estimate for the linear age term in the regression model, and a2β is the beta estimate for the age-squared term from the regression model. For cubic models, we used the following formula to estimate peak age:

$${\rm{MeanAge}} + \frac{{ - \left( {2{\rm{*}}{\it{a}}2\beta } \right) \pm \sqrt {{{\left( {2{\rm{*}}{\it{a}}2\beta } \right)}^2} - 4{\rm{*}}\left( {3{\rm{*}}{\it{a}}3\beta } \right){\rm{*}}{\it{a}}1\beta } }}{{2{\rm{*}}(3{\rm{*}}{\it{a}}3\beta )}}$$

where MeanAge is the estimated sample mean, a1β is the beta estimate for the linear age term in the regression model, a2β is the beta estimate for the age-squared term from the regression model, and a3β is the beta term for the age-cubed term from the regression model.

The predicted value graphs are intended to help readers visualize differences between the best-fitting unweighted and weighted data, as even models with quadratic or cubic terms can describe patterns of variation that are effectively linear. However, we are unable to compare aspects of these graphs (e.g., differences in slopes) with statistical tests because they are derived from different samples. For the same reason, calculations of age at peak surface area cannot be statistically compared between unweighted and weighted data and are included to provide a more tangible demonstration of how age-related trajectories in brain development may differ as a result of sample composition. For subcortical volume, final models also included ICV, and thus peak age was based on predicted values averaging the estimated volume within each 2-year age interval. To examine whether differences between unweighted and weighted models may be due to differences in head size, we examined subcortical ICV as an outcome. For subcortical ICV, the best-fitting models in the unweighted and weighted data were quadratic and indicated similar rates of change with age (see Supplementary Table 6 and Supplementary Fig. 1).

Data availability

The PING Data Resource includes neurodevelopmental histories, information about developing mental and emotional functions, multimodal brain imaging data, and genotypes for over 1000 children and adolescents. The data are available to members of the research community after submission of data use requests, agreement to the data use policies, and registration. More information about the PING Data Resource is available at http://pingstudy.ucsd.edu/ and http://ping.chd.ucsd.edu/. Our statistical code is available in Supplementary Data 1 and it is also available on GitHub at the following link: https://github.com/kajalewinn/PING.git.