Introduction

Approximately 1.8 million children under age 15 are living with HIV (UNAIDS, 2020)1. Although antiretroviral treatment (ART) and HIV care have improved in the last 2 decades, children living with HIV (CLWH) experience a diverse set of comorbidities, including neurocognitive impairment2,3,4. HIV related neurocognitive disorder typically presents with executive dysfunction and memory impairment. Attention, multitasking, impulse control, and judgment are also disrupted5. CLWH perform significantly worse on planning, reasoning, processing speed, and visuospatial tasks relative to healthy controls6,7,8. Early detection of neurocognitive deficits in CLWH is critical because of the detrimental life-long effects these deficits can have on educational outcomes, employment, and relationships, especially in low- middle-income countries (LMICs) where most cases of HIV exist.

A variety of neurocognitive test batteries are used to assess cognitive impairment in adults and children. Neurocognitive batteries are time-consuming, labor intensive, and must be administered and interpreted by trained professionals. This makes their use problematic in LMICs, where such resources are limited9. Neurocognitive test results are also affected by an individual's socioeconomic and educational status10. Low socioeconomic status and poor educational opportunities in children are often associated with poor neurocognitive function, which is also associated with HIV infection11,12. Therefore, predicting who may develop neurocognitive deficits due to HIV can be challenging. Discerning whether poor neurocognitive function in children is HIV related or due to other environmental and social factors can be difficult.

A different approach to assessing brain function is to evaluate the brain’s ability to process complex sounds (i.e., central auditory processing). Tests assessing this ability, termed central auditory tests (CATs), include measures of the ability to understand speech in background noise, detect short periods of silence in a continuous noise presentation (gaps-in-noise), or deciphering different words presented to both ears simultaneously. While peripheral auditory function (i.e., hearing sensitivity to tones) is necessary for accurate central auditory processing, CATs go beyond the peripheral auditory system and require rapid cortical processing, concentration, attention, and integration of executive function within the brain. Measures of central auditory processing correlate with neurocognitive function in people living with HIV13. Niemczak et al.13 showed that central auditory tests (CATs) are positively associated with both learning and working memory measures. Furthermore, recent research shows that electrophysiological measures of sound processing [e.g. Frequency Following Response (FFR)] can be used as markers of central nervous system dysfunction in CLWH (Ealer et al., Accepted at AIDS). These results suggest CATs might be useful for tracking or predicting neurocognitive function.

Because the interaction of multiple factors can affect neurocognitive function, a machine learning approach to prediction may be useful. Machine learning allows multiple factors to be considered to improve overall predictive abilities from a diverse data set. To understand whether CATs in addition to other factors such as education can predict neurocognitive function, machine learning models such as random forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector machines (SVM) may provide an improved analytical method14,15,16. Machine learning models consistently yield better predictive performances than traditional statistical models with biomedical data or other data with high levels of complexity17,18,19. High predictability machine learning models can detect predictor-outcome dependencies that traditional statistical models may fail to detect. Furthermore, machine learning models are better suited to learn nonlinear trends and patterns in the longitudinal predictor-outcome relationship when hyperparameters (i.e., model settings) are appropriately tuned during model training. These models have not been previously used to assess the predictive ability of central auditory function for future cognitive development in children.

In this study, we examined whether CATs early in childhood could predict cognitive performance at a later time point. We used longitudinal CAT scores and state-of-the-art machine learning models to examine the predictive ability of CATs for neurocognitive deficits in children living with and without HIV. Specifically, we investigated the ability of CAT performance to predict performance on Leiter-3 neurocognitive test composites (Nonverbal IQ, Processing Speed, and Nonverbal Attention/Memory) administered 0.5–1.5 years later. We used longitudinal data collected from children in Dar es Salaam, Tanzania. We hypothesized that predictors involving CATs would yield useful predictions of subsequent Leiter-3 composite performance.

Results

Predicting neurocognitive function

We used data from children between the ages of 3–10 years who were part of a longitudinal study of HIV and neurocognitive performance in Dar es Salaam, Tanzania. Children came at 6-month to 1-year intervals to have measurements of both central auditory function using CATs and cognitive performance using the Leiter-3. CATs could be either behavioral (the child heard a sound and gave a response) or electrophysiological (electrical activity of populations of neurons in the auditory pathway evoked by an acoustic stimulus). Children with less than three visits and/or missing CATs and Leiter-3 scores were excluded from the analysis. We used 7 different sets of variables to predict neurocognitive function as measured by the Leiter-3 (see “Methods” section for details). The performance metrics used were area under the receiver operating curve (AUROC/AUC) and F1-score. Rather than looking at model performance at various sensitivity and specificity thresholds, the AUC provides a single summary measure of a model’s overall predictive ability across all possible sensitivity and specificity thresholds. Accuracy is often used when there is balance in the distribution of outcome classes and it is simply the proportion of correct predictions. In our case a class imbalance exists, therefore it is more statistically appropriate to use F1-score as a model performance metric. The F1-score accounts for the fact that the minority class (class with smaller sample) may be harder to detect than the majority class. Consequently, it provides a more balanced measure of a model’s performance across all classes than accuracy does in the presence of class imbalance. Both metrics are presented on a 0–1 scale with higher values (closer to 1) being best. Four machine learning models were used: Random forest, eXtreme Gradient Boosting, Logistic regression, and Support Vector Machines.

CATs demonstrated predictive ability for 2 of the 3 Leiter composites. Machine learning models using CATs as predictors achieved the highest predictive capabilities. For the Nonverbal IQ Leiter-3 composite, models that contained behavioral CATs alone (F1 = 0.79, AUC = 0.73), electrophysiological CATs and covariates (e.g. education) (F1 = 0.65, AUC = 0.72), and behavioral and electrophysiological CATs and covariates (F-1 = 0.75, AUC = 0.74) performed best. Models for the Leiter-3 Processing Speed composite performed best when behavioral CATs (F1 = 0.64, AUC = 0.76) and behavioral and electrophysiological CATs (F1 = 0.84, AUC = 0.85) were included as predictors. All model and predictor combinations failed to reach AUC > 0.70 for Nonverbal Attention/Memory composite (see Tables 1, 2, 3, 4 for full results).

Table 1 Summary of distribution of demographic and central auditory tests data for HIV+ and HIV− children.
Table 2 Results for predicting Nonverbal IQ.
Table 3 Results for predicting Processing Speed.
Table 4 Results for predicting Nonverbal Attention/Memory.

We used the arithmetic mean (m) and standard deviation (sd) of F1 and AUC scores across the four machine learning models (see “Methods” section) to determine which sets of predictors performed best and were most stable. For Nonverbal IQ, the models that included electrophysiological CATs and covariates as predictors yielded the highest and most stable predictive performances across the four models (m F1 = 0.72 ± 0.04, m AUC = 0.64 ± 0.06). For Processing Speed, the models that included the behavioral CATs yielded the highest and most stable results across the models (m F1 = 0.72 ± 0.07, m AUC = 0.65 ± 0.07).

Discussion

As expected, factors such as education and HIV were useful for predicting performance on neurocognitive tests. The important finding, however, was that CATs alone were often the best predictor of subsequent poor neurocognitive function. This is interesting because CATs and the Leiter-3 differ substantially in how they engage the brain. The Leiter-3 is a totally non-verbal test. Instructions are given via gesture and facial expressions and no information is provided using speech. CATs, by contrast, test the underlying processes involved in processing speech and auditory signals. Nevertheless, CATs predict Leiter-3 performance suggesting they may both depend on the same underlying neurological processes.

The results also suggest CATs may be useful as an early predictor. CATs taken early in this longitudinal study predicted cognitive performance later on. While these results are preliminary, they present the possibility that CATs could be used early in a child’s development and perhaps tracked over time to obtain information on how HIV may be affecting the brain. By combining CAT results with environmental and social factors (including HIV and education) at young ages and using the predictive ability of readily available machine learning models we can reach better predictions of who may have lower neurocognitive performance at later ages in children with HIV. This is of high importance given the aforementioned findings that CLWH have higher risk of neurocognitive deficits. Implementing standardized testing at any age is challenging, particularly in LMICs. There are always cultural elements to consider, including the novelty of testing and the influence of normative differences between how tests are developed in the West and the cultural context in which they are being used. CATs can be completed with less influence from culture, while providing a strong measure of brain function hence they are crucial in improving predictions of neurocognitive performance in the LMCIs. Having tools that can predict functional problems in the future, and at such a young age, can lead to more rapid identification of developmental concerns and allow for earlier intervention to promote better outcomes20. This would be a significant improvement over trying to perform detailed neurocognitive testing.

A limitation of our approach is we observed instability in the performance of the machine learning models. For instance, performance varied from model to model for certain predictors (XGBoost − AUC = 0.73 to SVM − AUC = 0.53 for behavioral CATs only as predictors with Nonverbal IQ as the outcome). This is likely a consequence of limited sample size and/or low representation of the below age-based expectation class (the neurocognitive underperformers) in the testing set. With low representation of the below age-based expectation class in the testing set, a single misclassification or a correct classification is enough to significantly increase or decrease the predictive performance of a model. We also observed CATs did not perform well predicting the Nonverbal Memory composite performance. It is challenging to discern why that is the case. It may be due to the nature of the specific tasks in the Nonverbal Memory composite, which rely on higher-order processes (e.g., working memory, cognitive flexibility) that are less apparent in the Nonverbal IQ or Processing Speed composites. In this sense, CATs may align best with general cognitive reasoning and speed of information processing. Another challenging aspect of the analytical phase of our work is the low number of predictors in the data. Machine learning models reach their full predictive potential in the presence of a large number of predictors where they can find high order levels of interplay between those predictors. With a low number of predictors in the data we also run the risk of creating correlated trees in the random forest models which would cause overfitting of the models on the training set and subsequent underperformance on the testing set. Further investigation is needed on larger data to confirm our findings and obtain more consistent and stable predictive performances on our models. Additionally, we may need to follow children for a longer period of time to establish early CATs as true and meaningful predictors of later neurocognitive function.

In summary, we built machine learning models that used CATs and other demographics such as to predict neurocognitive function in CLWH and healthy controls with high accuracy. In multiple instances, models that contained CATs as predictors outperformed models that contained covariates only as predictors. We conclude that both behavioral and electrophysiological CATs have promise as predictors of neurocognitive function.

Methods

Participants and data

The data for CLWH and HIV-negative children were collected as part of an on-going longitudinal study in Dar es Salaam, Tanzania. Our research protocol was approved by the Committee for Protection of Human Subjects of Dartmouth College and the Research Ethics Committee of Muhimbili University of Health and Allied Sciences and all methodologies steps were conducted in accordance with the relevant guidelines and regulations.

Participants were recruited from local pediatric programs, district hospitals, and schools. Informed consent was obtained for all minors in this study from a parent and/or a legal guardian. As part of the study, participants attended bi-annual follow-ups up to age 6 and then annually after that. At each visit peripheral auditory function, CATs (behavioral and electrophysiological), and the Leiter-3 were collected.

Inclusion criteria

Data for 109 children were selected from a dataset using the following criteria. At the time of enrollment children were 3–11 years old, had normal hearing in both ears (i.e., ≤ 25 dB HL at 0.5, 1.0, 2.0, and 4.0 kHz), no history of exposure to traumatic noise, and normal tympanometry. Children with a history of mental illness, neurological disease, or loss of consciousness were excluded, as these factors impact performance on CATs21,22,23. HIV status was confirmed in children using medical records or a rapid HIV test and reconfirmed using an ELISA assay. Children with 3 or more visits were included in the analysis. Children with missing scores for more than one CAT, more than two electrophysiological CATs, or Leiter-3 were excluded from the analysis, as missing data can impact the predictive abilities of machine learning models, especially in small datasets24.

Audiometry

Pure-tone thresholds were collected in all children for 0.5, 1.0, 2.0, 4.0, 6.0, and 8.0 kHz using a Békésy-like tracking and Modified Hughson Westlake procedures (see Niemczak et al.25 for details). Thresholds of 25 dB HL or higher for each ear were considered abnormal. Pure tone average (PTA) was calculated by averaging thresholds from 0.5 to 4.0 kHz.

Behavioral central auditory tests

The Hearing in Noise Test (HINT), Triple Digit Test (TDT), and Staggered Spondaic Words Test (SSW) were used to measure central auditory processing in children (for details see Niemczak et al.13)11. HINT and TDT are used to assess one’s ability to perceive and process speech in noise, while SSW measures dichotic processing26. All tests were administered and presented in the Kiswahili language.

Electrophysiological tests

The acoustic brainstem response (ABR) followed a similar methodology to Niemczak et al.25. An Intelligent Hearing Systems SmartEP (Miami, FL) was used to record ABR measurements 100 μs rarefaction clicks presented at a rate of 21.1/s (slow) or 61.1/s (fast) at 80 dB sound pressure level to the right ear. The electrode montage consisted of the right earlobe as the inverting, ground at Fpz, and the high forehead at Fz serving as the non-inverting electrode. Two repetitions of each click were recorded and averaged (total 2000 sweeps). Responses were filtered from 0.1 to 1.5 kHz. The absolute latencies and amplitudes of waves, I, III, and V were measured from baseline.

The frequency following response (FFR) was evoked using the /ba/ syllable, collected from all subjects using the same hardware as the ABR. The collection of the FFR has been described in-detail elsewhere27 and methodology follows Ealer et al. (in review). Stimuli were played monaurally to the right ear at 80 dB HL at a rate of 4.35 per second. Two runs of 3000 artifact-free responses were collected and responses were then offline filtered from 0.7 to 2 kHz. The /ba/ was primarily analyzed at the vowel region of the stimulus (i.e., the /a/), which is spectrotemporally static. The /ba/ stimulus was 180 ms in duration, and the vowel region was from 60 to 180 ms. The FFR was recorded with alternating polarity. When analyzing the FFR, we added these polarities together (added condition) or subtract them from one another (subtracted condition) to emphasize lower frequency information (i.e., fundamental frequency) or higher frequency information (i.e., formants), respectively.

Leiter-3

The Leiter International Performance Scales-Third Edition (Leiter-3). The Leiter-3 assesses neurocognitive functioning in children and adults from 3 to 75 years of age. Domains measured include fluid and categorical reasoning, visual identification, and mental sequencing. The test is entirely nonverbal, with instructions delivered via gestures and pantomime. Participants provide responses via pointing, block or manipulative placement, and paper-and-pencil task completion. We have demonstrated the feasibility and acceptability of the Leiter-3 in Tanzania6,28 (see Lichtenstein et al.6 for a discussion of training methods and procedures.)

Predictors

Predictors were extracted from early visits of all participants (see Table 1). We defined early as the visits prior to or at the median age of a child during their participation in the study. Gender, HIV status, the maximum number of years of education (i.e., years of education hereafter), and the best score for each behavioral and electrophysiological CAT during early visits were extracted. We defined gender, HIV, and years of education as covariates. We reason that maximum years of education and best CATs’ scores should be used to predict best performance on the Leiter-3 composites. Age was excluded as a variable as it is highly correlated with years of education (r = 0.83). Including age would introduce correlation bias to trees-based models29. Using our predictors, we constructed 7 models with a variety of variable combinations: (1) covariates alone, (2) behavioral CATs, (3) electrophysiological CATs, (4) behavioral and electrophysiological CATs, (5) covariates and behavioral CATs, (6) covariates and electrophysiological CATs, and (7) covariates with behavioral and electrophysiological CATs.

Outcome classes

Outcome variables were extracted for later visits. Later visits are defined as after the median age of a child during their participation in the study. Highest age-adjusted Leiter-3 scores (as opposed to raw scores) were extracted for three Leiter-3 composites (Nonverbal IQ, Processing Speed, and Nonverbal Attention/Memory)30. We divided the Leiter-3 scores by the age of the child at the time the test was administered. This accounts for the fact that older children should in reality outperform their younger counterparts. The scores were then converted into categorical neurocognitive outcome classes (i.e., within age-based expectations and below age-based expectations). A child’s neurocognitive function class assignment was based on where their Leiter-3 composite score fell on the overall distribution of Leiter-3 scores. For each Leiter-3 composite, a child with a score of 1 standard deviation below the mean was assigned to the below age-based expectations class (class 1), all other children were categorized as within age-based expectations (class 0). We used a 1 SD cutoff instead of 1.5 SD below the mean due to a limiting size of our sample. Only up to 5 out of 109 children had scores ≤ 1.5 SD below the mean for each Leiter-3 composite and models tend to underperform with such low representation of any class31.

Predicting outcome classes

Models and performance metrics

We selected three widely used machine learning models (RF, XGBoost, and SVM) appropriate for the dimensions of our data. We assessed the predictive ability of a more traditional statistical model with logistic regression. All analytical steps were conducted in Python using the Scikit-learn package32. The selected tree-based models typically perform well when used with small datasets and SVM as well33.

After assigning each child to a neurocognitive function category based on their adjusted Leiter-3 scores for each composite, we randomly assigned 2/3 s of the data to the training set and the remaining 1/3 to the testing set while keeping a 2:1 ratio of below age-based expectations in training and testing sets respectively. We used the training set to find the best hyperparameters and the testing set for model evaluation. The final sample of children classified as below age-based expectations was 24 in Nonverbal 4.2.2 IQ, 17 in Processing Speed, and 17 for Nonverbal Memory.

We used performance metrics that account for the observed class imbalance in our data. The F-measure is a metric of choice in the machine learning literature when faced with class imbalances as it can assign different weights to precision and recall based on their relative contextual importance (see Eq. (1)):32

$$F = \left( {\beta^{2} + 1.0} \right) \times \left( {P\times R} \right) / (\beta^{2} \times P + R).$$
(1)

\(\beta\) is a non-negative parameter striking the balance of relative importance between \(P\) (precision) and \(R\) (recall)34. Precision is the proportion of correct predictions out of all data points assigned to a class. On the other hand, recall is the proportion of data points from a given class correctly assigned to that class. In our case, we assigned \(\beta\) to be equal to 1 indicating equal importance yielding:

$$F = 2\times P \times R / \left( { P + R} \right).$$
(2)

We also present the Area Under the Receiver Operating Curve (AUROC/AUC) for the models of interest as a second metric of assessing model performance.

Class balancing technique

Due to class imbalance in our data, the models would underperform in detecting the below age-based expectations (class 1) in the testing sets. Our testing sets were composed of 8 children in Nonverbal IQ, 6 in Processing Speed, and 6 in Nonverbal Memory Leiter-3 composites. To account for the imbalance, we set the class weight parameter of the models as balanced during model training, which was used to scale the loss function of each model during training. While training on each point, the error was multiplied by the weight assigned to the class of that training point, forcing the model to reduce its loss function by disproportionately penalizing each misclassification of the heavily-weighted (below age-based expectations: class 1) training points.