## The importance of socio-emotional skills

What makes a student successful in school? Studies have shown that socio-emotional skills such as self-control or self-esteem rival cognitive skills in predicting academic achievement1,2,3,4,5,6,7,8,9,10 and other life outcomes such as employment, earnings, health, and criminality11,12,13,14. Among these socio-emotional skills, conscientiousness, self-control, and grit have been identified as playing an important role for academic achievement and future life outcomes15,16,17,18,19,20. Recent empirical works have recently demonstrated that interventions to increase these skills in children led to an increase in academic performance, suggesting a causal link between these variables21,22,23,24,25. In light of these findings, policy makers and practitioners around the world have implemented programs aimed at developing socio-emotional skills in students (e.g., CASEL or KIPP charter schools in the United States, the Singapour Positive Education Network, the Contruye-T program in Mexico, the Beyond Academic Learning Program of the OECD, and Energie Jeunes in France).

The domain of socio-emotional skills is the subject of interdisciplinary research, spanning fields from economics to development psychology and professionals from academics to educators. As a result, many different terms and theoretical frameworks are used to describe and understand these skills. For example, in the economics literature, scholars often use the phrase “non-cognitive skills”, while some education experts prefer to talk about “character skills”. In this article, we use the term socio-emotional skills as it is widely used in the psychological literature, and define such skills in accordance to the OECD framework as individual capacities that can be manifested in consistent patterns of thoughts, feelings and behaviors26.

## The present research

To compare these measures, we first assessed their reliability, i.e. whether each measure is consistent across time, raters, and items, using long-term stability and Cronbach’s Alpha. Second, we assessed the validity of each method by testing whether the measures of socio-emotional skills correlate with behaviors related to the same psychological constructs. Time spent doing homework and discipline at school are both behavioral measures that we expect to correlate highly with conscientiousness, self-control, and grit. In addition, the literature has shown that these socio-emotional skills have an impact on the development of linguistic, cognitive, and academic aptitudes21. We therefore expect that students who rate higher on conscientiousness, self-control, and grit scales will be more likely to have higher grades or to see their grades improve over time. Valid measures should thus predict school behavior and academic performance.

Finally, we polled a sample of 114 researchers in both the economics and cognitive sciences before completing the data analysis to prevent hindsight bias, that is, the tendency for researchers to think “I knew that already”41. A link was sent out to researchers via the network of the Paris School of Economics, the École Normale Supérieure, and Université Paris Dauphine. The survey asked respondents to rank the three measures of socio-emotional skills in middle-schoolers: (1) standardized child self-reported questionnaire, (2) standardized teacher-reported questionnaire, and (3) standardized behavioral task. We found that a vast majority of researchers believe that behavioral tasks are better than teacher-reported questionnaires (81%) or believe that behavioral tasks are better than self-reported questionnaires (76%). Detailed results can be seen in Fig. 1. The respondent’s research speciality—economics or cognitive science—did not affect responses.

## Results

### Validity of the measures

Contrary to our own expectation and to researchers’ predictions, our results show that the behavioral task is the least valid method to assess socio-emotional skills in students. We first tested the correlation between each measure and the time spent doing homework. Results in Fig. 2a show that the student questionnaire was most correlated with time spend doing homework (0.13–0.19, p = 0.01) compared to the teacher questionnaire (0.06–0.08, p = 0.01) and the behavioral task (0.05, p = 0.05). However, the $$R^2$$ for these regressions is quite low ($$R^2$$ ranging between 0.03 and 0.06). Given that time spent doing homework was a self-reported variable, we can suspect measurement errors due to social desirability bias. Moreover, students prone to social desirability bias may over-declare both time spent on homework and their socio-emotional skills, which may explain the higher correlation between the two. Therefore, we also look at an objective measure of behavior: the discipline index of students, which is based on school administrative records of absenteeism, tardiness, sanctions and disciplinary actions. We find that, as expected, all socio-emotional measures are positively correlated to the disciplinary index, meaning that students who are more conscientious, gritty, or have higher self-control are more disciplined. Results in Fig. 2b also show that the teacher-reported questionnaire is more strongly correlated with the discipline index (0.40–0.47, p = 0.01) than the student questionnaire (0.13–0.27, p = 0.01) or the behavioral task (0.15, p= 0.01) are. We perform the same analysis for French and math GPA in panel 2c and 2d and similarly we find that the teacher questionnaire is most predictive.

### Comparing the costs of different methods

For a method of measure to be useful, it must not only be reliable and valid, but it must also be implementable in practice34. For this reason, we were interested in the relative cost of each method of measure. Our analysis included the cost of hiring research assistants, questionnaire, and task support (electronic or paper), data transcription, data cleaning, etc. We estimated the cost of each method to be 44,731 euros for the student questionnaire, 12,907 euros for the teacher questionnaire, and 50,191 euros for the behavioral task (see Table 2 for a more detailed breakdown of costs). Although specific costs may vary significantly from one study to the next, a number of factors make behavioral tasks consistently costlier than questionnaires. In addition, one teacher can be surveyed for multiple students, which creates economies of scale. Another way to look at cost is the money spent per student to predict a change of one standard deviation in behavior. Based on their respective predictive power of change in disciplinary index, we find that the teacher questionnaire is only 13 euros per student, while the student questionnaire and the behavioral task are respectively 75 euros and 154 euros per student. Our study suggests that many research teams might be allocating resources to the implementation of behavioral tasks, although cheaper and more accurate methods of measure exist.

## Discussion

Reliability and validity measures show that the behavioral task systematically under-performed relative to questionnaires to predict outcomes related to socio-emotional skills. In addition, the teacher-reported questionnaire, which was considered the worse method of measurement by researchers, showed similar reliability and the highest correlation with behavioral and school outcomes. It is however important to acknowledge limitations coming from our sample. The students who participated in our study are not representative of the entire French population as they attend low income schools. It may be the case that measures from a behavioral task are more reliable and valid for students with affluent backgrounds. It may also be the case that teachers in more affluent schools are less able to assess the socio-emotional skills of their students.

These results contradict researchers’ predictions and contribute to a growing literature demonstrating the limits of behavioral tasks. For example, a recent study on self-control shows that self-reported measures and inhibition task performance correlate very poorly46. The authors list three main reasons for the lack of convergence between self-reported measures and behavioral tasks: (1) self-reported questionnaires measure typical performance while tasks measure maximum performance, (2) self-reported measures capture central tendencies of behavior, while behavioral tasks are momentary captures of one time performance, and (3) self-reported questionnaires measure a general cross-domain trait, while a behavioral task focuses on a more narrow manifestation of the trait. A recent study by Enkavi et al. shows that self-regulation measures derived from self-reported questionnaires have higher test-retest reliability than those derived from behavioral tasks44. A number of studies also show that self-reported measures and behavioral tasks correlate poorly, with self-reported measures correlating better with real-life outcomes47,48. Finally, behavioral tasks may also suffer from framing effects. For example a study on cooperation shows that performance in a cooperation game is strongly affected by the name given to the game (Community Game, Wall Street Game, Environment Game or simply Game)49.

Our findings are relevant not only to the study of personality, but also to many other fields measuring individual outcomes. Studies focusing on health outcomes, for instance, have also shown that self-reported measures often predict actual morbidity, mortality or other risk factors better than supposedly more objective measures such as the Global Activity Limitation Index50,51,52. Although health institutions such as the National Health Institute (USA) have pushed for standardized ways of measuring the health of patients, these might in fact be less accurate than self-reports to predict morbidity and mortality. Including self-reported and third party questionnaires therefore remains valuable, especially in domains where experimental tasks or objective measures may not capture inter-individual differences properly. Studies of happiness and wellness could include third party questionnaires and compare the results with self-reported measures, as behavioral tasks and objective measures are hard to develop in this domain.

Does this mean we should stop using behavioral tasks altogether and rely only on self-reported or third-party questionnaires to measure socio-emotional skills? There are no perfect tools of measurement and selecting the appropriate one depends on the context of the experiment. For example, in the context of policy evaluation, self-reported questionnaires may not be reliable as the intervention may affect both behavior and the perception of the behavior. For example, a study by Algan, Guyon, and Huillery shows that following an intervention to curb school bullying, pupil-reported violence increased but objective and teacher-reported violence decreased53. Based on the self-report only, researchers would have falsely believed that the intervention had a negative impact, when the results are in fact due to a better awareness of bullying among pupils. A teacher-reported questionnaire may be a poor predictor of behavior if the teacher has limited contact with the student or has an incentive to bias responses (such as being compensated based on the progression of students). Behavioral tasks can be done repeatedly to obtain a more accurate measure of the average level of the trait being measured, several behavioral tasks can be combined to have a broader understanding of the construct and tasks can be improved to better capture the trait of interest. The most surprising finding of our study was the high validity of the teacher-reported questionnaire, even after controlling for observed behavior. Third-party informants is an interesting yet underused method to study psychological traits. A study by Vazire shows that informants are a cheap way to gather data, that they are often willing to cooperate and that they provide valid data54. Yet, one limitation of our study is that attrition is higher for the teacher-reported questionnaire than it is for the student-reported questionnaire, because it is harder to collect data from staff than from the children themselves. New approaches should be developed to ensure a high rate of teacher response. Instead of abandoning the use of questionnaires, we should work to improve them, for example by providing a shared reference point, or by ensuring that respondents believe in the confidentiality of their answer, which would minimise the influence of the social desirability bias.

Psychologists and economists believe that behavioral tasks are the most reliable way to measure socio-emotional skills. Our study allowed to test this intuition by comparing three tools to measure socio-emotional skills: a student-reported questionnaire, a teacher-reported questionnaire and a behavioral task. In addition to comparing the reliability of each tool, access to long-term behavioral data allows to compare their construct validity. We found that contrary to researchers’ predictions, the behavioral task was the least valid tool while the teacher-reported questionnaire was the most valid. Research on socio-emotional skills may suffer from a bias regarding which tools are best to use.

## Method

The experimental protocol was approved by the Ethical Research Committee of the JPAL (Abdul Latif Jameel Poverty Action Lab) in Paris. All methods were carried out in accordance with relevant guidelines and regulations.

### Participants

We collected data in a sample of 97 French REP middle schools located across the country. Written informed consent was obtained from all subjects or, for subjects under 18, from a parent and/or legal guardian. Seven students per class were randomly selected among all sixth and seventh grade’s students to participate in our study. Students were 13 years old on average (SD = 0.78), 88% were of French nationality, and 52% benefited from financial aid, which is about 14 points above the national rate. Data collection was embedded in a larger study aimed at measuring the impact of a low intensity intervention in middle school students to improve academic achievement25. To ensure no confounding with the larger study intervention, we sampled students only from the control group.

All students in the experiment completed the questionnaire and the behavioral task, therefore missing values came from random issues with extractions from school records. We excluded a total of 2,006 students from our sample for whom data was missing. However, to ensure that our results were not affected by attrition, we conducted the same analysis on the total sample using imputations and found no difference (see Appendix Figures C8 to C11). The data collection took place over three years. The first cohort consisted of 784 students in the sixth grade during the spring of 2015. The second cohort consisted of 1,166 students in sixth grade and 1,117 students in seventh grade in the spring of 2016. The third cohort consisted of 930 students in seventh grade in the spring of 2017. A total of 3,997 separate measures were included in our study. Among those, a subset of these measures come from the same students who were randomly selected in both the first and second cohort, or in both the second and the third cohort.

### Experimental design

Research assistants were dispatched across the French territory to collect data in each middle school. After training, research assistants collected administrative data from the school and administered both the behavioral task and the student questionnaire during normal school hours. Students each received earbuds and a digital tablet to complete both the questionnaire and the behavioral task. Research assistants also distributed the teacher questionnaire in paper format to one teacher per class and collected the answers a few days later. Research assistants and the teachers involved in the study were blind to the purpose of the experiment and the hypothesis being tested.

### Measures of socio-emotional skills

To measure socio-emotional skills, we used three separate instruments: a student-reported questionnaire, a teacher-reported questionnaire and a behavioral task. When the instruments were only available in English, the material was translated from English to French using the Back Translation method to ensure a high degree of reliability. The French version of the questionnaires can be found in the Appendix (see Materials Section in Appendix).

### Student-reported questionnaire

Students completed a battery of self-reported questionnaires on digital tablets. They were told that their answers would remain anonymous and confidential. All responses were encoded on a scale from 1: not at all like me to 5: very much like me. Some items in each scale were inverted on the questionnaire to make sure that students were not systematically choosing the same answer. Answers were re-coded and averaged such that a higher score always indicates more agreement with the construct. Students answered four questions from the Big Five Inventory to assess conscientiousness36. Students answered eight questions related to grit adapted from the Short Grit Scale, four questions related to consistency of interest and four questions related to perseverance of effort37. A grit composite index was calculated as the mean of the answer to all eight questions, with higher scores indicating more grit. Students answered eight questions related to self-control from the Domain-Specific Impulsivity Scale for children38. Four questions were related to self-control in the domain of school work and interpersonal relationships. A self-control composite index was calculated as the mean of the answer to all eight questions, with higher scores indicating more self-control.

### Teacher-reported questionnaire

One teacher per class completed a questionnaire during normal school hours. Questionnaires were translated from the Character Growth Card39 and the answers ranged from 1: this doesn’t resemble the student at all to 5: this completely resembles the student. Within the Character Growth Card, teachers had to answer three questions related to grit and eight questions related to self-control for each student. A grit composite index was calculated as the mean of the answer to all three questions, with higher scores indicating more grit. A self-control composite index was calculated as the mean of the answer to all eight questions, with higher scores indicating more self-control.

For the behavioral task, we replicated the Academic Diligence Task developed in Galla et al., 2014; the pre-registration for the replication is available on the project’s OSF page https://osf.io/afzgx40. This task was designed to measure self-control and grit in students. Students had to choose between solving simple math questions and watching entertaining videos (e.g., a movie trailer or music videos). Before the beginning of the task, the experimenter explained that solving math problems is important to develop the brain and students were encouraged to solve as many math problems as possible. Students were also told that their answers would be anonymous and confidential, and that they could do whatever they wanted. The task consisted of three blocks of three minutes during which the students could choose between solving one digit subtractions or watching entertaining videos. Our outcome variables were the total number of attempted subtractions, the total number of subtractions correctly solved, and the percentage of time they spent solving math problems. For a complete description of the task and replication, please see the Appendix.

### Measures of behavioral outcomes and academic achievement

For each student, we extracted data on disciplinary outcomes from school records: number of late arrivals, number of absences, number of sanctions, and number of disciplinary actions. On average, students arrived 4.8 times late at school during the year (SD = 8.3), has a total of 3.0 days of unjustified absences (SD = 6.8), received 3.5 sanctions (SD = 7.5) and 0.3 disciplinary actions (SD = 1.2). We standardized, inverted and summed these four measures to create a disciplinary index, higher values indicating more disciplined behavior. Each student was also asked to report the time she spent on homework in the last two days in the student-reported questionnaire. Given that time spent doing homework may be sensitive to the day of the week, we recorded the day of the week when the student completed the survey and added it as a control. In addition, we also collected students’ math and French GPA. French and math average GPA were respectively 12.1 and 11.9 out of 20 (SD = 3.4 and 3.9). For a subset of students, we had access to their school records in the years following our study. This allowed us to measure the change in the disciplinary index and the change in French and math GPA from one year to the next for these students.

### Statistical analysis

To estimate the validity of each method of measure, we used ordinary least square to estimate the following equation:

\begin{aligned} Y_{is} = \alpha + \beta NC_{is} + \gamma X_{is} + \theta _{s}+ \epsilon _{is} \end{aligned}

where $$Y_{is}$$ is the standardized behavioral or school outcomes (e.g., time spent doing homework, disciplinary index, French or math GPA) for individual i in school s, $$NC_{is}$$ is the standardized measure of a socio-emotional skill (e.g., BIG5 conscientiousness score, number of subtractions correctly solved, etc.), $$X_{is}$$ is a vector of baseline covariates (including the school year of the student), $$\theta _{s}$$ are school fixed effects, and $$\epsilon _{is}$$ is an error term. In order to control for current behavioral outcomes, we estimated a similar equation:

\begin{aligned} Y_{is, t+1} = \alpha + \delta Y_{is, t} + \beta NC_{is, t} + \gamma X_{is, t} + \theta _{s}+ \epsilon _{is} \end{aligned}

where $$Y_{is, t+1}$$ is the behavioral outcome (e.g., time spent doing homework, disciplinary index, French or math GPA) in the school year $$t+1$$ and $$Y_{is, t}$$ is the behavioral outcome the school year t and $$NC_{is, t}$$ is the standardized measure of socio-emotional skills in school year t. To test for potential effect of mitigating factors, we ran the same regressions adding age, gender, nationality (french or foreign) and financial aid (yes or no) as dummies. We find that all effects were robust to adding these covariates.