A national experiment reveals where a growth mindset improves achievement

A global priority for the behavioural sciences is to develop cost-effective, scalable interventions that could improve the academic outcomes of adolescents at a population level, but no such interventions have so far been evaluated in a population-generalizable sample. Here we show that a short (less than one hour), online growth mindset intervention—which teaches that intellectual abilities can be developed—improved grades among lower-achieving students and increased overall enrolment to advanced mathematics courses in a nationally representative sample of students in secondary education in the United States. Notably, the study identified school contexts that sustained the effects of the growth mindset intervention: the intervention changed grades when peer norms aligned with the messages of the intervention. Confidence in the conclusions of this study comes from independent data collection and processing, pre-registration of analyses, and corroboration of results by a blinded Bayesian analysis.


Main text
This document provides supplementary information for the main text: See manuscript for full author list, affiliations, and notes. Address correspondence to David S. Yeager (dyeager@utexas.edu) or Paul Hanselman (paul.hanselman@uci.edu).

Overview
This document provides methodological information and additional detail about the National Study of Learning Mindsets.
In the Research Questions section, we print the research questions from our pre-registered anlaysis plans.
In the Representativeness section, we compare the National Study of Learning Mindserts sample to the sampling frame to assess the representativeness of this sample.
In the Intervention section, we present methodological details of the intervention, including example screenshots and students' responses.
In the Implementation section, we provide methodological details of implementation and participation in the study.
In the Data Description section, we present descriptive statistics for the achieved sample, balance tests of the effectiveness of random assignment, and attrition.
In the Methods for Pre-registered Analyses section, we present methodological details for how we answered our primary research questions.
In the Methods for CACE Analyses section, we present methods for how we estimated the complier average causal effects, representing the effect of the treatment on the treated, in contrast to the pre-specified intent to treat estimates.
In the Mindset Norms Validity section, we present methodological information about the validity of the school mindset norms moderator measure.
In the Pre-registration File section, we reproduce the full pre-registered analysis plan, which is also available at: https://osf.io/afmb6/ • We hypothesized that the intervention effect would vary with respect to two factors: school achievement level and school challenge-seeking norms (called "mindset saturation" in the analysis plan, p. 2). • The analysis plan stated that we would test a "hybrid" mixed effects model (school fixed intercepts and random slope), and this is what was estimated. • The results of this analysis are reported in the Extended Data; both hypothesized moderators were significantly associated with the size of the treatment effect. • The analysis plan (p. 9) stated that we predicted moderation by behavioral norms but not by self-reported norms, and the manuscript reports this finding. • Planned follow-up analyses comparing low and high-achieving schools to medium-achieving schools), described in the analysis plan (section 18, page 12) appear in the Extended Data. • A planned robustness analysis, which involved providing statisticians with a blinded dataset so they could conducted non-parametric analyses of the moderators (as described in section 18, page 12 of the analysis plan) was conducted via BCF. See model output in the Extended Data.
Follow-up analyses. The pre-registered analysis plan called for several follow-up analyses (section 20, page 12): • The plan stated that we would take steps to address the robustness of the assumptions of the linear model, namely the homoskedasticity assumption. The models reported below did this by calculating heteroskedasticity-robust standard errors using defaults in the StataSE software. • The plan stated that non-parametric models would explore the potential different school achievement subgroups, and this is done in the "Bayesian Causal Forest" analysis reported in section 8 below. • The plan stated that we would assess the representativeness of the participating schools and the potential to generalize to the population, and we do so in section 3 below.
Planned exploratory analyses that were conducted. We stated that these "will be reported in the manuscript or supplement regardless of the outcomes" (analysis plan p. 13).
• The analysis plan stated that we would report treatment effects on the "poor performance rate" (rates of earning D/F averages). This is reported in the manuscript. • The analysis plan stated that we would report results for separate subjects, and this is done in the Extended DAta. • The analysis plan also stated that we hypothesized stronger results especially for math and science grades, and this hypothesis was true (with respect to the norms moderation finding) and so we report these results in the manuscript and in the Extended Data. • The analysis plan stated that we would compute 5 metrics of schools' success at implementing the treatment with high fidelity, that we would aggregate those metrics into a single school-level measure, and that we would test whether this aggregated measure changes the primary heterogeneity findings. The school-level fidelity metrics are reported in the manuscript and in section 5.6 below. Analyses that test the robustness of our moderation conclusions to differences in the fidelity of implementation are presented in the Extended Data; these find that fidelity does not explain the primary moderation results. However this exploratory analysis is preliminary. • The analysis plan stated that we would explore whether cross-site heterogeneity in the strength of the intervention effect on the manipulation check might explain cross-site heterogeneity in the effects of the intervention on GPA. We report in the manuscript that we did not find significant cross-site heterogeneity in the size of the intervention effect on the manipulation check. Neither the self-reported mindsets, reported below, nor self-reported challenge-seeking, which is a single item reported in Yeager et al., 2016 and not discussed here for parsimony), showed significant variability.
Exploratory analyses that were not conducted. A second set of "planned exploratory analyses" were described in this way (pages [14][15]: "They could be presented as secondary analyses for the primary paper, or they could constitute papers of their own." In all but one case, we do not answer the second set of planned exploratory analyses because these were beyond the scope of the paper. The only one of these planned exploratory analyses we conducted was #6 ("Interaction of achievement level and mindset saturation"). The "Bayesian Causal Forest" analysis allowed for this interaction and did not find it.
Deviations from the analysis plan. These are the minor deviations from the plan; none substantively impact the main results or conclusions of the study.
• We expected to receive grades data from 66 schools but one school provided only baseline (not post-intervention) data, so the number of schools was 65.
• On page 11 of the analysis plan we stated that we would conduct a permutation test of the variability of the treatment effect across schools, to test its significance, but an expert raised questions about the validity of that test, so we rely on the Q statistic instead (which was also pre-registered on page 11). That test was significant and yielded the same conclusions. • On the same page we stated that we would compare our cross-site variability statistic to published benchmarks, but this may not be valid because not all past studies were universal prevention studies. So although the current estimate of variability in GPA impacts would be at the higher end of the distribution of the effects reported by Weiss et al,. (2017) in Table 1C., we refrain from explicit comparisons because it would be difficult to justify that they are appropriate. • On page 12 of the analysis plan we stated that we would test whether there is a significant reduction in cross-site variability in the treatment effect after inclusion of the moderators; however we were not able to find a satisfactory and valid test of this difference, so we did not conduct that analysis.
Other analysis notes. Unless otherwise noted, analyses employ weights to represent the population of regular public high schools in the U.S. (as pre-specified). Weights were provided by the survey research firm. Robustness analyses (see Extended Data) examine the impact of weights on the results.
In the main text and extended data, we present regression coefficients in terms of unstandardized effect sizes to make it easier to translate impact in terms of the natural metrics of a 0 to 4.0 GPA or the % of students prevented from failing. The standardized effect sizes we present are "Glass's Delta," defined as the group mean difference divided by the control group's standard deviation.

Representativeness of the Analytic Sample
Because the National Study of Learning Mindsets (NSLM) had a school response rate of 56%, we evaluated whether site-level non-response compromised the generalizability of the sample. We did so by carrying out a benchmark analysis to assess the representativeness of the NSLM analytic sample relative to the national sampling frame. Here we provide a summary of this comprehensive benchmarkikng analysis. More details are reported in a technical paper on this topic (see Gopalan, 2018 1 ).
The school-and district-level benchmarks were obtained from publicly-available data such as the Common Core of Data (CCD) 2 , the Office of Civil Rights (OCR) 3 , and a school district-level tabulation of American Community Survey (ACS) data 4 . We obtained this information for the sampling frame, which included all regular, U.S. public high schools with at least 25 students in 9th grade and in which 9th grade is the lowest grade.
The NSLM analytic sample had a high degree of similarity to the inference population via two metrics. First, comparisons of school-and district-level characteristics between the NSLM analytic sample and the inference population find few statistically significant differences, as shown below. Second, applying an empirical method to quantify the degree of generalizability, the Tipton (2014 5 ) generalizability index, found that the analytic sample is highly generalizable to the population.

Generalizability Index
The generalizability index (Tipton, 2014) is a summary measure that provides the degree of distributional similarity between the schools in the analytic sample and the inference population, conditional on a set of covariates. The index is calculated using propensity scores from a sampling propensity score model, which predicts membership into the analytic sample, given a set of observed school-level characteristics, using logistic regression. The generalizability index takes on values between 0 and 1, where a value of 0 means that the analytic sample and inference population are completely different and a value of 1 indicates that the analytic sample is an exact miniature of the inference population on the selected covariates (i.e., all standardized mean differences for the covariates are 0). Please see Tipton (2014) for more details regarding the motivation, proofs, and empirical validity of this index in making generalizability claims. This index is estimated using kernel densities and the R code provided by Tipton (2014) in her online supplement.
Based on a simulation study, Tipton (2014) recommends that experimental samples with generalizabilityindices greater than 0.90 can be considered to be as good as a random sample from the population of interest, conditional on the covariates included in the sampling propensity score model. The Table below shows that the generalizability index is .98. Additionally, Gopalan (2018) finds that the analytic sample is similar to four other theoretically-relevant inference populations identified based on school achievement categories and the proportion of stereotyped minority students in school (high vs. low). This is important because our paper seeks to make inferences about conditional average treatment effects within these subgroups. In all, we find that site-level non-response does not compromise the generalizability of the results from the achieved sample of schools in the NSLM.

Overview of the Student Participation Process
The participation process is depicted in the Study Overview Figure below. As it shows, the design of the study called for schools to deliver both sessions of the intervention to all students in the school in the fall of 2015, and for the two sessions to be roughly three weeks apart.
A school liason worked with a member of the data collection firm to select teachers who would devote two class periods non-academic to the study. There was no restriction on type of class, and often non-academic subjecs such as PE or Music were selected. Teachers brought their students to the school's computer lab during normal class time and read a brief script explaining that students were about to participate in a study.
The study was described as a part of a research project that entailed a survey about the transition to high school. Students then signed into the research website and were randomized by the web server to a growth mindset condition or a "brain basics" control condition. Every person involved in the study was blind to condition assignment throughout the study (indeed, there was no way for a school staff person or research team member to access that information).
Students in the growth mindset condition: (1) were presented with information about neural plasticity that emphasized how brain functions can improve when one confronts new challenges and practices more difficult ways of thinking, (2) completed writing exercises designed to help students understand and internalize the intervention message by applying the message to their own life and restating the message for a future student. For examples of the intervention materials and student writing, see below.
Students in the control group, like those in the intervention group, read a brief article about the brain and answer reflective questions. However, they did not learn about the brain's malleability. Instead, they learn about basic brain functions and their localization, for example, the key functions associated with each cortical lobe. The experimental conditions were designed to look very similar so that students' instructors would remain blind to their condition assignment, and to discourage students from comparing their materials.

Student Responses to Three Key Treatment Open Response Prompts
In this section we summarzie students engagement with the growth mindset intervention with information about their responses to three key treatment prompts that asked students to reflect on key aspects of the mindset message. Selected text from each of the three prompts is as follows: • A. What is a time you grew stronger connections in your brain? Think about a time you had to work really hard to get better at something in school: maybe it was a new kind of writing assignment or a math problem that seemed really hard at first. What was a time you made your brain stronger in school? • B. This is where we really need your input. Think about new students coming to 9th grade next year. Imagine a student who is struggling in one of their classes and is feeling discouraged. Maybe the work feels too hard for them, or maybe they are having trouble staying motivated. What is the most important thing (or things) you learned today that could help them? Write a personal letter to encourage a 9th grader next year in the box below. • C. When people have a stronger brain, they're ready to do things that matter to them. And if we want to explain this to next year's students, we need to learn what kinds of issues matter to you. Please answer this question: What issues matter most to you personally? . . . Try especially to think of something where having a stronger brain might help a person like you make a difference for the issue one day.
The distribution of word counts represent high individual-level engagement with the interactive intervention materials. Responses were lower in the second session due to the students who did not see session 2 materials (570 did not participate and 60 received control condition materials due to matching errors). Among students who started session 2, engagement was comparable. Example Responses. In response to the prompt in the treatment condition asking about a time when students had to stretch and grow their brains, students wrote: A time I made my brain stronger in school is every other day when I go to algebra class. It's not that it's a hard subject for me, it's just that when we first have to learn something new it's difficult at first. But then when we keep working and do practice on it, it becomes easier.
There was a math unit that I really didn't understand and when we took the quiz I got a really bad grade. But I studied more and was able to retake that quiz to get a better score. My brain grew stronger during exams and finals because you need to study in order to pass and learn by doing this your brain gets stronger and smarter.
In math because I couldn't really understand some assignments as much . But I started to help my mom with college alegbra so then I stared off again pumped up to do math. Ever since then I have been during real good in that class. Example Responses. At the end of Session 1, students wrote what appear to be inspiring notes to future students who may be struggling in their freshman year. For instance, students wrote:

Free
Dear Struggling Student, Don't be afraid to ask for help because once you do you won't regret it. And just because something is hard that doesn't mean you aren't smart.
It will be your first year in high school which means that it will be hard and you will struggle in some of your classes but that doesn't mean you have to give up and not try any different ideas. For example I thought my math was hard when I was a freshmen but after months passsing by I started to get better at math so then I started to get higher test scores on my test. So my word to you guys is to not give up and keep trying :) Don't be afraid or scared to learn. Just know that if you are trying your brain is getting smarter. Just because you don't know how to do it or it's too hard, just ask for help. Example Responses. During Session 2, students were invited to reflect on issues that mattered most to them personally, and connect their learning to their desire to make a difference on those issues. Ninth grade students wrote passionately about a broad variety of important societal issues. Here are a few examples:

Free Response C
The issues that matter most to me personally are helping people who are less fortunate than us get jobs. Society lately has been very cruel to homeless people are those who do not possess a lot of money. They tell them that they need to get a job, yet how can they get a job when they have no money to get a house, or presentable clothes?
The issues that matter most to me personally would have to be dirty water in other countries. While we have nice somewhat clean water it's horrible that other countries have to drink horrible non sanitized drinking water.
One issue that matters to me is the Syrian refugees. In some refugee camps, they are treated very poorly and don't get enough food and water. Also, there are some people who are stuck in Syria and can't get away, and they are stuck in a war-torn country that they can no longer call home.

Qualitative Assessment of Responses
As a simple measure of students' levels of engagement with the intervention content, an analyst coded whether participants wrote valid, good faith (non-gibberish) responses to free response questions A, B, and C listed above. These three were selected because they were considered critical to the intervention content (they were not comprehension checks) and they asked for a substantive, more-involved responses. The analyst drew a random 10% of participant responses separately for each of the three questions. Blank responses (which were 3% to 8% of responses) were not sampled. According to the codebook, a valid response was an honest attempt to answer the question. An invalid response was any of these: "idk," "I don't know," "nothing," "no," "never," a response that stopped at the sentence starter (i.e. "dear struggling student") or a nonsense response (e.g. random letters and numbers). In this random sample, 99% of responses were judged to be valid, honest answers, and 1% were judged to be invalid or nonsense answers. This signals high levels of engagement with the open-ended questions.

Summary of Implementation and Implications
Below, we present data on the timing of the intervention sessions in the schools.
The results show that, overall, schools were quite compliant with the timing requests. Some schools, however, implemented the intervention in the spring. Moreover, students varied in how long they had between sessions.
These descriptive statistics have three implications for our study. First, although our planned analysis was to use 8th grade GPA as the prior achievement variable and 9th grade fall and spring as the outcome variable, in schools where random assignment happened in the spring, it was preferable to use fall 9th grade as the prior achievement variable. In sensitivity analyses, we examine whether or not the choice of a prior achievement term affects our conclusions.
Second, some students received the intervention very late in the year, and it was not uncommon for springimplementation schools to deliver the second half of the intervention well into March-just two months before the school year was over. This necessarily limits the potential for the intervention to affect their grades. This problem is especially acute when considering that some schools only provided a year-end grade, not broken out by semester. Therefore some students' outcomes were already mostly determined before random assignment. Thus, the intervention effect sizes in the study are conservative relative to what might be gained with ideal timing of implementation.
Third, the relatively high rate of compliance overall testifies to the scalability of the intervention and to the effectiveness of the study procedures designed by the research team and by the independent data collection firm ICF International.

Student Survey Response Rates
Response rates, defined as the proportion of eligible students in the school who started the survey and were in the intent-to-treat sample, were high. The mean response rate across all schools for session 1 was 93.5%, and the median was 98.0%. The few cases with very low response rates came from schools that required signed consent from parents. The mean response rate across all schools for session 2 was 88.4%, and the median was 95.0%.

Timing of Intervention Sessions Within the School Year
Three quarters of schools implemented the intervention in the fall, as planned. Most schools were able to follow the request to space the two intervention sessions 3 to 4 weeks apart.

How Long Did Students Spend on the Intervention Exercises?
Each of the two study sessions took students about 25 minutes on average, for a total of 50 minutes overall. This is notable because the primary analyses are looking for effects of this 50-minute experience on grades across all core classes at the end of the school year, sometimes many months later.
The control group was shorter in session 1, corresponding to somewhat less content than in the intervention condition. The control group was a longer in session 2 because students answered extra questions about the classroom and school climate. These extra survey questions were included so that secondary data analysis of this dataset could be conducted on other topics besides intervention effects.

Average Time Spent on each Session, in Minutes
Session Mean (minutes) SD Median

Distribution of Time Spent on each Session, in Minutes
Here, we report the distribution of time spent on the intervention materials for each session. This information is relevant to the scalability of the intervention, because if large proportions of students required more time for a given session than a typical class period would allow, it would make the intervention difficult to scale. However, very few students required more than the typical 40-50-minute class period, and those in the high end of the distribution are likely to be students who forgot to close their Internet browsers.

Session Completion
Here we present the rates at which students started and finished key aspects of each intervention session. This is informative because it shows that students, in general, showed high compliance with study procedures. We distinguish starting/finishing the overall session (defined as seeing the first and last screen of the survey) from staring/finishing the intervention material (defined as seeing the first/last screen of the intervention materials). In both Sessions 1 and 2 there were screens before and after the intervention content. Survey items preceded the first session and were included after both sessions.

Proportion of Students who Started/Finished Session 1 Sections for All Participants (N = 13410)
Proportion

Implementation Fidelity
We considered the following the following pre-registered measures of school-level fidelity of implementation, ultimately combining: 1. the percentage of open-ended questions that students answered during their on-line sessions, 2. the percentage of screens that students opened (and presumably viewed) during their on-line sessions, 3. the student-level response rate, 4. the amount of distraction that students reported experiencing during their on-line sessions, and 5. the amount of distraction that students reported other students experienced during their on-line sessions These correspond to the dimensions list in the pre-registration plan, with the final item combining two distraction items (self and others). Fidelity is high across these measures and consistently so. This modest variation in fidelity reflects the refined design of the intervention and the ongoing efforts of the independent research firm. It also limits our ability to investigate. However, we conducted a sensitivity analyses to see if including school-level fidelity in school moderation models altered the substantive conclusions. It did not (see below).

Data Description
In this section we provide background on the key variables and characteristics of the experimental study.

Defining Key Variables
• Growth Mindset Scale = Post-intervention mindset 3-item scale; values range from 6 (most growth mindset) to 1 (most fixed mindset). The survey items are framed in terms of a fixed mindset, and so they are reverse-scored in order to obtain fixed mindset values. • Growth Mindset Indicator = Post-intervention growth mindset indicator, greater than 4.0 on 3-item growth mindset scale (less than 3.0 on original fixed mindset scale) • Hard Problems = Willingness to seek out challenges in math (number of hard problems selected in make-a-math-worksheet task); also referred to as "challenge-seeking" and the basis for a school mindset saturation measure • Self-reported Growth Mindset Norm = School average of growth mindset self-reports noted above, estimated from all students prior to random assignment. In the pre-registration, we called this "mindset saturation (self-report operationalization)." • Behavioral Challenge-Seeking Norm = School average of hard problems students chose on the "makea-math-worksheet" task, estimated from all students in the control group who completed the Session 2 survey. In the pre-registration, we called this "mindset saturation (behavioral operationalization)." • GPA = Post-intervention grades in core academic courses (mathematics, English/English Language Arts, science, social studies; omitting support courses like labs or tutorials) in 9th grade, on a 0 to 4.3 scale. • D/F Avg = Core GPA in D or F Range (less than 2.0 on a 0 to 4.3 scale)

Summary of Full Sample (unweighted)
Below we summarize key pre-intervention and outcome variables in the analytic sample. Note: ICC = intraclass correlation coefficient at the school level, representing the proportion of variance estimated to be between schools.

Descriptive Statistics for Previously Lower-achieving Students (unweighted)
We hypothesize larger intervention effects on GPA for lower-achieving students, defined as those with achievement below the school median prior to random assignment. The steps taken to define this group are described in the pre-registered analysis plan. Below we present the descriptive statistics for this subgroup. Note: ICC = intraclass correlation coefficient at the school level, representing the proportion of variance estimated to be between schools.

Descriptive Statistics for Previously Higher-achieving Students (unweighted)
Higher-achieving students are defined as those above the school median prior to random assignment. The steps taken to define this group are described in the pre-registered analysis plan. Below we present the descriptive statistics for this subgroup. Note: ICC = intraclass correlation coefficient at the school level, representing the proportion of variance estimated to be between schools.

GPA Distributions for Previously Lower-and Higher-Achieving Students
Because the pre-specified definition of previously lower-achieving students is relative to their high school, there is some overlap in the distribution of prior achievement for these groups on the absolute grade point scale (which itself may not be fully comparable across schools). The distributions highlight that some students with below-median prior achievement have objectively high grades (e.g., an A-). Intervention impacts on GPA may be lower for this group. They may face fewer immediate academic challenges for which a growth mindest is thought to be most beneficial, and practically, there may be ceiling effects for achievement measures. As a result, we conduct sensitivity analyses with a more restrictive "lower-achieving" subgroup.

Additional Information on Components of the Lower-achieving Designation
Our pre-registered lower-achieving group indicator is based on pre-intervention GPA when available, supplemented with self-reported expectations of success or standardized test scores. Our pre-registered plan states: Previously low-performing students are defined as students who were earning grades lower (or equal) to 50 percent of his or her 9th grade school peers, prior to random assignment. At an operational level, this is a student whose pre-random assignment GPA is at or below the 50th percentile of his/her 9th grade peers.
In the case of missing prior GPA data, we specificed: "[W]e will impute prior achievement values using their 8th grade test scores and self-reports of expectations for success in the coming year (cf. Hulleman & Harackeiwicz, 2009)." In this subsection, we report correlations among prior grades, expectations for success, and standardized achievement. These positive associations provide empirical support for our a priori decision to use these variables for imputation, and they show that expeations for success are more closely related to academic grades than standardized test scores. Among the control group, we see that both expectations for success and standardized test scores are predictive of 9th grade GPA above even when controlling for prior GPA. Moreover, expectations are a stronger predictor of future grades (based on standardized coefficients) than standard achievement.

Experimental Balance on Pre-treatment Characteristics
Random assignment was effective at producing balance between groups in terms of characteristics measured prior to random assignment.

Rates of Attrition
The intervention and control groups did not differ in terms of the proportion of students who were missing data for Grade Point Average or the outcome variables measured in session 2.

Differential Characteristics of Students who Attritted Versus Those Who Did Not
Students missing data for the GPA outcome were more likely to be male, not have a mother with a college degree, higher fixed mindset, and lower pre-treatmenbt GPA.

Balance for Students with Outcome Information
Among students who were not missing data, the final sample was nevertheless balanced between conditions on pre-intervention characteristics.

Enrollment
Participants were identified for participation by the third party research firm (ICF International) in consultation with school officials by virtue of grade membership and enrollment in targetted classes.

Students Enrolled in the Experimental Study
Considered Parental Refusal Intention to Treat (Randomized) 13490 70 13420

Allocation
Students were randomized at the start of the computerized activity. Students received the allocated intervention for session 1. Some students were absent and received no session 2 materials. Other students incorrectly inputted their names at session 2, and they were always given the control group materials.
All analyses are Intention-to-Treat, regardless of whether students saw the session 2 materials. There are two primary reasons why participants were lost to follow-up for the primary analyses of GPA outcomes. First, one school did not provide administrative records. Second, some students' GPAs could not be matched with the administrative data, usually because their names or student IDs could not be matched, or because schools no longer had their records by the end of the year. We cannot discern every reason for non-matching records. However as noted above the characteristics of students who were missing the grades data were not differential by condition in terms of baseline characteristics.

Pre-Registered Intervention Impacts on Academic Grade Point Average (GPA)
We pre-specified four core questions about the impacts of the growth mindset intervention on core academic GPA. These questions build to the primary research question, RQ4, which is the cross-site heterogeneity in the treatment effect effect among lower-achieving students. In the sections that follow, we present the analysis methods that we used to answer the four questions.

Average Intervention Effects for All Students (Pre-registered RQ1)
Here we explain how we answered the first reserach question, which was: 1. What is the average treatment effect (ATE) of a Growth Mindset (GM) intervention on the GPA of 9th grade students in regular U.S. public high schools? (Note that the pre-registration did not predict a significant main effect, but instead only predicted a significant effect for RQ2).
Following the pre-registration plan, the analytic model for RQ1 was: Where: • P i is the prior achievement for student i, z-scored within schools • X ki is school-mean-centered baseline covariate k for student i • S ji is an indicator variable indicating that student i attends school j Also following the pre-analysis plan, we estimated parameters using person-level weights and cluster-robust standard errors, clustered at the level of primary sampling unit (typically pairs of schools). Given the survey design, the primary sampling unit is more appropriate than the school level (which we originally indicated in our pre-registration).

Conditional Average Intervention Effect for Previously Lower-achieving Students (Pre-registered RQ 2)
Here we explain how we answered the second research question, which was: 2. What is the conditional average treatment effect (CATE) of a GM intervention on the GPA of 9th grade previously lower-performing students in regular U.S. public high schools?
The models for RQ2 are similar to RQ1, except that analyses are restricted to the subgroup of lower-achieveing students.

Robustness and Sensitivity Analyses for RQ1 and RQ2
In addition to the pre-specified analyses, we considered the sensitivity of results to several alternative specifications, listed below. Note that pre-specified options are highlighted in bold. The results of these analyses appear in the Extended Data.
Survey weights: 1. Grade 9 weights [pre-specified] = records weighted by weighted based on sampling design and non-response (including missing grade 9 GPA outcomes); weights calculated by survey firm 2. Grade 9 weights trimmed = records weighted by a modified version of the Grade 9 weights, with weights top coded to the 3rd highest value within school achievement groups 3. Design weights = records weighted by the inverse of intervention selection for the school, given the sampling design 4. No weights = all individual records assigned a constant weight, maintaining clustering corrections for strata and primary sampling unit; expected to yield conservative estimates because the study over-sampled schools expected to show small or null effects Alternate GPA Outcomes: 1. Grade 9 post core [pre-specified] = GPA in core academic courses (Mathematics, English Lanugage Arts, Science, Social Studies) from the intervention term to the end of the year; support courses not included 2. Grade 9 post academic = GPA in all academic courses (including support courses and Foreign Lanugage, etc.) from the intervention term to the end of the year; expected to yield conservative estimates 3. Grade 9 post English/math/science = core GPA without social studies (English, Mathematics, and Science courses) from the intervention term to the end of the year; this variable replicates the pilot study's outcome, as explained below. 4. Grade 9 average core = GPA in core academic courses for all of 9th grade; expected to yield conservative estimates because includes some pre-intervention information Prior Low Performance Definition: 1. Lower-achiever [pre-specified] = below school-level median pre-intervention GPA relative to high school peers (using prior expecations or standardized tests when grades are unavailable) 2. Restricted Lower-achiever = a more restrictive subset that starts with the pre-specified relative definition and omits students above absolute thresholds (GPA above 3.3 or highest self-reported expectations of academic success). year-long GPA, meaning that the majority of the GPA outcome is based on pre-intervention performance

School Heterogeneity (Pre-registered RQ3)
Here we explain how we answered the third research question, which was: 3. How much does the CATE of a GM intervention (on the GPA of 9th grade previously lower-performing students) vary across U.S. public high schools?
Following the pre-registration plan, the analytic model for RQ3 was a multilevel model: Level one (students): Level two (schools): • Y ij is the outcome for student i in school j (GPA) • α j is a school-specific intercept • T ij is an indicator for experimental group (1 if treatment, 0 if control) • P ij is the prior achievement for student i in school j, z-scored within schools • X kij is school-mean-centered baseline covariate k for student i from school j The Stata code used to estimate parameters of this model is as follows: mixed gpa_post_avg treatment pregpa_imputed_smc pregpa_missing_smc pretest_imputed_smc pretest_missing_smc s1_exp_suc_1_imputed_smc s1_exp_suc_1_missing_smc pre_gpa_self_smc pre_gpa_self_dummy_smc gender_smc asian_smc black_smc hisp_smc native_smc mideast_smc pacisl_smc white_smc pared_1_smc pared_2_smc pared_3_smc pared_4_smc pared_5_smc pared_6_smc pared_7_smc pared_8_smc ell_smc sped_smc gt_smc firstyear-freshman_smc lunch_smc i.school_id if analysis_flag == 1 || schoolid: treatment, nocons , reml The parameter of interest is tau, the standard deviation of intervention impacts across schools. Multi-level model heterogeneity analyses are estimated with restricted maximum likelihood (REML), because this is the ideal method for estimating the random effect, but this particular model cannot account for sampling weights because REML does not function with weights. In the unweighted sample, the estimated intervention effect for lower-achieving students on post-intervention GPA in an average schools is 0.066 (SE = 0.022. The estimated standard deviation of school impacts is 0.09. To test whether tau is statistically significantly greater than zero, we use the Q-statistic proposed by Bloom et al. (2017). The Q statistic is 85.5 (df = 64, p = 0.038).

Theoretical Justification for School Moderation Analysis (Pre-registered RQ4)
Given that we have found that the treatment impact varies across schools, we now justify the approach we take for understanding this variability, which was our fourth research question.

Definition of Average Treatment Effects and Conditional Average Treatment Effects
Following standard notation for causal effects from the potential outcomes model 6 , the individual treatment effect is defined as the difference between student i in school j's potential outcomes: The sample average treatment effect (i.e., the treatment effect in the sample of participating students) is given by where there are m sampled schools, school j has n j students participating in the study, and there are n = n 1 + n 2 + . . . n m total students in the sample. Notice that all we are doing here to obtain the overall sample average treatment effect is taking the average of all participating students' individual treatment effects. The population average treatment effect is defined similarly, with the sum running over all the students in the population: where there are M total schools, school j has N j students, and N = N 1 + N 2 + . . . N m total students in the population. This parameter was estimated in RQ1.
Conditional average treatment effects (CATEs) are defined as the average difference in potential outcomes under treatment vs. control for individuals in a given subgroup of students g either in the sample or in the population. In RQ2, we estimated this for the population subgroup of previously-lower-achieving students. In RQ4 we will estimate conditional average treatment effects for previously-lower-achieving students attending particular types of schools, such as students in high-achieving schools or students in medium-achieving schools.
The population average treatment effect, then, is simply the weighted average of all of the population conditional average treatment efects for all of the subgroups g: where N g is the number of students in group g. Here i ∈ g indexes the students in subgroup g in the population. We can represent all the averages in (7.5.1) as conditional expectations over the population: With a probability sample, we can estimate these population conditional average treatment effects consistently provided that we make appropriate adjustments for the sampling design (described below) Next, we present the specific estimands of interest.

School Moderator Definitions and Analyses (Pre-registered RQ4)
We tested two school moderators: 1. School achievement level = Composite of standardized achievement (PSAT, AP scores) and other indicators (see Tipton, Yeager et al. in press for information on how this was generated). 2. School challenge-seeking norms = school mean number of "hard" problem selections among control students on the worksheet task (called school mindset saturation, behavioral operationalization in the pre-registration). We also considered a self-reported operationalization as a secondary measure, based on the concerns stated in our pre-registration.

School Achievement Level
Following the pre-registered analysis plan, we divide schools into 3 categories based on a school achievement composite (see Tipton, Yeager, et al., in press, for details on the composite construction).
• The low-achieving group is schools at or below the 25th percentile for school achievement level.
• The medium-achieving group is schools between the 25th and 75th percentile for school achievement level. • The high-achieving group is schools at or above the 75th percentile for school achievement level.

School Challenge-Seeking Norm
The school challenge-seeking norm is defined as the prevalence of growth mindset-relevant beliefs and behavior in the school environment (in the pre-registration, we called this "mindset saturation").
We test the hypothesis that there will be larger effects on GPA in higher challenge-seeking norm schools. The reason why might be that the environment reinforces the message over time. Giving the intervention in a high mindset norm schools might be like "planting a seed in tilled soil." At the same time, it was possible that students might benefit most when they attend schools with unsupportive norms. Giving the intervention in low mindset norm schools might be like "water on parched soil." Challenge-seeking, as noted previously, is measured by the number of hard problems selected on the behavioral make-a-math-worksheet task. For model summarization purposes, after estimating models that used the full behavioral norms measure (as pre-specified), we label schools with above (below) the population mean number of hard problems selected as high (low) challenge-seeking norm.

Defining Conditional Average Treatment Effects for School Achievement and Mindset Norms
Having plotted variability in treatment effects, by our two pre-specified moderators, here we define the estimands of interest more formally, and explain our approach to modeling them.
Under randomization to treatment and SUTVA (Imbens & Rubin, 2015), where i ∈ g indexes the students in subgroup g, Y ij (z) are the potential outcomes for each student under the different treatment statuses z, and T ij is the random assignment to treatment or control.
To get estimates of these conditional expectations, we can use mixed effects models: In this model α j is a school-level intercept (treated as fixed in the linear mixed effects models, following Bloom et al (2017) 8 , and random N (0, φ 2 ) in the Bayesian models, θ(x ij ) is a function of school and individual level control covariates, centered at the school level and collected in a vector x ij (see the pre-registration for definitions), q ij is a variable or variables defining subgroups of interest, λ(q ij ) is the subgroup-dependent portion of the treatment effect for student i, and r j is random school-level variability in treatment effects, modeled N (0, τ 2 ).
Under this model, for a student with Q ij = q ij in school j, = λ(q ij ) + r j (Note that since the treatment effects do not depend on the control variables x ij , they are omitted from the conditioning set in the conditional expectations above.) In the BCF analyses we report estimated changes in conditional average treatment effects for a given change in moderators, namely an increase in mindset norms (as measured by challenge-seeking behavior on the make-a-worksheet task at baseline) of 0.5 of a difficult question (out of 8). Mathematically, this is simply the difference between two conditional average treatment effects: where q is constructed using the realized moderators (including norms) and q is the same, except the continuous norms variable is increased by 0.5.
In order to estimate population (conditional) average quantites it is necessary to account for the complex sampling design of the study. Unless the conditioning set q ij includes all the variables used to determine the probability of selection, model-based estimates of the treatment effects will be biased when computed naively using sample data. In the linear mixed effects models we adjusted for the complex sampling design by maximizing a weighted likelihood function constructed to estimate the population likelihood, rather than the (biased) sample likelihood, given the over-sampling of rare subgroups of schools and modest non-response. In the Bayesian models, we include the sampling weight as a control and as an effect moderator, and estimate population expected values using the relationship: nj i=1 w ij where w ij is the sampling weight for student i in school j, and substituting model-based estimates for the conditional expectations on the right-hand side of the equation, similar to model-based post-stratification 9 . Conditioning on the sampling weight makes the sampling design ignorable, enabling consistent estimation of the population expectations on the right-hand side of the equation from models fit to the sample data.
Raudenbush & Bloom (2015) note that we might expect the error variance to depend on the treatment arm when there is unmodeled heterogeneity. However, the bias due to ignoring this heteroskedasticity depends on the magnitude of the difference between the two variances. We expect any bias is very small, since the difference in the error variances is due to unmodeled treatment effect heterogeneity and the range of heterogeneous treatment effects is small relative to the unexplained variability in outcomes.

Pre-registered Mixed Effects Regression Models Testing School-level Moderators (RQ4)
Here is the primary model we used to answer our fourth and primary research question: 4. Do school-level factors explain the variability in the size of the CATE of the GM (on GPA for previously lower-performing students) in U.S. public high schools?
Following the pre-registration plan, the analytic model for RQ4 was a multilevel model: Level one (students): Level two (schools): Where: • Y i j is the outcome for student i in school j (GPA) • α j is a school-specific intercept • T i j is an indicator for experimental group (1 if treatment, 0 if control) • P i j is the prior achievement for student i in school j, z-scored within schools • X k ij school-mean-centered baseline covariate k for each student i from school j • A j is the grand-mean-centered achievement level for school j, coded continuously • M j is the grand-mean-centered percent minority (black, Hispanic, Native American) in school j • N j is the school challenge-seeking norm on its natural metric from 0 to 8 for school j As for RQ3, school moderation hypothesis tests do not employ survey weights. (Survey weights were later applied to generate the conditional average treatment effects reported in the main paper, so that estimated effect sizes generalized to the population of inference).

Multilevel Bayesian Causal Forest Model
The multilevel Bayesian causal forest model is specified as Note that unlike the linear model, the sampling weights w ij are included among the controls and the moderators as discussed above. The other moderators q ij include our continuous measure of challenge-seeking norms, the school achievement categories, and the percent minority variable. Multilevel BCF generalizes the linear mixed effects model by allowing θ and λ to include interactions and nonlinearities of the variables in their arguments. These features are inferred from the data and need not be pre-specified. However, this requires the use of prior distributions to avoid overfitting. Specifically, we use Bayesian additive regression tree (BART) priors on θ and λ. These prior distributions encode conservative beliefs about λ in particular: The prior on the λ function is centered on a constant function at zero, and the prior favors simple forms for λ such as additive functions. Our prior specification follows Hahn et al. (2018) 10 . The multilevel version above uses standard prior distributions for α j and r j (normal, with half-Cauchy priors on their standard deviations; c.f. Gelman, 2006 11 ).

Interpreting the Results of the Bayesian Causal Forest Model
In addition to reproducing the primary analyses' results, the Bayesian analysis added three contributions beyond the conclusions of the main analysis.
First, the treatment effects on math or science GPA revealed that lower-achieving schools' estimated treatment effects fell between higher-achieving schools' and medium-achieving schools' effects. This is reflected by even posterior odds that lower-and higher-achieving schools differed, pr(CATEAch=High > CATEAch=Low) = .49, and a moderate probability that medium-and lower-achieving schools differed pr(CATEAch=Medium > CATEAch=Low) = .78, both updated by the data from a prior probability of .5. This result matched our pre-specified hypothesis that the lowest-achieving schools in the U.S. may not have as much access as other schools to the formal resources needed to sustain the effects of an initial boost in motivation from the mindset intervention. Therefore, this result justifies future research into the potential minimal achievement level needed to produce a growth mindset treatment effect on GPA.
Second, the BCF model generated information that could serve as the basis for future research on the causal effects of growth mindset norms. We used the fitted model to estimate the increase in growth mindset treatment effects that could be expected under the hypothetical scenario in which schools were moved from being a low-norm school to being a high-norm school, assuming all other school-and student-level characteristics were left untouched (as in a random-assignment experiment). To estimate this, we used the model parameters to draw new posterior probability distributions for the average treatment effect for each low-norm school, but with norms set to a level corresponding to an increase of 0.50 additional challenging math problems chosen (out of 8), which is roughly the size of the school-level IQR. All other characteristics of students and schools were fixed at their true levels in the data. The original posterior distributions of treatment effects for each low-norm school were then subtracted from the counterfactual distributions, yielding a posterior distribution for the expected increase in treatment effect due to improvements in the norms holding all other moderators constant. The average increase in treatment effect expected for low-norms schools was .031 grade points (95% PI, -.012, .135), relative to the original distribution of treatment effects within the subgroup of low-norms schools (.024, 95% PI, -.096, .103). Thus, the model estimated a 130% increase. In other words, the treatment would be more than twice as effective on average. Moreover, the partial effect of norms on the size of the treatment effect was not different across school achievement levels or racial composition, justifying the primary linear model specification.
Third, the BCF Extended Data figure shows that was some heterogeneity that remained unexplained even after accounting for the pre-specified moderators. Therefore, exploratory analyses might be able to advance theory about the mechanisms for long-run growth mindset effects even further.

Complier Average Causal Effects
Students were randomized at the start of the computerized activity, and received the allocated intervention for session 1. Therefore, the intent-to-treat (ITT) sample was defined to include all students who started session 1 regardless of whether they completed the key intervention components, and/or engaged with the treatment message, and/or if they saw the session 2 materials. The ITT effect on core academic GPA of the students is the most conservative, policy-relevant effect of interest because it provides an estimate of the average impact of intervention.
As a secondary step, we estimate the causal effect of the treatment on those students who engaged adequately with the treatment message-or those who "took-up" the treatment. However, defining treatment take-up in an online social-psychological intervention is not straightforward. Engagement with the treatment message can be measured in several ways-such as did the students complete the key modules of the intervention? Did the students internalize the treatment message by responding to open-ended questions asked immediately after the treatment materials? Indeed, ITT estimates under-estimate the causal effectiveness of the treatment in the case of partial non-compliance. Thus, we estimate the complier average causal effect (CACE) of the growth mindset treatment under certain assumptions.
We use a key measure of engagement with the treatment to define treatment "take-up". At the end of session 1, students in the treatment condition were asked to write a note to future students who may be struggling in their freshman year (Open Response Prompt B above). This "saying-is-believing" exercise used in past successful social-psychological interventions (Walton & Cohen, 2011 12 ) has shown to be effective and integral in helping students internalize the treatment message (Yeager et al., 2016 13 ).In other words, we identify those students who wrote a note to future students as those who internalized the treatment and therefore "took up" the treatment.
Next, we estimate the CACE, also known as the "treatment on the treated" (TOT) effect, by instrumenting for treatment take-up with the randomized treatment assignment indicators T i defined earlier in a two-stage least squares (TSLS) framework. We follow standard best practices ( Where: • Y i is the outcome for student i (GPA) • D i is an indicator for the endogenous treatment take-up indicator (1 if "took-up" treatment, 0 if control). The indicator D i is instrumented with T i , an indicator for experimental group (1 if treatment, 0 if control) in a TSLS framework • P i is the prior achievement for student i, z-scored within schools • X k i is a vector of school-mean-centered baseline covariates k for each student i • S j i is a set of indicator variables indicating that student i attends school j We estimate parameters using person-level weights and robust standard errors.
Formally, the First-stage equation in the TSLS framework is: Under certain assumptions, described briefly below, we can show that the CACE estimates are valid when using the TSLS instrumental variable approach discussed above: • (1): Random assignment: Experimental group assignment is random • (2): One-sided non-compliance -i.e., control group members are restricted access to treatment altogether • (3): Valid exclusion restriction-i.e., treatment assignment works entirely through the internalization of treatment message as measured by student response on a saying-is-believing exercise While assumptions (1) and (2) are easily satisfied in a randomized intervention, the exclusion restriction cannot be directly verified. However, there is sufficient theoretical evidence to show that the "saying-is-believing" exercise may be an essential component of the treatment exercise. This writing exercise enables the student to internalize the growth mindset message through reflection after adequately engaging with the treatment materials (Aronson, 1999 18 ; Yeager et al., 2016; Yeager & Walton, 2011). Therefore, while the exclusion restriction might not be strictly true, we believe that the effect of the treatment on those who did not engage with the materials and/or internalize the treatment message are likely to be much smaller. Future research should explore the determinants of take-up and exploit variation in take-up rates across schools to explore additional mediating mechanisms of the intervention.

Methodological Information About the School Challengeseeking Norms Measure
In this section, we assess the predictive validity of the school challenge-seeking behavioral measure, which we label challenge-seeking norms. Recall that this measure consists of the mean number of hard items selected on the make-a-worksheet task among the control group students in each school.
We test whether challenge-seeking norms predicts school-level AP mathematics course-taking for previous cohorts of students, using data collected from official administrative sources. Results show that the behavioral measure is predictive of advanced mathematics course-taking, even when controlling for average scores on standardized tests administered earlier in high school to previous cohorts of students.

Mindset Norms and Mathematics AP: Schools with AP Data Available Only
Here, we repeat the validity analysis excluding the schools with 0% of students taking AP. This supports the same conclusion.

Cohort Analysis of i3 Evaluation Effect Sizes
The manuscript summarizes a "cohort analysis" of studies from the U.S. federal government's i3 initiative. The conclusion of this analysis was that it is rare for studies with adolescents to exceed what Kraft (2018) called a "large" effect for a randomized trial in education: .20 SD.
Here we provide greater detail about the information in the report that led to that conclusion. Abt Associates summarized the results of 67 evaluation studies funded by the U.S. Institute of Education Sciences as a part of the i3 initiative (Boulay et al., 2018). These studies are informative because they involved attempts to obtain causal effects (via random assignment experiments) on objective academic outcomes (e.g. grades or test scores) and the evaluators had to pre-register the study and pre-specify the outcome(s) of interest. Since the evaluation studies met several criteria for rigor prior to winning the funding (prior evidence, importance, and strong evaluation design) this cohort of studies is useful for generating an empirical distribution of the effect sizes for promising education interventions evaluated in rigorous randomized trials. Sixty-six of the studies listed in the Abt report prespecified their analysis plans, 48 of these evaluations examined outcomes among adolescents (middle school or high school), and 13 of these involved a program administered in an existing school and reported at least one pre-registered outcome that met the highest standard of rigor (the What Works Clearinghouse standard of "without reservations"), which qualified them for the present analysis. One of these effects (the scale-up of KIPP, a charter network) was excluded because it is not comparable to a typical educational intervention program, which involves adding programming or training in an existing school (not starting a new school). When a program reported multiple pre-registered primary outcomes, Those effect sizes were averaged.
The unweighted average effect size in this cohort of pre-registered studies was .03 SD (when including KIPP it was .04 SD), which was "small" according to Kraft (2018). Two programs (17%) showed a significant and "large" effect (.20 SD and .23 SD, respectively). One of these effects (.20 SD) came on tests developed by the research team, not on grades or state tests. Because effect sizes on researcher-created tests are known to be larger (Cheung & Slavin, 2016) then it was less relevant to the present study as a comparison. Only one program in this cohort analysis was successful at raising adolescents' grades. Last, fully 75% (9/12) showed effects smaller than .10 SD (the growth mindset intervention effect for the targeted group of lower-achieving students), regardless of the outcome.

Overview
A team of PhD and master's-level social science researchers and computer scientists at MDRC processed the grades data that had been delivered by the schools to ICF International (who cleaned and merged the raw grades files). MDRC's processing created the focal variables to estimate treatment impacts: pre-intervention core course (English, math, science, and social studies) grade point average (GPA) and post-intervention core course GPA.
It was necessary to harmonize this information across the schools because the grades data were provided by many different districts with many different naming conventions for course names and grading periods. MDRC's coding process drew on both human judgment and automated methods. This coding was conducted blind to students' condition assignments and blind to the impact of various decisions on estimates of treatment impact. All decisions were checked by a second coder and discrepancies were resolved through discussion or by revising the coding scheme. This was done so that the coding decisions could be reproducible and standardized. MDRC's study director can be contacted with queries about the technical details of the coding process (Pei Zhu: Pei.Zhu@mdrc.org).
The work carried out by MDRC and described here occurred in two phases: 1. Coding course names into specific subject areas (English, math, science, and social studies); 2. Coding grading periods to determine the pre-and post-intervention epochs and writing data analysis syntax to construct the pre-and post-intervention grades variables used in analyses.

Phase 1: Coding Course Names
In Phase 1, MDRC executed this aspect of the pre-analysis plan (p. 7): "Core course designation will be made through a combination of course catalogs from schools and coding of course names. Coding of core courses will be independent of knowledge of the effect on outcomes of the study, and all syntax will be retained to enable robustness checks." Phase 1 began by examining the schools' official course catalogues to determine the core classes dictated by the schools' curricula. When course names appeared in both the dataset and the course catalogue, then a straightforward coding of the course names was executed. When the course catalogues and course names did not match, which was the case for many courses in many schools, human coders determined core course classifications. They did so by applying what is known about high school course offerings and by examining cross-tabs of course enrollments to make logical inferences.For example, a course titled "Geo" could be geometry (i.e. math) or geology (i.e. science) or geography (i.e. social studies). But if the cross-tabs showed that "Geo" was mutually exclusive with Algebra 1 or 2 and often co-occurred with biology and world history, then the coders might infer that the course was geometry, not geology or geography. MDRC developed routines for flagging ambiguities, and these were resolved by pairs of human coders who made the final determinations and documented the reasons for their decisions. As a last step, MDRC used text patterns to detect all the courses that seemed like they could potentially be a core course, with a focus on students who were missing a core class for a given grading period, and two coders then validated them by hand.

Phase 2: Calculating GPAs
MDRC executed the pre-registered decision rules for determining pre-and post-intervention GPA, described on page 7 of the pre-analysis plan. That is, MDRC transformed the grades file so that it had one value per student for each subject area for the pre-and post-intervention epochs. Merging and data aggregation syntax was written by a trained computer scientist and checked and commented by a team of experienced social scientists.
Depending on when the intervention was given within a school, MDRC assigned different grading period GPAs as before pre/post treatment for each school, following the analysis plan (p. 7). MDRC decided how to handle the case where a student had more than one grade for the same marking period, which could happen if schools allocate units in half-unit metrics (i.e., if the fall semester and spring semester show independent grades but are both listed as half of the "final" grade). In some cases, schools were re-contacted to clarify the delivered data. The next step involved standardizing all numeric and letter grades across all the schools to a scale of 0-4.3. Next, MDRC averaged together all the core courses that were taken by a student in a given grading period to evaluate the student's academic grading period GPA. MDRC also calculated GPAs for each subject area by grading period for the specific subjects (mathematics, science, English, and social studies).
The final product of this process was the syntax for creating a pre-and post-intervention GPA variable for each student. This was then shared with the primary investigators in October, 2018, who then ran it to create the analytic file. In the future, MDRC will carry out its own analyses and release its own report.

Pre-registration File
In the following pages we append the full pre-registered analysis plan file. This file is archived at: https: //osf.io/afmb6/ H 3: Cross-school variation in CATE. We hypothesize that there will be significant cross-school variation in the school-average effect of the GM intervention for 9th grade previously low-performing students in regular U.S. public high schools (on GPA).

H 4: Explaining cross-school variation in CATE.
Research question 4 involves confirmatory analyses of previously-untested hypotheses. In particular, we hypothesize that: H 4a. Among previously low-performing students, the school-average effect of the GM intervention will vary based on school achievement level. 2 Directionally, we hypothesize that the CATE will be: i. Smallest (and possibly zero) in the lowest-achievement schools. 3 ii. Significant and positive in medium-achievement schools, and larger than in the lowestperforming schools. 4 iii. Significant and positive, but of unknown relative magnitude, in the highest-achievement schools. 5 Supplemental analyses will test for the effects for different strata of schools defined in the initial sampling plan.
H 4b. Among previously low-performing students, the school-average effect of the GM intervention will vary based on school mindset saturation level.
There are two competing directional hypotheses: 6 i. Larger effects on GPA in higher mindset saturation schools. The reason why is that the environment reinforces the message over time. Giving the intervention in a high mindset saturation school is like "planting a seed in tilled soil". ii. Larger effects on GPA in lower mindset saturation schools. The reason why is that in high mindset saturation schools students are already receiving growth mindset from their teachers and peers (because the control group is getting "treated") -the intervention is a "drop in the bucket". Meanwhile, in lower mindset saturation schools, students are most in need of a growth mindset -the intervention is like "water on parched soil." We define school achievement level and school mindset saturation level below.
c) prior research that we are replicating (e.g. Paunesku et al., 2015; Yeager et al., 2016) only finds benefits for low-achieving students and does not focus on main effects in the full sample. Thus, the ATE for the average student (RQ 1), which includes previously low-and high-performing students, is expected to be very small and positive. The effect for previously low-performers is expected to be moderate positive, relatively larger than for the full sample, and statistically significant.

Sampling Plan
In this section we will ask you to describe how you plan to collect samples, as well as the number of samples you plan to collect and your rationale for this decision. Please keep in mind that the data described in this section should be the actual data used for analysis, so if you are using a subset of a larger dataset, please describe the subset that will actually be used in your study.

5.1.
Preregistration is designed to make clear the distinction between confirmatory tests, specified prior to seeing the data, and exploratory analyses conducted after observing the data. Therefore, creating a research plan in which existing data will be used presents unique challenges. Please select the description that best describes your situation. Please do not hesitate to contact us if you have questions about how to answer this question (prereg@cos.io).
If you indicate that you will be using some data that already exist in this study, please describe the steps you have taken to assure that you are unaware of any patterns or summary statistics in the data. This may include an explanation of how access to the data has been limited, who has observed the data, or how you have avoided observing any analysis of the specific data you will use in your study. The purpose of this question is to assure that the line between confirmatory and exploratory analysis is clear.
All of the present research questions concern the effect of an intervention on students' GPA assigned from the point of intervention through the end of 9 th grade. Students' grades have been recorded by school districts but have not yet been delivered to the researchers for 62 of the 66 schools in the study. Most of the school districts have delivered their datasets to a third-party research firm, ICF international, which is cleaning and merging the data. ICF international has not yet shared the full grades dataset with the research team.
ICF shared an "early release" of 4 of the 66 schools to the research team so that the team could provide feedback on the data cleaning and merging process and make additional requests for formatting and information that could be applied to the full set of 66 schools. Furthermore, data from those 4 schools were cleaned and analyzed by the research team, to inform the pre-registered analysis plan.
In sum, 62 of the 66 schools' achievement data are not yet delivered to the research team by the third-party research firm. Therefore we are not yet able to test any of the four research questions above.

Data collection procedures.
7.1. Please describe the process by which you will collect your data. If you are using human subjects, this should include the population from which you obtain subjects, recruitment efforts, payment for participation, how subjects will be selected for eligibility from the initial pool (e.g. inclusion and exclusion rules), and your study timeline. For studies that don't include human subjects, include information about how you will collect samples, duration of data gathering efforts, source or location of samples, or batch numbers you will use.
A research firm selected a sample of schools and recruited them into the study. A school liaison, working with the research firm, helped students complete the materials in a school computer lab. The sampling plan is described in the methodological report for the study.

Sample size
8.1. Describe the sample size of your study. How many units will be analyzed in the study? This could be the number of people, birds, classrooms, plots, interactions, or countries included. If the units are not individuals, then describe the size requirements for each unit. If you are using a clustered or multilevel design, how many units are you collecting at each level of the analysis?
School level: We took a stratified random sample of approximately 150 high schools from the universe of all regular U.S. public high schools. 7 76 schools agreed to participate and collected student survey data and 66 provided student record data. The primary analytic sample for the present study will be the 66 schools with student achievement records.
Student level: Students are nested within schools. Our target was to include all 9 th grade students within each randomly selected school. Students are included in analyses of treatment effects provided that they (a) saw the first page of treatment or control content, and (b) have student records data (for calculating GPA).
There were approximately 16,000 students who began Session 1 in the 76 schools, but we do not yet know the sample size for the subset with student records because the student records have not yet been delivered.
9. Sample size rationale 9.1. This could include a power analysis or an arbitrary constraint such as time, money, or personnel.
We recruited as many schools as could be recruited in the period between April 2015 and February 2016 (when the final schools implemented the treatment). The plan was for all schools to complete the intervention by the second month of 9 th grade, but we extended the window until February of 2016 to increase sample size.

Stopping rule
10.1. If your data collection procedures do not give you full control over your exact sample size, specify how you will decide when to terminate your data collection.
Our data collection procedures did not give us full control over exact sample size. Termination of data collection occurred when it was too late in the year to include more schools (February 2016).
Variables 11. Manipulated variables 11.1. Describe all variables you plan to manipulate and the levels or treatment arms of each variable. For observational studies and meta-analyses, simply state that this is not applicable.
We manipulated the materials during individual computer activities students completed at school. Students were randomly assigned by the computer program to be presented with either a growth mindset treatment or a control activity.

Measured variables
12.1. Describe each variable that you will measure. This will include outcome measures, as well as any predictors or covariates that you will measure. You do not need to include any variables that you plan on collecting if they are not going to be included in the confirmatory analyses of this study.

Outcome Measure(s):
GPA: GPA serves as the single confirmatory outcome measure for all hypotheses discussed in this analysis plan. GPA refers to the end-of-the-school-year GPA based on grades in core courses only. Grades are defined as grades on a 0-4.33 point scale. Core courses refer to math, science, social studies, and English/Language Arts. Grades in these core courses will be averaged (unweighted) to calculate GPA. Plans for data processing of grades are provided in the Indices section of the analysis plan.
Analyses of other configurations of grades are possible, but the confirmatory GPA variable will drive the main "story" regarding the effectiveness of the GM intervention. We will also explore the GM intervention's effects: • In specific subjects (e.g., Math).
• On "poor performance" at the end of 9 th grade, such that 1 = D/F average in core courses, 0 = satisfactory performance (C-or above).
Attitudes: Students self-reported a number of attitudes at pre-test and post-test, and analyses of these were preregistered prior to data delivery (https://osf.io/byc2e/). Students self-reported mindsets and we measured their behavior on the "make-a-worksheet" challenge-seeking task (as interim outcomes). We collected measures of treatment fidelity (described in the exploratory analyses).

Student-level subgroup(s):
To answer RQ 2-4, we must define who is considered a "previously low-performing student." We do that here.
Previously low-performing students are defined as students who were earning grades lower (or equal) to 50 percent of his or her 9 th grade school peers, prior to random assignment. 8 At an operational level, this is a student whose prerandom assignment GPA is at or below the 50 th percentile of his/her 9 th grade peers.

School-level subgroup(s):
To answer research question 4, we must define school achievement-level and school mindset saturation level. We do that here.
School-level achievement is defined as a latent variable derived from school-level achievement data. When testing for non-linear differences, we break school-level achievement into three categories, which align with the sampling plan and the hypotheses described in section 4.1: i.
Lowest-achievement schools are those schools in the bottom quartile of the school-level achievement index. ii.
Medium-achievement schools are those schools that fall above the 25 th and below the 75 th percentile on the school-level achievement index. iii.
Highest-achievement schools are those schools in the top quartile of the school-level achievement index.
The school-level achievement index, or measure, is described in section 13.
School-level mindset saturation is defined as the prevalence of growth mindset thinking in the school environment.
A continuous variable will test the competing hypotheses described in section 4.1. The school-level mindset saturation indices are described in section 13.

Student-level Covariates
-Student male/female identification -Student race/ethnicity o Dummy variables for Asian/Asian-American, Hispanic/Latino/a, Black/African-American, or other, with the referent group white students -Student special education status (when available), dummy variable -Student maternal education, dichotomized (1=four-year degree or higher, 0=less than a four-year degree) -Student self-reported expectations for success (unless multi-collinearity with prior achievement is too high) -Missing data. We will not use list-wise deletion of cases that are missing covariates. We will impute missing covariates using the missing value dummy method, unless an alternative method is recommended by our statistician advisors. -Collinearity. We will remove a covariate from the models if it is too highly correlated with others, if there is excessive missing data, if it increases standard errors due to multi-collinearity, or if it prevents the model from converging.

If any measurements are going to be combined into an index (or even a mean)
, what measures will you use and how will they be combined? Include either a formula or a precise description of your method. If you are using a more complicated statistical method to combine measures (e.g. a factor analysis), you can note that here but describe the exact method in the analysis plan section.

For Outcome Measure(s):
Processing of grades for calculating GPA: Here is how we will process both pre-and post-intervention grades: -We will analyze grades at the term level (e.g. fall or spring semester, or, in block schedules, a quarter). When only independent marking period grades are provided (e.g., marking periods 1-3, but not fall semester) then we will aggregate them to the term level (except in the case of missing pre-treatment data, as noted below). -We will analyze only core course grades. We define core courses as math, science, social studies, and English/language arts. Non-core courses are electives, such as art, PE, computers or music. Non-core courses also include "support" classes, such as a lab class that is co-enrolled with a science class. o Core course designation will be made through a combination of course catalogs from schools and coding of course names. Coding of core courses will be independent of knowledge of the effect on outcomes of the study, and all syntax will be retained to enable robustness checks.
-If a school has a non-standard schedule (e.g. a block schedule) we may need to create a school-specific rule. We will annotate the syntax in the grades processing file, along with the justification. These decisions will be made prior to merging data with the randomized condition variable. -Grades will be provided as letter grades (e.g., A, B, C). Core course grades will be re-coded on a 0 to 4.33 point scale, with 0 referring to "F" and 4.33 referring to "A+." Some schools will only report up to an A and so 4.0 will be the max grade for them. We will test the impact of putting all schools on the same scale (from 0 to 4).
GPA at the end of 9 th grade.
-Goal: The main outcome is GPA in core courses at the end of 9 th grade, weighting each core course equally. The initial plan was to average Fall 2015 and Spring 2016 achievement. However, some schools delivered the intervention in Spring 2016. -Confirmatory Operationalization.
o In schools that delivered the intervention in Fall of 2015, the outcome will be the average of Fall 2015 and Spring 2016 core course grades. o In schools that delivered the intervention in Spring of 2016, the outcome will be Spring 2016 core course grades only. o An exception will be if a school uses block scheduling and an entire quarter's grade is selfcontained. In that event, we will look at the timing of the delivery of the treatment and the beginning and end of the quarter, to determine whether grades were recorded post-intervention or pre-intervention. -Missing data.
o We will use list-wise deletion of cases that are missing the primary outcome variable. o We will examine the impact of differential attrition on our inferences (for instance, perhaps the treatment kept marginal students from dropping out) and develop adjustments if attrition is differential.

For Student-level Subgroup(s)
Previously low-performing students: -Goal. Conceptually, we wish to know if the treatment benefits students who were not already earning very high grades prior to receiving the intervention. The design of the study called for a fall intervention, and so we expected to use 8 th grade achievement to define this subgroup. However some schools gave the intervention in the spring term of 9 th grade (or perhaps after a full quarter's grades were recorded, for block schedules). Therefore grades from the fall of 9 th grade are the most recent term. We will use the most recent term to create the pre-intervention low-performing student subgroup. -Confirmatory Operationalization.
o In schools that delivered the intervention in the Fall of 2015 prior GPA will be the average grade in 8 th grade core courses, or if these were not provided, it will be 8 th grade spring in core courses. o In schools that delivered the intervention in the Spring of 2016, the pre-intervention GPA will be the Fall 2015 GPA in core courses (i.e. Fall of 9 th grade). o The exception to these two rules would be schools using block scheduling on a quarter system and where the treatment was delivered in the Fall but after an entire quarter's grades were recorded. In such cases, the completed first quarter Fall 2015 grades will be prior achievement and the remaining three quarters will be the outcome. o Pre-intervention GPA will be z-scored within schools.
To be precise, let: = the 8 th grade GPA of student i attending 9 th grade school j (among students where the study was implemented in in the fall of 9 th grade), or the fall 9 th grade GPA of student i attending 9 th grade school j (among students where the study was implemented in the spring of 9 th grade or after a Fall 8 th grade block was finished).
= the median pretreatment GPA of students attending 9 th grade school j. So: o When students do not have 8 th grade achievement but at least some of their grades were reported on a progress report in 9 th grade prior to the delivery of the treatment, such as first quarter grades, then these will constitute pre-treatment GPA. o When students do not have any of these grades, we will impute prior achievement values using their 8 th grade test scores and self-reports of expectations for success in the coming year (cf. Hulleman & Harackeiwicz, 2009).

For School-level Subgroup(s)
School-level achievement index -Goal. The goal for the school achievement variable is to understand whether treatment effects are different at school with different levels of rigor and standards. As a proxy for this, we created a latent variable of school achievement level for the purposes of stratification when randomly sampling schools to participate in this project (see Tipton, Yeager et al.). 9 This same latent variable will be used for subgroup analyses. o There is no missing data on the school achievement level variable.
School-level mindset saturation index -Goal. The goal is to assess whether environments with a strong mindset climate have weaker or stronger effects. However there is no established measure of mindset saturation. -Confirmatory Operationalizations. We will test and report two operationalizations: o Self-report. The average "fixed mindset rating" on a 6-point scale for students in the school, measured prior to random assignment (both treatment and control group). The advantage of this measure is that it is a direct assessment of the construct -the prevalence of fixed/growth mindset thinking. The disadvantage of this measure is the potential for "reference bias" in making between-school comparisons (see Duckworth & Yeager, 2015). Another disadvantage is that peers may conform more to perceived actions than private beliefs. Then again, reference bias may be minimal for growth mindset (see West et al. 2017).

RQ 1. ATE for all students.
To estimate the average treatment effects (ATE), we use the following fixed effects model: where: = the outcome for student i (GPA) = 1 if student i was randomized to treatment and zero otherwise, = the prior achievement for student i, z-scored within schools = school-mean-centered baseline covariate k for student i (see section 12) = indicator variable indicating student i attends school j The model will use person-level survey weights (which include school-level adjustments) and will not include any school-level covariates. It will use cluster-robust standard errors, clustered at the school level, to account for the nesting of students within schools. The parameter of interest is , the average effect of the GM intervention for all students.

RQ 2: ATE for previously low-performing students.
To answer research question two we will use equation (1) on the subsample of previously low-performing students (i.e. students below the median within their school). The parameter of interest is , the average effect of the GM intervention for previously low-performing students.
We use the above model (rather than the random effects model in RQ3 and 4) because RQ1 and RQ2 seek to estimate the effect for the average student, not the average school.
Below are the conclusions we would draw from the analyses in RQ1 and RQ2.

RQ 3: Variability in effects across schools, among previously low-achieving students.
To estimate variability in the treatment effect across schools, we will estimate a mixed effects model in the subset of previously low-performing students, using the model described by Bloom et al. (2017): Level one (students) Level two (schools) where: = the outcome for low-achieving student i from school j (GPA) = 1 if student i from school j was randomized to treatment and zero otherwise, = the prior achievement for student i from school j, z-scored within schools = school-mean-centered baseline covariate k for student i from school j, (see section 12) For each school (j) this model allows for a fixed school-specific intercept ( ), to account for the possibility of differences across schools in the proportion of students who are randomly assigned to the treatment vs. control group. This model allows for the treatment effect among low-achieving students to vary randomly across schools, , with variance 2 . Note that the model allows the student-level residual variance to be different for treatment and control group members (represented by the subscript in the term 2 ). These analyses will use survey weights and not include any school-level covariates. The parameter of interest for RQ 3 is , the standard deviation of the schoollevel distribution of average treatment effects.
We will conclude the intervention effects vary across schools when either a permutation test or a Q-statistic from meta-analysis shows that is different from zero (see Bloom, Raudenbush, Weiss & Porter, 2016). We will interpret the practical significance of our estimate of by comparing it to published benchmarks in program evaluation research (Weiss et al., 2017).
Here are the conclusions we would draw from the analysis in RQ3: • ̂> 0, p < .05: The effectiveness of the GM intervention varies across schools. • ̂≈ 0, p > .05: There is no discernable evidence that the effectiveness of the GM intervention varies across schools.
If ̂ > 0 and p < .05, we will also estimate and graphically present the school-level distribution of average GM effects, as described in Bloom, Raudenbush, Weiss & Porter, 2016.

RQ 4: Predicting variability in effects across schools, among previously low-achieving students.
Regardless of the answer to RQs 1-3, we will test whether school factors predict variation in the GM intervention's effects among schools -i.e. the 's in equation (3). The moderators are school achievement level and mindset saturation. All models will control for percent minority (black, Latino/a, or Native American) because these could be confounded with school achievement level and mindset saturation. These are confirmatory analyses of exploratory hypothesesthus the approach to the analyses is more flexible than the approach for RQ 1-3. This will also require a more cautious interpretation.
To preview, we will conduct four parametric tests: 1. School achievement as a continuous variable 2. School achievement as a categorical variable 3. Mindset saturation assessed via self-reports 4. Mindset saturation assessed via behavior Then we will estimate a flexible, non-parametric model that likely will use Bayesian inference.
As a first test, we will examine independent, linear predictors. Specifically, we estimate a two-level mixed effects model. The level 1 model is specified in equation (2). The level 2 model is in equation (4) below: Level two (schools) Where: = The grand-mean centered school achievement level for school j, coded continuously = The grand-mean centered percent minority (black, Latino/a or Native American) in school j = The grand-mean centered saturation of fixed mindset for school j The significance and direction of will answer RQ 4a. The significance and direction of will answer RQ 4b. We do not have a substantive hypothesis about the parameter, for minority composition, but would attempt to interpret and understand it if it was significant.
We will test whether there is a significant reduction in 2 as a result of the inclusion of school-level covariates (i.e. comparison of questions (3) and (4)). This will answer the research question of whether these three school-level factors in general explain variability in the GM effect among schools.
Second, we will use the school-achievement level variable coded into the three categories that constituted the sampling strata (bottom 25%, middle 50%, and top 25%) and conduct planned contrasts of subgroup ATEs. In the second model mindset saturation will still be a continuous variable.
Third, with consultation from statisticians, we will evaluate potential non-parametric models to examine the independent and interactive impact of school-level moderators on between-school variability in the treatment impact in equation 4 (e.g. likely a variation on Bayesian Additive Regression Trees). These models will test robustness of results to potential confounds in the school-level moderators (such as rural/urban or poverty concentration). To avoid over-interpretation of results, we will provide the statisticians with a dataset where the name of the variable and the meaning of the value labels are masked. The initial summary of the significant moderators will be generated by the statisticians, blind to the identities of the variables or of the treatment or control values.

19.1.
If you plan on transforming, centering, recoding the data, or will require a coding scheme for categorical variables, please describe that process.

Follow-up analyses
20.1. If not specified previously, will you be conducting any confirmatory analyses to follow up on effects in your statistical model, such as subgroup analyses, pairwise or complex contrasts, or follow-up tests from interactions. Remember that any analyses not specified in this research plan must be noted as exploratory.
See primary analyses above and planned analyses below.
We will also examine whether the data meet the assumptions of the linear models. If they do not, we will adjust the model and possibly the estimation methods for standard errors to fit the data as appropriate. We expect the linear models with robust standard errors to be appropriate, however.
For research question 4, we will test the planned sub-groups of "low" "medium" and "high" achieving schools, and we will allow the penalized non-parametric models to tell us where subgroup effects are appearing.
Analyses of the characteristics of schools that did and did not agree to participate will be used to assess whether the answers to the RQs generalize the population of all 9th grade regular U.S. public high schools or only those represented by the sample that agreed to participate. We expect that non-participation will be unrelated to observable characteristics following non-response adjustment (i.e. weighting).

Results for different courses:
We will report results separately by course (math, English, social studies, etc.) either in the paper or in an online supplement. There might be larger effects in math and science, under the assumption that lay beliefs about fixed ability are stronger in math and science and therefore might benefit more from correction via a growth mindset treatment.
3. Intervention fidelity: We will assess the implementation fidelity of our treatment and control conditions with the following measures: (1) the percentage of open-ended questions that students answered during their on-line sessions, (2) the percentage of screens that students opened (and presumably viewed) during their on-line sessions, (3) the student-level response rate, (4) the amount of distraction that students reported experiencing during their on-line sessions, and (5) the amount of distraction that students reported other students experienced during their on-line sessions. We will create a composite of all or a subset of these (using factor analysis or analogous data-reduction methods) and aggregate to the student and school level. We will explore whether intervention fidelity explains differences in treatment impact, and whether it is a mechanism for potential moderation by achievement level.

Strength of manipulation check:
We will explore whether different schools show different treatment effects because they were more or less successful at delivering the treatment in a way that caused students to change their attitudes and interim behaviors, as measured by the size of the treatment effect on manipulation checks (self-reported mindsets and challenge-seeking behavior, after receiving the treatment) across schools.
The below additional planned exploratory analyses test alternative research questions. They could be presented as secondary analyses for the primary paper, or they could constitute papers of their own. Although these questions are not fully developed, we pre-register them here so that they are listed prior to seeing or analyzing any results, so as to constrain researcher degrees of freedom.

Student-level moderators:
(1) academically negatively-stereotyped minority students (e.g., black, Latino, native American) vs. academically non-negatively stereotyped students (white or Asian-American students), (2) females vs. males (especially in quantitative classes), (3) and students who are socioeconomically disadvantaged (defined either by parental education/occupation or by free/reduced price lunch status) vs. advantaged students, (4) school track (i.e. advanced math vs. regular math); (5) attitudinal measures obtained at the beginning of students' first on-line session, such as initial growth mindsets (to replicate the marginally-significant moderation in Blackwell et al., 2007), expectations for academic success (conceptually replicating Hulleman & Harackeiwicz, 2009), or math anxiety. In general, we will test the conceptual hypothesis that students who face more disadvantage or have greater vulnerability might also show stronger treatment effects.

Interaction of achievement level and mindset saturation:
When investigating research question 4, achievement level and mindset saturation could potentially interact. The greatest treatment effects might occur in places where there is the highest school-achievement level, but the weakest school level mindset saturation.
7. Adding convenience sample schools to increase cross-site statistical power. A planned supplemental analysis for Research Questions 3 and 4 will combine the treatment effect estimates obtained in the pilot study (Yeager et al. 2016) and in a replication in a convenience sample of urban district schools (Hanselman et al. in prep) with the national study estimates, to increase the number of schools by 18. This will increase our power to detect cross-site variation in treatment effects. After merging these schools' data with the national sample, we will re-conduct the analyses for Research Questions 3 and 4.