Students being and becoming scientists: measured success in a novel science education partnership

The primary and secondary learning years shape development of scientific interest and skills required for science literacy, presenting a critical timeline target for science education intervention. Although many initiatives exist to target this timeframe, the modern classroom belies easy scientific investigation. Numerous initiatives often run simultaneously in a given classroom, creating limited capacity for variable control. Consequently, there is a dearth of high-quality and meaningful data in education sciences that exacerbates the general segregation of education research from practice. Many science reform programmes go unmeasured. The limited number that is researched often report strictly qualitative results or stop short of statistically significant quantitative investigation. Lack of high-resolution data restricts the ability to make informed policy changes and precludes attainment of “evidence-based education”. Here, we demonstrate 5-year efficacy of a novel, inquiry-based primary and secondary science reform programme Integrated Science Education Outreach (InSciEd Out). Five years of data over three cohorts of matched students from US grades 5–8 show maintained gains in science fair and honours biology election, as well as improved performance on Minnesota state standardized science testing. Detailed value-added analyses further reveal InSciEd Out-correlated gains in partnership-focused areas of life sciences, and history and nature of science. These analyses provide evidence that scientifically rigorous evaluation demonstrating relevant programme efficacy is indeed achievable in education science. Our results support the premise that the InSciEd Out programme is a scalable intervention capable of primary and secondary science education reform. The programme substantively builds upon prior efforts in the field. Although InSciEd Out deploys novel approaches and tools, the broad lessons learned from this programme are readily translatable to other contemporary efforts cultivating science literacy for all.


Introduction
Science, technology and innovation (STI) drive progress in sectors from public health to security. Globally, STI is imperative to achieving the Millennium Development Goals (United Nations Economic and Social Council, 2013;United Nations, 2014); nationally, the United States (US) views STI as key to securing the nation's future (National Science and Technology Policy, Organization, and Priorities Act of 1976). Although STI has contributed to > 50% of US economic growth post-World War II (Sturko Grossman, 2008), Science, Technology, Engineering and Mathematics (STEM) occupations account for only 5.5% of the US workforce (Langdon et al., 2011). There is therefore much interest in bolstering the STEM pipeline by cultivating scientific interest during the primary and secondary years (Murphy and Beggs, 2003;Osborne et al., 2003;Tai et al., 2006;Osborne, 2007;Logan and Skamp, 2008).
Many initiatives exist to target the primary and secondary pipeline in schools. In response, the science education field has begun emphasizing rigorous measurement of student outcomes and greater consideration of study designs (Schroeder et al., 2007;Slavin et al., 2012). This push for rigour is welcomed, as major challenges in science education include both high-resolution capture of student outcomes (Schroeder et al., 2007;Slavin et al., 2012) and application of research data to practice (Porter and McMaken, 2009;Rust, 2009). One 1998 report found that less than 10% of teacher professional development programmes directly measured student achievement (Killion, 1998); another same-year math and science review found only four science programmes with data collection on student learning (Kennedy, 1998). More recent summations of the field show how lack of scientific rigour remains a barrier to evidence-based practice. Although calls for scientific teaching are widespread, actual practice of scientifically rigorous evaluation of science education is largely lacking (Handelsman et al., 2004;Hanauer et al., 2006;Wieman, 2007).
Part of the difficulty of capturing student gains lies within the myriad confounders embedded within the modern classroom. This challenge can often result in reduced statistical power due to limited sample size. It can also lead to reductions in the scope or depth of the tested intervention. One example is the concurrent implementation of multiple different programmes in every classroom as part of school or District-wide initiatives. This practice is a practical reality, but it presents the perfect catch-22: too many initiatives obstruct progress (Hatch, 2002;Fullan, 2004;Bartalo, 2012;DuFour et al., 2013;Freedman and Cecco, 2013;Smith, 2015), but simultaneous initiatives complicate selection of effective programming. Identifying a signal in this noise requires detailed quantification of value-add in a manner that both celebrates and accounts for the everyday classroom. There is therefore a need for effective statistical evaluation in education science to test correlations in student outcomes with specific programming. Selection of successful, evidence-based programmes is needed to better build the primary and secondary science pipeline for the future.
Herein we evaluate a school-wide, inquiry-based science education intervention Integrated Science Education Outreach (InSciEd Out, insciedout.org). InSciEd Out is a collaborative partnership committed to rebuilding primary and secondary science education curricula for the twenty-first century. The programme is driven by scientific professional development internships for multidisciplinary teams of primary or secondary teachers from a common school. Internships are followed by sustained support in curriculum writing and implementation during the school year (Pierret et al., 2012). Teacher professional development remains a widely accepted method to strengthen the STEM pipeline (Kennedy, 1998;Guyton and Dangel, 2004;Czerniak et al., 2006;Schroeder et al., 2007;Lumpe et al., 2012;Shymansky et al., 2012;Slavin et al., 2012), but numerous aspects distinguish InSciEd Out from other professional development offerings.
First, InSciEd Out is a sustained partnership. The programme recruits whole teaching teams, not just science teachers, to cultivate a school culture of change. It then fosters connections between participant schools and their larger communities. The presence of school-tocommunity connections has been shown to correlate with improved student learning and behaviour (Michael et al., 2007). InSciEd Out's status as a partnership between schools, scientists, university faculty and parents follows recommendations to sustain professional development in science education (National Science Teachers Association, 2006). The longterm nature of InSciEd Out professional development and its ongoing support infrastructure are also designed to rectify key pitfalls of ineffective professional development programming (Gulamhussein, 2013).
Second, one signature component of InSciEd Out is the extensive use of the aquatic animal the zebrafish (Danio rerio). Teacher interns spend considerable laboratory time exploring science through the zebrafish model system. In turn, the curricula they create incorporate zebrafish for student exploration of science. Zebrafish have been previously used effectively in inquiry-based classroom activities (Ekker, 2009). The model system is highly adaptable to the school environment due to its high fecundity, transparent external embryonic development, genetic similarities to humans, size, availability of tissue-specific transgenics and timeline of development (Kimmel et al., 1995). Of the above characteristics, transparent development has been shown to be especially effective in challenging students' Life Sciences misunderstandings, particularly with regard to cells and heredity (Berthelsen, 1999).
Access to model systems like the zebrafish lends to InSciEd Out's status as a unique platform for inquiry-based science. Although inquiry-based learning is commonly cited in both professional development and education reform (Keys and Bryan, 2001;Anderson, 2002;Capps et al., 2012;Furtak et al., 2012), InSciEd Out's implementation of learnerdriven inquiry is distinguishable in both scale and depth. InSciEd Out strives to realize the version of inquiry represented by Short (2009) as "a collaborative process of connecting to and reaching beyond current understandings [...]". InSciEd Out additionally believes that inquiry begins with a question or point of perplexity that is intriguing to a learner and involves a complex journey towards deeper understanding. Science is the application of a structure to ask and answer a question through inquiry. Inquiry is therefore essential to science education because you can teach science to learners, but without inquiry learners cannot be scientists. The fundamental expectation of InSciEd Out is that learners should be producers of novel science knowledge. InSciEd Out learners conduct peer-reviewed and publishable research, where the scientific outcome is truly unknown. To this end, both teacher interns and their primary and secondary students are supported to ask and answer their own new questions in science. This expectation of learners to strive for personal and novel science pushes learners towards self-direction on the National Academy of Science's Essential Features of Classroom Inquiry spectrum (Olson and Loucks-Horsley, 2000).
Ultimately, InSciEd Out is driven by a detailed theory of action. The above foci upon interdisciplinary partnership and student-driven inquiry are but two cornerstones of InSciEd Out's detailed theory of action. Many modern science education reform efforts are driven by incomplete theories of action (Fullan, 2006). InSciEd Out instead strives to explicitly state, understand and assess the strategies it employs. Many different theories and strategies shape InSciEd Out, and the programme is continuously revising itself alongside best practice. Nevertheless, InSciEd Out strives to follow the seven principles set forth by Michael Fullan in pursuit of meritorious change: (1) A focus on motivation; (2) Capacity building; (3) Learning in context; (4) Changing context; (5) A bias for reflective action; (6) Tri-level engagement; (7) Persistence and flexibility (Fullan, 2006).
InSciEd Out teacher professional development is structured as sequential tiers of internships. Tier 1 internships are 12 days of instruction and exploration in a thematic area, followed by an additional 3 days of curriculum development. Tier 1 teacher interns learn about community-generated health themes, genetics and development, pedagogy, dialogue and the nature of science. Tier 2 internships enable an additional 5 days of independent integration of cultural relevance into the initial scientific work developed during Tier 1. One key goal of Tier 2 learning is to revise InSciEd Out classroom curricula to reach students previously marginalized to STEM disciplines. Lastly, the Tier 3 Gold Master internship is an intensive, opt-in programme for teachers who wish to become InSciEd Out Teacher Leaders. Training spans the course of 2 years and is focused around capacity building in inquiry, action research, collaborative peer review and global awareness.
InSciEd Out curricula created within the internships range from a few lessons to a monthslong experience integrated among disciplines such as Language Arts, Mathematics, Science, Physical Education and Art. The lessons are driven by state standards and are cross-matched to Next Generation Standards. Each unique set of InSciEd Out curriculum is called a module and is designed by teacher interns in partnership with InSciEd Out team members. Modules replace previously inefficient or outdated lesson plans. In this manner, high-quality student learning experiences are made possible without overtaxing content-saturated syllabi. An excerpt from the rubric for a module is included in Table 1.
The study here analyses outcomes for InSciEd Out partner school Lincoln K-8, Rochester, MN. Lincoln is part of the Rochester School District (MN#535) and has been an InSciEd Out partner since 2009. A previous preliminary analysis of 2-year InSciEd Out implementation at Lincoln revealed improvements in Lincoln student performance on the Minnesota Comprehensive Assessment (MCA) Science relative to the state and other District schools. Longitudinal analysis from 2008 to 2011 revealed increases in student science proficiency. Effect size analysis normed to the state of Minnesota showed that the first cohort of InSciEd Out Lincoln students outpaced District students on MCA Science improvement from grades 5 to 8. Multiple linear regression analyses controlling for demographics further showed that the level of Lincoln student MCA Science growth exceeded that of other schools' students in the District. InSciEd Out students at Lincoln also showed improvements in their science engagement through simultaneous increases in honours biology election and science fair participation (Pierret et al., 2012). While these results were promising and trended towards gains in student science learning, statistical significance was not achieved in these analyses.
This current analysis is a longitudinal, multi-cohort study of InSciEd Out's quantitative value-add at Lincoln, spanning 5 years of programme implementation. Previous analysis (Pierret et al., 2012) focused upon the 2011 grade 8 cohort (Cohort 1). Here, we analyse the 2012-2014 grade 8 cohorts (Cohorts 2-4), utilizing Lincoln students as their own internal controls and the broader Rochester Public Schools District (MN#535) and the state of Minnesota as externally normed comparisons. This study expands upon the previous pilot analysis of Lincoln to comprehensively evaluate InSciEd Out as a programme for science education achievement. In addition, the broader significance of these results is presented to emphasize the attainability of and need for appropriate and detailed statistical methods. These methods aid in capture of statistical significance for science education programming.

Engagement metrics
The first level of analysis involves engagement data including overall eligible Lincoln student cohorts pre-and post-InSciEd Out to depict overall trends. The percent election of honours biology and percent participation in regional science fair are calculated as the proportion of eligible Lincoln students engaging in the science pipeline. All enrolled grade 6-8 students are included for science fair analysis, and graduating grade 8 students are included in the honours biology analysis. Science fair allows students to voice ownership of their science. Honours biology election is an important self-selected science class decision that historically determined downstream high school science trajectory in the Rochester Public School District.

Achievement metrics
Longitudinal cohort achievement analysis utilizes individual-level data for Lincoln students' performances on grade 5 versus grade 8 MCA tests for each cohort. Demographics for the Lincoln student cohort were drawn from grade 5 school records. Publically available summary data for the state of Minnesota, District and individual District schools were obtained from the Minnesota Department of Education (accessible at: http:// education.state.mn.us/MDE/Data/). Grades 5 and 8 time points were chosen due to administration of MCA Science in grades 5,8 and high school. At the time of our study, high school MCA proficiency in mathematics and reading were two requirements of graduation. RPS has no minimal graduation requirement for science, much less a high-stakes middle school equivalent; the MCA Science remains the only Minnesota standards-based accountability assessment. Recent legislative changes post-study have since relaxed mathematics, writing and reading graduation requirements to first phase out use of the MCA test in favour of the ACT and then to eliminate mandatory graduation assessments entirely (Minnesota Department of Education, 2015).
Overall assessment-Analysis of overall MCA performance first compares grades 5 and 8 cohorts directly without matching students via percent proficiency. The MCA has four achievement levels: Exceeds Standards (E), Meets Standards (M), Partially Meets Standards (P) and Does Not Meet Standard (D). Percent proficient is the percentage of students at level M or E.
Strand analysis-Subsequent strand analysis utilizes z-scores, which are standard, statecentered scores representing the number of standard deviations any given data point is above or below the mean. While z-scores cannot wholly compensate for construct differences and use of normative growth does not reflect absolute growth in science knowledge, normative growth is common in programme evaluation. Z-scores enable analysis precluded by grades 5 versus 8 standards differences, versioning of the MCA test (II versus III) and raw versus stanine reporting of strand scores for different years. Strand analysis is matched, as some students enrolled in grade 5 may not remain enrolled in grade 8. Student ID individually matches students from grade 5 to grade 8 with October grade 8 enrolment providing school affiliations. Matched strand analysis uses z-score data from Lincoln school records for individual matching of students from grades 5-8 and only includes students continuously enrolled at Lincoln for the study timeframe. Data for other middle schools and the District (MN#535) are from District records and are also individually matched.
Multiple linear regression-A series of multiple linear regressions examine the valueadded contribution of Lincoln enrolment during InSciEd Out programme implementation. These models control for grade 5 MCA Science scores, demographics (gender, ethnicity, limited English proficiency, Special Education status, and Free or Reduced Price Lunch) and Lincoln enrolment to predict grade 8 MCA science scores. Each regression model is completed twice. The first is a "null" model with only the grade 5 score and other demographic covariates included in the fit. The second, "full" model includes a dichotomous variable indicating enrolment at Lincoln. As Lincoln enrolment during this period coincides with InSciEd Out implementation, it serves as a surrogate marker for InSciEd Out effect. Regression estimates effects of being enrolled at Lincoln with R 2 values calculating explained variance for each model. The F-test of change, or the F-statistic of the ANOVA test, compares the explanatory powers of the null versus full regression models. This identifies whether or not the inclusion of additional explanatory variables to the null model results in a significant increase in explained variance. This metric therefore detects the statistical significance resulting from the inclusion of the Lincoln K-8 Choice indicator in our study. When comparing null and full regression models, the change in R 2 quantifies the additional explained variance resulting from adding the additional explanatory variables to the null model. In this study, the F-statistic and change in R 2 determine effects of being enrolled at Lincoln. For a more conservative estimate of statistical significance, Bonferroni correction for multiple comparisons is provided to adjust for the number of variables in each regression model. This new level of significance is P = 0.005 (original P = 0.05/9). This research was reviewed under the Mayo Clinic Human Research Protection Program by the Mayo Clinic Institutional Review Board and deemed exempt. InSciEd Out's use of zebrafish as a platform for student inquiry was approved by the Mayo Clinic Institutional Animal Care and Use Committee.

Results and discussion
Engagement analysis reveals sustained improvement of Lincoln students' science pipeline election correlated with InSciEd Out. The MCA Science test provides programme assessment insights extending engagement data to science learning. Overall analysis via the MCAs (Fig. 2) shows Lincoln emerging with statistical significance above the state and the District. InSciEd Out-correlated statistical significance emerges in Year 1 of implementation for grade 5 and in Year 3 for grade 8. Degree of significance is heightened and/or maintained with additional years of InSciEd Out programming. Despite these gains over time, comparisons of grade 5 versus grade 8 percent proficiencies are not statistically significant for any unique Lincoln cohort-posing a question as to where Lincoln's "within-cohort" gains may be found.
Deeper analysis consequently accounts for the four content areas, called strands, within the MCA Science test: History and Nature of Science (HNS, MCA-II) or Nature of Science and Engineering (NSE, MCA-III), Physical Science (PSCS), Earth and Space Science (ESS) and Life Science (LIFS). InSciEd Out's partnership with Lincoln in this study focused on HNS/NSE and LIFS to target historical performance issues. To better understand student outcomes attributable to the InSciEd Out intervention, in-depth strand analysis and multiple linear regression are conducted here, utilizing z-score conversion to standardize student scores to the state and allow for normative growth analysis. We evaluate data through two lenses: longitudinally, comparing each year to the last, and "within-cohort" by following unique groups of Lincoln students as they advance from grade 5 to grade 8. This "withincohort" lens includes previously unpublished student data from 2009-2012 (Cohort 2), 2010-2013(Cohort 3) and 2011-2014. Cohort 1 (2008-2011) was previously described by Pierret et al. (2012). (Table 2) Table S3, 0.735 and 0.730 z-score, respectively).

Student individually matched strand analysis
As Δz-scores are intrinsically reflective of "within-cohort" progress, positive Δz-scores at Lincoln are suggestive of Lincoln's "within-cohort" gains (Table 2). Lincoln Cohorts 2 and 3 exhibit "within-cohort" gains in HNS/NSE; all Lincoln cohorts show these gains in LIFS. In the two strands not targeted by InSciEd Out programming, Lincoln showed ESS declines relative to the state in all cohorts and PSCS declines in Cohorts 2 and 3. Nevertheless, ESS and PSCS strand scores remain relatively high compared with both state and District scores (supplementary Table S3). These results show a specific HNS/NSE and LIFS effect strongly correlated to InSciEd Out programming.
Multiple linear regression extends targeted strand score gains to demonstrate statistical significance of Lincoln's "within-cohort" student growth. Table 3 and supplementary Table  S4 provide summary information from 40 regression models. These models include information for each of the three grade 5 to grade 8 cohorts and all cohorts combined, as well as both individual strand and overall modelling. R 2 statistics show highest explained variance for the "All Strands" model with lower explained variance for individual strand modelling. This can be attributed to the low number of questions used to assess each strand, which impedes reliability of the strand data. Analysis of regression coefficients (β) reveals that Cohort 2 exhibits positive, but not statistically significant, growth attributable to Lincoln enrolment in HNS/NSE (0.150, P = 0.268) and LIFS (0.196,P = 0.198). Cohort 3 students have similarly positive, but non-significant HNS/NSE growth (0.217 z-score, P = 0.142), but statistically significant LIFS growth (0.452 z-score, P = 0.002). Cohort 4 students statistically significantly outscore predicted values in both HNS/NSE (0.375, P = 0.002) and LIFS (0.481 z-score, P = 0.000). Together, these results corroborate the strand analysis.
There are both longitudinal (increasing statistical significance over time) and "withincohort" (positive β) improvements in statistical significance of the Lincoln enrolment predictor. All-strand modelling shows Lincoln students scoring about where they would be predicted to score for Cohorts 2 and 3 (−0.040 and − 0.010 z-score) and nearly higher than predicted for Cohort 4 (0.173 z-score, P = 0.06). Thus, this growth is again specific to the targeted HNS/NSE and LIFS strands. The F-test of change and R 2 change statistics demonstrate corresponding statistical effects (P of ΔR 2 ) in Cohort 3 LIFS (P = 0.002) and Cohort 4 HNS/NSE (P = 0.002) and LIFS (P = 0.000). After applying Bonferroni adjustment based on number of variables in each regression model (P = 0.005), results are still significant for Lincoln. Longitudinal trace substantiates previous analyses with increasing magnitude and significance for the F-test of change in HNS/NSE and LIFS over time.

Conclusion
Overall, these data strongly support the premise that InSciEd Out is an efficacious science education intervention for Lincoln. Students maintain high status in state science assessments and engagement metrics with growth in InSciEd Out-targeted areas. As InSciEd Out activities in HNS/NSE and LIFS were designed within existing curriculum, they did not take away curricular focus upon ESS and PSCS. Thus, they cannot directly account for any noted declines. Future InSciEd Out partnerships will expand fields of science education focus and has begun with the 2014-2015 launch of an Environmental Sciences module. InSciEd Out expansion is ongoing in the District, Minnesota, broader US, India and beyond.
Given the dynamic modern-day classroom, signal isolation from the noise is difficult, but necessary, for science education advancement. This study reveals two important points to help select successful reform initiatives. First, "best practice" study design varies with study intent. Our pre-post assessment of multiple Lincoln cohorts in comparison with District and state enabled unbiased assessment of student performance despite not using conventional study design hierarchies (Coalition for Evidence-Based Policy, 2007). Second, higher data resolution helps identify specific intervention strengths and weaknesses. Strand analysis in this study revealed targeted student gains. Combination with multiple linear regression enabled Lincoln "within-cohort" growth analysis correlated with InSciEd Out. The inclusion of engagement metrics additionally extended didactic knowledge gains towards preliminary quantification of student entry into science. Engagement with the science pipeline is predictive of further STEM pipeline progression (Osborne, 2007;Aschbacher et al., 2010).
Student data is foundational to improvement of the educational system despite research and practice having limited integration in the field of education (Porter and McMaken, 2009;Rust, 2009). Appropriate measures to assess science education practice is a topic of contention, but US students' flagging performance on international achievement tests (Organisation for Economic Co-Operation and Development, 2012;Provasnik et al., 2012) is an opportunity to iteratively improve our utility of US education resources. Better quantification of science education interventions is both possible and necessary for sustainable policymaking and continued betterment of student education.