The potential for effective reasoning guides children’s preference for small group discussion over crowdsourcing

Communication between social learners can make a group collectively “wiser” than any individual, but conformist tendencies can also distort collective judgment. We asked whether intuitions about when communication is likely to improve or distort collective judgment could allow social learners to take advantage of the benefits of communication while minimizing the risks. In three experiments (n = 360), 7- to 10-year old children and adults decided whether to refer a question to a small group for discussion or “crowdsource” independent judgments from individual advisors. For problems affording the kind of ‘demonstrative’ reasoning that allows a group member to reliably correct errors made by even a majority, all ages preferred to consult the discussion group, even compared to a crowd ten times as large—consistent with past research suggesting that discussion groups regularly outperform even their best members for reasoning problems. In contrast, we observed a consistent developmental shift towards crowdsourcing independent judgments when reasoning by itself was insufficient to conclusively answer a question. Results suggest sophisticated intuitions about the nature of social influence and collective intelligence may guide our social learning strategies from early in development.

www.nature.com/scientificreports/ In Experiments 2 and 3, we contrasted the five-person group discussion with a crowd of 50 people answering alone. For the Reasoning questions, we chose a set of constraint-satisfaction problems that would challenge adults' capacities, but still be understandable to children (e.g., Sudoku). Because the solutions to these questions must satisfy a mutually understood set of explicit constraints, discussion can help groups generate potential solutions and and reduce processing demands on individuals while relying on demonstrative reasoning to correct errors. In Experiments 1 and 2, we contrasted the Reasoning questions with Population Preference questions (e.g., most popular fruit in the world). Though individuals' intuitions may sway as the discussion generates potential answers, discussion provides no objective means of adjudicating disagreement; thus, it may distort intuitions rather than sharpen them. In Experiment 3, we contrasted easy versions of the same Reasoning questions with a set of challenging Perceptual Discrimination questions (e.g., fastest rotating item in an array), which a separate sample had rated as more difficult than the Reasoning questions. This allowed us to test the role of perceived difficulty against the potential for effective reasoning. If participants simply favor discussion for questions that feel more difficult-regardless of whether discussion can reliably adjudicate disagreement-then the preference for group discussion will be stronger for the Perceptual Discrimination questions than the Reasoning questions.
Our general prediction in all three experiments was that sensitivity to the contrast between reasoning and intuitive judgment would lead all ages to prefer group discussion for reasoning questions. However, because past work has suggested that children may underestimate the risks of social influence until between the ages of 6 and 9 21,23,24,26 , we predicted that a robust preference for crowdsourcing non-reasoning questions would emerge only among older children (ages 9-10) and adults, while younger children (ages 7-8) would favor group discussion for both kinds of questions in Experiments 1 and 3. All experiments were preregistered, and the data, materials, and power analyses (https:// osf. io/ 6pw5n/) are available on the OSF repository. All experiments were approved by the Yale University Institutional Review Board and conducted according to their guidelines. Written informed consent was obtained from all adult participants. Because children participated online, parents were recorded reading the informed consent form aloud.

Experiment 1
Method. Participants Procedure. Children were introduced to the protagonist, Jack (a silhouette). They were told that Jack was unsure of the answers to the questions, and could ask five people for help. The five people could either help by Talking Together (giving Jack a single answer as a group), or by Answering Alone (each giving Jack their own answer after thinking about the question without consulting others). For each item, children and adults rated whether "talking together" or "answering alone" was "probably more helpful, or definitely more helpful", producing a 4-point scale of relative preference, where 1 corresponds to "definitely answering alone", and 4 corresponds to "definitely talking together. " Adults used the scale directly; children's responses were staggered: they first chose the more helpful strategy, and then were asked for a "probably/definitely" judgment. After answering the eight test items, participants were asked the two comprehension check questions (these were not counterbalanced: Comp_TT was always presented first). Two features of the procedure are important to keep in mind. First, participants could not evaluate the content of any answer to any question, because none was given: they were asked to choose a means of advice, not evaluate the quality of the advice itself. Secondly, they could not make judgments based on degree or quality of consensus-they only knew that the group would have to give one answer, while the crowd would have to give 5 independent answers which could differ or not. . This developmental shift towards Answering Alone when discussion provides no objective criteria for evaluating accuracy is slightly earlier than we had predicted, but consistent with past work on children's evaluation of non-independent testimony 26 .

Results
Finally, all ages agreed that a teacher who wanted a group of five students to answer test questions accurately should have the students Talk Together, while a teacher who wanted to know which students had done their homework should have the students Answer Alone (Comp_AA: M Young = 65%, p = 0.04, M Old = 87.5%, p < 0.0001, www.nature.com/scientificreports/ M Adult = 92.5%, p < 0.0001, Comp_TT: M Young = 70%, p = 0.008, M Old = 85%, p < 0.0001, M Adult = 87.5%, p < 0.0001). This suggests that by age 7, children recognize that discussion could undermine inferences about individuals' "independent" beliefs, but expect group discussion to either generate or disseminate accurate answers. Taken together, these two tasks suggest that sophisticated intuitions about the risks and benefits of social influence may guide decisions about how to learn from collective judgment. Notably, these intuitions are consistent with empirical findings documenting the a group advantage over individuals for reasoning questions, and the value of independent responding when discussion is likely to bias collective judgment.

Experiment 2
Could Experiment 1 have underestimated the value of crowdsourcing? Crowdsourcing may be most valuable with large crowds: larger crowds are more likely to include at least one accurate individual, and better represent the relative frequency of beliefs in the population. Moreover, in large enough crowds, even a minimal plurality will easily outnumber the unanimous consensus of a small group. Thus, if a belief 's frequency is a cue to its accuracy, a large crowd will always be more informative than a small group. In Experiment 2, we contrasted the 5-person group with a larger 50-person crowd. We predicted that since the Popularity questions simply ask the group or crowd to estimate what most people in a population prefer, all age groups would find it intuitive to ask more people-i.e., the crowd. The benefit of large crowds is less clear for Reasoning questions. If few individuals can solve a problem alone, identifying the correct answer in the crowd may be akin to finding a needle in a haystack; indeed, if individual accuracy is known to be rare, the most common answer may be a widely-shared misconception 53 . Yet, if many individuals can solve the problem alone, large crowds are redundant and a learner can outsource evaluating accuracy to a group discussion. We therefore predicted that adults and older children would continue to favor group deliberation over crowdsourcing for Reasoning questions. However, we saw two plausible alternatives for younger children. First, younger children could show the mature pattern. Alternatively, younger children's preference for reasoning in groups could be attenuated by a "more is better" bias. Additionally, since the only difference between Experiments 1 and 2 was the increased crowd size, our design also allows us to explore the effects of crowd size itself by comparing the two experiments directly.

Method.
Participants. We recruited 40 adults through MTurk, as well as 80 children (40 Younger, M = 8.01, SD = 0.56; 40 Older, M = 9.92, SD = 0.56; 39 girls). As in Experiment 1, children participated through an online platform for developmental research that allows researchers to video chat with families using pictures and videos on slides 53 . One additional child was excluded and replaced because the family lost internet connection partway through the experiment and could not rejoin.
Materials and procedure. The materials and procedure were identical to Experiment 1, but participants were first shown a large crowd of people, and told that Jack could either ask 5 of them to Talk Together, or 50 of them to Answer Alone. The answer choices from Experiment 1 were altered to display fifty cartoon icons for Answering Alone instead of five.
Results. As before, the four responses for each question Type (Fig. 2b) were averaged to create a single score for each Type. A repeated measures ANOVA revealed a significant effect of Type (F(1,117) = 376.88, p < 0.001, η p 2 = 0.763) and AgeGroup (F(2,117) = 9.63, p < 0.001 η p 2 = 0.141), and an AgeGroup*Type interaction (F(2,117) = 5.39, p < 0.01, η p 2 = 0.084). Despite the crowd having ten times as many sources as the group, participants were not swayed by a "more is better" bias; all age groups continued to prefer the group discussion for Reasoning questions, both as compared to Popularity questions (Bonferroni corrected, Younger: t(117) = 8. We also conducted two exploratory analyses of the effect of crowd size. Our preregistered prediction in Experiment 2 was that participants would favor the crowd for population preference questions, but continue to favor the group for reasoning questions. However, because the only difference between Experiments 1 and 2 was the increase in crowd size from 5 to 50 people, our data also enables us to test the crowd-size effect directly. We ran separate ANOVAs for each QuestionType using AgeGroup & Experiment as predictors. The tenfold increase in crowd size had no impact on participants' preference for discussing Reasoning questions in small groups (F(1, 234) = 0.045, p = 0.8320); an AgeGroup*ExpNum interaction was significant (F(2,234) = 4.434, p = 0.0129), but post-hoc comparisons revealed only a marginal difference between younger children's and adults' preferences for reasoning in groups in Exp 1, with no other differences. However, participants were significantly more likely to crowdsource Popularity questions in Experiment 2 than Experiment 1 (F(1, 234) = 19.303, p < 0.0001), with no differences between age groups.
As in Experiment 1, responses to the comprehension questions at the end of the task suggested even the youngest children recognized that talking together would make it impossible for the teacher to know which students had done their homework (Comp_AA: M Young = 67.5%, p = 0.019, M Old = 92.5%, p < 0.0001, M Adult = 90%, p < 0.0001). However, while older children and adults agreed that the students would do better on the test if they www.nature.com/scientificreports/ could discuss their answers, younger children were at chance (Comp_TT: M Young = 52.5%, p = 0.4373, M Old = 90%, p < 0.0001, M Adult = 90%, p < 0.0001). Younger children may be less confident in the value of discussion than their responses to the main task questions in Experiments 1 and 2 would suggest; however, informal questioning of participants after the experiment suggested that younger children in Exp 2 may have simply rejected talking together on a test as cheating, even though the question specified that the teacher could choose to allow students to talk together. In short, Experiment 2 suggests that young children's intuitions about the value of group discussion are consistent with empirical demonstrations of a group advantage for reasoning questions and the value of large crowds for intuitive estimations. Moreover, directly comparing Experiments 1 and 2 suggests that while children's preference for reasoning in small groups is stable even in the face of a much larger crowd, they also recognize that for some questions, larger crowds are more helpful than smaller crowds.

Experiment 3
Using Population Preferences as the Non-Reasoning questions in Experiments 1 and 2 leaves two points unclear. First, since a culture's preferences are intuitive for most people, the Popularity questions may have simply seemed easier to answer than the Reasoning questions. Second, because individual preferences are literally constitutive of the population preference, children's responses could reflect an understanding of the nature of preference polling as much as an understanding of the potential for groupthink. To test these two alternatives, we contrasted easy versions of the reasoning questions with challenging perceptual discrimination questions. Disagreement about a challenging perceptual discrimination task would leave a group of laypeople little to discuss beyond confidence, which may be sufficient to filter out obviously wrong answers 56,57 , but is generally an unreliable proxy for accuracy. In contrast, polling a large crowd has been shown to increase the accuracy of a collective decision for perceptual tasks 58 . The relative difficulty of the Easy Reasoning and Hard Percept items was confirmed in a pre-test (Supplemental Materials). If participants' preference for group discussion in Experiments 1 and 2 was driven by perceived question difficulty, then they will prefer group discussion for Hard Percept questions more than for Easy Reasoning questions. If group discussion was preferred because of its perceived benefits for reasoning, participants will prefer group discussion more for Easy Reasoning than Hard Percept. If participants recognize the risks of social influence when discussants cannot rely on demonstrative reasoning, they will prefer crowdsourcing for Hard Percept questions. We predicted that adults and older children would recognize the tradeoffs, but because children under 8 frequently fail to recognize the potential for motivational biases even in simpler cases 59 , we predicted that younger children would prefer group discussion for both question types. As an exploratory analysis, we also compare the results in Experiment 3 directly to Experiment 2, but given that both the perceived difficulty and the subtype of Non-Reasoning question differ between Experiments, direct comparisons should be interpreted with caution.  54 . Two children were excluded and replaced when database records identified them afterwards as having already participated in Experiment 2. Two adults were excluded and replaced as well; though our preregistered plan was to accept all MTurkers who passed the basic attention screening, two worker identification codes appeared multiple times in the data, passing the attention screen after failing and being screened out two and three times, respectively, in violation of MTurk policies.

Method.
Materials and Procedure. Methods were identical to Experiment 2, with the exception of the following changes made to the questions themselves. First, we presented four new Non-Reasoning questions, replacing the four Popularity questions with four Percept questions: (1) decide which of two pictures of a face "at the tipping point of animacy" is a photo and which is a photorealistic drawing 60 , (2) decide whether an opaque box contains 30 or 40 marbles by listening to a recording of it being shaken 61 , (3) identify which of twelve colored squares in a visual array is rotating the fastest, and (4) rank the 25 brightest stars in a photo of the night sky in order of brightness. Second, we simplified the four Reasoning questions (see Supplemental Materials) by (1) completing most of the Sudoku, (2) reducing the number of treasures Mario was required to pick up in the vehicle routing problem, and (3) replacing the "impossible object bottle" with an analog of the "floating peanut" task, which requires the learner to extract an object from a jar of water without touching the jar or object 62 . The fourth Reasoning question, Nim, remained the same, as adults rated the 5-item Nim heap as easy to solve.
Results. For the primary test, the four responses within each question domain (Fig. 3) were again averaged to create a single score for each Type. A repeated measures ANOVA again revealed a significant effect of question Type (F(1,117)  . However, since Experiment 3 was designed contrast Easy Reasoning questions with Hard Percept questions, rather than Hard Percept with Population Preferences, these direct comparisons with Experiment 2 should be interpreted with caution: for example, the weaker preference for crowdsourcing Hard Percept questions than Population Preference questions may be due the difference in Non-Reasoning subtype or an effect of difficulty that is specific to Non-Reasoning questions. We explore these possibilities further in the General Discussion.

General discussion
We asked children and adults to choose between two social learning strategies: soliciting a consensus response from a small discussion group, and "crowdsourcing" many independent opinions. Though discussion can sometimes lead to groupthink, by affording individuals opportunity to correct each others' mistakes and combine insights while also reducing individual processing load, discussion can also allow small groups to outperform even their best member. In contrast, the value of crowdsourcing is fundamentally limited by the distribution of individual competence in the crowd relative to its size. The less competent individuals are on average, the larger the crowd needs to be to produce a reliably accurate estimate. Thus, when individual competence is low, crowdsourcing may be costly; when individual competence is high, the value added by crowdsourcing may have little advantage over discussion-for problems where discussion is more likely to improve accuracy than diminish it. Our results suggest that the decision to crowdsource or discuss may in part turn on learners' beliefs about the efficacy of demonstrative reasoning for a given question. Analogously to young children's failures on false belief tasks, our results suggest that the default expectation for group judgments may be that "truth wins": though individuals may initially disagree, discussion allows groups to ultimately see the truth. As an understanding of how conscious and unconscious biases can influence people's judgments develops, learners can preempt potential biases by crowdsourcing independent judgments. Though even the youngest children in our experiments expected discussion to improve accuracy on reasoning questions, the preference for crowdsourcing non-reasoning questions underwent a developmental shift in all three experiments. Indeed, in Experiment 3, the youngest children favored discussion for both kinds of questions, suggesting that they may have failed to recognize when discussion can promote groupthink. The timing of the developmental shift is consistent with past work suggesting that between the ages of 6 and 9, children begin to use informational dependencies [21][22][23][24][25][26] and the potential for motivational bias in individual reports 29,59 to adjudicate cases of conflicting testimony. Though recent work suggests that even preschoolers identify cases of individual bias stemming from in-group favoritism 63 , unconscious biases due to herding or groupthink may be www.nature.com/scientificreports/ less obvious, particularly if people assume that informants are motivated to be accurate. For example, even though children as young as six predict that judges are more likely to independently give the same verdict when objective standards are available than when they are not (e.g., a footrace vs. a poetry contest), at age ten children are still no more likely to diagnose in-group favoritism as an influence on judgments in subjective contexts than objective contexts 28,63 . In our experiments, both the reasoning and non-reasoning questions had objective answers, but only the reasoning questions afforded an objective method of finding those answers. Learning to recognize this relatively subtle distinction may allow children to take advantage of the benefits of group discussion while avoiding the risks. This is not to suggest that people expect group reasoning to be infallible-merely that they expect groups to improve individual accuracy. This is consistent with recent work asking adults to predict group and individual accuracy on a classic reasoning task: while participants radically underestimated the true group advantage, they did expect groups to be more accurate than individuals 64 . Interestingly, they also expected dyads to be less accurate than individuals. A more granular approach to intuitive beliefs about the dynamics of social influence may reveal more sophisticated intuitions: for instance, beliefs about others' conformist tendencies and the distribution of individual competence may increase confidence that "truth wins" in small groups more than in dyads. Past work has suggested that while people dramatically underestimate crowds and overestimate their own accuracy 17,65 , they defer to others more when uncertainty is high and crowds are larger 66,67 . While increasing the crowd size from five to fifty had no impact on Reasoning questions in our experiments, the larger crowd did appear to increase crowdsourcing for Population Preference questions. However, while we only tested Hard Percept questions with a crowd of fifty, confidence in crowdsourcing was lower for Hard Percept questions than for Population Preferences (Supplemental Materials). While our design licenses no firm conclusions on this point, one reason seems evident: by definition, population preferences are whatever most individuals in a population prefer, while perceptual facts like the brightness of stars are wholly independent of individual judgments. Moreover, under the right conditions, discussing perceptual judgments with a single partner can improve accuracy 68,69 . Thus, participants' reduced confidence in crowdsourcing Hard Percept questions may have been justified. The extent to which intuitive beliefs about the benefits of discussion and crowdsourcing for different question types correspond to the empirical benefits is an open question.
Our design is limited in one important respect: the discussion group was only allowed to give a single answer, while the crowd could give multiple answers. This procedure strictly ensured that group members could not answer independently, but also entailed a unanimous consensus endorsed by a minimum of five people. Unanimous consensus can be a powerful cue: even a single dissenter can sharply reduce conformity 24,67 . However, the meaning of dissent may vary across contexts and questions. In a crowd, a single "dissenter" may simply have made a mistake; but dissent-despite-discussion signals that the group has failed to convince them. When questions afford conclusive demonstrations of accuracy, failure to convince all discussants may reflect poorly on group accuracy. Conversely, in more ambiguous contexts, unanimity may suggest groupthink. For instance, in ancient Judea, crimes more likely to elicit widespread condemnation were tried by larger juries for the express purpose of reducing the odds of consensus, and unanimous convictions were thrown out on the grounds that a lack of dissent indicated a faulty process-an intuitive inference confirmed by modern statistical techniques 70 . A similar logic may underlie inferences about testimony that contradicts social alliances. For example, if Jenny says Jill is bad at soccer, even preschoolers give Jenny's judgment more credence if Jenny and Jill are friends than if they are enemies 63 . Our results suggest that even in early childhood, the absolute number of sources endorsing a belief may be less important than how those sources arrived at their beliefs. Indeed, the limited number of possible answers to the questions in Experiments 2 and 3 guaranteed that even a plurality of the 50-person crowd would considerably outnumber the 5-person group. Yet, participants' preference for discussion and crowdsourcing bore no relationship to the number of possible endorsers. Future work will compare explicit degrees of consensus in groups and crowds.
The last decade has produced an extensive literature describing how individual social learning heuristics and patterns of communication in social networks can improve or diminish collective learning 33,[71][72][73] . By focusing on population-level outcomes, much of this work has tacitly treated individuals as passive prisoners of social influence. However, the heuristics guiding social learning develop in early childhood, and recent work has shown that like other intelligent systems capable of self-organization, people are capable of "rewiring" their social networks to improve both individual and collective learning, by "following" or "unfollowing" connections depending on their accuracy 73 . Our experiments focused on two features of communication patterns that individuals can and do control in the real world, beyond who they choose to trust: how many people to talk to, and whether to talk with those people as a group or a crowd. Our results suggest that even in early childhood, people's judgments about how to best make use of group discussion and crowdsourcing heuristics may be consistent with the empirical advantages of each strategy. An understanding of how intuitions about social influence develop may contribute to a clearer empirical picture of how people balance the benefits of learning from collective opinion with the risks of being misled by it.