Making Sense of Data: Identifying Children’s Strategies for Data Comparisons

ABSTRACT The statistical properties of data are not present in any individual value, but rather, emerge only by perceiving the set as a whole. Summarizing the statistical properties of sets (e.g., creating ensembles) is ubiquitous in cognition, yet one unanswered question is how this process changes over development. The properties of number sets (e.g., means) provide a unique opportunity to investigate the mechanisms underlying summarization. We presented fourth (~ten-year-old) and sixth grade (~twelve year-old) children from the Midwestern region of the United States with a data comparison task, determining which of two golfers produced the farthest drive, and measured their accuracy, confidence, and eye fixation patterns while solving each trial. Children’s data strategies were identified by coding their eye tracking patterns. The results demonstrated that accuracy and confidence were related to the statistical properties of the sets. Older US children consistently used a strategy that demonstrated attention to diagnostic set properties (e.g., attending to most numbers in a set), whereas most younger children used a variety of strategies, many of which were less accurate (e.g., attending to only one number in a set) or used the same strategies less efficiently than older children (e.g., attending to non-diagnostic place values). The results add to our understanding of US children’s quantitative reasoning by identifying strategies children use to make sense of data, their developmental transitions, and how changes in children’s strategy use is a key component in understanding the developmental improvements in summarizing complex information in the environment.

. However, open questions include documenting what strategy or strategies young children use to summarize these properties from data sets, if children use effective strategies with low fidelity, and if children initially use a variety of strategies to make sense of data and show a developmental improvement in the selection and implementation of different strategies. The goal of the present experiment was to investigate how children make sense of data by identifying their strategies using eye tracking patterns and evaluating the efficacy of these strategies and their implementation as a means of understanding the development of children's reasoning about data evaluation.

Summarization
Adults frequently summarize the statistical properties of complex information in their environments Wagemans et al., 2012;Whitney & Yamanashi Leib, 2018). For example, when shown a series of dots, adults summarize the average size of the set of dots, as demonstrated by erroneously "recognizing" the average more frequently than actual dots from the set (Ariely, 2001). Although there is considerable evidence that adults summarize such information rapidly (Obrecht, 2019;Peterson & Beach, 1968), there is surprisingly little evidence regarding children's ability to summarize statistical properties. Children and adults can quickly apprehend the statistical properties of sets of faces (e.g., average happiness of a set of faces; Rhodes et al., 2016) and children as young as 4 can quickly summarize the average size of an object from a set of objects, though not as rapidly as adults (Sweeny, Wurnitsch, Gopnik, & Whitney, 2015). When 4-and 5-year-olds were shown groups of oranges in trees, they were able to quickly summarize the average size of each set of oranges, though adults were more accurate (~75% vs. ~95%, respectively (Exp. 1); Sweeny et al., 2015).
It is possible that children summarize the statistical properties of number sets, much as adults summarize the statistical properties of complex visual scenes (Wagemans et al., 2012;Whitney & Yamanashi Leib, 2018). There is evidence that children (like adults) can make rapid and accurate informal estimates from data sets (Masnick & Morris, 2008;Watson & Moritz, 1999, 1998. For example, when shown two sets of multi-digit numbers, children as young as nine quickly estimated which set of data was larger than the other at above chance levels, though at significantly lower levels than adults (Masnick & Morris). However, little is known about why summaries improve over development and how this develops within the domain of number.

Making sense of data
Making sense of numerical data begins with making sense of individual numbers. Individual numbers are represented as precise, verbal categories (e.g., "15") and as approximate, relative magnitudes with error variance (Dehaene, 2009). Approximate representations are activation functions on a mental number line that peak near a value and decline with increasing distance from the value (Park & Starns, 2015). Comparisons between singledigit numbers are faster and more accurate as the ratio of difference between numbers increases (i.e., distance effect; Buckley & Gillman, 1974;Moyer & Landauer, 1967;Sprute & Temple, 2011). Approximate representations appear to be used for these comparisons because accuracy is greater and solutions are faster as the spatial distance between approximate representations on a mental number line increases (Nieder & Dehaene, 2009).
Number sets have unique properties, not present in single values such as central tendency, variance, and set size. Obviously, these properties can be formally calculated, but they can also be estimated informally. In adults, comparison accuracy improves as the difference between the set means (e.g., the ratio of means) increases (Irwin & Smith, 1957;Morris & Masnick, 2015;Obrecht, Chapman, & Gelman, 2007;Trumpower, 2013). For example, when comparing products on the basis of aggregated consumer ratings, confidence judgments were most affected by mean differences in ratings (Obrecht et al. 2007). When estimating the number of dots in a subset of two mixed sets of dots, participant responses were influenced by the ratio of set means (Cordes, et al., 2013). When asked which golfer hit the ball farther, adults were more accurate and more confident given a high (4:5) than a low (9:10) mean difference, and produced more visual fixations when comparing sets with low mean differences than high mean differences, suggesting that smaller differences cue the search for more information (Morris & Masnick, 2008). Variance is also perceived intuitively (Lathrop, 1967); however, it is often not as salient as mean differences (Masnick & Morris, 2008;Obrecht et al.;Trumpower, 2013). The interaction between these properties contribute to the distinctiveness of sets. Specifically, mean difference and variance interact such that comparisons and confidence were more accurate between sets with high mean differences and low variance than those with low mean differences and high variance (Morris & Masnick, 2015;Obrecht et al., 2007;Trumpower, 2013).
One explanation for rapid, accurate performance is that numbers in a set have both individual representations and are part of summary representations that include means and variance. Comparisons can be made on the basis of the difference between summary representations. Summary representations could emerge from multiple approximate number representations (Masnick & Morris, 2022;Morris & Masnick, 2015). Recall that individual numbers are associated with an activation function on a mental number line that peaks around a specific location and declines proportionally with distance from this location (Bonny & Lourenco, 2013). Multiple numbers would create multiple activations, resulting in summaries that represent approximate set means and variance. Activation areas could differ in the amount of within-set similarity (approximate mean) and the width of the activation area (approximate variance; see Alvarez, 2011;Morris & Masnick, 2015 for a more detailed explanation). During comparisons, summary mean activations should be more salient than variance. A highly dissimilar set of numbers would result in an indistinct area of activation and a wide overall activation area while a highly similar set of numbers would produce a distinct area of activation and a narrow overall activation area. The summary activation areas would be easier to distinguish given distinct differences (i.e., distances) between them.
As stated above, statistical properties that are not present in any individual set member emerge from a set of numbers. When perceiving sets of numbers or objects, the whole is greater than the sum of its parts. We refer to this process as summarizing statistical properties, while acknowledging and incorporating the findings from similar approaches (e.g., deriving Gestalt impressions, Wagemans et al., 2012, and creating ensemble codes, Whitney & Yamanashi Leib, 2018). One must attend to a sufficient proportion of the information in a set to summarize the statistical properties with reasonable fidelity (Alvarez, 2011). Although summaries of these statistical properties are rapid and largely accurate, the direction of attention influences the nature of and quality of the representation (Whitney & Yamanashi Leib, 2018). For example, focusing attention on specific set members (e.g., a small dot), influences the subsequent representation (e.g., underestimating the set average; De Fockert & Marchant, 2008). This pattern suggests that attention to set elements determines what will be included in the summary value. For example, only attending to one number in each set will provide single number comparisons. However, attending to the hundreds place value of all numbers in two sets will yield two summary values that can then be compared. Although there is considerable evidence that adults and children summarize the properties of object sets (e.g., dots), less is known about the underlying processes of summarizing numerical data.
There are many ways to make sense of data; however, these strategies for data comparisons differ in their utility. For example, during comparisons, when data sets are very different from each other (e.g., very large mean difference), nearly any strategy should yield a correct answer. However, when data sets are similar, children will need to pick up a sufficient amount of diagnostic information to detect a true difference between sets. We suggest that there is an expected strategy that is most effective in picking up information when solving this task, and we discuss what makes a strategy expected in detail below. We will investigate these underlying processes by identifying the strategies used by children to make sense of data and evaluate the degree to which strategies are used effectively.

The role of strategies in summarization
The goal of the current experiment is to identify the data strategies children use from realtime, behavioral data created through process tracing, that is, from the eye tracking patterns during problem solving (Schulte-Mecklenbeck et al., 2017). Strategies have been investigated in a variety of domains such as spelling (Rittle-Johnson & Siegler, 1999), science (Chen & Klahr, 2008), deductive reasoning (Morris & Sloutsky, 2001), mathematics (e.g., addition;Thevenot, Barrouillet, Castel, & Uittenhove, 2016), and playing games (Crowley & Siegler, 1993). Children adaptively select and modify their strategies with experience (Siegler, 2016) and learn to discard less effective strategies over time (Kuhn, 2013). This process is dynamic and changes over very small time scales, as demonstrated by a child using multiple strategies across trials within the same task or within a single trial (Siegler, 2016). The overlapping waves model provides a conceptual framework to explain how strategies are acquired, their outcomes evaluated, how new selections are made, and why children demonstrate within-and between-task variability (Siegler, 2005(Siegler, , 2016). An earlier model, the test-operate, test-exit model, demonstrated that a strategy is implemented, feedback on its efficacy is obtained, and if the goal is not attained, an additional attempt is made using either the same or a different strategy (Miller, Galanter, & Pribram, 1960). In the overlapping waves model, strategy selections are adaptive in that children tend to select the fastest and most accurate strategy available (Fazio, DeWolf, & Siegler, 2016;Siegler, 2016). When a strategy does not produce an expected result, this prompts additional steps in the form of using the same strategy again or a different strategy to achieve the processing goal, even in the absence of feedback (Alibali, 1999). There is evidence that children demonstrate multiple, conflicting ideas about and approaches to making sense of data, focusing on individual elements for some problems and relational elements for others (Watson & Moritz, 1999). Further, children and adults may select a strategy but use it ineffectively, resulting in incorrect answers (Miller, 2000). The overlapping waves model may help explain why children demonstrate such variability in their strategy decisions in data interpretation.
It is also possible that children do not initially use expected strategies to summarize the statistical properties of number sets but acquire these strategies over a developmental and educational progression. As a result, children may initially use a variety of strategies and increase their use of expected strategies over the course of development (Siegler, 2005(Siegler, , 2016. When asked to summarize an average (i.e., determine the center of a series of dots), children used different strategies than adults to solve this task and these strategies were often less effective for producing a correct response (Jones & Dekker, 2018). In this case, children used a variety of strategies within the same session to estimate the center of a cloud of dots. Thus, at least in some domains, children used less effective strategies for summarizing the statistical properties of sets. The use of less effective strategies may be driving less accurate summary estimates.
Another possible factor limiting performance is the fidelity of strategy use. Children's ability to differentiate perceptual and conceptual information improves as they attend to and summarize diagnostic information in the environment (Gibson, 1969). More recent work from this tradition suggests that performance in high-level cognitive tasks such as mathematics is related to learning to focus attention on diagnostic patterns (Goldstone et al., 2010). For example, when shown algebra equations, children initially focus on obvious features such as numbers versus letters but later attend to more nuanced relations such as the order of operations (e.g., items in parentheses are attended to first; Goldstone et al., 2017).
The limited evidence related to detecting diagnostic patterns in data sensemaking suggests that children may initially focus on irrelevant information (Garfield & Ben-Zvi, 2007;Hancock, Kaput, & Goldsmith, 1992). For example, third and fourth grade children may initially focus on individual values as being representative of the set (Watson & Moritz, 1999), leading them to incorrect answers. Older children in this study were more likely to focus on the entire set when drawing inferences from data (see Hancock et al., 1992 for similar results).

Identifying data sensemaking strategies
Our goal was to identify children's strategies for comparing number sets to elucidate the underlying processes used to summarize the statistical properties of number sets. In particular, to better understand how children make sense of data, we aimed to identify children's strategies, evaluate the efficacy of strategy use, measure the variability between and within participants, and assess the accuracy of different strategies. We used eye tracking because it provides real-time, direct evidence to what information is being attended and at what point in the reasoning process. Eye tracking has been described as process tracing (Hayhoe & Ballard, 2005) because visual attention is gathered slightly in advance of action and attention allows organisms to monitor their actions. For example, in a study of a participant making a sandwich, nearly all fixations preceded actions (e.g., looking at jar of peanut butter before reaching) or occurred during actions (i.e., monitoring; Rothkopf & Pelz, 2004). Mapping the pattern of what children are attending to during the task allows us to identify strategies children used when comparing data sets. We identified possible strategies based on pilot data and from coding data from previous experiments (Morris & Masnick, 2015). As discussed above, an expected strategy would include attention to a sufficient sample of the information in a set (e.g., a large enough sample to approximate the properties of the set itself) as well as attention to diagnostic features of sets (e.g., hundreds values in three-digit numbers). The use of other strategies may provide information about how naïve children (i.e., children who have received no formal instruction) make sense of data. Table 1 provides example data sets, framed as the results of a golfing competition in which two golfers (Golfer A and Golfer B) are competing to win a contest for the longest drive (i.e., a golf shot with the goal of producing maximum distance). In this example, each golfer hits a ball five times and the result is measured in feet. The goal was to determine which golfer hit the ball farther and a correct response was operationalized as choosing the set with the higher mean. Based on previous research (Masnick & Morris, 2008;Morris & Masnick, 2015), we hypothesized specific strategies and their associated eye tracking response patterns below (see Table 2).

Visual Example
Note. We also coded for a calculation strategy in which a participant explicitly calculates values such as sums or means. Only one child used this strategy and we eliminated the strategy from further analysis. More information about this strategy is included in the Supplementary materials.
Strategy 1: Comparison. One possible strategy is to compare a limited part of the total set of numbers (e.g., comparing only the first two drives in a set), prompting comparisons between potentially unrepresentative samples. In the example in Table 1, if a student compared only the first two numbers, she would erroneously choose Golfer 1 in Example A and correctly choose Golfer 4 in Example B. Focusing only on an unrepresentative sample of values, rather than the entire set, ignores critical information about the set. Use of this strategy may suggest a belief that any value equally represents the properties of the entire set. The evidence presented above (Masnick & Morris, 2008) suggested that this is a common approach for younger children (e.g., elementary school).
Strategy 2: High-Low. Another possible strategy is to compare the highest or lowest value in each set, which requires searching through each column for the highest (or lowest) value, and then comparing these two values. Returning to the examples in Table 1, if our student used this strategy, she would incorrectly choose Golfer 1 for Example A and correctly choose Golfer 3 in example B. The use of this strategy may indicate knowledge about sets in which certain values, e.g., the highest or lowest values, are more diagnostic than other values. Although this strategy is more sophisticated than the comparison strategy above, it potentially ignores relevant information about the other properties of the set (e.g., differences in means, variance).
Strategy 3: Gist. Another strategy for comparing sets of numbers is to scan the entire set for an approximation of set properties, such as a rough estimation of mean and spread (see Morris & Masnick, 2015 for a discussion). Approximate averaging has been demonstrated in perceptual research; for example, presented with sets of dots of different sizes, adult participants appeared to encode the properties of the set of dots. When asked to select a dot they had seen, and shown a circle from the set and an "average" circle, that is, a circle that was a mathematical average of the size of the dots in the set and not in the original set, participants chose the average more frequently than they selected actual members of the original set (Ariely, 2001;Chong & Treisman, 2003, 2005. In these studies using dots, the gist of the set was an average of the size of the dot array. As noted above, there is evidence that the gist of a set of numbers may similarly yield an approximate, mathematical average (Morris & Masnick, 2015). It was assumed that participants generated approximate averages (i.e., the gist) of the set, not via the deliberate strategy of averaging, but instead from the intuitive (c.f., Dehaene, 2009) representation of a set of values based on the very fast reaction times (often less than 100 MS). The gist of a set, or an approximate summary of properties such as the mean of a set (Brainerd & Reyna, 2004), can be obtained by scanning entire numbers. For example, for a set of three-digit numbers, a student who encodes only the values in the hundreds column will encode a more diagnostic gist than a student who encodes only the ones column (Morris & Masnick, 2015). Children decompose multi-digit numbers early in development, as demonstrated by comparing compatible number pairs (i.e., when both digits of one number are bigger than in the comparison number; 68 vs. 25) faster than incompatible number pairs (e.g., 51 vs. 37, where 5 is bigger than 3, but 1 is smaller than 7; Nuerk, Weger, & Willmes, 2001). However, a child who scans only a small proportion of numbers or focuses on non-diagnostic place values would likely create an unrepresentative summary. Table 2 provides detailed descriptions of the eye tracking patterns used for coding.
Low fidelity strategy use. Another factor is that strategies may be used but used with low fidelity, i.e., ineffectively. One possible explanation is that younger children were not using strategies effectively (Miller, 2000). There is evidence that children (Bjorklund et al., 1992) and adults (Gaultney et al., 2005) sometimes implement strategies with low fidelity. However, it is also possible that a child could be using a partial strategy (Waters, 2000). Returning to the strategies above, a child might use the gist strategy but might attend to only a few numbers, perhaps 3-4, when comparing two sets of 10 numbers. In such a case, the child might generate a summary not representative of the full dataset, leading to inaccurate conclusions.

Multiple strategies
As suggested above, multiple strategies may be used on the same trial. For example, if a gist strategy did not yield a conclusive answer, a child might use the same strategy again or might use a different strategy (e.g., comparing the highest values). This is consistent with the predictions of the test-operate, test-exit model in which strategies are implemented, evaluated, and, if a goal is not met, additional operations are implemented. In general, multiple strategies tend to be used more frequently when solving difficult problems compared to simpler problems (Carr & Alexeev, 2011;Rittle-Johnson & Siegler, 1999) and the use of multiple strategies in a single trial has been linked to subsequent improvements in reasoning (Fujimura, 2001;Siegler, 2007). As outlined above, the overlapping waves model provides a framework for explaining differences in individual strategy use, fidelity of use, and multiple strategy use.

Present study
The goal of the current study was to identify and evaluate the accuracy of children's spontaneous data comparison strategies, as inferred from their eye movement behavior (see Table 2). Although the task was similar to that of Morris and Masnick (2015), except with children, the present study advances our understanding not only by extending the sample, but also by conducting a detailed strategy analysis derived from the visual fixation patterns. These strategies illuminate what information is being used and when in real time during the comparison process. To this end, we presented fourth and sixth grade participants with 36 sets of data that varied systematically in the mean ratio, relative variance, and number of observations (details below). We measured eye fixations, accuracy, and confidence (using a 4-point scale), and a brief post-experimental survey of strategy use and knowledge of averaging (i.e., calculating averages from data sets). Fourth and sixth grade children were selected because both have multiple years of experience with place values and the mathematical skills to perform basic calculations (e.g., addition, multiplication). For this reason, all children should exhibit fluency with three-digit numbers. However, these grades differ in their experience with data and probability. The Common Core Math Standards for sixth grade outline basic concepts to be covered, including samples and populations, measures of center, and variance (National Governors Association, 2010). In contrast, the fourth grade standards do not include statistics and probability (e.g., central tendency, variance). These data were collected after the Common Core had been implemented in each district. In sum, both groups should be equally proficient with individual numbers but might show differences in formal skills for making sense of numbers in sets. We first coded each child's patterns of eye movements (fixations, saccades, etc.) for each trial to identify the strategy or strategies used. We then evaluated differences in strategy accuracy by the properties of the data sets (e.g., mean ratio). However, children may use a strategy but implement it with low efficacy; thus, differences in accuracy levels should be compared within the same strategy to investigate this possibility. The identification of differences in strategy use, particularly across ages, would suggest developmental changes in summarizing statistical properties. Another possibility is that children may attempt to summarize the statistical properties but that their attention to nondiagnostic features of set members may skew the resulting representations. For example, if a child focuses the majority of her attention on one number in a set, this number may be more heavily weighted in the summary representation (e.g., underestimating or overestimating an average).
The use of different strategies or ineffective implementation would suggest an emerging understanding of data sensemaking. For example, a child who compares the first number from each set may suggest a belief that each number is equally informative, and that conclusions about sets do not require attention to the full data set. In addition, we analyzed individual variation in strategy use within-and between-trials to investigate how strategies are used to achieve a goal. Of particular interest is whether children used a single strategy or if they demonstrate flexible approaches that suggest a dynamic, iterative approach to making comparisons. Finally, we predict the use of multiple strategies will be more frequent given sets that are more similar than sets that are more distinct. We provide an operational definition for these terms below.

Method
Participants. A G*Power analysis was conducted before collecting data, indicating that a sample size of 36 per grade would allow for the detection of large effects (.70) with .80 power. Participants were 44 4 th grade (mean age = 9.78, SD = .60, 48% female, 79% White, 11% African-American, 9% Asian-American) and 38 6 th grade (mean age = 12.14, SD = .44, 53% female, 71% White, 18% African-American, 10% Asian-American) students from three public schools in the Midwestern United States. We will refer to participants by grade level throughout the remainder of the paper. Our study protocol received ethics approval from the Kent State University IRB, Identifying data comparison strategies in children and adults using eye tracking. The average household income in the school districts was $34,000, $33,000, and $36,000, respectively. All districts were eligible for free and reduced lunch for 100% of students. Twelve participants (five 4 th and seven 6 th graders) wore glasses and reported no difficulties seeing the number sets when they removed their glasses, which was necessary for accurate eye-tracking readings.
Materials. Participants saw 36 numerical set pairs organized in two columns on either side of a computer screen. Each set pair varied on the following properties: (a) set size: 4, 6, or 8 three-digit numbers per set, (b) mean ratio: a between set ratio of means of either 4:5 (relatively large difference) or 9:10 (relatively small difference), and (c) coefficient of variation within set of either .10 or .20 (equivalent across sets). For each of the 12 possible combinations of mean ratio, coefficient of variation, and set size, there were three trials with numerical set pairs meeting these specifications, for a total of 36 experimental trials.
Numbers were presented in 42-point Times New Roman font and each column of numbers was centered on each side. Within each number, an extra space was placed between the hundreds, tens, and ones places and 1.5 vertical spacing was used between numbers in each column. A Tobii T-60XL eye tracker was used for data collection. Participants were seated in front of a 17-inch monitor and adjusted position until approximately 70 cm from the monitor (the experimenter's screen displayed this information in real time to ensure distance and optimal angle precision). We provide more eye tracking details in the Supplementary materials.
Procedure. The eye tracking data were collected by the first author. After being seated in front of the eye tracker, participants were instructed to keep their heads and bodies still during the experiment. A nine-point calibration was performed with the eye tracker. After completing the calibration, participants were told that the session would begin. They were then provided with the following instructions: You will be shown the results from a series of golf drives (a single golf shot to achieve maximum distance). Each slide will show how far a golfer hit a series of balls from one of two tees (LEFT or RIGHT). Your job is to tell me which golfer, on average, hit the ball FARTHER (all drives were measured in feet). 1 Participants responded verbally and the experimenter recorded their responses.
Next, participants were given the following instructions about the confidence scale: After you make your choice, I will ask you how sure you are using the scale in front of you. The scale goes from 1 to 4. A 1 indicates that you were NOT SO SURE about which golfer hit the ball farther and a "4" indicates that you were TOTALLY SURE that one golfer hit the ball farther. I will ask you: How sure are you that this golfer hit this ball farther? Please say a number. Participants verbally responded and the experimenter recorded the response.
On each trial, participants first saw a fixation slide in which a + was placed in the center of the screen for 1 second. After one second, participants saw a data slide consisting of the two sets of data (positioned on the left and right sides of the screen) and were asked which golfer hit the ball farther and to rate their confidence in this difference. After they produced both answers, the experimenter advanced the screen to the next fixation slide for 1 second. No participant required prompts during the experimental phase to produce a response. Data sets were presented in blocks by set size and within each block mean ratio and variance were randomized into two counterbalanced orders.
Defining Areas of Interest (AOI). AOIs for eye fixations were defined around stimuli before data collection. AOI's were defined around the hundreds, tens, and ones columns and around each three-digit number. The number of fixations and duration of fixations occurring within each AOI were automatically recorded using Tobii studio. Coding details are provided in the Supplementary materials.

Results
We begin with analyses of accuracy, present the confidence data, analyses of the eyetracking data, aggregated strategy data, and then analyses of individual strategy use. Comparisons for gender yielded non-significant results, so gender was not considered a factor in the following analyses (see Section S1 in the Supplementary materials for full gender analyses). Grade is considered a between-subjects factor for the following analyses because sixth graders performed better than fourth graders, t(71.80) = 3.66, p < .001.
Accuracy. We considered a result to be accurate when a participant chose the set with the higher mean as the answer to which golfer, on average, hit the golf ball farther. The dependent variable was the total number of accurate responses (e.g., choosing the set with the higher mean) of the 36 trials. We conducted a 4-way mixed ANOVA with grade as a between-groups variable (grade: 4 th and 6 th ) x 3 (set size: 4, 6, & 8) x 2 (mean ratio: 4:5 vs. 9:10) x 2 (coefficient of variation: High vs. Low) analysis. The results indicated that accuracy was related to set properties as predicted (see Table 3 for detailed results). We briefly describe the results most relevant to the hypotheses described above and include effect sizes. There were main effects for grade, F(1, 79) = 86.36, p <.001 partial η 2 = .99; mean ratio, F(1, 79) = 46.97, p <.001 partial η 2 = .49; set size, F(1, 79) = 6.76, p <.001 partial η 2 =.20; and variance, F(1, 79) = 3.23, p <.001 partial η 2 = .09 (see Table 3 below for the descriptive statistics and section S2 in the Supplemental materials for full results). Tukey's HSD indicated that sixth graders (M = .90, SD = .08) were more accurate than fourth graders (M = .76, SD = .12; p < .001). In addition, children were more accurate for sets with a 4:5 mean ratio (M = 1.8, SD = .12) than a closer 9:10 mean ratio (M = .1.69, SD = .11; p = .01), set sizes of 4 and 6 (M = 1.67, SD = .12; p = .001) compared to sets of 8 (M = .1.59, SD = .13; p = .03), and with a smaller coefficient of variation (10% of mean) (M =1.79, SD = .11; p < .001) than larger coefficient of variation (20% of mean) (M = 1.59, SD = .10; p = .01). A two-way interaction indicated that children were more accurate when comparing small sets with 4:5 mean ratios than larger sets with 4:5 mean ratios, F(1, 79) = 72.89, p <.001, partial η 2 =.21 (there were no other significant two-way interactions). Finally, there was a three-way interaction between set size, mean ratio, and variance that demonstrated that children were more accurate with small sets with 4:5 mean ratio and low variance than large sets with 9:10 mean ratio, and high variance, F(1, 79) = 72.89, p <.001, partial η 2 =.08 (see Table 3 and Figure 1 for details).

Confidence.
As with the accuracy data, we conducted a factorial ANOVA with three within-subjects factors resulting in a 3 (set size: 4, 6, & 8) x 2 (mean ratio: 4:5 vs. 9:10) x 2 (coefficient of variation: High vs. Low) analysis, with self-reported confidence ratings as the dependent measure (See Table 3 for descriptive data and S2 in the Supplemental materials for full results). There was no main effect of grade or gender and thus the analyses were conducted on all participants' data as one group. There was a main effect of mean ratio, F(1, 79) = 72.89, p <.001, Partial η 2 =.43, indicating higher confidence given a larger, 4:5 mean ratio than a smaller, 9:10 mean ratio. A main effect of set size was also present, F(2,78) = 5.34, p = .007, Partial η 2 = .10, and a Tukey's HSD indicated a significant drop in confidence from sets of four to six and no difference between sets of six and eight. Confidence was greater when sets had a lower coefficient of variation (10% of mean) than a higher coefficient of variation (20% of mean), F (1, 79) = 11.88, p = .014, Partial η 2 = .21. As expected, the mean ratio by coefficient of variation interaction was significant. Participants were more confident in their comparisons for sets with 4:5 mean ratios and low variance than sets with 9:10 mean ratios and high variance, F(1, 79) = 25.64, p <.001, Partial η 2 =.29. No other interactions were significant.

Eye tracking analyses
We analyzed the eye tracking data in three ways. First, we analyzed the relation between the number of fixations and the properties of data sets. Second, we investigated the number of times children switched their attention between the two data columns as an index of difficulty. Third, we coded eye tracking patterns within each trial to determine strategy use.

Number of fixations.
A within-subjects ANOVA was used to investigate the number of fixations by data properties. We did not include duration results because they were highly correlated with fixation results. Not surprisingly, the number of fixations increased as set sizes increased, F(1, 79) = 12.296, p <.001, Partial η 2 = .52. There was an interaction between set means and variance demonstrating. Specifically, there were more fixations for sets with 9:10 mean ratios and high variance than for sets with 4:5 mean ratios and low variance, F(2, 79) = 4.8, p = .008, Partial η 2 = .37. There were no other significant interactions between confidence, mean ratio, and variance. There was also an interaction between set size and mean ratio with more fixations for larger sets compared to smaller sets, only in sets with 9:10 mean ratios, F(2, 79) = 4.51, p = .012, Partial η 2 = .12 but no difference for sets with a 4:5 mean ratio. No other interactions were significant.
Column Switches. The number of switches between columns was negatively related to mean accuracy, r = −.23, p = .049, suggesting that making fewer switches between columns was related to better accuracy. The results of a factorial ANOVA indicated that fourth graders had more column switches than sixth graders (F (1, 77) = 9.55, p =.001, Partial η 2 = .24) and that the mean ratio by variance interaction was significant, F (1, 77) = 4.70, p = .033, Partial η 2 = .11 for both grades (see Figure 2). Tukey's HSD indicated that, across all set sizes, sets with high (i.e., 4:5) mean ratios and low variance (10% of set mean) (M = 3.39, SD = .47) were associated with the fewest switches whereas sets with low (9:10) mean ratios and high variance (20% of set mean) were associated with the most switches (M = 5.08, SD = .73; p < .001). None of the other interactions were significant. The factorial ANOVA also indicates one main effect, that there were fewer switches for sets with a 4:5 mean ratio than 9:10 mean ratio, F (1, 77) = 34.40, p < .001, Partial η 2 = .43, whereas there were no differences for coefficient of variation and set size (F's < 1). Table 4 presents the proportion of strategies used during the task for all participants on all trials. There were 2,664 total trials across all participants and 4.4% of the trials were unable to be coded due to missing data. Two thousand five hundred fifty-seven (2557) strategies were coded and 30% of trials were double coded by two hypothesis-naive coders using the coding criteria described in Table 2. The initial comparison yielded 97% reliability (κ = .83). All disagreements were resolved following a discussion with the first author. A chi-square analysis demonstrated a significant difference in the distribution of strategies by grade X 2 (1, N = 1630) = 67.13, p < .001 (see Table 4 for proportions and standardized residuals). The next analysis investigated accuracy by strategy and grade. As above, the analysis compared accuracy by the total number of strategies (2557) independent of participant(see Table 5).. An ANOVA demonstrated significant differences in accuracy by strategy and grade, F(5,240) = 7.3, p = .001, Partial η 2 = .48. Tukey's HSD post hoc tests revealed that sixth graders (M =.90, SD = .08) were more accurate than fourth graders (M = .76, SD = .12; p = .003) regardless of strategy. As stated above, we could not conduct meaningful analysis for sixth graders due to the highly uneven distribution of strategies. Fourth graders were somewhat more accurate when using the comparison strategy (p = .01) than any other strategy.

Strategy analysis
To test the possibility that children might be summarizing the statistical properties that were weighted by attention to specific items (as in the comparison strategy), we examined accuracy on the only five trials in which the first number in each set was not diagnostic of  Note. Standard Deviation in parentheses the set. For example, we looked at cases where the set on the right had the higher average despite the first number in the left set being higher than the first number in the right set. If a child makes comparisons only based on the first two numbers in each set, he or she should answer incorrectly. However, if these features are merely weighted as part of averages, the child should answer correctly. As shown in Table 4, some children did demonstrate the comparison strategy via their eye tracking behavior. The results demonstrated that accuracy for children using the comparison strategy on these trials (32% accuracy) was lower than those using any other strategy (78%) on the same data sets. This finding demonstrates that these children were not summarizing the properties of the entire set while weighting the first item more heavily. Instead, some children were comparing single values from each set while effectively ignoring the set properties. This result provides evidence against the idea that children initially make use of expected strategies for summarizing the statistical properties of data sets.

Why were fourth graders less accurate when using the same strategies?
One interesting result is that fourth graders were less accurate in using the gist strategy than sixth graders. At the level of strategy identification, both ages demonstrated the same pattern of focusing on multiple numbers within the same column, then focusing on multiple numbers within the other column. The difference in accuracy suggests differences in the fidelity of strategy use. To investigate this, we conducted an analysis of two features that children attended to within each strategy. One, we identified the proportion of numbers within each column the child attended to within each trial. This may influence accuracy because attending to a higher proportion of numbers (e.g., attending to all numbers) should be associated with higher accuracy than attending to a smaller proportion of available numbers. This was coded by using the fixation count tool in Tobii Studio to identify the pattern of fixations within the AOIs in each column for each strategy. These fixations were counted to identify if a number in a set had been attended to and then to determine a proportion of attended numbers. For example, in a set of eight numbers, if the AOIs of five numbers were attended to, the proportion for that column would be 0.625. Two, we identified place values attended to within each column. As mentioned earlier, a focus on hundreds place values will be more diagnostic than a focus on ones place values. This was coded by using the fixation count tool in Tobii Studio to identify fixations within the AOIs around the place value of each number. We totaled the fixations in each place value and divided the total by the number for each place value to determine a proportion for that column. For example, if a participant made 16 fixations within a column, 8 in the hundreds place, 4 in the tens, and 4 in the ones, then this would be coded as .5, .25, and .25, respectively, with 50% of fixations on the hundreds place, and 25% each on the tens and ones. The results of these analyses are displayed in Tables 6 and 7 below. The results demonstrate two age-related differences in the fidelity of strategy use. One, fourth graders using the gist strategy attended to significantly fewer numbers than sixth graders using the same strategy. There were no significant differences for other strategies (see Table 6 below). Two, fourth graders attended to less diagnostic features of the three-digit numbers themselves, specifically, the tens and ones place values, than sixth graders. Sixth graders attended to the hundreds place value, the most diagnostic place value, more than fourth graders (see Table 7). Taken together, these results clarify why fourth graders' use of the gist strategy resulted in less accurate comparisons. Fourth graders used the gist strategy but attended to fewer numbers and were more likely to attend to less diagnostic number features, specifically the ones place value, than sixth graders, who focused on more numbers in each set and more diagnostic features of the numbers themselves (i.e., hundreds place value). In other words, sixth graders encoded a more comprehensive gist representation, leading to more accurate estimates.
Individual consistency. An important question is the degree to which participants were consistent in their use of a particular strategy. This is important because strategy variability is associated with partial, and often transitional, problem solving approaches (Siegler, 2007). The first analysis measured the degree to which participants used one strategy consistently throughout the task. We coded consistency as the use of one strategy on at least 75% of trials. We then compared the differences between strategy use and grade (See Figure 3). The results indicated that the distribution of strategies significantly differed from chance X 2 (1, N = 79) = 14.4051, p = .006. We used standardized residuals to determine significant differences between cells because these values are similar to z-scores. The results indicated  that more sixth graders (1.53) than fourth graders (−1.49) were classified as using the gist strategy consistently while more fourth graders (1.99) than sixth graders (−2.09) were classified as using no strategy consistently. Accuracy was then compared between participants using the gist strategy consistently and those not using any strategy consistently. Results indicated no significant differences in accuracy between sixth graders using the gist strategy consistently (M = .93, SD = .18) and sixth graders using no strategy consistently (M = .88, SD = .22), t(78) = .774, p = .306. Fourth graders coded as not using a strategy consistently (M = .79, SD = .23) were significantly more accurate than fourth graders who consistently used the gist strategy (M = .70, SD = .16), t(78) = 1.88, p = .033. Sixth graders were highly consistent in strategy use across the problem set but fourth graders' strategy use was much more variable. This may suggest that most sixth graders' strategy use reflected an understanding of which strategy was most effective and under what conditions it was most effectively implemented. In contrast, the fourth graders who consistently used the gist strategy were accurate most of the time, but still often erred in their conclusions, even more than those who used a variety of strategies. This result is consistent with prior work on children's knowledge about averages and variability that suggested that children may, at least initially, have a partial understanding of statistical concepts such as averages (Watson & Moritz, 1999). It also suggests the fourth graders' variation in strategy use may have been an effective approach for sets with relatively large differences but much less effective for less distinct sets.

Multiple strategy use
The next analysis investigated whether children were using multiple strategies within a single trial. Multiple strategy use has been documented in many domains (e.g., mathematics, spelling, playing tic-tac-toe; Crowley & Siegler, 1993;Siegler, 2007). The first analysis was to determine the average number of strategies that participants used across trials. Sixth graders used a mean of 1.89 strategies (SD = 1.03) and fourth graders used a mean of 2.61 strategies (SD = 1.04) across the task (t(79) = 3.09, p = .003) with 59% of sixth graders and 90% of fourth graders using more than one strategy on at least one trial.
We next investigated the relation between the use of multiple strategies and problem features. The use of multiple strategies is likely to occur on more difficult problems than simple problems. We have suggested that because mean ratio is the most salient feature, that problems with high mean ratios should be simpler than those with low mean ratios. In addition, based on previous research (Morris & Masnick, 2015), we expected variance to interact with mean ratios in making sets with large mean differences and low variance more easily discriminable than sets with a small mean difference and high variance. To investigate this, we compared the relation between problem features and the use of multiple strategies across ages.
We identified trials on which multiple strategies were used and identified the problem features of each trial. We then calculated the mean number of multiple strategy uses for each problem type. No formal analyses were run for sixth graders due to the small number of observations (25 trials). For fourth graders, a within-subjects ANOVA indicated that the use of multiple strategies differed significantly by problem features. Multiple strategies were more frequent for sets with low mean differences than for sets with high mean differences, F (1, 79) = 5.6, p <.014, Partial η 2 = .14. There was an interaction between mean ratios and variance, F(2, 79) = 3.4, p <.037, Partial η 2 = .12. A Tukey's HSD post hoc test demonstrated that there were more strategies used for sets with 9:10 mean ratios and high variance (M = 2.3, SD = .22; p < .001) than sets with 4:5 mean ratios and low variance (M = 1.4, SD = .17; p < .001). No other differences were statistically significant. We also asked participants to self-report their strategies. These data did not provide additional information beyond the eye tracking data but are presented in the supplementary materials (see section S3).

Accuracy for averaging
The second part of the post-trial survey presented four averaging questions. These questions provided number sets and four response options. Participants were asked to select the average of each set. Fourth graders (M = 1.02, SD = 1.1) chose correct answers significantly less frequently than sixth graders (M = 2.49, SD = 1.36) t(78) = 5.3, p < .001, d = 1.18). We then analyzed the extent to which calculation accuracy explained unique variance in comparison accuracy. The results indicated that a small amount of variance in comparison accuracy was explained by calculation accuracy (r 2 = .126, b = .35, t(78) = 3.35, p < .001), suggesting that greater knowledge of calculating a mean provided a small, but significant, benefit for the accuracy of informal comparisons. In addition, a small amount of variance in confidence was explained by calculation accuracy (r 2 = .15, b = .39, t(78) = 4.1, p < .001) suggesting that greater knowledge of calculating a mean increased confidence in informal comparisons.

Discussion
Our experiment investigated how US children make sense of numerical data by investigating the strategies children used to summarize the properties of number sets. Accurately summarizing sets requires the use of effective strategies that direct attention to a sufficient proportion of diagnostic features. Our results extend knowledge of how US children make sense of data and their emerging understanding of how individual numbers are related to each other in sets in three ways. One, the data demonstrate that children initially use a variety of strategies to summarize the statistical properties of data sets. This pattern suggests that the use of expected strategies with fidelity is the result of a developmental and educational progression. Two, our results indicated that the statistical properties of data sets were a strong predictor of child accuracy and confidence suggesting that, at least for highly distinct sets, children's initial strategies may be sufficient to compare sets accurately. Three, we identified a series of strategies US children used to compare data sets and coded the consistency of their use. Our strategy coding was derived from eye tracking data during task trials and provides a measure of process tracing, or identifying the elements to which children attend in real-time as they solve problems (Schulte-Mecklenbeck et al., 2017). Our results demonstrated that several of the younger children used multiple strategies, often within the same trial, to make sense of the data.
Adults are quite adept at summarizing the statistical properties of sets (Whitney & Yamanashi Leib, 2018). Summarizing these properties from complex information provides useful information about the environment and reduces the overall processing burden (Alvarez, 2011). Although children can summarize statistical properties in some conditions (e.g., the average size of object sets, Sweeny et al., 2015), they appear to have difficulty with others (e.g., identifying the center from a dot array, Jones & Dekker, 2018). Regardless of context, children need to attend to relevant information in order to create accurate summaries. Our results add to this literature by describing and evaluating the strategies children use to summarize and compare data sets. Sixth graders were highly accurate in their comparisons and consistent in their strategy use. Specifically, nearly all used a gist strategy in which they attended to the most diagnostic features of the entire (or nearly entire) set from which they could summarize an approximate average for comparisons. In contrast, fourth graders were less accurate and less consistent in their strategy use than the older children. Many of the strategies they used were characterized by attention to part of a set or to one feature of the set properties (e.g., highest value). This suggests that younger children may not yet understand that summarizing set properties necessitates attending to the entire data set (similar to the results of Jones & Dekker, 2018). At the same time, younger children who used comparisons and a variety of strategies tended to be more accurate than those who used the gist strategy most consistently.
Why might children vary their strategies more and be able to summarize the statistical properties from some type of stimuli but not others? It may be that the stimuli used by Sweeny et al. (2015) provided supports that helped children understand the relations between elements. Specifically, in the Sweeny et al. (2015) experiment, children saw four oranges in one tree and four oranges in a different tree. The object sets were in close physical proximity and were conceptually and contextually related. These cues may have provided support that the objects were meaningfully related. In contrast, though the number sets used in the present experiment shared perceptual similarity, their relation to the concepts being measured may have been less familiar, weakening the cues that guide attention to relevant features and the summary of statistical properties. Further, the variation in the sets may have led children to differing strategies on different sets. Future research should investigate this potential limitation by investigating children's performance across different types of stimuli to measure the role of contextual support in performance.
The statistical properties of data sets emerge from the numbers in the set but are not present in any individual number. The results indicated that aggregated accuracy and confidence were closely related to the statistical properties of data sets, particularly when children directed their attention to the relevant features of the numbers in the set (e.g., using a gist strategy). The ratio of means was the strongest predictor of accuracy and confidence, indicating that relatively large differences between sets were easier to detect than smaller differences, regardless of strategy. Accuracy and confidence decreased as set sizes increased, suggesting that representing properties was easier for smaller sets. Variance was associated with main effects for accuracy and confidence and consistently interacted with mean ratios and set sizes by magnifying their effects, consistent with previous findings (e.g., Obrecht et al., 2007).
Eye fixation data provided converging evidence that set properties had a strong influence on performance. For example, participants made a greater number of fixations on numbers in the sets with small mean differences compared to sets with larger mean differences. This suggests that, in the aggregate, more attention was recruited to distinguish less distinct sets (e.g., sets with smaller mean differences and high coefficient of variation). Finally, participants made more column switches when comparing less distinct sets and accuracy decreased as the number of switches increased. These results suggest that less distinct sets cued the search for additional information during comparisons. A potential alternative explanation is that the increase in the number of column switches might indicate information processing overload. However, evidence from previous research suggests that information processing overload is associated with fewer fixations and gaze aversion, rather than increased fixations (Doherty-Sneddon & Phelps, 2005).
Younger children were less consistent in their strategy use than older children. Specifically, while over half of sixth graders consistently used the gist strategy (i.e., used a single strategy on >75% of trials), the majority of fourth graders used no strategy consistently. Previous research has suggested that the use of multiple strategies may indicate a transitional point in the learning process (Siegler, 2005). The strategies used by fourth graders also suggested a focus on individual numbers, rather than attention to set properties. For example, most fourth graders used a comparison or high-low strategy on at least one trial, which demonstrates more attention to individual number than set properties. Comparing only the highest value in each set indicates that such values carry information about the relations between numbers in a set, while focusing on the first value in the set suggests a belief that any value is equally diagnostic. Either strategy ignores other potentially relevant set information such as means and variance. Finally, even when younger children used potentially more effective strategies like the gist strategy, they often used them ineffectively. The data demonstrate that this ineffective use is related to attending to a smaller proportion of numbers and non-diagnostic number features (e.g., more attention to the ones place value than the hundreds place value). At the same time, fourth graders who used more varied strategies were more accurate than those who used primarily a gist strategy. This may be due to properties of the sets themselves -when the comparison strategy leads to an accurate answer because the first values are lined up with the means, this may have been considered a more efficient strategy, and it led to accurate answers with many of the datasets.
Children also used multiple strategies within a single trial. The use of multiple strategies fits the overlapping waves theory (Siegler, 2016) in which children's thinking is variable, children may use multiple strategies, and an initial approach might not yield a definitive result. In the absence of an immediate solution, reasoners might try the same approach again or might change their approaches. Our results provide evidence that fourth graders were more likely to use multiple strategies within a trial than sixth graders. One possible explanation is that younger children were not using strategies with fidelity (Miller, 2000;Waters, 2000). For example, a child might use the gist strategy but might attend to only a few numbers rather than the entire set. In such a case, he or she might not see a clear difference between sets because several of the numbers in each set might overlap. Our data support this in that sixth graders made more accurate comparisons than fourth graders using the same strategies. At the same time, overall the fourth graders did better when using multiple strategies than when using just the gist strategy, in part likely because of their lack of fidelity to the gist strategy use. The results demonstrated that younger children attended to fewer numbers than older children when using the gist strategy and attended to less salient features (e.g., looking at the ones place value). These factors would result in summaries that were less accurate and could lead to erroneous comparisons.
The use of multiple strategies was related to the properties of the number sets. Multiple strategies were more likely to be used within trials with low mean differences compared to trials with high mean differences. There was an interaction between mean difference and variance in that multiple strategies were most likely to be used for sets with low mean difference and high variance. Further, we found a developmental difference, in that fourth graders used more strategies overall and more strategies within a single trial than sixth graders.
The results provide support for the idea that summarizing the statistical properties of data is the result of a developmental and educational progression. The evidence suggests that fourth graders were less likely to attend to salient features and were more likely to use of multiple strategies across the set of trials, than sixth graders. The latter reflects a child trying out different approaches for a novel problem and is consistent with the overlapping waves model (Siegler, 2016), in which children accrue data from strategy use that allows them to refine the likelihood of future strategy use. As demonstrated in the results on individual numbers and set properties, sixth graders were more likely to attend to diagnostic features of both (e.g., attending to most, if not all numbers in a set). This helps to explain why sixth graders used the gist strategy consistently while fourth graders used a variety of strategies across and within trials. The converging evidence from mean calculation accuracy supports this suggestion in that children with experience calculating the statistical properties of data sets may have more knowledge about the diagnostic properties of numbers in sets.
Finally, the results add to our understanding of how children become more adept at summarizing complex information. Our results suggest that summarizing complex information required visual attention to a sufficient amount of diagnostic information to represent set properties accurately. The focus on identifying and measuring changes in strategy use helps us understand, at a fine grain size, children's attentional patterns as they summarize complex information. Future research, using similar eye tracking measures, could help identify strategies children use to create ensembles.
The results suggest two educational implications. One, because young children enter classrooms with limited knowledge about data, providing experiences that add to their background knowledge may improve learning outcomes. Allowing children to explore data before instruction may be beneficial because it may help focus their attention on relevant features, which prepares children to learn (Schwartz & Martin, 2004). This may be especially useful when exploration provides comparisons between sets to highlight relevant statistical properties such as variability within sets. Two, it may be easier for children to detect the relevant properties of data from graphical representations than only through the presentation of numerical information (Garfield & Ben-Zvi, 2007). Properties of data (e.g., variance) may be more salient in a graphical representation than raw numerical representations (Schwartz & Martin, 2004). For example, college students learned more about probability distributions when they played a video game that provided dynamic, graphical information about distributions than when they read information about such distributions (Arena & Schwartz, 2014), and a similar effect may occur with children.
One limitation of this study is that we measured only one component of data sensemaking, comparisons. The strategies used to perform comparisons may be dissimilar to those used for other tasks, and thus caution should be used when generalizing to similar tasks. Other possible options for measuring data sensemaking, including generating approximate means and generating predictions from data, should be investigated in future research to determine the extent to which the strategies identified in our results are indicative of more general data sensemaking skills. A second limitation is that no data were collected to measure other, potentially related, numerical and mathematical skills. One example is the accuracy of children's number representations. As suggested in the introduction, summarization may occur by aggregating multiple approximate number representations. However, the accuracy of summary values is likely dependent on the accuracy of the underlying number representations. When underlying representations are highly inaccurate, summaries of these values are also inaccurate (Morris, Todaro, Arner, & Roche, 2022). Thus, future research should measure the accuracy of individual representations. In addition, more robust assessments of children's mathematical abilities might be useful in determining math skills and number sense that may contribute to data sense. A third potential limitation is the use of eye tracking to determine strategies. Although the fixation patterns themselves constitute objective data, the interpretation of the patterns is not objective. Future research that makes use of fixation patterns would be stronger with converging measures that bolster their interpretation by excluding potential alternative explanations. That said, the use of fixation patterns in this context was useful for identifying strategies and may be a useful approach for future research.

Conclusions
Children are driven to make sense of data. One way that adults and children make sense of data is by quickly summarizing the statistical properties from data sets. Our results suggest that while young children use many strategies to make sense of data, ineffective strategies are sometimes used and effective strategies are sometimes used with low fidelity. As a result, children may make sense of data by focusing on part of a set (e.g., individual numbers) or values that provide a limited characterization of the set (e.g., highest value). Overall, children were more accurate, more confident in their evaluations, and produced fewer fixations as sets became more distinctive, that is, sets with large ratios of mean difference, low coefficients of variation, and small set sizes. We identified children's data strategies from their eye tracking patterns. Our results demonstrated that most sixth graders consistently used a strategy characterized by attending to the most diagnostic properties of the entire set. Fourth graders used more strategies across trials, were less likely to use strategies with fidelity, and were more likely to use multiple strategies within a single trial, suggesting that fourth graders are acquiring the foundational intuitions about the emergent properties of numbers in sets. These results suggest that a clearer understanding of children's strategies may improve our understanding of the cognitive processes underlying making sense of data and may suggest efficacious educational interventions. The results also extend our understanding of how improvements in children's strategy use help them improve their summaries of the abundant information in their environments. Finally, the results suggest possible targets for educational interventions that support children's attention to relevant problem features as well as the acquisition and selection of effective data strategies.