An innovative cycle-based learning approach to teaching with analog sandbox models

Abstract Scaled analog modeling (“sandbox modeling”) allows deformational processes, such as the development of a mountain belt, to be observed in real time in a classroom setting. However, the actual learning gains from exposure to sandbox modeling in geology courses in higher education settings have not been explicitly studied. We begin to investigate the possible effects of incorporating a sandbox modeling activity on geologic understanding in an upper-level tectonics class. The designed activity utilized a cycle-based learning approach, where the 11 participating students predicted outcomes of different deformation experiments and then evaluated and revised their predictions in the light of their experimental observations. Scored predictive sketches and a spatial visualization test administered before and after the sandbox activity demonstrate improvements in geological understanding of deformation, the influence of different mechanical properties on deformation style, and penetrative thinking skill. The observed gains were particularly marked for students who had poorly developed penetrative thinking skills prior to the activity. These results indicate that the use of sandbox models in the classroom may have a measurable effect on penetrative thinking skills and geologic understanding, particularly in students with less expertise. However, further study is required to test if these effects can be reproduced, and shown to be statistically significant, in larger groups of students.


Introduction
Scaled analog modeling ("sandbox modelling") allows the recreation of tectonic deformation at a tabletop scale; in an educational context, it allows plate boundary processes, such as the development of a mountain belt, to be observed in real time in a classroom setting (Castello & Cooke, 2008). Modeling is one important way for geology researchers to gain broader spatial and temporal perspective for field observations (e.g., Bateman et al., 2020) and can support student learning (e.g., Woods et al., 2016).
The instructor of an upper-division undergraduate and graduate tectonics course was dissatisfied with the observed learning outcomes when using analog sandbox models to develop students' understanding of the patterns and processes of tectonic deformation, a key course learning objective. The instructor's original implementation of sandbox modeling involved groups designing and running their own experiments. Although the activity was popular with the students in the tectonics course, basic errors and omissions in their experimental write-ups indicated that they were struggling to identify and observe the evolution of key structural features, and were failing to accurately visualize the relationship between the observed deformation and the applied strain. This was supported by later assessments in the course that relied upon these critical skills, such as drawing a representative cross-section across a mountain belt after a class field trip. Castello and Cooke (2008) argued that sandbox modeling may increase student understanding of extensional and contractional faulting, fault development, and the influence of different material properties on structural development, but surprisingly, there is little published research studying student learning gains from the use of sandbox models beyond affective outcomes. The disappointing outcome of their original implementation of a sandbox activity led the instructor to hypothesize that a more structured approach might lead to more meaningful learning gains. The present investigation outlines this approach whilst also targeting a gap in the literature, by beginning to explore the effects of using sandbox models on student learning.
The activity developed for this study uses the cycle-based learning approach outlined by Davatzes et al. (2018). This approach is explicitly designed to overcome student misconceptions about geological processes by giving instant experiential feedback: students repeatedly predict an answer to a question and compare their answer to the correct one before being given the next question. This gives students the opportunity to continuously refine their mental models-the thought processes that they use to represent how a concept or process operates in reality-in response to the new information presented in each cycle. Amongst the studies summarized by Davatzes et al. (2018), Gagnier et al. (2017) indicated that sketching predictions of cross sections through 3D geologic block diagrams can promote spatial thinking over and above visualizing the interior of a model without sketching. Based on this research, we designed a sandbox activity with a "predict-observe-revise" cycle, which offered students an opportunity to identify and modify any misconceptions in their understanding of deformational processes with instructor support. Students were asked to sketch predicted cross-sections through different sandbox models, observe the "correct" answers by running experiments in class, and were then immediately given the chance to incorporate their observations into revised predictions of experiments that were not run. The primary goal of our new activity is that by its end, students will demonstrate an improved ability to predict deformational features consistent with the applied strain in the sandbox model experiments.
Sandbox modeling also has potential as a learning tool for developing spatial reasoning skills, a key goal of geological education (e.g., Libarkin & Brick, 2002;Manduca & Kastens, 2012). In particular, the opportunity to observe structures developing in a 3D volume, and to directly observe how subsurface deformation, viewed in cross-section, is translated into surface topography, is relevant to the development of penetrative thinking-the ability to visualize spatial relations inside an object. Two common instruments for testing penetrative thinking skill are the Planes of Reference Test (e.g., Titus & Horsman, 2009), where students choose the shape of intersection of a slicing plane with a geometric solid, and the Geologic Block Cross Sectioning Test (Ormand et al., 2014), where students select the correct vertical cross-section through a geologic block diagram. Although these tests are not precisely equivalent, Ormand et al. (2014) demonstrated a "moderately strong correlation" between scores for the Planes of Reference and Geologic Block Cross Sectioning Tests.
Both of these tests have been used to examine the evolution of spatial reasoning skills during an undergraduate geology degree. Titus and Horsman (2009) tested students before and after taking Introductory Geology, Tectonics, and Structural Geology classes at Carleton College, and documented an overall increase in average visual penetrative skill after they took these courses; they also noted that average scores were higher in the middle-and upper-level courses compared to Introductory Geology. In a similar study of students taking geology at three different institutions, Ormand et al. (2014) found modest gains of about 10% between pre and post-test scores within courses and higher average scores in more advanced courses, but also observed that the skill level of individual students in advanced courses remained highly variable. Hannula (2019) documented improvements in penetrative thinking skill during a sophomore field methods course at a four-year public college that were retained through the beginning of a structural geology course nine months later. In addition to monitoring changes in spatial reasoning skills over multiple courses and semesters, the effects of specific instructional changes on the development of spatial reasoning skills can also be tested. For example, Giorgis (2015) recorded increased gains between pre-and post-test scores after adding a Google Earth exercise to a Structural Geology class.
The secondary goal of our new sandbox modeling activity is that by its end, students will increase their general penetrative thinking skill, as measured using a modified version of the Geologic Block Cross Sectioning Test.

Study participants and setting
The study was conducted at a large research university in the Midwestern United States. The three-credit hour course, Tectonics and Orogeny, was taught in-person over 15 weeks, meeting in the mornings twice per week for a total of 45 contact hours. We report on 11 students (all White, eight females, three males) from this class, with varying levels of geological experience (three graduate students pursuing a Master of Science in Geology; six undergraduate Juniors and Seniors pursuing a Bachelors of Science in Geology, all of whom had taken a previous course in structural geology; two geology minors who had not taken structural geology).

Design and implementation of sandbox modeling activity
The instructor paired with a science education expert to design the sandbox modeling learning experience. The sandbox activity was a self-contained unit scheduled after a module on the tectonics and evolution of different kinds of plate boundaries; this included general patterns of faulting and folding generated by extensional and compressional strain, and how these patterns are modulated by different rheologies within and between different plate boundary regions. The sandbox activity occurred directly before a mandatory three-day class field trip to the Appalachians. The field trip was explicitly linked to the sandbox activity by incorporating experiments that illustrated the different structural styles observed in different parts of the Appalachians, such as thin-skinned thrusting over a weak layer (Appalachian Plateau) and thick-skinned thrusting involving strong basement (Blue Ridge).

Sandbox model
The analog sandbox used ( Figure 1) features a wide moving base plate that is pushed underneath a fixed backstop using a manually operated screw mechanism. Deforming layers of sand and/or other materials are confined by transparent perspex sheets on each side. For the class activity, a plywood separator was inserted at the midpoint between the two perspex sheets. This allowed two different experiments (each ∼300 mm wide) to be run in parallel, so that students could directly compare the response of different combinations of materials to the same input deformation. Experiments were set up by building up successive layers, typically ∼5 mm thick, of different materials between the side walls: • sieved, colored art sand, with colors alternated to allow developing structures to be easily observed; • sieved confectioner's sugar, for stronger, more cohesive layers; • glass microbeads, for weaker layers.
During experimental runs, the base plate was slowly advanced in 1-2 cm steps, until a stable end-state was reached. After each step, digital photos were taken from above and on each side and were uploaded to a shared folder to complement students' notes and sketches.

Activity design
The sandbox modeling activity took place over a 2 1/2 week period toward the end of the semester. Two cycles of experiments situated in the predict-observe-revise framework were undertaken ( Figure 2). Each cycle took two classes, with the first set of experiments focusing on extensional deformation, and the second on convergence. Within each cycle, the students were given four possible experimental set-ups (Table 1) and asked to individually sketch predictions of the structures that would develop on standardized worksheets (see Supplementary material). A vote was then held on which two experiments would be run in parallel in the next class. The two experiments selected were run as a group experiment, with the students progressing the model by increments, discussing (with guidance from the instructor) the changes that were occurring, and recording observations on provided logging sheets. At the end of the session the students filled in a worksheet where they (i) sketched the final state of the two experiments and compared it to their original predictive sketches, and (ii) were given the chance to sketch a revised prediction of what might happen in the other two experiments in the light of their observations. Space was explicitly provided for students to record their observations on how the experimental results differed from their predictions, and to reflect on how what they had seen affected their revised predictions. Following submission of their worksheets, photos from the other two experiments, run by the instructor outside of class time, were made available on a shared folder; however, student access was not tracked.

Evaluation
The activity was begun and ended by giving students a shortened version (10 questions, 15 minute time limit) of the Geologic Block Cross-Sectioning Test (referred to hereafter as the "block model test"; Ormand et al., 2014). The test was shortened because the recommended time limit for the full test was too long to easily fit in with the rest of the scheduled activity; the questions that were removed involved visualizing cross-sections with intrusive dikes, which were less relevant to the sandbox experiments than the folds and faults in the questions that were retained.
The four worksheets collected over the course of the activity were designed to record student thinking over a cycle of prediction, observation, and revision. Six predictive sketches for each cycle (four predictions prior to the experiments, two after the experiments for the set-ups that were not run) were rated using a score of 1-4 for how closely the predicted deformation was compatible with the applied strain (Figure 3), and consistent with the different mechanical properties of different layers of the models (Figure 4). Development of the scoring rubric and initial score assignment was performed by the instructor, with the education expert providing several rounds of feedback, focusing on a number of key examples, until a consensus was reached on both the rubric and scoring.
In addition to grouping students based on their degree of previous geologic experience, the study population was also subdivided into two groups, based on performance in the block model pretest: the four minor and BS students who scored less than 8 out of 10, and the seven BS students and MS students who scored 8 or more. The cutoff score of 8 was selected based on a break in the score distribution. Three of the six BS students missed one of the four sessions and thus only fully completed one of the predict-observe-revise cycles rather than two. All of the evaluated metrics (pre-and post-block model test scores, and the changes in the sketch scores) indicate that the BS student population was also the most variable, but no particular features (such as an improving vs static/deteriorating trend) were clearly associated with these BS students compared to those who had completed the two full cycles, so they were not separated into a separate group for analysis. Student comments to two questions in an end-of-course written reflection were analyzed for their perspective on the impact of the sandbox models on their thinking, if any: • To what extent did your ideas about what would happen under the different modelling conditions change as we performed the experiments? Please explain your reasoning. • How, if at all, did you use the modeling results to interpret the deformation of the Appalachians [on the required class field trip]?
The researchers examined the student reflections independently to inductively analyze the comments. Open coding was followed by higher level coding to develop themes. Discrepancies were resolved through discussion. This resulted in three initial themes, which were checked against Figure 2. Visualization of the predict-observe-revise framework used in the sandbox exercise. students make predictive sketches of four proposed deformation experiments, observe the results of two of those experiments, and are then asked to revise their predictions of the two experiments that were not run. Table 1. the four experimental set-ups offered to the class; these set-ups were subjected to extension in cycle 1 and convergence in cycle 2. experiment no.
cycle 1 (extension) cycle 2 (convergence) 1 layered sand only (control) layered sand only (control)* 2 Middle layer of (weak) glass microbeads within layered sand.* Middle layer of (weak) glass microbeads within layered sand. 3 Basal layer of (strong) confectioner's sugar beneath layered sand. Basal layer of (strong) confectioner's sugar beneath layered sand.  the data. Through discussion, the researchers agreed to make one theme a subtheme of another. This resulted in two final themes: 1. The sandbox models helped students to understand deformational processes and make better predictions (subtheme: some students attributed this change to being able to examine a physical artifact and visualize the processes and resulting structures). 2. The models helped students to identify issues with their reasoning.

Results
Overall, there were improvements in most individual students' predictions of the outcomes of the sandbox deformation experiments, as represented by the coded sketch scores ( Figures 5 and 6, Tables 2 and 3), and their penetrative spatial thinking skill, as represented by the block model test scores (Figure 7). Student comments also reflected this change (Table 4). The small sample size (n = 11 before division into subgroups) precludes meaningful statistical analysis other than descriptive statistics.

Improvements in quality of predictive sketching
Mean scores for the students' predicted strain and mechanical responses on their sketches at the beginning and end of both predict-observe-revise cycles are plotted in Figures  5 and 6, subdivided according to previous experience ( Figure  5) and performance on the block model pretest ( Figure 6). Score distributions for the predictive sketches are listed in Tables 2 and 3, grouped by experience and pretest performance, respectively. When all students are considered together, the mean score for strain predictions increased from 1.9 out of a possible 4 points ( Figure 3) at the beginning of cycle 1, to 2.3 at the end of cycle 1, then to 3.0 at the beginning of cycle 2, with no further increase within cycle 2. The mean score for mechanical predictions was ∼2.4 out of a possible 4 points ( Figure 4) within cycle 1, with a marginal (<0.1 points) increase at the end of the cycle; the mean score then increased to 3.0 at the beginning of cycle 2, before falling to 2.8 at the end of cycle 2.

Scores subdivided according to experience
There are improvements in predictions of strain response and mechanical behavior at all experience levels over the course of the two predict-observe-revise cycles ( Figure 5, Table 2). MS students consistently made better predictions than BS and minor students, with typical scores of 2 or 3 for their predictions of strain behavior and mechanical response at the beginning of cycle 1 increasing to scores of 3 or 4 at the end of cycle 2. However, the change in BS and minor students' scores over the course of the activity was greater, particularly when considering their predictions of strain response; typical scores of 1 or 2 increased to 2, 3 or (occasionally) 4 over the course of the activity ( Table 2), such that the gap between the MS students and the other two groups was much smaller by the end of cycle 2 ( Figure 5, left graph).
The timing of these improvements in prediction quality also varies with group and type ( Figure 5), although there was usually a marked improvement in predictive sketch scores between the end of cycle 1 and the beginning of cycle 2 a week later. Mean scores for MS students' strain predictions increased only at this step; within both cycles, their revised strain predictions were actually of lower quality than their initial predictions. In contrast, mean scores for MS students' mechanical predictions only increased at the end of cycle 1 and did not improve further. BS students' strain predictions increased in quality throughout the activity, although mainly within cycle 1 and in the gap between the cycles, whilst the quality of their mechanical predictions decreased within the two cycles and increased between them, for little overall gain. Minor students' strain and mechanical predictions did not improve within cycle 1, but mean scores for both increased in the gap between the cycles and within cycle 2. Figure 5. Mean sketch scores for accuracy of predicted strain (left) and mechanical (right) behavior through two predict-observe-revise cycles, subdivided by experience level in Geology (Ms = diamonds, Bs = circles, minor = triangle). the thicker gray line plots the combined mean scores for all three groups.

Scores subdivided according to pretest performance
At the beginning of cycle 1 the mean score for predicted strain response is 2.1/4 for the seven students who achieved a score of ≥8/10 on the pretest, compared to a mean of 1.4/4 for the four students who got a score of <8/10 ( Figure  6, Table 3). Although there was a significant overlap in the score distributions for both groups, only individuals in the high-scoring group initially produced predictive sketches that scored more than 2 (Table 3). Within cycle 1 and between cycle 1 and 2, both groups improved the quality of their predictive sketching, but the low-performance group improved their predictive sketching much more than the high-performance group, such that by the end of the activity the mean strain prediction scores for both groups were the same (3.0/4, Figure 6). In contrast, at the beginning of cycle 1 the difference in mean scores for predictions of mechanical response was only ∼0.1, and this small gap was maintained for the duration of the activity: for both groups, there was a small increase in mean scores within cycle 1, a larger increase between the cycles, and a small decrease within cycle 2.

Performance on geologic block cross-sectioning test
The mean block model test score for all 11 students increased from 6.5/10 to 8.2/10 between the pretest and the post-test (Figure 7). On the pretest, scores vary with experience level, with MS students performing best (mean score 9/10), BS students performing less well (mean score 7/10), and students taking the minor performing the least well (mean score 1.5/10). This ranking was maintained on the post-test, but the gap closed between the minors and the other two groups. Compared to the other two groups, the BS students exhibited a much larger range of scores on the pretest (from 3/10 to 10/10) and individual trajectories during the study: three students got a higher score on the post-test, one got the same score, and two got lower scores. All minor and MS students improved their scores from pretest to post-test.
When subdivided into two groups based on performance on the pretest, the four minor and BS students who did poorly on the pretest generally improved markedly on the post-test, with the average score increasing from 2.75/10 on the pretest to 7/10 on the post-test. In contrast, there was very little change in the average score for the seven BS and MS students in the high-scoring group.

Students' reflections on sandbox models
In their responses to questions related to the sandbox activity in the end of course survey (Table 4), eight students recognized that their ability to understand and predict the results of tectonic deformation increased after working with the sandbox model; five of those eight explicitly referred to the value of access to a physical artifact that they could observe and use to refine their mental models of deformation. Four students stated that the experiments helped them identify issues with their reasoning about how deformation should occur.
Participants identified as most helpful the modeling experiments and being able to ask questions openly. For example, one of the geology minors shared that "being able to ask questions as a non-geology major that may seem stupid and not be ridiculed for it allowed an open and non-judgmental learning space." A MS student reflected, "I had previously done similar experiments but these were a lot more useful and incorporated more variables that made them a lot more worthwhile."

Discussion
The primary goal of the sandbox activity described here was to improve students' understanding of how geological structures develop in response to applied strain, and the effect of different rheologies on the ultimate expression of those structures. Regardless of the students' experience level, the chance to observe structures developing in the sandbox coincided with an improved ability to predict geologically Figure 6. Mean sketch scores for accuracy of predicted strain (left) and mechanical (right) behavior through two predict-observe-revise cycles for all experiments, subdivided by block model pretest performance (open squares <8/10, filled squares ≥8/10). the thicker gray line plots the combined mean scores for both groups. Table 2. Predictive sketch score distributions, and mean sketch scores, for accuracy of predicted strain and mechanical behavior through the two predict-observe-revise cycles.
cycle/stage sub-groups are divided by experience level in Geology. Group totals marked by a * indicate where some students in the group missed the session or did not submit predictive sketches. Group totals marked by a * indicate where some students in the group missed the session or did not submit predictive sketches. sub-groups are divided by block model pretest performance (low: <8/10, high: ≥8/10).
realistic deformation patterns (Figures 5 and 6). Students' own comments on the activity support the idea that the experimental activity helped students to make better-informed predictions (Theme 1/1a, Table 4). Low scores on predictive sketches in the first cycle were typically the result of unrealistic predictions about how deformation would occur. Common examples included predicting pure shear rather than buckling and the formation of faults in the sand pile, and failure to realistically model the differential strain in layers with different rheologies, such as predicting that the "strong" sugar layer would not deform at all (Figure 4). It is unclear whether these mistaken predictions reflect misconceptions about geological deformation, an initial failure to appreciate that physical scaling laws allow granular materials to successfully simulate brittle tectonic deformation in the Earth's crust, or a combination of these factors. However, improved sketch scores by the end of the second cycle reflect the much less frequent occurrence of unrealistic predictions, as students used their observations during the performed deformation experiments to refine their mental models of deformation, a process the cycle-based learning approach is designed to promote (Davatzes et al., 2018). Some students explicitly testified that they engaged in this process (Theme 2, Table 4).
The most consistently observed increase in predictive sketch scores, for all groups and for both types of prediction, was between cycle 1 and cycle 2. It is possible that the new information revealed by the experiments was only successfully assimilated into students' mental models of deformation after a period of reflection, and/or most students were given insufficient time to reflect at the end of the experimental sessions before having to sketch and submit revised predictions. But it is also possible that this improvement is related to the switch from extension in the first cycle to convergence in the second. If students had a better conceptual understanding of convergent deformation compared to extensional deformation, then the observed improvements in the quality of their predictions might be unrelated to the cycle-based learning process. This latter hypothesis could be tested by switching the order of the experiments in a future iteration of this activity. In the present study, mechanical prediction scores, where the type of strain is not as important, and the predictive skill of the minor students, who were unlikely to Figure 7. comparison of block model test scores (out of 10) before and after the sandbox exercise. small symbols represent pretest scores and large symbols post-test scores. individual score trajectories (upper panel) are plotted with open symbols for students who scored <8/10 on the pretest and filled symbols for students who scored ≥8/10. Middle panel plots mean scores for groups divided by experience level in Geology. Ms students (n = 3; diamonds) showed a small but consistent improvement on their high initial scores; Bs students (n = 6; circles) showed a small improvement on average, but there was much more variability in individual trajectories; minor students (n = 2; triangles) improved, but not consistently. When split according to performance on the pretest (lower panel), students in the low scoring group record a substantial average improvement.
be familiar with either type of deformation, both increase between cycles ( Figure 5, Table 2), which might not be expected if the switch in strain type was an important factor. More generally, the different trajectories of improvement at different experience levels might reflect differences in how well-developed and firmly-held students' mental models of deformation are at different stages of a geology degree, although the sizes of the subgroups in this study are too small to draw firm conclusions.
The secondary goal of the sandbox modeling activity was to improve students' penetrative thinking skill, a key spatial reasoning skill in geology. As shown in Figure 7, the mean block model score for all participants increased by ∼25% on the post-test. But these gains were not evenly distributed: the seven students who got high scores on the pretest showed little improvement, whilst the four students in the low-scoring group showed large increases in their post-test scores relative to the pretest. In the former case, scores of 8 or 9 out of 10 on the pretest leave little room for improvement on the post-test. This ceiling effect may have been exacerbated by the fact that the questions that were removed from the test to fit class scheduling involved igneous intrusions cross-cutting earlier structures, and thus required a higher skill level to answer correctly. The large score increases observed in the case of the low-scoring group was unexpected, as previous work has indicated only modest gains in spatial reasoning skill over the course of an entire semester (Titus & Horsman, 2009, Ormand et al., 2014, even when the effect of adding a specific instructional method is being assessed (Drennan and Evans 2011;Giorgis 2015;Hannula, 2019). Although having two tests only two weeks apart could potentially enhance the "test-retest effect" compared to previous studies, which generally use a pretest at the beginning of the semester and the post-test at the end of semester, no significant test-retest effect has been observed for the block model test over a testing interval of 3-4 weeks (Ormand et al., 2014). Furthermore, when the participants in this study are considered as a single group, a 25% improvement with a wide range of scores is more consistent with these previous studies-it is only when the low-performing students are analyzed separately that a big change is apparent, although further study is required to test if more variability would be present in a larger group of students with low penetrative thinking skill.
When predictive sketch scores are subdivided according to initial performance in the block model test ( Figure 6, Table 3), there is an apparent relationship between lower predictive sketch quality at the beginning of the activity and less developed penetrative thinking skills: the group with low scores on the block model pretest were also less successful at predicting a realistic strain response. The larger improvement in performance in the block model post-test for the low-scoring group is also mirrored by the improvements in the quality of their predictions. By the end of the second predict-observe-revise cycle, the gap in mean strain prediction scores between the low-scoring and high-scoring groups had closed completely. Therefore, for this group of students, both the block model test and predictive sketch scores indicate that the sandbox activity had a larger effect on the geological understanding of students with less developed penetrative thinking skills-a hypothesis that could be more robustly tested in a study with a larger number of participants. It should also be noted that the high-scoring group's predictive sketch scores also improve, which further supports the possibility that gains in penetrative thinking skill in this group, as measured by block model test, are being hidden by a ceiling effect.
The specific mechanism by which students may have improved their penetrative thinking skill is beyond the scope of this study, but the apparent benefit of sketching cross-sections through a physical model bear an affinity to Table 4. total responses to student surveys (out of 11 participants) where comments were coded as meeting the listed themes, and examples of student responses that met each theme. theme number of participants representative quotes 1. the sandbox models helped students to make better predictions/understand tectonic deformation 8 "along with better understanding the actions of tectonics i believe i was better equipped to make a more informed prediction." (uG minor) "My ideas became more detailed and accurate because my knowledge of what would happened grew. the models helped with interpreting thick and thin skinned deformation [on the field trip]" (Ms student) 1a. some participants attributed this change to being able to examine a physical artifact and visualize the processes and resulting structures.

5
"doing the sandbox experiment helped me better understand what happens during convergence and rifting, with different layers being put into place. at times, i have a hard time picturing in 3d… actually doing the experiment helped out tremendously" (Bs student) "the convergent models helped visualize the deformation patterns seen on the trip. it was nice to do those right before we went over there. Good timing." (Ms student) 2. the models helped students to identify issues with their reasoning 4 "as we performed the experiments, i felt like i got a better understanding of the extent of the deformation and/or fault/ folding. … the modeling helped me to have a better understanding of the internal structure of the appalachians." (Bs student) the observations of Gagnier et al. (2017), who indicated that sketching predictions of cross sections through 3D geologic block diagrams can promote spatial thinking over and above visualizing the interior of a model without sketching. Similarly, when discussing the positive effect of a mapping course on spatial reasoning skills, Hannula (2019) speculated that mapping areas where relief clearly exposes cross-sectional relationships helps with visualization. The clear sides of the sandbox model provide a similar window into deformed sequences, with the additional benefit of also allowing the actual process of deformation to be observed.

Limitations and suggested improvements
Although the results presented above are promising, they are limited by the small number of participants. A study of a larger cohort of students, and/or multiple cohorts across multiple institutions, would make it possible to perform robust statistical analysis of overall learning gains, the differences between subgroups, and changes at different stages in the activity. There are also a number of other limitations in how learning gains were measured, which could be improved upon: • As already discussed in the previous section, the switch in strain type between the two learning cycles makes it possible that the learning gains observed at this step are related to preexisting differences in student understanding of convergent and extensional deformation, which could be tested in a future iteration by changing the order of experiments for some students. • The comparison of predictive sketches across cycles is also potentially complicated by the fact that the specific experimental set-ups observed by the students also change between cycles (Table 1). One experimental set-up (no. 3) was never run in class and thus generated a full set of both initial predictions and revised predictions in both cycles, which can be directly compared without having to account for differences in students' understanding of different material behaviors. When the results from this set are considered separately (Supplementary material Figures S1 and S2), the trends are very similar to those observed for the full dataset (Figures 5 and  6). Nonetheless, a future study could be designed to investigate the effects of varying which experiments are run in each cycle for different sets of students. • Because all assessment was contained within the activity itself, it is unclear if the apparent improvements in geological understanding and penetrative thinking skill at the end of the activity persist over longer timeframes. There was also no formal assessment of how successfully the students applied their improved mental models of deformation in the sandbox to real geological situations, although there is some support from student comments that, as intended, the examples used during the convergence experiments helped with the visualization of deformation in the Appalachians on the field trip.
Incorporating later assessment activities could be used to investigate these questions. • The assessment of penetrative thinking skill development in more capable students was limited by the fact that many students got very high scores on the pretest so could not show significant improvement on the post-test, even though their predictive sketch scores indicated some development in geological understanding over the course of the activity. This ceiling effect might be partially rectified by giving the full version of the block model test; however it is also possible that an alternative or an additional instrument is needed when the level of penetrative thinking skill is already likely to be high. • The student surveys provided positive, but quite generalized, information about how students viewed the activity. The questions analyzed were part of an end of semester survey and, despite specific prompting, student answers often referred to the whole course rather than the sandbox activity in particular. In the future, a separate survey or interviews right after the activity is completed might allow students' thinking to be more fully interrogated.
The design of the sandbox activity itself could also potentially be improved in future iterations. The sandbox experiments often ran close to the end of the scheduled class time, giving students little time to reflect and make their revised predictive sketches; ensuring that the experiments end 5-10 minutes before the end of class might yield clearer improvements within each learning cycle. It would also be possible to add an additional reflection step, by making students submit assessments of their revised predictions using pictures or videos of the experiments they did not run in person, effectively creating four learning cycles instead of two. During the experiments, the instructor could take a more active role in specifically highlighting the critical features that are developing, and guiding discussions of how they compare to the other model running in parallel, and previously observed experiments.

Implications and future research
Although sandbox modeling is widely acknowledged as a compelling demonstration tool that can demonstrate tectonic deformation in an accessible way, there has been very little reported research that investigates the most effective methods of using this tool to develop students' geological understanding. This study provides the first evidence that, when coupled with a cycle-based learning approach, a sandbox activity can advance students' geological understanding, whilst also potentially developing more general spatial reasoning skills. When coupled with a deliberately designed activity, sandbox modeling can not only provide a compelling demonstration of tectonic deformation processes, but also effectively realize important learning outcomes.
Whilst analysis of predictive sketches demonstrated growth in geological understanding for students at all levels of experience, the activity described here appeared to be more effective for students earlier in their geology training, and particularly for those who have weaker penetrative thinking skills (as measured by the block model test). This suggests that use of sandbox modeling activities early in the program of study might offer a simple yet effective way to promote spatial skill development, although many studies (e.g., Ormand et al., 2014) have observed that geology majors who are quite advanced in their program of study can sometimes still have poorly developed spatial skills, which was also observed in this study (Figure 7).
The results presented here are based on a single class of eleven students. Further study in multiple and/or larger classes is therefore needed to confirm these conclusions. Other topics of interest for future research include more detailed analysis of how and when students successfully assimilate and transfer their observations of sandbox deformation experiments when making new predictions; what mechanism drives improvements in penetrative thinking ability; and the question of whether the short-term gains observed in this study persist beyond the end of the activity.