Pick the Smaller Number: No Influence of Linguistic Markedness on Three-Digit Number Processing

The symbolic number comparison task has been widely used to investigate the cognitive representation and underlying processes of multi-digit number processing. The standard procedure to establish numerical distance and compatibility effects in such number comparison paradigms usually entails asking participants to indicate the larger of two presented multi-digit Arabic numbers rather than to indicate the smaller number. In terms of linguistic markedness, this procedure includes the unmarked/base form in the task instruction (i.e., large). Here we evaluate distance and compatibility effects in a three-digit number comparison task observed in Bahnmueller et al. (2015, https://doi.org/10.3389/fpsyg.2015.01216) using a marked task instruction (i.e., ‘pick the smaller number’). Moreover, we aimed at clarifying whether the markedness of task instruction influences common numerical effects and especially componential processing as indexed by compatibility effects. We instructed German- and English-speaking adults (N = 52) to indicate the smaller number in a three-digit number comparison task as opposed to indicating the larger number in Bahnmueller et al. (2015). We replicated standard effects of distance and compatibility in the new pick the smaller number experiment. Moreover, when comparing our findings to Bahnmueller et al. (2015), numerical effects did not differ significantly between the two studies as indicated by both frequentist and Bayesian analysis. Taken together our data suggest that distance and compatibility effects alongside componential processing of multi-digit numbers are rather robust against variations of linguistic markedness of task instructions.

to be classified as even (showing a so-called "odd effect"; see also Nuerk, Iversen, & Willmes, 2004). Following from the linguistic markedness account, the default (unmarked) pick larger setup might differ from the marked pick smaller setup resulting in differences in general task performance (i.e., longer reaction times in the pick smaller setup) as well as in observed numerical effects.

Linguistic Markedness and Numerical Effects
Up to now, only few studies investigated modulations of numerical effects resulting from manipulations of unmarked vs. marked task instructions. For instance, Verguts and De Moor (2005) manipulated linguistic markedness of task instruction (pick the smaller vs. pick the larger number) when investigating the distance effect in a two-digit number comparison task. They found an overall distance effect for within-decade number pairs (e.g., 64_68) but not for between-decade number pairs for which decade distance was held constant (decade distance was always 1; e.g., 68_72) for both the marked and the unmarked task instructions (see Moeller, Klein, & Nuerk, 2013, for a discussion of the differential results regarding distance effects). Crucially, although there was no formal statistical comparison, descriptively overall response times in the pick smaller condition were about 60 ms slower than in the pick larger condition (see Figure 1 in Verguts & De Moor, 2005). Thus, this study seems to show an effect of linguistic markedness on overall reaction times, however, no evidence was provided indicating a modulating effect of linguistic markedness on the numerical distance effect.
Contrarily, Arend and Henik (2015) demonstrated that the linguistic markedness of the task instruction modulates the size congruity effect (SiCE). The SiCE refers to the finding that in numerical and physical comparison tasks, response times are longer when number magnitude and physical size are congruent (e.g., ) than when they are incongruent (e.g., ;Henik & Tzelgov, 1982). In their study, reaction times were longer in the pick smaller condition compared to the pick larger condition. Moreover, the SiCE was larger when participants were instructed to pick the larger as compared to when they were instructed to pick the smaller number in the number magnitude comparison task, but no difference was found in the physical comparison task.
Further studies show that the linguistic markedness of task instruction also affects other types of Spatial-Numerical Associations (SNAs; see e.g., Cipora, Schroeder, Soltanlou, & Nuerk, 2018). Patro and Haman (2012) found an effect of SNA congruency (i.e., faster reactions to larger numerosities on the right) only in the pick larger but not in the pick smaller condition (i.e., reactions to smaller numerosities did not differ between left and right; see Figure 2 in Patro & Haman, 2012). Type of instruction also affects comparative judgments of conceptual size of objects, but not Arabic numbers (Shaki, Petrusic, & Leth-Steensen, 2012).
To sum up, the evidence for the modulating role of linguistic markedness of task instruction on numerical effects remains inconsistent. One potential mechanism by which linguistic markedness of task instruction might affect specific numerical effects may be due to its influence on overall reaction times. For instance, the spatial-numerical association of response codes effect (SNARC effect; Dehaene, Bossini, & Giraux, 1993) was shown to increase with longer overall reaction times (Cipora, Soltanlou, Reips, & Nuerk, 2019; see also Gevers, Verguts, Reynvoet, Caessens, & Fias, 2006;see Cipora et al., 2016, for a discussion of potential measurement artifacts in this context). Other cognitive effects, such as the Simon effect seem to also vary with general reaction time (Mapelli, Rusconi, & Umiltà, 2003; see also Glaser & Glaser, 1982, for the Stroop effect).
With respect to the effects of interest in the present study, the distance effect was shown to be more pronounced for longer reaction times (Hohol et al., 2020). However, to the best of our knowledge, associations of overall response times and compatibility effects have not been reported yet. Nonetheless, in developmental studies overall reaction times were standardized to control for potential effects of interindividual variability in reaction times on the size of compatibility effects (Mann, Moeller, Pixner, Kaufmann, & Nuerk, 2012;Nuerk, Kaufmann, Zoppoth, & Willmes, 2004;Pixner, Moeller, Heřmanová, Nuerk, & Kaufmann, 2011). The reasoning behind the standardization is that prolonged processing of a stimulus might lead to increased interference of task irrelevant digits (i.e., unit digit in two-digit number pairs, unit and tens digit in three-digit number pairs) in incompatible number pairs and, thereby, to larger compatibility effects.

The Present Study
The current study set out to evaluate the generality of basic effects in multi-digit number processing (i.e., distance and compatibility effects) across marked and unmarked task instructions (i.e., pick the larger vs. pick the smaller number). In particular, in a conceptual replication attempt of the study by Bahnmueller et al. (2015), we employed the same three-digit number comparison paradigm with Arabic digits in a comparable sample of German-and English-speaking adults. However, instead of asking participants to indicate the larger of two presented three-digit numbers we asked participants to indicate the smaller of two three-digit numbers.
As it seems unlikely that a change in linguistic markedness of task instructions leads to major disruptions of the main underlying cognitive mechanisms of multi-digit Arabic number processing (i.e., number magnitude should still be processed, numbers should still be processed componentially), we predicted reliable main effects of hundred distance, hundred-decade compatibility, and hundred-unit compatibility when participants are asked to pick the smaller number.
To investigate potential modulating effects of linguistic markedness more directly, we compared overall reaction times as well as the respective numerical effects directly between the newly conducted pick smaller and the pick larger experiment in Bahnmueller et al. (2015). In line with previous reports (Arend & Henik, 2015;Verguts & De Moor, 2005), we expected prolonged reaction times when instructed to pick the smaller as compared to picking the larger number.
Regarding modulations of the numerical effects due to linguistic markedness of the task instruction, we expected to replicate the findings by Verguts and De Moor (2005) showing comparable distance effects for marked and unmarked task instructions. However, regarding the hundred-decade and the hundred-unit compatibility, we expected to find larger compatibility effects when instructed to pick the smaller number because longer overall reaction times and, thus, prolonged processing of number pairs in the pick smaller experiment should lead to increased interference of task irrelevant digits (i.e., unit and tens digit) in incompatible number pairs and, thereby, to larger compatibility effects.

Method Participants
For the analyses of the pick smaller experiment, newly collected data of a total of 53 participants were considered (after exclusions, see below). Based on Bahnmueller et al. (2015; henceforth referring to the pick larger experiment), we did not expect three-digit number processing to be influenced by the number word structure (e.g., inverted vs. non-inverted number words; but see, e.g., Steiner et al., 2021, this issue, for inversion-related effects when processing multi-digit numbers in children). However, we recruited a comparable sample of German-and English-speaking participants for the pick smaller experiment. This allowed for optimal comparability between studies and further exploration of potential language-related modulations within the present pick smaller experiment.
Three participants were excluded in the pick smaller experiment because error rates exceeded 10% in the experimen tal trials. Moreover, another four participants were excluded because they consistently used the reverse response coding (i.e., they picked the larger number). Thus, the final pick smaller sample consisted of 30 native German speakers (24 female, all right handed, M age = 22.7 years, SD = 2.8) and 23 native English speakers (16 female, all right handed, M age = 19.7 years, SD = 1.4).
For the re-analyses of the pick larger experiment, data of a total of 51 participants were considered. Two participants were excluded because error rates exceeded 10%. Thus, the final pick larger sample consisted of 24 native German speakers (21 female, 20 right handed, M age = 23.1 years, SD = 6.3) and 27 native English speakers (21 female, 25 right handed, M age = 20.1 years, SD = 2.3).
German-speaking participants were recruited via postings at the University of Tuebingen and the Leibniz-Institut für Wissensmedien Tübingen. English-speaking participants were recruited at the University of York. Participants received course credit or 5€/4£ for compensation. The study was approved by the local ethics committee of the University of York.

Power Calculations
Sample size estimates for paired t-tests for the pick smaller experiment were calculated using JAMOVI (The jamovi project, 2020) and were based on the respective effect sizes observed in the pick larger experiment. Based on this, a sample size of 27 should be sufficient to detect a hundred-decade compatibility effect (i.e., the smallest main effect observed in the pick larger experiment) of an effect size of d = 0.59 or larger with α = .05 (one-tailed) and a power of .90. To achieve comparability between the pick smaller and the pick larger experiment and to increase sensitivity for detecting a smaller effect in the pick smaller experiment, we aimed at collecting a comparable number of participants (N = 51) allowing us to detect a medium sized effect of d = 0.46. G*Power (Faul et al., 2009) was used for power estimates of the between-subject effect of instruction (pick smaller vs. pick larger) as well as the within-between interaction of the respective numerical effect and instruction in the 2 × 2 mixed factor ANOVAs. A total sample size of 100 is sufficient to detect a medium sized between-subject as well as interaction effect of f = 0.33 (η p 2 = .1) with α = .05 and a power of .90 (see Supplementary Materials for all outputs of the power calculations).

Stimuli
The same stimulus set was used in the pick smaller and the pick larger experiment. In total, 640 three-digit number pairs were used. Of these, 320 were experimental items manipulated orthogonally according to hundred-, decade-, and unit distance (each small [1-3] vs. large [4][5][6][7][8]), as well as hundred-decade and hundred-unit compatibility (compatible vs. incompatible). Moreover, problem size was matched across all item categories and decade as well as unit distance was matched for the respective item categories. In addition to the 320 experimental items, 320 filler items were included in the stimulus set to avoid that participants focused only on the decision-relevant hundred-digit (160 within-hundred filler items, e.g., 672_648; 160 within-hundred-within-decade filler items, e.g., 282_284). Please refer to the Supplementary Materials in Bahnmueller et al. (2015) for a more detailed description of the stimulus set as well as descriptive characteristics of all item categories.
Unfortunately, due to a programming error in the pick smaller experiment, participants were only presented with 560 of the 640 items (i.e., the last block [80 items] was not presented). The 560 items were randomly drawn from the total item set for each participant. Regarding the 320 experimental stimuli included in the analyses, an item was presented 46.4 times on average (SD = 2.4, range: 40-52). Because items were drawn randomly, stimulus matching was not substantially affected (see Supplementary Materials for item characteristics of the experimental items in the pick smaller experiment compared to item characteristics of the matched stimulus set).

Procedure
The procedure of both experiments was identical and differed only with respect to the task instruction. In particular, participants were instructed to indicate the smaller (pick smaller experiment) or the larger (pick larger experiment) of two simultaneously presented three-digit numbers as fast and as accurately as possible. Numbers were presented above each other. In the pick smaller experiment, participants were asked to press the upward arrow of a standard keyboard in case the upper number was smaller, and they were asked to press the downward arrow in case the lower number was the smaller one. In contrast, in the pick larger experiment, participants had to indicate the location of the larger number by pressing the upward arrow in case the upper number was larger, and the downward arrow in case the lower number was larger.
The respective experiment started with 10 practice trials, followed by 8 blocks (7 blocks in the pick smaller study) containing 80 items each. After each block, the participant could take a short break. Stimulus order was randomized separately for each participant and across blocks. Stimuli were presented centrally in white against a black background (font: Arial, font size: 24, bold). A trial started with a fixation cross presented centrally for 500ms. Following the fixation cross, a number pair was presented and remained on the screen until a response was given. The next trial started after an inter-trial-interval of 500ms.
As error rates were very low (pick smaller experiment: M = 4.3%, SD = 2.0%; pick larger experiment: M = 3.7%, SD = 2.1%) analyses focused on reaction times (RT). Practice trials and filler items were excluded from the analyses. Moreover, RTs faster than 200ms as well as RTs deviating more than +/-3SD from an individual participant's mean RT were excluded. This trimming procedure resulted in a loss of 1.4% of data.
Directly addressing our primary research question, we first report results of the analyses of numerical effects in the new pick smaller experiment using three paired t-tests 1 (i.e., one per numerical effect; effect sizes (Cohen's d for paired t-tests) along with 95% confidence intervals were estimated as implemented in JASP). Moreover, a 2 × 2 × 2 × 2 mixed design ANOVA similar to the one reported by Bahnmueller et al. (2015) discerning the within-subject factors hundred distance, hundred-decade compatibility, and hundred-unit compatibility, as well as the between-subject factor language group (German vs. English) will also be reported for the pick smaller experiment.
Analyses of the pick smaller experiment are directly followed by the re-analysis of the results of the pick larger experiment using the same, more focused analyses (i.e., one paired t-test per numerical effect). Afterwards, results of the direct comparison of the two experiments are reported separately for mean reaction times and each numerical effect using both frequentist as well as Bayesian measures to be able to quantify the evidence for both the null and the alternative hypothesis.
We further ran a 2 × 2 × 2 × 2 mixed design ANOVA discerning the within-subject factors hundred distance, hundred-decade compatibility, and hundred-unit compatibility, as well as the between-subject factor language group (German vs. English) for the pick smaller experiment. As expected based on the results of the t-tests above, we observed significant main effects of hundred distance, F(1, 51) = 353.75, p < .001, η p 2 = .87, hundred-decade compatibility, F(1, 51) = 46.23, p < .001, η p 2 = .48, and hundred-unit compatibility, F(1, 51) = 51.32, p < .001, η p 2 = .50. Moreover, the interaction of hundred-distance and hundred-unit compatibility was significant, F(1, 51) = 4.66, p = .036, η p 2 = .08, indicating that the hundred-unit compatibility effect was significant for both small and large hundred distances (small: t(52) = 6.16, p < .001; large: t(52) = 4.03, p < .001) but was larger for small compared to large hundred distances, t(52) = 2.24, p = .029. Crucially, neither the main effect of language group, F(1, 51) = 1.88, p = .176, η p 2 = .04, nor any of the interactions with language group were significant (all p ≥ .142). Thus, results for the pick smaller experiment provide no evidence for a 1) Distance effects are often investigated using a continuous measure of distance rather than a categorical one. However, because we based our analyses on Bahnmueller et al. (2015), we decided to follow the categorical approach in the orginal paper and to use a categorical variable for both the analysis focusing on the distance effect only and the more complex factorial analysis. difference in numerical effects between German and English speakers replicating observations of Bahnmueller et al. (2015) previously reported for the pick larger experiment. Results of a parallel Bayesian mixed design ANOVA showing a comparable pattern can be found in the Supplementary Materials.

Pick Larger Experiment
Paralleling analyses of the pick smaller experiment and providing a more focused analysis as presented in Bahnmueller et al. (2015), three separate paired t-tests were also run for the pick larger experiment. Comparable to the pick smaller study, a significant hundred distance effect was observed showing faster RTs for number pairs with a large (M = 728ms, SD = 160ms) as compared to small hundred distance, M = 815ms, SD = 176ms, t(50) = 22.88, p < .001, d = 3.20, 95% CI [2.52, 3.88]. In addition, the effect of hundred-decade compatibility was significant, t(50) = 4.19, p < .001, d = 0.59, 95% CI [0.29, 0.88]; indicating that compatible number pairs (M = 764ms, SD = 172ms) were responded to faster than incompatible number pairs (M = 777ms, SD = 165ms). Finally, the effect of hundred-unit compatibility was also significant, t(50) = 6.79, p < .001, d = 0.95, 95% CI [0.62, 1.28], with compatible number pairs (M = 757ms, SD = 165ms) being responded to faster than incompatible number pairs (M = 784ms, SD = 172ms). Again, the significance of results remains unchanged when correcting for multiple comparisons. Refer to Bahnmueller et al. (2015) for results of the analysis of the full factorial design.

Modulation of the Hundred Distance Effect
A mixed design ANOVA with the within-subject factor hundred distance (small vs. large) and the between-subject factor instruction (pick smaller vs. pick larger) revealed a significant effect of hundred distance, F(1, 102) = 818.76, p < .001, η p 2 = .89; small: M = 801ms, SD = 162ms; large: M = 711ms, SD = 143ms). Neither the main effect of instruction, F(1, 102) = 1.05, p = .308, η p 2 = .01; pick smaller: M = 741ms, SD = 143ms; pick larger: M = 771ms, SD = 173ms, nor the interaction of hundred distance and instruction were significant, F(1, 102) = 1.56, p = .214, η p 2 = .02. To quantify the evidence in case of non-significant results, we further ran a Bayesian mixed design ANOVA using default JASP prior scales. It revealed that the data were best represented by a model that included the main effect of hundred distance only. The Bayes Factor (BF 10 ) for this model was 4.33 × 10 46 , indicating strong evidence for this model over the null model. Results further showed strong evidence against the model only including the main effect of instruction (BF 10 = 1.29 × 10 -47 or BF 01 = 7.75 × 10 46 ) as the data were 7.75 × 10 46 times more likely under the best model (i.e., the model only including the main effect of hundred distance). Moreover, results revealed weak/inconclusive evidence against the model including both main effects (BF 10 = 0.49 or BF 01 = 2.03) and moderate evidence against the model additionally including the interaction term (BF 10 = 0.18 or BF 01 = 5.55) when compared to the best model (see Table 1, see also Supplementary Materials for JASP output and analyses files). Note. HD = hundred distance; Instr = instruction; m = model; incl = inclusion. Models are compared to the best fitting model (i.e., the model only including the main effect of HD).

Modulation of the Hundred-Decade Compatibility Effect
A mixed design ANOVA with the within-subject factor hundred-decade compatibility (compatible vs. incompatible) and the between-subject factor instruction revealed a significant effect of hundred-decade compatibility, F(1, 102) = 54.64, p < .001, η p 2 = .35; compatible: M = 747ms, SD = 154ms; incompatible: M = 763ms, SD = 150ms. The interaction of hundred-decade compatibility and instruction was not significant, F(1, 102) = 1.57, p = .213, η p 2 = .02. The paralleling Bayesian mixed design ANOVA showed that the data were best represented by a model that included the main effect of hundred-decade compatibility only. The BF 10 for this model was 1.72 × 10 8 , indicating strong evidence for this model when compared to the null model. Moreover, there was strong evidence against the model only including the main effect of instruction (BF 10 = 2.51 × 10 -9 or BF 01 = 3.98 × 10 8 ) by indicating that the data are 3.98 × 10 8 times more likely under the best model (i.e., only including the main effect of hundred-decade compatibility). Finally, results revealed weak/inconclusive evidence against the model including both main effects (BF 10 = 0.45 or BF 01 = 2.23) and moderate evidence against the model additionally including the interaction term (BF 10 = 0.17 or BF 01 = 5.97) when compared to the best model (see Table 2). Instr .20 1.55 × 10 -9 6.21 × 10 -9 2.51 × 10 -9 Note. HDC = hundred-decade compatibility; Instr = instruction; m = model; incl = inclusion. Models are compared to the best fitting model (i.e., the model only including the main effect of HDC).

Modulation of the Hundred-Unit Compatibility Effect
A final mixed design ANOVA with the within-subject factor hundred-unit compatibility (compatible vs. incompatible) and the between-subject factor instruction revealed a significant effect of hundred-unit compatibility, F(1, 102) = 93.43, p < .001, η p 2 = .48; compatible: M = 742ms, SD = 149ms; incompatible: M = 767ms, SD = 155ms. The interaction of hundred-decade compatibility and instruction was not significant, F(1, 102) = 0.47, p = .494, η p 2 = .01. The corresponding Bayesian mixed design ANOVA showed that the data were best represented by a model that included the main effect of hundred-decade compatibility only. The BF 10 for this model was 1.06 × 10 13 , indicating strong evidence for this model when compared to the null model. When compared to the best model (i.e., only including the main effect of hundred-unit compatibility), results revealed strong evidence against the model only including the main effect of instruction (BF 10 = 5.61 × 10 -14 or BF 01 = 1.78 × 10 13 ). Moreover, when compared to the best model, results revealed weak/inconclusive evidence against the model including both main effects (BF 10 = 0.43 or BF 01 = 2.32) and moderate evidence against the model additionally including the interaction term (BF 10 = 0.11 or BF 01 = 9.01; see Table 3). Note. HUC = hundred-unit compatibility; Instr = instruction; m = model; incl = inclusion. Models are compared to the best fitting model (i.e., the model only including the main effect of HUC). Figure 1 illustrates Cohen's d and 95% confidence intervals around the respective effect separately for each numerical effect and instruction (pick smaller vs. pick larger). In line with Bayesian analyses, similar point estimates and largely overlapping confidence intervals do not provide evidence for a difference in numerical effect between experiments.

Figure 1
Cohen's d and 95% Confidence Intervals Presented Separately for Each Numerical Effect and Task Instruction

Bin Analyses
To explore potential differences in the time course of the effects of interest both within and across experiments, we further ran a bin analysis dividing the RT distribution in each condition into four equal bins (i.e., from fastest to slowest RTs; cf. Arend & Henik, 2015). In contrast to Arend and Henik (2015), the results pattern did not show evidence for a systematic influence of RT bin on the numerical effects of interest (neither in the pick smaller nor in the pick larger experiment). The differential result pattern may result from differences in effects under investigation (size congruity effect versus distance and compatibility effects), and number range (single vs. multi-digit numbers). For the interested reader results of these analyses are provided in the Supplementary Materials).

Discussion
In a conceptual replication attempt of the study by Bahnmueller et al. (2015), the present study aimed at evaluating the generalizability of basic effects in multi-digit number processing across marked and unmarked task instructions. Overall, we replicated effects of hundred distance, hundred-decade-, as well as hundred-unit compatibility that were previously reported using an unmarked task instruction (i.e., pick the larger number, cf. Bahnmueller et al., 2015) in a three-digit number comparison task using a marked task instruction (i.e., pick the smaller number). Results showed no significant difference in overall reaction times between the comparison tasks using the marked (pick smaller) and the unmarked (pick larger) task instruction. Additional Bayesian analyses provided evidence that linguistic markedness of the task instruction did not affect the numerical effects of interest. Moreover, no evidence for a difference between experiments in the size of either one of the numerical effects was observed. These results were confirmed by Bayesian analyses providing moderate evidence against the interaction of task instruction and the respective numerical effect. Taken together, our data suggest that distance and compatibility effects and with this componential processing of multi-digit numbers are largely unaffected by variations of the linguistic markedness of task instructions.

Numerical Effects and Task Instruction
In line with previous observations regarding three-digit number comparison tasks (Bahnmueller et al., 2015(Bahnmueller et al., , 2016Huber et al., 2013;Korvorst & Damian, 2008;Mann et al., 2012), we replicated both the hundred-decade and the hundred-unit compatibility effect as well as the effect of hundred distance in the pick smaller experiment. Importantly, effect sizes observed in the pick smaller experiment were very similar to those observed in the pick larger experiment, and the interaction between task instruction and the numerical effects of interest was not significant. Moreover, Bayesian analyses provided moderate evidence against an influence of linguistic markedness on the three numerical effects under investigation. Thus, no major disruptions of the behavioural signatures of multi-digit Arabic number processing were observed when participants were confronted with a marked task instruction. Thereby, the present study provides further evidence for the robustness of the numerical effects under investigation and suggests that these numerical effects do not seem to be bound to specific experimental setups. And further, as indexed by significant compatibility effects resulting from interference due to the decision irrelevant tens/unit digit, the present study provides evidence towards the componential processing account put forward for multi-digit number processing (cf. Huber et al., 2016).

General Performance and Task Instruction
However, in contrast to previous findings in single-and two-digit number comparison (Arend & Henik 2015;Verguts & De Moor, 2005), we did not detect reliable differences in overall response times in frequentist analyses. Although the Bayesian analysis supports the null model, the evidential value is relatively weak. Thus, it is possible that with a larger sample the direction of the evidence would change providing evidence for an effect of linguistic markedness. However, given our sample size, this scenario seems rather unlikely. What we can conclude is that an effect of linguistic markedness on general reaction times, if it exists, must be rather subtle. Furthermore, as overall reaction times were comparable between experiments, the mechanism through which we anticipated modulations of the compatibility effects (i.e., longer reaction times when confronted with the marked task instruction resulting in more elaborated processing of a stimulus and, therefore, increased interference due to the irrelevant tens/unit digit in incompatible trails) could not be demonstrated.
Moreover, it seems that most participants in the pick smaller experiment were fairly adaptive to the marked task instruction. Interestingly, in the pick smaller experiment, four participants had to be excluded from the analyses because they consistently picked the larger number although instructed to pick the smaller one. Similar confusions did not occur in the pick larger experiment. Thereby, our results may suggest that, when comparing numbers beyond the two-digit number range, following an unmarked task instruction relies on an initial categorical internalization of the task instruction rather than on a continuous, ongoing conflict or source of interference throughout the comparison task. As this account is rather speculative, future studies might consider manipulating linguistic markedness of the task instruction in within-participant designs, for instance, using a task switching paradigm (cf. Shaki et al., 2012). In such a task switching paradigm participants would have to switch between marked and unmarked task instructions when comparing numbers on a trial by trial basis. This would allow for evaluating whether marked task instructions indeed influence multi-digit number processing on a trial by trial basis when an initial categorical internalization of the task instruction is not possible.

Conclusion
Taken together, we successfully replicated main results reported by Bahnmueller et al. (2015) showing that distance and compatibility effects in a three-digit number comparison task generalize across marked and unmarked task instructions. Crucially, however, linguistic markedness of task instructions did not seem to influence basic numerical processing as the size of numerical effects was comparable between experiments using a marked compared to an unmarked task instruction. In particular, results suggest that basic strategies in three-digit number processing are rather robust against variations of the linguistic markedness of task instructions.