The Ups and Downs of Black and White: Do Sensorimotor Metaphors Reflect an Evolved Perceptual Interface?

ABSTRACT The Implicit Association Test (IAT) was used to measure population levels of conceptual alignment among two polar sensory metaphors and clusters of concepts to which they are commonly applied. A total of 873 participants were tested online, to compare within- and between-cluster alignments of concepts associated with two different polar sensory metaphors (up/down and black/white). IAT results were sensitive to semantic alignments that were also picked up by Latent Semantic Analysis (LSA) using a large-scale corpus of English. However, even with these semantic alignments taken into account, the dual categorization results demonstrated strong metaphor-cluster alignments over and above the predictions of LSA. It is proposed that, rather than ontogenetic development of metaphoric concepts based on sensorimotor experience, conventionalized sensorimotor metaphors in language may be cognitive tools that provide language learners with insights supported by conceptual properties of a phylogenetically evolved perceptual interface.

Max Black (1955) argued that, rather than regarding metaphor suspiciously in philosophical and scientific discourse, we should recognize that good metaphors provide a new perspective.Contrary to the idea that metaphoric meaning was simply a substitute for what could be expressed literally, Black's interaction theory proposed that strong metaphors create cognitions not possible by other means.Thus, although the words "black" and "down" can both be used in contexts where the word "bad" might suffice, interaction theory proposes that each of these specific metaphoric uses conveys something more precise because they are related to different "system[s] of associated commonplaces" (Black, 1955).Thus, in Black's theory (1955;1977), metaphoric statements have the potential to create new cognitions, and even conventional metaphors might be thought of as important cognitive tools already established in the language.In this paper, we argue that sensory metaphors, in particular, may derive some of their meaning from the evolved sensory interface that humans share.
For example, English is rife with metaphoric appeals to a conceptual alignment between mood and the vertical axis: A day spent with a good friend can lift one's spirits; an e-mail delivering bad news can send them plummeting.These expressions contain a consistent mapping between spatial direction and emotion, where a good mood is linked with an upward direction and a bad mood is linked with a downward direction (Lakoff & Johnson, 1980a;Nagy, 1974).In emphasizing the importance of metaphor for structuring thought, Conceptual Metaphor Theory (CMT; Lakoff & Johnson, 1980a), which is a type of interaction theory because it asserts that metaphors create meaning, proposes that sensorimotor experience, among other things, can provide grounding metaphors by which humans can understand other aspects of the world.CMT suggests that metaphoric mappings underlie our conceptual understanding of the world, and this conceptual understanding is generative, resulting in multiple idiomatic expressions referring to the same underlying metaphoric structure.Thus, in this theory, sensorimotor metaphors structure our thoughts.In support of this idea, a great deal of evidence for the interaction of sensorimotor space and emotional or social interpretation has been adduced by proponents of CMT (e.g., Crawford, Margolies, Drake, & Murphy, 2006;Dudschig, de la Vega, & Kaup, 2015;Meier & Robinson, 2004).
However, there are many other possible forms of interaction theory (Kittay & Lehrer, 1981; see also, Landau et al., 2010;Tourangeau & Rips, 1991) and there are also theories that provide alternative accounts of the generativity of metaphoric ideas.For example, Gentner and colleagues (e.g., Bowdle & Gentner, 2005;Gentner & Grudin, 1985; see also Murphy, 1997) suggest that consistent metaphoric mappings are common in everyday language because they take advantage of existing structural alignments between different domains (see also Tourangeau & Sternberg, 1982).Becoming fluent in metaphorical language may help to conventionalize abstract conceptualizations (Bowdle & Gentner, 2005).Moreover, clusters of similar analogical alignments (extended metaphors) may develop for communicative efficiency (Thibodeau & Durgin, 2008), and systematic alignments might also act as embedded cognitive tools in language that convey accumulated cultural knowledge to language learners.
In this paper we will propose a hybrid interactionist theory that might reconcile the structural alignment perspective (e.g., Gentner, 1983) and CMT in a new way.One major difference between CMT and structural alignment theories concerns whether the concepts that are understood by means of conventional metaphors are grounded in our various sensorimotor experiences of the world, or whether metaphors contribute structural analogies that are merely combined with other concepts.This contrast may be a false dichotomy.Modern theories of perception suggest that evolution has created and tuned our experienced perceptual categories, making them essentially proto-conceptual (Hoffman, Singh, & Prakash, 2015).Thus, it is possible that even metaphors based on sensorimotor experience can be understood as appealing to preexisting structural analogies.This synthesis is an interactive theory because it allows that conventional sensorimotor metaphors appeal to existing structural analogies, yet nonetheless, alter the concepts they become associated with by providing new insights into them.In this paper, we will use the Implicit Association Test (IAT, Greenwald, McGhee, & Schwartz, 1998) to test this idea by examining whether concepts that are structured by a common sensorimotor metaphor are strongly aligned to one another.This hypothesis will receive preliminary support.Moreover, the data gathered in this process provide an interesting perspective on the possible effects of metaphor on implicit cognitive biases (see also Thibodeau & Boroditsky, 2011).

Is experience constructed from evolved sensorimotor concepts?
New (and old) theories of perceptual experience suggest that much of sensorimotor experience is a kind of evolved conceptualization or interface (Hoffman, Singh, & Prakash, 2015; see also the concept of Umwelt, e.g., Clark, 2009).The interface theory of perception is based on the idea that evolved perceptual experience is analogous to a computer interface in which icons have locations and colors on the computer screen that simply do not reflect the real characteristics and locations of the relevant "file" within the computer.Despite the widespread tacit assumption that perception must be veridical (accurate) to be useful, the modeling of evolutionary processes suggests that utility and accuracy essentially always diverge (Hoffman, 2018;though see Berke, Walter-Terrill, Jara-Ettinger, & Scholl, 2022).Evolved perceptual interfaces highlight information that is of utility in ways that might include gross distortions, or even completely "fictional" properties.Color is perhaps the most wellknown example.Light, itself, (i.e., electromagnetic radiation in the "visible spectrum") doesn't have color, but our trichromatic visual systems convert ratios of receptor activities into a kind of fictional experience that seems very real to us.Many theorists of perception agree that space, too, seems to be structured in our experience in ways that are more tuned to the service of survival than to the actual geometry of the world (Durgin & Li, 2017;Jackson & Cormack, 2007;Proffitt et al., 1995).Thus, our sensorimotor experiences of both color and space seem to be products of evolved abstraction.
Solomon Asch sought to discover if the conceptual structure of sensory experience (Hoffman's "interface") was relatively universal, by examining whether sensorimotor metaphors that are applied to persons (such as warm, cold, high, low, black, and white) had similar interpretations across languages (Asch, 1955;Asch, 1958).He asked scholars of six essentially unrelated languages to search for instances of social sensory metaphors using these terms.Although presented only qualitatively, Asch's data seemed to show that there was a great deal of similarity across languages, such that even when the social metaphoric meanings of sensory qualities differed (e.g., sour person as a spurned lover, a cold person as a lonely person), the metaphors were still easily interpreted by English speakers.This is consistent with the idea that sensory experience is a kind of quasi-conceptual interface.

Grounding vs. structural alignment and the interface theory
The folk psychological idea of perception is that sensory experience is an unmediated veridical representation of the world.Lakoff and Johnson (1980a) proposed that metaphors serve to ground abstract concepts in sensory experience, but they emphasized that an ontogenetic developmental trajectory grounded these metaphors.For example, they suggested that up represents more because of the way putting more of something into a pile results in a higher pile.Their explanation of why up represents happy refers to postural associations (upright when happy vs. slouched when depressed).This kind of argument was used to provide a developmental story of why we understand more and happy to be up and is also claimed to ground the concepts of more and happy in a sensory understanding of up.
In general, the claim that basic metaphors can be accounted for by developmental origin stories has been controversial.Keysar andBly (1995, 1999) have argued that such post hoc explanations of metaphoric alignment are cognitive illusions.They used antiquated idioms (e.g., "the goose hangs high") in a study in which they showed that once an interpretation was attached to such idioms (e.g., either positive or negative), participants taught one of the two interpretations would later argue that the alternative interpretation would be unsatisfactory.Keysar and Bly's findings suggest that post-hoc developmental explanations are not good evidence for developmental grounding stories.Moreover, Murphy (1996Murphy ( , 1997) ) has argued that there seems to be no explanatory advantage in trying to establish metaphors as a source of grounded meaning because one still has the task of keeping the metaphoric and literal meanings separate.Gentner (1988) has proposed that conventional metaphor contributes to the understanding of existing concepts by processes of structural analogy, and thus concepts that are structured by association with the same metaphor ought to have overlapping interpretations.This position allows common abstract structure that is not captured in the origin stories, and does not appeal to grounding, thus avoiding Murphy's (1996) principal concern.In explaining up/down metaphors, H. Clark (1973) has already provided a detailed analysis of why the vertical dimension (up) is ideally suited to representing magnitudes generally (rather than preferences).Just as prices can go up or down (in magnitude) from their current level, mood can go up or down (in quantity of happiness) from its current level.Perhaps the metaphoric association of up with happy implies that one of our standard instrumental understandings of mood (with respect to valence) is thus a magnitude representation with ground level as neutral (neither up nor down).This would not be a grounding of the meaning of happy, but an enrichment through shared abstraction and implicit analogy (Gentner, 1983).If two categories associated with similar sensorimotor metaphors show shared structural alignment (e.g., more/less and happy/sad), this would support the central idea of interaction theory that strong metaphors act as cognitive tools that affect our understanding of the world (Black, 1977).

Measuring alignment with the IAT
The IAT (Greenwald, McGhee, & Schwartz, 1998) seems to be a candidate measure of alignment across domains defined by pairs of terms, such as up/down or black/white.In a typical IAT, participants are asked to simultaneously perform two categorization tasks.For example, they might sort presented items into one of four categories (e.g., black vs. white and positive vs. negative), but these categories are reduced to two responses (typically a left button press and a right button press).When both "black" and "positive" are assigned to the same button (and "white" and "negative" to the alternative button), responses are typically slower than when the categories "white" and "positive" are paired.The items presented for categorization can be, differentially, words (e.g., for positive and negative) and pictures (e.g., for black and white), so that it is unambiguous which category is being selected, though other forms of the test (e.g., all words) are also used.Interestingly, an equally strong alignment effect can be found when the black or white color of a square is used rather than faces of African-Americans and European-Americans (Smith-McLallen et al., 2006; see also Williams and Morland, 1976).This seems consistent with the idea that the IAT may be sensitive to the metaphoric alignment of the words white and black with positive and negative valences.Indeed, previous work has used the IAT to measure associations between up/down and social status (Gagnon, Brunye, Robin, Mahoney, & Taylor, 2011).
Note that there is controversy over whether the IAT is a suitable instrument for studying individual differences (e.g., in racism) on the grounds, for example, that race IATs do not have adequate testretest reliability to qualify as measures of individual differences (Schimmack, 2021).These criticisms, which seem important, do not apply to our intended use of the IAT, inasmuch as we are not seeking to differentiate the alignment scores of individual participants, but rather to measure a mean alignment across many participants.There is good reason to believe that IATs are successful at picking up group differences concerning the categorical labels used (De Houwer, 2001), and we are seeking to use them to discriminate among items measured on groups of speakers.To be more confident, we conducted a pilot experiment to validate our novel use of the method.
In this pilot experiment (see supplemental materials, Experiment S1), we examined IAT scores for five different concept pairs in conjunction with up/down.Three of the pairs were concepts metaphorically associated with up/down according to Lakoff and Johnson (1980b): happy/sad, more/less, and healthy/sick, and two pairs were used as controls: liquid/solid and fruit/vegetable.As expected, the evidence of alignment (the mean D-score) was strong for the metaphorically associated terms, and was weak for the controls.Thus, it appears that the IAT can be used to measure alignment strength.
Indeed, a further observation of this pilot experiment was that the mean D-scores across the five items were correlated with a simple measure of alignment based on statistical models of semantic content in a large natural language corpus (Latent Semantic Analysis; LSA Landauer & Dumais, 1997;Landauer, Foltz & Laham,1998).This seemed to suggest that the IAT can pick up relationships within language itself, supporting our use of the IAT as a measure of alignments present in language, and supporting Kintsch's (2000Kintsch's ( , 2008) ) idea that vector models of meaning in language are able to capture metaphoric meanings to some extent.With this initial evidence in hand, we selected two sets of metaphors that draw on different sensorimotor experience to compare the strength of between-vs.within-cluster alignments.

The present study
We created a design in which seven distinct category pairs were factorially crossed with each other.Two of the category pairs were sensorimotor metaphor terms referring to space and to color (up/down and black/white).Much as psychological (happy/sad) and physical (more/less) evaluations of quantity seem to be the primary pair of domains for which vertical orientation are used, surface lightness is particularly associated with two domains representing psychological (moral: good/evil) and physical (clean/dirty) domains of purity (e.g., Sherman & Clore, 2009).Of the remaining five categories, two were known to be related to up/down (happy/sad, more/less), two were known to be related to black/ white (evil/good, dirty/clean; Berne, 1959), and the seventh category was a control pair (fruit/ vegetable).
Our goal was to determine whether mean IAT scores for each of the 21 unique pairing of these 7 pairs with each other would (1) be predicted by LSA alignments scores (see supplementary materials), or would (2) additionally show stronger effects for pairs of concepts related to the same sensorimotor metaphor -as predicted by the idea that sensorimotor metaphors are based on quasi-conceptual structures posited by the interface theory of perception (Hoffman, 2018).
Before proceeding to the design of our experiment, we computed LSA alignment scores for all the 21 possible domain pairings using the category labels.This alignment score (for details, see the method section below) is the difference between the sum of the cosines between the concepts of the same valence and the sum of the cosines between the concepts with opposed valences.The mean scores are shown in Figure 1.Consistent with a primary sensitivity to valence (or the evaluative dimension, Osgood, 1952), we observed that LSA alignment scores did not strongly differentiate within-metaphor alignments from cross-metaphor alignments.The theoretical question of the experiment was therefore whether paired domains from within a metaphor cluster would in fact show stronger IAT alignment effects than paired domains from different metaphor clusters.This prediction was contrasted with the null hypothesis (based on Experiment S1 in the supplemental materials) that IAT D-Scores would simply be related to their LSA alignment scores.The further question addressed by the study would be whether, indeed, LSA alignment scores predicted the D-scores at all, in this more complex design.

Method
This study and those in the supplementary materials were approved by the local Institutional Review Board (protocol 14-15-018-O).Although not pre-registered, the main experiment reported here (and the two experiments reported in the supplementary materials) represent all of the data we collected on this project (i.e., the results are not selectively reported from among a larger set of obtained results).All the data were collected between September of 2016 and April of 2017.
In order to more easily test 21 pairings, we first conducted a second pilot study (Experiment S2 in the supplementary materials) to ensure that all pairings could be based on the same simple set of stimuli.A previous study has argued that using a single item to represent a category is less effective than using multiple items (Nosek, Greenwald & Banaji, 2005).However, the evidence for this claim had used the category label itself as the sole item, which seemed to confound singleness with identity.In the second pilot experiment, we tested both visual (symbols) and verbal items comparing D-scores for IATs using either 10 items for each category or a single item (not the category label).The highest D-scores were observed when only a single iconic item was used for each category (up and down arrows for up and down, and smiley and frowny faces for happy and sad).Based on this evidence, we developed seven pairs of iconic images that could be easily paired as stimuli across all 21 pairings of the main experiment.

Participants
We sought to have at least 40 participants for each of the 21 conceptual pairings.The participants included 892 registered MTurk users who were paid $2 to participate.In order to access the study, users had to be located within the United States and have a minimum task approval rate of 90%.Nineteen of the recruited participants were excluded based on criteria developed in our pilot experiments.Specifically, participants who answered more than 25% of the total trials incorrectly or skipped more than 10% of the trials were excluded from the analysis and replaced.Following the recommendation of Greenwald, Nosek, and Banaji (2003), we also dropped participants who responded within 300 milliseconds on more than 10% of the trials.This left 873 participants in the final analysis (with 40-43 participants per pairing).Forty-six percent of the participants identified as male.The mean age of the participants was 38 years, with a range from 19 to 76 years.All but thirteen participants (1.5%) reported that English was their primary language by the age of 5 or earlier.

Materials
Each category of a domain was represented by a single black and white symbol during the task (Figure 2).For example, we used a schematic symbol of an angel with a halo to represent good, and a schematic symbol of a devil with horns to represent evil.For fruit/vegetable, we used collections of different fruits and vegetables enclosed by a circle to represent the superordinate categories of fruit and vegetable (Nosek et al, 2007).Each of the seven domains was paired with each of the others, resulting in 21 unique pairings, of which 6 were within-cluster, 9 were between-cluster and 6 were control pairings.

Latent semantic analysis alignment scores
In Experiment S1 we found evidence that IAT scores were correlated with the statistics of language usage.We had estimated the alignment strength between different domain pairings using LSA (Landauer et al., 1998).LSA is a computational method that extracts occurrence patterns of words from existing texts and determines the similarity between words by analyzing all contexts in which a word does and does not appear.Each word can be represented as a high-dimensional (e.g., 300dimensional) vector.LSA has shown human-like performance on different linguistic tasks such as priming semantically-related items, judging essay quality, as well as categorizing words (see Günther, Rinaldi, & Marelli, 2019).
We computed the alignment score between each pairing of word pairs, using the LSA cosine similarities.The linguistic space ("EN_100k_lsa") was built from a corpus containing about two billion tokens (across 5.4 million documents) from the British National Corpus, the ukWaC Corpus, and a 2009 Wikipedia dump, which were implemented in the R package LSAfun (Günther, Dudschig, & Kaup, 2015).The alignment score between two pairs of words (e.g., up/down and happy/sad) was defined as the sum of the similarities between the valence-aligned categories (e.g., up and happy, and down and sad), minus the sum of the similarities between the nonaligned categories (e.g., up and sad, and down and happy).

IAT task and procedure
As in a standard IAT experiment, each participant completed 7 blocks in the order shown in Table 1 (Greenwald, Nosek, & Banaji, 2003).The critical blocks were the 3rd and 4th blocks as well as the 6th and 7th blocks, which measured implicit association for the aligned (e.g., the same response key for happy and up) vs. nonaligned (the same response key for happy and down).
The experiment was programmed in the online library, PsyToolkit (Stoet, 2010(Stoet, , 2017)), using a black background with white text for the category labels.At the beginning of each trial, a fixation cross appeared on the center with labels on each side (one label on each side for the practice blocks and two for the critical blocks).Labels were the category names of each domain (e.g., happy and sad or up and down) and stayed on the screen for the entire trial, whereas the fixation disappeared after 200 ms.After another 200 ms, the stimulus, one of the images shown in Figure 2, appeared on the center of the screen.Participants were instructed to match each stimulus with one of the labels presented on the screen by pressing either the "e" (for the left) or "i" (for the right) key on the keyboard.The stimulus stayed on the screen for a maximum of 3000 ms or disappeared when a response was made.
After responding, participants received feedback in the form of a green "✓" mark for 300 ms if they categorized the stimulus correctly and a red "x" mark that flashed for 2000 ms if they categorized the stimulus incorrectly or did not respond; because paid participants tend to seek to minimize time on task, we designed the experiment to proceed faster if fewer errors were made (see the supplementary  materials).Within each block, items were presented in random order, with each item shown an equal number of times across the 20 trials per block.Participants were randomly assigned to one of the 21 pairings using TurkPrime (Litman, Robinson & Abberbock, 2017).Within each pairing, the order of which domain appeared in the first block (note that response keys/labels for this domain switched in the fifth block) was counterbalanced across participants.The experiment took 8-12 min, and was followed by a few demographic and language usage questions.

Data analysis
Our analysis was restricted to 80 critical trials from blocks 3, 4, 6, and 7.Among retained participants, individual trials were excluded if they received an incorrect response (n = 3217; 4.6% of trials), no response (n = 350; 0.5% of trials), or a response time of less than 301 ms (n = 29; 0.04% of trials).Excluded trials were not replaced.After these exclusions, IAT D scores for each participant were computed using the improved scoring method of Greenwald, Nosek, and Banaji (2003): Namely, for each participant, we subtracted their mean response time during each of the two "aligned" blocks from that of each "non-aligned" block (i.e., block 3 vs.6, and block 4 vs. 7), resulting in two difference scores.Each of these two difference scores was then normalized by dividing it by the standard deviation of response times pooled across the two blocks that contributed to that difference score (i.e., SD of blocks 3 and 6, and SD of blocks 4 and 7).The final D score used for analysis for the participant was the mean of the two normalized difference scores.D-scores were then analyzed using linear mixed effects regression (Bates, Mächler, Bolker, & Walker, 2014).The model included fixed effects of Cluster (within, between, and control), and normalized LSA alignment scores, as well as a random intercept for item.Contrasts for the Cluster fixed effect were dummy coded with within-cluster being the baseline reference.The p-value estimates were obtained using Satterthwaite's approximation of degrees of freedom (see Luke, 2017) in the lmerTest package for R (Kuznetsova, Brockhoff & Christensen, 2017;R Core Team 2019).

Results
The mean D-score data by cluster type are plotted in Figure 3. Table 2 shows the mean D-score for each pairing.Consistent with the correlation observed in Experiment S1, there was a significant effect of LSA alignment score, β = 0.097, SE = 0.039, t(17.0)= 2.51, p = .023.This confirms that the present paradigm, using pictorial symbols as test items, was indeed sensitive to semantic alignments of the verbal category labels, that were identifiable in statistical models of language.This confirms that the IAT can pick up regularities associated with language itself.
Crucially, however, even with these LSA alignment scores accounted for, D-scores for withincluster pairs showed much larger alignment effects (M = 0.67) than both the between-cluster pairs, M = 0.26, β = −0.286,SE = 0.077, t(17.0)= 3.71, p = .002,and the control pairs, M = 0.35, β = −0.275,SE = 0.100, t(17.0)= 2.75, p = .014.A secondary analysis confirmed that the between-cluster pairs did not differ significantly from the control pairs, t(12.0)= 0.24, p = .811.In other words, over and above the semantic alignments detected by LSA, our D-scores were additionally predicted by the differential metaphor clusters relating to up/down or black/white.
However, the strongest test of our hypothesis concerns the six pairings between the target domains governed by the same metaphor (e.g., more/less with happy/sad) and those governed by different metaphors (e.g., more/less with good/evil), excluding the metaphor pairings themselves.Because there were only two such within-cluster pairings (one for each cluster), and only four such between-cluster pairings, we chose to make signed predictions, so that we could use a more-sensitive, 1-tailed, test.The analysis was otherwise structured exactly as the previous analyses.As before, there was a significant effect of the LSA alignment score in the expected direction, β = 0.091, SE = 0.034, t(2.97) = 2.67, p = .038,among these items.More importantly, even with the effect of LSA alignment taken into account, D-scores for within-cluster pairs showed larger alignment effects (M = 0.69) than the between-cluster pairs (M = 0.51), β = − 0.172, SE = 0.072, t(2.98) = 2.40, p = .048.

Testing alternative vector models
Although our experimental analysis was planned based on the use of LSA, we subsequently sought to see how general the results would be by testing two other vector models, given that there is recurrent interest in the power of vector models to explain metaphor (e.g., Nick Reid & Katz, 2018).Using the PsychWordVec R library (Bao, 2023;Bao et al., 2023) we generated alternative alignment scores using two other types of vector model: word2vec, using the "word2vec_googlenews_eng_1word.RData" model (Mikolov, Chen, et al., 2013;Mikolov, Sutskever, et al., 2013), and GloVe, using the "glove_-commoncrawl_300d.RData" model (Pennington et al., 2014).In each case, we computed the alignment measure just as we had for LSA.The between-metaphor cluster was used for the baseline in these new comparisons, given that it had the most unique items (9 pairings).This streamlines the analyses without affecting the conclusions.For LSA, this version of the LMER model still, of course, showed that within-metaphor-cluster IAT scores were much higher than between-cluster IAT scores, β = 0.29, SE = 0.077, t(17.0)= 3.71, p = .0017,even with LSA alignments, taken into account, while LSA also remained predictive of additional variance, β = 0.097, SE = 0.092, t(17.0)= 2.51, p = .023.
Two additional LMER models, using these other two vector models of English, also showed that within-cluster pairings predicted stronger IAT D-scores than the between-cluster paring, both when normalized word2vec-derived alignments scores were included in the model, β = 0.39, SE = 0.072, t  (17.0) = 5.35, p < .0001,and when normalized GloVE-derived alignment scores were included in the model, β = 0.28, SE = 0.076, t(17.0)= 3.70, p = .0018.In other words, these additional analyses replicate the observation that, within the present study, belonging to a common metaphor cluster strongly predicted greater IAT-measured alignment (as contrasted with pairings between the clusters) -a result that was not explained by alignment scores derived from any of the vector models we have tested.At the same time, both these LMER models confirmed that, across the 21 pairings tested, the IAT was simultaneously sensitive to the alignments that were computed from these vector models for the 21 pairings as well, both for word2vec alignments: β = 0.114, SE = 0.034, t(17.0)= 3.36, p = .0038,and for GloVe, β = 0.092, SE = 0.035, t(17.0)= 2.64, p = .017.

Discussion
The present experiment was designed with two questions in mind.Our primary theoretical question concerned the possibility that categories associated with similar sensorimotor metaphors (e.g., vertical direction) might show evidence of shared structural alignment.We expected this because sensorimotor experience is already quasi-conceptual (Hoffman, 2018) and thus supports structural alignment based on those underlying categories.We contrasted this view with the possibility that metaphoric language is based on direct experience of the world, as implied by the origin stories of Lakoff and Johnson (1980a).Our goal was not to disprove Lakoff and Johnson's CMT, but to provide a new way of looking at how the sensory grounding of metaphoric meaning emerges (phylogenetically rather than ontogenetically).A secondary, but theoretically important, question from our perspective, was whether the alignment strength of the category labels, as assessed by vector models of language such as LSA, affects IAT scores.Kintsch (2000Kintsch ( , 2008) ) has proposed that at least some of the work that metaphors do can be captured by vector models of language, like LSA, even though those vector models of language are based on databases that do not differentiate literal from figurative tokens.With regard to our primary theoretical question, we found evidence that appears to support the idea that concepts structured by the same sensorimotor metaphors do show an elevated level of IATscore relative to the predictions of LSA alone.This result seems consistent with Gentner's (1983) structural alignment view of metaphors, as opposed to the origin stories by Lakoff and Johnson (1980a).The origin stories tend to predict that there might be little alignment across target domains that simply happen to use the same sensory metaphor (piling up as a source for "more is up" is very different from standing upright as a source of "happy is up").Although Lakoff and Johnson do argue that some orientation metaphors (e.g., for health, for happiness, and for life) cohere into a general well-being structure, they do not include "more is up" in that framework.After all, even murder rates can be "up."In contrast, Gentner's views suggest that if there is a common abstract structural alignment between metaphors and their targets, then two targets that share a common metaphor might indeed become yoked, as our data suggest.This evidence, though preliminary, does provide some support for the idea that the quasi-conceptual nature of sensorimotor experience may serve as a source of common insight with regard to multiple target domains (Hoffman, 2018).
Our second theoretical question was whether alignment strength assessed by vector models of language such as LSA affects IAT scores.We found that LSA alignment significantly predicted the IAT scores across the 21 category pairings, replicating what we found in the initial pilot study.Thus, we regard this as strong, replicable evidence that the IAT is indeed sensitive to information about the category labels embedded in language.This observation provides support for Kintsch's (2000Kintsch's ( , 2008) ) claim that metaphoric meaning can be represented in vector models, and also shows that IATs can be sensitive to this meaning.Because our initial pilot studies found that IAT performance with vertical direction metaphors could be predicted in part by our LSA alignment scores, we ended up reframing our primary question to ask whether the IAT picks up information that goes above and beyond what vector models of language can, which will be addressed in the following session.
In considering the deviations between LSA alignment scores (Figure 1) and the observed IAT D-scores (Figure 3) for within-cluster vs. between-cluster pairings, it initially appeared that LSA may have exaggerated the evaluative alignment of within-and between-cluster pairs.On this account, the fact that the IAT differentiated between within-and between-cluster pairings suggested to us that the IAT was sensitive to something over and above what the LSA alignment scores revealed.That is, even with the LSA alignment effect taken into account, IAT scores were significantly higher for within-cluster pairing than for between-cluster pairings.However, it is logically possible that the differences between the IAT scores and the LSA alignment score are actually due to a limitation of the IAT.That is, it is possible that what the IAT measures, rather than being over and above what LSA alignment makes available, is actually under and below what language makes available.Consider that participants in an IAT may develop strategies to primarily use one dimension, such as valence (e.g., Osgood, Suci & Tannenbaum, 1957), to align categories, resulting in less sensitivity to LSA.However, this explanation would seem to predict that LSA would end up having little predictive power, whereas it continued to serve as a significant predictor in the model.
From this, we would argue that two main findings of the present study support an interactive theory of metaphor.First, there was a strong alignment between literal concepts expressed by the same sensory metaphor, suggesting common structural alignments rather than distinct experiential origins.This is not antithetical to CMT, but it does depart from the causal stories told by Lakoff and Johnson (1980a).Second, the conceptual alignment scores derived from LSA (and two other models) continued to predict IAT scores even though they did not predictively explain the metaphoric alignment effect.This suggests that the instantiation of metaphoric alignment detected by the IAT is above and beyond the statistics of language tokens, even though those statistics also affect IAT performance.This claim is strengthened by the evidence that alternative vector models performed about as well as LSA at predicting IAT scores, but could not predict the effects of sharing a metaphoric cluster.This might be because these metaphoric alignments tap extralinguistic structures (i.e., the perceptual interface).

Conclusion
We have shown that the IAT can be used to measure and contrast metaphoric alignments.Moreover, IAT scores seem to additionally reflect the alignments between category labels in vector models of language, consistent with the proposals of Kintsch (2000Kintsch ( , 2008)).We propose that modern theories of perception, such as Hoffman's (2018) interface theory, imply that structural alignments based on an evolved perceptual interface could underlie (provide conceptual content to) many conventional sensorimotor metaphors.Like conceptual metaphor theory, our proposal is an interactive theory, and our theory supposes that conventional metaphoric mappings observable in language constitute cognitive tools for understanding the world.However, our proposal is that structural alignments (Gentner, 1983;Gentner, 1988) based on an evolved perceptual interface can provide novel perspectives about domains to which they are applied.Good metaphoric alignments can become conventionalized metaphors (Bowdle & Gentner, 2005) which eventually become part of the language.Conventionalized extended metaphor systems may be thought of as cognitive tools, embedded in language, that can supplement our understanding by linking abstract ideas to our evolved perceptual interface.

Figure 1 .
Figure 1.Mean LSA alignment scores computed for the various pairings, with (by-item) standard error bars.Note that LSA seems to poorly differentiate the two sets of metaphor clusters.

Figure 2 .
Figure 2. The symbols used as test items to represent the 7 pairs of concepts.

Figure 3 .
Figure 3. Mean D-scores in the IAT with (by-item) standard error bars.

Table 1 .
Structure of IAT task.