Is Region-of-Interest Overlap Comparison a Reliable Measure of Category Specificity?

Analysis of the degree of overlap between functional magnetic resonance imagingderived regions of interest (ROIs) has been used to assess the functional convergence and/or segregation of category-selective brain areas. An examination of the extant literature reveals no consistent usage for how such overlap is calculated, nor any systematic comparison between different methods. We argue that how ROI overlap is computed, especially the choice of the denominator in the formula, can profoundly affect the results and interpretation of such an analysis. To do this, we compared the overlap of the FFA-FFA (fusiform face area) and FFA-FGA (fusiform Greeble-selective area) in a localizer study testing both Greeble novices and experts. When using a single ROI as the denominator, we found a significant difference in FFA-FFA versus FFA-FGA overlap, consistent with the result of a previous study arguing for face specificity of the FFA [Rhodes, G., Byatt, G., Michie, P. T., & Puce, A. Is the fusiform face area specialized for faces, individuation, or expert individuation? J Cogn Neurosci, 16, 189203, 2004]. However, these ROI overlap differences disappeared when the denominator combined both of the involved ROIs, and the patterns of such overlap comparisons were dependent on given statistical thresholds. We also found proportionally decreasing FFA-FFA overlap with increasing center-of-FFA distance, resolving an apparent contradiction between the consistency of the location of the FFA and the seemingly low FFA-FFA overlap. Finally, Monte Carlo simulations revealed the most stable formulathe most resistant to ROI size variationsto be the average of the two single-ROI-denominator-based overlap indices. In sum, ROI overlap analysis is not a reliable tool for assessing category specificity, and caution should be exercised with regard to ROI overlap definition, underlying assumptions, and interpretation.

In a similar vein, the present study focuses on a particular form of conjunction analysis: the region-ofinterest (ROI) overlap comparison. Specifically, an ROI overlap index is the number of overlapping voxels between two ROIs divided by a common denominator. By comparing the overlap percentage across different tasks or conditions, a number of studies have used this overlap index difference to infer the degree of convergence or segregation between functionally defined ROIs. For example, O'Craven and Kanwisher (2000) measured the percentage of overlap of the fusiform face area (or FFA; Kanwisher, McDermott, & Chun, 1997) derived from participants imagining faces and from participants actually viewing faces. They reported a high degree of overlapping voxels in the FFA between the two conditions, and argued that this result supported a shared representation for face perception and face imagery.
In another study, Grill-Spector, Knouf, and Kanwisher (2004) compared the FFA derived from a localizer scan (passive viewing of faces vs. objects) and from an identification task using a variety of stimuli, such as faces, birds, flowers, houses, guitars, and cars. In their study, two overlap indices were calculated. Based on our interpretation of Table 1 (p. 559), the numerator was the number of overlapping voxels between the FFA from the localizer and the specific category that was tested; the denominator was either the voxels for the FFA localizer or the FFA task. Both indices showed significantly higher overlap between the FFA localizer and the FFA task compared to the overlap between the localizer and the other task-derived object-selective areas. However, their ROI overlap comparison was derived from novices' data (n = 5), so whether car experts, for example, might show higher overlap between FFA and fusiform car-selective areas was unclear.
With similar results to those of Grill-Spector et al. (2004), Rhodes, Byatt, Michie, and Puce (2004) reported the overlap between the FFA and a fusiform Lepidoptera area (Lepidoptera are commonly know as butterflies and moths), or FLA, in both Lepidoptera novices (Experiment 1) and experts (Experiment 2). They scanned people with an initial FFA localizer (or LO), which consisted of passive viewing of faces and objects, followed by a similar passive viewing (or PV) of faces, objects, and Lepidoptera, and a subsequent individuation (or IN) task, requiring participants to recognize the items they viewed in the previous session (half of them were from the previous session or ''old''). The FFA-FLA overlap index was defined as the overlap between face-and Lepidoptera-selective voxels in the fusiform gyrus (FG) divided by the face-selective voxels in the FG or (FFA \ FLA)/FFA. 1 For the purpose of comparison, the FFA-FFA overlap across the three tasks (LO, PV, and IN) was divided into two values, (FFA pv \ FFA lo )/FFA pv and (FFA pv \ FFA in )/FFA pv , and was calculated for each participant to provide the baseline overlap index, which was a mean of 26.2% in Experiment 1 and 28.3% in Experiment 2 (across both left and right hemispheres, Table 1 , p. 194). In comparison, the average FFA-FLA overlap, or the mean of (FFA pv \ FLA pv )/FFA pv and (FFA in \ FLA in )/ FFA in , across both hemispheres, was 10.7% in Experiment 1 and 6.7% in Experiment 2. The significant difference between the FFA-FFA and FFA-FLA overlap was taken as support for the face specificity hypothesis (Downing, Chan, Peelen, Dodds, & Kanwisher, 2006;Kanwisher, 2000).
In addition to the above studies, which are directly related to the decade-long debate between the face specificity and the perceptual expertise hypotheses (Gauthier & Bukach, 2007;Bukach, Gauthier, & Tarr, 2006;Kanwisher, 2006;McKone & Robbins, 2006;McKone & Kanwisher, 2005;Gauthier, Anderson, Tarr, Skudlarski, & Gore, 1997;Kanwisher et al., 1997), a recent commentary (Peelen & Downing, 2005a) also reported the ROI overlap of imagined extrastriate action-related area (ARA; Astafiev, Stanley, Shulman, & Corbetta, 2004) and the extrastriate body-part area (EBA; Downing, Jiang, Shuman, & Kanwisher, 2001) as only 14%, implying a possible segregation between ARA and EBA. In their later reply, Astafiev et al. (2004) pointed out that a low index may be biased by the use of the larger ROI (here EBA) as the denominator. They suggest that the overlap may have been larger if the smaller ARA was used. This raises two important concerns in interpreting ROI overlap results: (1) the percentage of overlap should be compared to an appropriate baseline condition, just as the FLA-FFA overlap was compared to the repeated FFA overlap used in the Rhodes et al. (2004) study; (2) the choice of the denominator can greatly affect the calculated overlap percentage. This second point will be elaborated in the next section.

The Choice of the Denominator in ROI Overlap Comparison
Although the ROI overlap has been used in a number of studies for a variety of purposes, a careful examination of the extant studies that adopted the ROI overlap analyses reveals one surprising finding: Although the numerator used in the overlap equation was always the total number of overlapped voxels between the two ROIs, the chosen denominator varied from study to study. As mentioned earlier, Peelen and Downing (2005a), Grill-Spector et al. (2004), Rhodes et al. (2004), and O'Craven and Kanwisher (2000) used a single ROI as the denominator, whereas other studies used the sum (Ganis, Thompson, & Kosslyn, 2004), the mean (Nieto-Castanon, Ghosh, Tourville, & Guenther, 2003), or the union (Sawamura, Georgieva, Vogels, Vanduffel, & Orban, 2005) of the two ROIs as overlap denominators. In most of these studies, the reasons for the use of a particular denominator were not discussed, nor were alternatives considered. In a real example of how the choice of denominator can drastically affect the overlap ratio, we recalculated the data from Table 1 of O'Craven and Kanwisher (2000Kanwisher ( , p. 1016. This study compared the overlap for imagining and viewing faces. When the voxel overlap (mean size = 9) was divided by the face imagery condition (mean size = 11 voxels), the mean overlap was 84%. However, when the voxel overlap was divided by the face perception condition (mean size = 63 voxels), the average overlap dropped to 22%. Therefore, it is reasonable to infer that in the Rhodes et al. (2004) study, the low overlap value (<30%) may have been due to the use of the comparatively larger ROI as the denominator; they reported a larger FFA pv than FFA lo (1440 vs. 813 mm 3 ) for the FFA-FFA overlap, and a larger FFA than FLA in both the FFA pv -FLA pv and FFA in -FLA in overlaps (1440 vs. 650 and 1759 vs. 1346 mm 3 , respectively).
Agreeing that the volume (e.g., FFA vs. FLA) differences may account for some variation in the degree of overlap between faces and Lepidoptera, Rhodes et al. (2004) emphasized that ''. . .it cannot account for the relatively modest degree of overlap between faces and Lepidoptera (7%) compared with the two sets of face stimuli (27%)' ' (p. 194). Indeed, although the choice of a single larger ROI as the denominator seems unable to explain the consistently larger FFA-FFA overlap compared to FFA-FLA overlap, the magnitude of the overlap difference (27% vs. 7%) can surely be adjusted by the denominator chosen. Figure 1 shows the results of three different ROI overlap indices using data from the Rhodes et al. report. When the choice of denominator changes from a single ROI (what Rhodes et al. used) to either the union or the sum of the two ROIs, the overlap difference drops from 20% to 16% to 13%, respectively. Although a 4-7% decrease in overlap difference may appear trivial, it does call for the need to systematically examine the effect of the denominator in the ROI overlap comparison, and to evaluate which formula might fare best in the current ROI overlap scenarios in assessing functional selectivity of the ROI.

How Consistent is the FFA (across Runs and Sessions)?
Another interesting issue raised in the Rhodes et al. (2004) study was their somewhat contradictory finding of low FFA-FFA overlap (27%), and a high consistency in the location of the FFA across tasks, runs, and sessions reported in the literature (Peelen & Downing, 2005c;Grill-Spector et al., 2004;Gauthier, Tarr, Anderson, Skudlarski, & Gore, 1999;Kanwisher et al., 1997). Rhodes et al. state that ''given the apparent robust consistency of previous research demonstrating activation of FG [fusiform gyrus] to faces, it should be noted that the measure of overlap used here is a conservative index as it is the degree of overlap in activated regions thresholded to the same conservative criterion for both tasks' ' (p. 194). First, it is not clear why their uncorrected p value thresholds of .001 should be considered ''conservative'' compared to other published FFA studies which used p value thresholds of .0001 (uncorrected) or less (Kanwisher, Tong, & Nakayama, 1998;Kanwisher et al., 1997). Second, even if the threshold was indeed conservative, further explanation is needed regarding how statistical threshold affects the FFA-FFA overlap, including whether a lenient statistical threshold elevates the ROI overlap and whether a more stringent threshold decreases it, and if so, by how much. It is unclear whether changes in the threshold would have any substantial effect on the overlap index. Any increase of the threshold, say from p = .05 to p = .10, will likewise increase the sizes of both ROIs in the denominator and their overlap in the numerator (and similarly, a decrease in the threshold would also be perpetuated equally), in many cases, rending the final outcome equipotent.
In the current study, we will (1) further explore the relationship between the FFA sizes, distances, and their resulting overlaps (defined by various formulae); and (2) compare the FFA-FFA and FFA-FXA overlap (X being the object of expertise category) under several statistical thresholds to see whether there is any systematic relationship. The latter point not only addresses the use of different statistical thresholds and the effect on the low FFA-FFA overlap specifically but also more generally tests the effect of statistical criterion on various ROI overlap comparison results. (1) single ROI (FFA from the passive viewing session, or FFA pv ), (4) the other single ROI (FFA from the localizer session, or FFA lo ), (2) the union, and (3) the sum of FFA pv and FFA lo . As shown, the differences of the ROI overlap (FFA-FFA minus FFA-FLA overlap) decreased from 20% (27%À7%) in (1), to 16% (21%À5%) in (2), or 13% (18%À5%) in (3). In addition, the FFA-FFA overlap increased from the original 27% to 49% by just switching the denominator from FFA pv to FFA lo . These FFA values were derived from Table 1 of Rhodes et al. (2004, p. 194).

Goal of the Present Study
Based on the above review and discussion, the goal of the present study is to examine: (1) the effect of the denominator choice (or the formula used) on ROI overlap comparisons; (2) the relationship between FFA location and the overlap index, and (3) how ROI overlap is affected by the choice of statistical threshold. To achieve this, we tested 10n Greeble novices with three runs of a localizer scan (consisting of blocks of faces, objects, and Greebles). Five of these participants were subsequently given 2 weeks of training to become ''Greeble experts'' . These five laboratory-trained experts were further scanned in two additional sessions with the same localizer, once during training and again immediately following training (training procedure = 2 weeks; Rossion, Kung, & Tarr, 2004;Gauthier, Williams, Tarr, & Tanaka, 1998). With the data we sought to (1) replicate previous results using an overlap analysis similar to the one used by Rhodes et al. (2004); (2) further examine six different denominators and compare their results; (3) reconcile the issue of FFA location consistency across runs and sessions and corresponding FFA-FFA overlap measures; and (4) evaluate the effect of varying statistical thresholds on ROI overlap formula and their values.
In our assessment of FFA-FFA overlap, we compared the same task, face stimuli, and epoch length across three separate scanning sessions. We expected no difference in the FFA size across scans and predicted no differences regardless of which single FFA served as the denominator. In addition, we predicted that the size of the FFA would be larger than the FGA (''fusiform Greeble-selective area''), and this size difference would lead to inconsistencies in the ROI overlap measure, dependent on the particular formula used. More specifically, when using only the larger face area as the denominator, we predicted that the FFA-FFA overlap would be significantly larger than the FFA-FGA overlap, consistent with the face specificity prediction. However, when a combination of both FFA and FGA are used as the denominator, we predicted that the resulting mean overlap between FFA-FFA and FFA-FGA would not be significantly different, consistent with the prediction of the perceptual expertise hypothesis. To test these predictions, we used six different denominators. The numerator was the same across all formulae, that is, the number of overlapping voxels (represented as ROI a \ ROI b in the formulae below).
For formulae a and b, the number of voxels for a single category is used as the denominator. For example, the number of active voxels in the FG calculated for a face localizer task could be used (ROI a ), which might be measured as 1000 voxels. Alternatively, the number of active voxels for a Greeble localizer task could be used as the denominator (ROI b ), which might be measured as 50 voxels. If our number of overlapping voxels for faces and Greebles was 25 voxels, this would yield an overlap index of 2.5% in the case of formula a, and 50% in the case of formula b.
For formula c, the denominator is the sum of the active voxels for two categories. For example, if there were 1000 voxels for faces and 50 for Greebles, the denominator would be 1050 voxels. With 25 overlapping voxels, this would yield a percent overlap of 2.38% using formula c.
In formula d, the mean volume of two ROIs is calculated. For example, if there were 1000 voxels for faces and 50 for Greebles, the denominator would be 525 voxels. Using our example of 25 overlapping voxels, this would yield a percent overlap calculation of 4.76% in the case of formula d.
In formula e, the union of both ROIs is calculated and used as the denominator. This differs from the sum of voxels in that the shared voxels for Greebles and faces are not counted twice. For example, if the face area was 1000 voxels and the Greeble area was 50 voxels, but 25 of these voxels are shared, then the denominator would be 1025 voxels (975 unique face voxels + 25 unique Greeble voxels + 25 shared voxels). Using our example of 25 overlapping voxels again, this would yield a percent overlap calculation of 2.43% using formula e.
Formula f is the average of formulae a and b. Thus, for this formula, we take the mean overlap index for formula a (2.5%) and formula b (50%), yielding an overlap index of 26.25%. From these examples, we can see that the overlap index varies widely with the use of different denominators. However, the critical issue here is how much these different overlap denominators affect the comparison of the overlap for repeated measures of the face-selective area (e.g., FFA-FFA overlap) versus the overlap between face-selective and another category-selective area, in this case, Greebles (e.g., FFA-FGA overlap). We examined each of these formulae using fMRI data for Greebles and faces.

Participants
Ten undergraduate or graduate students at Brown University participated in the one-session scan (6 women, 1 left-handed, mean age = 25.9 years, SD = 4.2 years). Five of them (3 women, mean age = 25 years, SD = 4.74 years) continued with two additional scans, a week apart, during the middle and end of 2 weeks of Greeble expertise training. Behavioral performances in both picture-naming and picture-name verification tasks were monitored to make sure the participants achieved expertise criterion, for example, no significant difference between basic-and subordinate-level objects in accuracy and response time (Tanaka & Taylor, 1991). Participants typically achieved criterion by the eighth or ninth training session (Rossion et al., 2004). All the participants gave informed consent as approved by the Brown University Institutional Review Board (IRB).

Stimuli
The stimuli consisted of 90 full-color Caucasian frontview face photographs (half male), 100 full-color object photos (sampled from commercial object image CDs), and 80 asymmetric Greeble images created using 3-D Studio Max R4 (www.discreet.com). The complete set of the Greeble stimuli, both images and 3-D models, is available at www.tarrlab.org/. The asymmetric Greebles were created by randomly repositioning the ''boges,'' ''quiff,'' and ''dunth'' within a particular placement range. In interviews before and after training, participants indicated that the asymmetric Greebles did not appear facelike. The stimuli, which subtended approximately 88 of visual angle, were projected onto an opaque screen and were viewed by participants through a mounted mirror on the head coil.

fMRI Procedure
Each scan session began with an MPRAGE structural scan (1 mm 3 ) and followed by three runs of functional FFA localizers. The FFA localizers consisted of blocks of faces, objects, and Greebles (20 images per block), interleaved with the fixation baseline ( Figure 2). Participants were instructed to press a button with their dominant hand when they detected a stimulus repetition (1Àback identity task). Forty-five 3-mm axial EPI slices (in-plane resolution = 3 Â 3 mm 2 , TR = 3600 msec, TE = 38 msec, flip angle = 908, gap = 0 mm, 76 measurements) were acquired in the 1.5-T Siemens scanner located at Memorial Hospital of Rhode Island.

fMRI Data Analysis
We used BrainVoyager 2000 v4.96 (Brain Innovation) to (1) discard the first two volumes in each run; (2) preprocess using 3-D motion correction, slice scan time correction (interleaved ascending), linear trend removal, and high-pass filtering at 3 cycles/sec. Structural images were warped into Talairach space for group analysis. The volume time-course files were analyzed using a general linear model (Boynton, Engel, Glover, & Heeger, 1996) with three convolved predictors: faces, objects, and Greebles, corrected for autocorrelation (AR1), and restricted by the gray matter mask. Subsequent contrasts of faces minus objects ( p = .05, corrected for main experiment, with p = .01 and.10 for additional analyses) were applied either to each run (for all 10 Greeble novices), or to each session collapsing over three runs (for the five Greeble experts). The face-selective voxel cluster, or FFA, was determined by confirming the colocalization of (1) a reasonable anatomical range (x TAL: 35 to 40; y TAL: À40 to À60; z TAL: À15 to À25), (2) a higher average BOLD response for faces than for other object categories (Figure 2), and (3) a significant mean percent signal change (PSC) of about 1% for faces compared to fixation, and a smaller PSC for objects and Greebles.

Within-and Between-session FFA Spatial Consistency
Our first analysis sought to reconcile the discrepancy between the low FFA-FFA overlap (M = 27%) reported by Rhodes et al. (2004), and the consistent withinsubject FFA spatial location measured across runs and sessions commonly reported in the literature (Peelen & Downing, 2005c;Gauthier et al., 1999;Kanwisher et al., 1997). All the participants' FFA locations (center-of-mass Talairach coordinates) and size (mm 3 ) are listed in Table 1. Examples of the spatial location of the FFA are shown for four participants in Figure 3. The top two rows show the data by run and the bottom two rows show data by session. Several points can be made about these data. First, there are considerable between-subject differences for the overlap indices, in both the between-run and between-session calculations. For example, using formula f, two participants exhibited overlap in the range of $60% to 70% (Rows 1 and 3), whereas the two remaining participants exhibited overlap of only $3% (Row 2) and $20% (Row 4). Second, as expected, the larger the FFA-FFA distance, the smaller the overlap index. This inverse relationship was significant for all the correlations between the FFA-FFA distances and the six overlap indices (a, b, c, d, e, and f ), both across-run [n = 10, r(29) = À.50, À.52, À.76, À.76, À.76, and À.66; all p < .05] and across-session [n = 5, r(14) = À.64, À.80, À.78, À.78, À.74, and À.83; all p < .05]. One example of the correlation results for a single formula (formula f ) is shown in Figure 4. Third, the by-session analysis combined data over three runs, resulting in FFA sizes that were larger than the by-run analysis [t(9) = 3.36, p = .008 for 10 novices; and t(14) = 2.71, p = .016 for 5 experts who underwent 3 scans] and more consistent in size, which was indicated by the highly significant FFA-FFA volume correlation [r(14) = .91, p < .0001] combining all 15 possible pairings for the across-session analysis (5 participants Â 3 sessions). In comparison, the acrossrun FFA volume correlation [r(29) = .39, p > .05] was Table 1. FFA Volume (mm 3 ) and the Center-of-Mass Talairach Coordinates (x, y, z) of the Present Experiment, Across-Run (n = 10, One Session) and Across-Session (n = 5, 3 Sessions)  Average (n = 10) 834 much smaller, possibly reflecting the combined effects of fewer data points per run and within-subject noise (including adaptation, fatigue, etc.), which jointly contribute to larger variability in FFA size across runs. Finally, the range of center-of-FFA Talairach coordinates are consistent with what has been reported in the literature: Across Tables 1 and 2, the coordinates are in the range of: x = 35 to 45, y = À40 to À60, z = À10 to À25.
Our FFA results, in terms of the location and size, were both within the ranges reported in the literature, therefore supporting the qualitative notion of FFA ''consistency.'' However, the substantial differences in calculated overlap indices between individuals reveal several important factors related to overlap calculations. Consistency in the spatial location of the FFA can be linked to ''small'' center-of-FFA distance variations (e.g., ±5 mm in the y-axis) and high correlations among the calculated sizes of the FFA (at least across sessions). In contrast, the FFA-FFA overlap index was concurrently ''low'' (e.g., based on the equation shown in Figure 4, with 5 mm FFA-center distance, the expected mean overlap is $35%), yielding the impression of a discrepancy between spatial location and overlap. This apparent ''inconsistency'' can be resolved by a quantitative comparison. We found a significant negative correlation between FFA-center distance and the corresponding Figure 3. The axial slice map of the FFA location (Talairach coordinates and FFA sizes, in mm 3 , shown on top of each slice) of four representative subjects (upper two rows for within-session, lower-two rows for across-session analysis). With a revised overlap index (formula f ) which averages two single-ROI-as-denominator overlap indices, the mean FFA-FFA overlap was: 64% and 3% for the upper two, and 67% and 27% for the bottom two rows. In general, our results are congruent with Peelen and Downing (2005b), who found that, for individual subjects across runs or sessions, the location and size of the FFA is relatively consistent, with some across-subject variation. overlap, shown in Figure 4. These data indicate that, as you would expect, as the distance between the center points of calculated face-selective areas increases, the amount of overlap decreases. Our results, therefore, suggest that FFA consistency depends on what one defines as the measure of consistency, and where you place the cutoff for ''consistent'' versus ''inconsistent.'' If consistency is measured as center-of-FFA distance, several millimeters is thought to be trivial and data are interpreted as indicative of as a stable, functionally defined area. However, when measured as FFA-FFA overlap, the result appears much smaller (e.g., <50%), which might be interpreted as indicative of a less stable area, even though these measures are significantly correlated.
Comparison of FFA-FFA vs. FFA-FGA Overlap As a basis for later comparisons with different overlap formulae, the first step of our FFA-FFA versus FFA-FGA comparison was to replicate previous results showing larger FFA-FFA overlap compared to the overlap between FFA and the category-selective responses arising from a second domain of expertise, including Lepidoptera (Rhodes et al., 2004) and cars (Grill-Spector et al., 2004). Here we used asymmetric Greebles, and created experts with a training paradigm that has been shown to be effective in a number of published studies (Rossion et al., 2004;Gauthier et al., 1998Gauthier et al., , 1999. Across 10 Greeble novices, using a single FFA measure as the denominator and p = .05 as the statistical threshold for the face minus object contrast (the middle bars for each graph in Figure 5), we found a significantly larger mean FFA-FFA overlap (40.8%) compared to the mean FFA-FGA area overlap (21.7%), t(9) = 2.81, p = .02. This result is  consistent with Rhodes et al. (2004) and Grill-Spector et al. (2004). Note that Rhodes et al. used FFA activation as measured in an initial passive viewing session as the denominator, rather than using the FFA activation for faces measured during the experimental tasks (e.g., individuation for the trained novices, Experiment 1 or an average of passive viewing and individuation for the untrained novices in Experiment 1 and the experts in Experiment 2). In our case, because there was no systematic size difference for face activation across the three runs, we averaged the FFA-FFA overlap across all three comparisons (FFA r1 -FFA r2 , FFA r2 -FFA r3 , and FFA r1 -FFA r3 ). In order to compare the effect of different denominators on ROI overlap, we calculated six plausible variations of the overlap index by fixing the numerator as the number of overlapping voxels for ROI a and ROI b , and then varying the denominator from formula a to f (see Methods). Figure 5A summarizes the mean results of the six different within-session FFA-FFA and FFA-FGA overlap comparisons. Using the p = .05 threshold level, we find significant FFA-FFA and FFA-FGA overlap differences [for a, t(9) = 2.81, p < .05; for b, t(9) = À4.66, p < .001], but only when overlap is calculated with a single ROI as the overlap denominator (the two leftmost comparisons in Figure 5). However, there were no significant overlap differences for the remaining overlap formulae [for c and d, t(9) = À0.19, p = .84; for e, t(9) = À0.03, p = .97; for f, t(9) = À1.7, p = .12], in which the denominator included both ROIs. Thus, at least using p = .05 for the threshold level, we replicated previous results, finding a significant overlap difference between FFA-FFA and FFA and another class of objects. However, it is important to note that this difference is obtained Figure 5. The six chosen ROI overlap indices (various denominators) calculated either across-or within-sessions (across runs) for FFA-FFA and FFA-FGA (''fusiform Greeble area'') overlap analyses, with three levels of statistical thresholds ( p = .01, .05, and .10, all corrected), using our fMRI data from 10 subjects. The upper row (A) represents the within-session, the lower row (B) the cross-session ROI overlap comparison. Across the two rows, both the first and second panels use a single ROI as the denominator of ROI overlap index and reveal significant FFA-FFA and FFA-FGA overlap differences (e.g., consistent with the face specificity prediction). Using denominators that take both ROI sizes into account (panels 3-6), we find no systematic differences in FFA-FGA versus FFA-FFA overlap. *p < .05, **p < .001. only when we rely on a formula that uses only a single ROI as the denominator. In contrast, overlap formulae that take both ROIs into account reveal no significant differences between the overlap of two face localizers and the overlap of faces and Greebles.
In Figure 5A, the FFA-FFA overlap was calculated for three separate run comparisons, FFA r1 -FFA r2 , FFA r2 -FFA r3 , and FFA r1 -FFA r3 , and then averaged together. However, due to the limited number of Greeble-selective voxels in each run, the FFA-FGA overlap was averaged across three runs within a single session, FFA r123 -FGA r123 , and then compared. Because this was not a direct comparison (across-run in the FFA-FFA comparison vs. a combined-run comparison in FFA-FGA overlap), we also calculated the overlap comparison for the five Greeble experts over three scanning sessions so that acrosssession FFA-FFA and FFA-FGA overlap comparisons, calculated identically for both, were possible. These across-session results are shown in Figure 5B. Again for a p = .05 threshold, we observed a significant FFA-FFA and FFA-FGA overlap difference [for formula a, t(14) = 2.58, p = .02; for formula b, t(14) = À2.64, p = .019] in the first two columns. For the remaining four overlap comparisons, in which both ROIs were taken into account in the denominator, there were no significant differences between the overlap for face areas and the overlap for face and Greeble areas [for formulae c and d, t(14) = 1.26, p = .22; for e, t(14) = 1.31, p = .20; for f, t(14) = À0.4, p = .96]. 2 A comparison of Figure 5A and B shows highly similar FFA-FFA and FFA-FGA overlap results for each overlap formula, regardless of how the sessions were combined. This demonstrates a consistency in the results across runs and sessions in our study. Together, these data strongly support the proposal that ROI overlap varies significantly across different calculation methods using different denominators.
It is worth pointing out that in Figure 5, the FFA-FFA versus FFA-FGA overlap comparisons for two formulae (a and b) only differed in the FFA-FGA comparison [21% in a vs. 79% in b, t(9) = À0.68, p < .0001 for the acrossrun analysis; 18% in a vs. 60% in b, t(14) = À4.41, p = .0005 for the across-session analysis], but did not differ substantially in the FFA-FFA overlap [41% in a vs. 35% in b, t(9) = 1.81, p = .10 for the across-run analysis; 40% in a vs. 38% in b, t(14) = 0.23, p = .81 for the across-session analysis]. This suggests a reverse pattern (e.g., ''FFA-FFA > FFA-FGA'' in a, vs. ''FFA-FFA < FFA-FGA'' in b) by merely changing the overlap denominator from the larger FFA in a to the smaller FGA in b. Another minor point of interest is, similar to the point raised in Figure 1, whether the denominator size alone can explain the current ROI overlap comparison pattern. The observation that the mean denominator sizes (for the FFA-FGA overlap in both across-run and across-session analyses) in formulae a (774 mm 3 ) and b (212 mm 3 ) were indeed smaller than in c (986 mm 3 ) and in e (828 mm 3 ) seems to support the importance of denominator size per se in explaining the pattern of overlap comparison across formula, despite the case of formula d (493 mm 3 ), whose denominator size was in between those of formulae a and b, but with a similar overlap comparison pattern as with formulae c and e. More complicated, the results of formula b also signify the problem in changing the denominator across FFA-FFA and FFA-FGA comparisons (because the FFA was the denominator for calculating FFA-FFA overlap, and the FGA was the denominator for calculating FFA-FGA overlap), whose significant size differences surely contributed to the final significant overlap difference [mean FFA vs. FGA: 834 vs. 230 mm 3 , t(9) = 3.23, p = .01 for the across-run analysis; 714 vs. 194 mm 3 , t(14) = 3.23, p = .005 for the across-session analysis]. Therefore, inspections of formulae (a, b) versus (c, d, e, and f ) in Figure 5 suggest that it is the combination of several factors, including at least the denominator size and the use of the identical denominator for the baseline and category-selective overlap calculations, that affects the final overlap comparison results.

Effects of Statistical Threshold on Overlap Analysis
Because the definition of ROI overlap is based on the ROI sizes and their area of overlap, all of which are determined by the given statistical threshold in the fMRI data analysis procedure (e.g., face-object contrast after the general linear model), overlap analysis is tremendously dependent on the statistical threshold. So far, we have only discussed the results for a threshold of p = .05, corrected, a common threshold used in many studies (Downing et al., 2006;Hasson, Nir, Levy, Fuhrmann, & Malach, 2004). To qualify Rhodes et al.'s (2004) claim that the conservative threshold can, in part, explain low ROI overlap calculations (<30%), and, more importantly, to address how statistical threshold affects the ROI overlap in general, we carried out two additional overlap analyses using p = .01 and p = .10, both corrected. These data are shown alongside the p = .05 bars in Figure 5. The general pattern is that for a threshold of p = .01, the overlap value slightly decreases, and for a threshold of p = .10, the overlap value increases, when compared to the original p = .05 analysis. Examined individually, a threshold increase (from p = .05 to p = .01) yielded three significant FFA-FFA and FFA-FGA overlap differences in the by-session analysis: formulae c and d, t(14) = 2.28, p = .038; and formula e, t(14) = 2.29, p = .037. In contrast, the threshold decrease (from p = .05 to p = .10) yielded nonsignificant or marginally significant results for formula a in the by-run analysis [t(9) = 1.62, p = .13], and formula b in the by-session analyses [t(14) = À1.88, p = .07]. In order to further examine the interaction between statistical threshold and overlap formula, we ran a 3 ( p = .01, p = .05, and p = .10) Â 6 (the six formulae presented in the Methods) analysis of variance. With the mean ROI overlap difference as input, the interaction between statistical threshold and overlap formula was not significant [F(10, 162) = 0.38, MS = .02, p = .95 for the by-run analysis; F(10, 72) = 0.23, MS = .01, p = .99 for the by-session analysis], suggesting that quantitatively, the overall effect of statistical threshold change did not interact with the formulae chosen (refer to Figure 5). Taken together, these results suggest that the overall effect of statistical threshold on the mean ROI overlap is, indeed, contingent on the statistical threshold chosen. When examined individually, these threshold shifts can result in changes in whether an ROI overlap index passes or fails a significance test. Thus, the result of the local overlap comparison does depend on the specific threshold chosen. However, the global effect of threshold shift on ROI overlap is about the same across the different overlap formulae, with no significant interaction found between threshold and the different overlap formulae.
Lastly, our additional analyses only found partial support for Rhodes et al.'s (2004) claim that the more conservative threshold decreases the overlap value (and a more lenient threshold increases the overlap value). In our study, such increases or decreases were small or moderate, and clearly not the only contributing factor to the overall low overlap values. It is more likely that the low FFA-FFA overlap reported in their study (27%) was primarily due to the large FFA chosen as the denominator, not the conservative statistical criteria they adopted.

Monte Carlo Simulations
A comprehensive evaluation of the overlap analysis is not complete without an objective assessment of the expected overlap range under each overlap formula. Computing the mean overlap index using various ROI sizes provides the chance to compare the results from the fMRI data analysis with those from the simulation. Also, the simulation serves as an important selection criterion for determining the ''best'' overlap formula for use in the type of study presented here. We define ''best'' as the formula that is least biased for the comparison of two ROIs, regardless of whether the ROI sizes are quite similar (e.g., FFA-FFA) or largely different (FFA-FXA, X being any given object category of expertise). 3 Here, we used a Monte Carlo simulation to determine the range of mean expected values under systematic ROI size variations. 4 Figure 6 shows the simulation results of 10,000 iterations for each of the six formulae used in our ROI overlap comparisons. The mean ROI overlap is represented along the vertical axis, and the horizontal axis indicates the size difference between the two virtual ROIs. To simulate the potential size relationships between category-selective brain regions, we used 50-unit increments for ROI 1 (ranging from 50 to 1000 units) and held ROI 2 constant at 1000 units. This produced 19 steps, with the ROI differences becoming systematically smaller (e.g., 950 to 0) as the simulations progress toward the right in Figure 6. Three observations are apparent. First, the overlap index varied by formula when the ROI sizes were equal (ROI 1 and ROI 2 = 1000). For formulae a, b, d, and f ( Figure 6A, B, D, and F) it was 50%, for formula c it was 25% ( Figure 6C), and for formula e it was 33% ( Figure 6E). Second, the effect of ROI size differences on the estimated overlap index differed substantially across formula, as represented by the slopes of the bar graphs. For example, when the denominator was a single ROI, the overlap index varied substantially with ROI size difference ( Figure 6A and B). Formulae c, d, and e yielded intermediate slopes ( Figure 6C, D, and E), which showed a plateau at an ROI 1 size of 500 units (ROI size difference of 500). In contrast, formula f was robust to changes in the size difference, yielding the same mean overlap (50%) across all ROI size differences ( Figure 6F). Not surprisingly, our empirically established ROI overlap results fall within the range of values predicted by the Monte Carlo simulations. For example, in the third panel of Figure Figure 6A and B, in contrast, clearly show a more drastic change in the overlap index as the size difference varies when a single ROI is used as the denominator, reflected by a larger slope of change (compared with c, d, e, and f ). This variability makes it more difficult to compare the simulations with real fMRI data.

DISCUSSION
The present study examined whether the functional properties of category-selective brain regions can be inferred on the basis of the degree to which ROIs overlap. Under this proposal, we expect a high overlap index for tasks or object classes that are thought to recruit common functional systems, for example, the FFA in perceiving and imagining faces (O'Craven & Kanwisher, 2000), or in passively viewing two sets of faces (Rhodes et al., 2004). At the same time, we expect a low degree of overlap for categories or tasks that we believe are processed by distinct functional systems. Obviously, this argument relies on the assumption that spatially localized brain regions correspond, at least roughly, to functional mechanisms. Although this logic is widely accepted in the neuroimaging literature, spatial localization (or co-localization) may not entirely address the issue of functional commonalities. For example, as will be discussed in more detail later, higher-resolution imaging might reveal functional separability at a finer level of analysis (e.g., Grill-Spector, Sayres, & Ress, 2006). As Astafiev et al. (2004) suggest, a low overlap index may be the result of using a large, single area as the denominatora proposal tested systematically by our current study.
Consistent with Astafiev et al.'s (2004) claims, the results of our FFA-FFA and FFA-FGA (''fusiform Greeble area'') overlap comparisons reveal a clear influence of the choice of denominator: When a single FFA is used, the FFA-FFA overlap is significantly larger than the FFA-FGA overlap (replicating Rhodes et al., 2004). Importantly, this overlap difference disappeared altogether when the overlap denominator was a combination of both ROIs, be it the sum, the union, or the average. Monte Carlo simulations corroborated this empirical result, and further revealed that, compared to a single ROI denominator, when a combination of the two ROIs was used, the overlap index was much more resistant to ROI size variations, and the average of two single ROIs as the denominator was the best among the six formulae tested in the current context (comparing FFA-FFA and FFA-FXA overlap), in which within-subject ROI size variation was a central issue. 5 In addition to finding significant changes in the overlap index depending on the formula chosen, our results also address the issue of consistency in the FFA across tasks. Although both 1Àback identity and passive viewing tasks are commonly used as default FFA localizer tasks, and some studies have found no significant difference between the two tasks (Tong, Nakayama, Moscovitch, Weinrib, & Kanwisher, 2000), the nature of passive viewing may still vary within the task context. In the Rhodes et al. (2004) study, participants were told to ''passively view'' the stimuli in both the initial FFA localizer and the ensuing experimental scans with faces. However, the mean FFA size ratio for the localizer (FFA lo ) was almost half of that in the experimental face scans (FFA pv ) (813 mm 3 vs. 1440 mm 3 ). Rhodes et al. suggested that this difference was the result of using Figure 6. Six simulation results (over 10,000 iterations each) of changing the size of ROI 1 in 50-unit steps (from 50 units to 1000 units) shown along the horizontal axis (ROI 2 was kept constant at 1000 units); the vertical axis plots the mean overlap index. Given that each different formula has different asymptote overlap value (ranging from 25% to 50%) and a different degree of stability over ROI size changes, the most consistent overlap formula is the average of A \ B/A and A \ B/B (F), which also seems best suited for comparing FFA-FFA to FFA-FGA (typically having different-sized ROIs) in the current study. different stimuli and epoch lengths. In contrast, several studies have found that the FFA is consistent across sessions and/or experimental block length (Gauthier et al., 1999;Kanwisher et al., 1997). In our study, we compared face activation in the same task (1Àback identity) across three runs within a session, or three separate scanning sessions. Consistent with these early reports, we found little size variation (mean FFA size: 775 mm 3 in FFA 1 , 780 mm 3 in FFA 2 , and 623 mm 3 in FFA 3 ) and a high correlation across three scan sessions [r(14) = .91, p < .0001], indicating that FFA activation was relatively stable, at least across sessions, in terms of small center-of-FFA distance variation (several millimeters). In addition, we found a proportionate decrease in terms of ROI overlap indices as the center-of-FFA distance increased, both across runs and across sessions (Figure 4). Thus, the issue of FFA ''consistency'' depends on how consistency is defined, either in terms of distance between center-ofmass Talairach coordinates (relatively small) or in terms of ROI overlap indices (larger decrements).
Our results are congruent with the majority of the FFA literature demonstrating the reliable consistency of repeated FFA measures. In addition, we are able to explain what has been an incongruence between overlap measures and center-distance measures. Despite the apparent disparity between these measures, we found a systematic relationship between FFA-center distance and corresponding FFA-FFA overlap. Furthermore, the correspondence between $40% FFA-FFA overlap calculated from the fMRI data, and the 50% overlap calculated from the Monte Carlo simulation (both from formula f, which is least resistant to ROI size variations), suggests that even in a highly consistent region like the FFA, the expected overlap across runs or sessions is only about 50%. In support of this finding, if we recalculate the overlap for several studies using formula f, we find similar results. The average FFA-FFA overlap for O'Craven and Kanwisher (2000) is 53%, and for Rhodes et al. (2004) is 38%. These calculations are similar to our findings using a 1Àback identity task, both across-run (38%) and across-session (39%). The only higher-thanexpected overlap index for the FFA-FFA was found for Grill-Spector et al. (2004). Using formula f to recalculate their data, we find a mean overlap of 72%. Among other possibilities, this high overlap may be due to differences in task, or to the small sample size (n = 5) in their study.
It is also important to note that, although our data were generally consistent with the predictions of the ''perceptual expertise hypothesis'' (Bukach et al., 2006), our participants only exhibited behavioral evidence for Greeble expertise by the third scan (Gauthier et al., 1999;. Thus, it is somewhat surprising that we found no significant difference in the overlap index for Greebles and faces from the initial scans onward. This apparent dissociation may be explained, in part, by previous studies that have shown that fusiform activity is related to task demands (Rogers, Hocking, Mechelli, Patterson, & Price, 2003) and high within-family homogeneity of the stimulus class . In particular, we may have failed to find a change in the degree of FFA-FGA overlap with training because we employed a 1Àback identity task-a task that has been associated with increased focal fusiform activity relative to passive viewing (Kanwisher et al., 1998). Such task-related activation might also help account for previous results which show a nonlinear relationship between behavioral measures of expertise and FFA activity for objects of expertise (e.g., car experts viewing cars) when using a 1Àback identity task . This heightened activation for an object category in the 1Àback task may explain why we found similar overlap for the FFA-FGA and FFA-FFA in participants' first scans, prior to expertise training. Regardless of one's preferred explanation, the significant FFA-FFA and FFA-FGA overlap we observed before expertise training is certainly not easily explained by current instantiations of the face specificity hypothesis.
Last year, two fMRI studies explored the effect of ''expertise training'' of novel objects in the whole brain (Moore, Cohen, & Ranganath, 2006) or in the extrastriate cortex (Op de Beeck, Baker, DiCarlo, & Kanwisher, 2006). After training, participants in Moore et al.'s study showed increased activation in the dorsolateral prefrontal, inferior parietal, and occipito-temporal cortices. In Op de Beeck et al.'s study, they found increased activation in the lateral occipital complex (LOC). Both studies failed to find a significant increase in FFA activity after training. A closer look at the training paradigms for these studies, however, reveals disparate tasks used in training and scanning. In previous Greeble training studies (Rossion et al., 2004;Gauthier et al., 1998), object naming tasks (e.g., what is this Greeble's name?) and shape-name verification (Is this Greeble's name ''Tezi''?) were alternated. This alternation of different tasks was intended to increase the speed of subordinate-level access, a characteristic of expert-level recognition (Tanaka, 2001). During scans, a simple delayed-matching or passive-viewing task was typically used (Gauthier et al., 1999). In contrast, in the Moore study, simultaneous match-to-sample, delayed recognition, family placement, and family discrimination tasks were used during training, and a typical working memory task (cue-delay-probe) was used during scan. In the Op de Beeck study, discrimination learning was used during training, and a demanding color-change detection task was used during the scan. Therefore, the task differences and, more importantly, the disparate underlying representations the participants may have acquired in these studies, may explain their failure to find significantly increased FFA activity after training. Such conjecture, however, calls for future studies that systematically compare various training paradigms on (a) specific ''aspects of expertise'' acquired during training, (b) the different neural manifestations underlying training process, and (c) the different patterns of neuronal reorganization.
Finally, as mentioned earlier, a recent discussion of highresolution fMRI studies of the FFA (Baker, Hutchison, & Kanwisher, 2007;Grill-Spector, Sayres, & Ress, 2007;Simmons, Bellgowan, & Martin, 2007;Grill-Spector et al., 2006) further reinforces our arguments regarding the importance of formula definition for the particular results and ensuing conclusions. Grill-Spector et al. (2006) scanned five novices and used [Preferred À Nonpreferred] / [Preferred + |Nonpreferred|] as the category selectivity index. They found evidence, at 1-mm 3 voxel resolution, for category-selective voxels to faces, animals, cars, and even sculptures. However, as Baker et al. (2007) and Simmons et al. (2007) pointed out, the analysis method used by Grill-Spector et al. was flawed due to the voxel selection procedure (dependent vs. independent) and the lack of a baseline adjustment. Especially relevant to our study is the commentary of Simmons et al., in which they argue that using the absolute value of the nonpreferred condition in the selectivity index can inflate the voxel selectivity for other categories within the FFA. Sidestepping the issue of whether the FFA is homogeneously face selective or intermingled with other nonface-selective voxels (Grill-Spector et al., 2007;Peelen & Downing, 2005b;Schwarzlose, Baker, & Kanwisher, 2005), our study highlights the importance of choosing a sound index and verifiable rationale, and then cautiously interpreting the results.
To conclude, an overlap index, when appropriately used, can provide useful information and complement other analyses. However, it is not a suitable measure to assess category specificity on its own, as demonstrated in our study. We found that changing the denominator using six perfectly plausible alternatives significantly changed the results of the overlap comparison. These data suggest that finding a significant difference between FFA-FFA overlap and FFA-FXA (X being any given category of domain experts) overlap may be largely a result of choosing a denominator that biases the results in this direction, rather than reflecting a robust difference between face areas and areas of expertise. Furthermore, using ROI overlap comparisons to test the face specificity and perceptual expertise hypotheses is not ideal, in particular, because we found equivalent FFA-FFA and FFA-FXA overlap even in novices. In addition, the mean overlap value for different face measures, using the least-biased overlap formula, was only about 50%. Therefore, it is our contention that different models of FFA function, including face specificity (McKone & Kanwisher, 2005;Grill-Spector et al., 2004;Spiridon & Kanwisher, 2002;Kanwisher, 2000), distributed (O'Toole, Jiang, Abdi, & Haxby, 2005;Hanson, Matsuka, & Haxby, 2004;Haxby et al., 2001) and perceptual expertise accounts (Gauthier & Bukach, 2007;Bukach et al., 2006;, may be better served by focusing more on why certain stimuli, tasks, analysis methods are chosen, and how these particular concatenations do or do not give rise to category selectivity. %computing mean ROI overlaps in 6 formula mean_ovp1( j) = mean(ovp(i))/a( j); mean_ovp2( j) = mean(ovp(i))/b( j); mean_ovp3( j) = mean(ovp(i))/tot( j); mean_ovp4( j) = mean(ovp(i))/(tot( j)/2); mean_ovp5( j) = mean(ovp(i))/(tot( j) À mean(ovp( j))); mean_ovp6( j) = (mean(ovp(i))/a( j)+mean(ovp(i))/b( j))/2; end 5. With the recommended formula of d 0 ¼ ½Mpreferred À Mnonpreferred ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s 2 preferred þ s 2 nonpreferred 2 q (Afraz, Kiani, & Esteky, 2006), we also adopted and recalculated the overlap index with the formula ½ROI a \ ROIb ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ROI 2 a þ ROI 2 b 2 q (the standard deviation in the denominator was not available because each ROI had only the size information here, unlike the case of d 0 with four responses: faces, objects, sculptures, and animals, for each voxel). The results of the FFA-FFA versus FFA-FGA overlap is 0.291 versus 0.363, t(9) = 0.64, p = .53 for across-run, and 0.332 versus 0.397, t(14) = À0.696, p = .497 for across-session analysis, consistent with the results from formulae c, d, e, and f.