Links and Legibility: Making Sense of Historical U.S. Census Automated Linking Methods

Abstract How does handwriting legibility affect the performance of algorithms that link individuals across census rounds? We propose a measure of legibility, which we implement at scale for the 1940 U.S. Census, and find strikingly wide variation in enumeration-district-level legibility. Using boundary discontinuities in enumeration districts, we estimate the causal effect of low legibility on the quality of linked samples, measured by linkage rates and share of validated links. Our estimates imply that, across eight linking algorithms, perfect legibility would increase the linkage rate by 5–10 percentage points. Improvements in transcription could substantially increase the quality of linked samples.


Introduction
Linking historical U.S. Census records is crucial in the study of a range of economic outcomes such as migration and intergenerational mobility.However, data quality issues reduce the accuracy of these links, with unclear consequences for the resulting economic analyses.This article explores one specific source of error: the difficulty in transcribing records with poor handwriting legibility.We show that this is a quantitatively important barrier to accurate linking using a novel measure of legibility that compares two independent transcriptions of the 1940 U.S. Census schedule.We document wide variation in legibility and show that low legibility reduces the number and quality of links.We also find that the effect of legibility on these links depends on the choice of linking algorithm.
The first contribution of this article is to propose and document a measure of legibility.For a given enumerator's handwritten entries, we use the share of recorded names where independent transcriptions by Ancestry.comand FamilySearch.orgare identical. 1Names written less legibly will be more likely to be entered differently by two individual transcribers.We find wide variation in this measure.For the lowest decile of enumeratorlevel legibility, fewer than half of the transcriptions agree, while for the highest decile the share is almost 90%.Figure 1 illustrates differences in legibility between enumerators: the handwriting in Figure 1(a) leads to fewer transcription errors than the one in Figure 1(b).
Legibility is critical to link records across census rounds.We implement a variety of existing algorithms to link the 1930 and 1940 Census rounds and document how these methods perform 1 See Ruggles (2021) for a history of collaboration between these organizations and IPUMS, the flagship organizations for the distribution of historical and contemporary U.S. census data.
CONTACT Sam Il Myoung Hwang hwangii@mail.ubc.caUniversity of British Columbia, Vancouver, Canada.Supplementary materials for this article are available online.Please go to www.tandfonline.com/UBES.as legibility changes. 2 The proportion of individuals who are linked across census rounds (the "linkage rate") increases by up to two-thirds when moving from the bottom to the top decile of legibility.Further, the share of false positives declines with legibility: We find a decrease of up to 20% in the share of links that fail a validation test when moving from the bottom to the top decile of legibility.
One may be concerned that the quality of the linked samples and our measure of legibility are correlated for reasons that are not relevant to the underlying "true" legibility.For example, our measure of legibility may be low for enumeration districts with a high share of unusual names of foreign origin.To identify the causal effect of legibility on the quality of linked samples, we exploit discontinuities in legibility at the boundaries of enumeration districts.The discontinuity is created due to the following feature of the enumeration procedure for the U.S. Federal Census since 1880: All households in an enumeration district are enumerated by a single census enumerator. 3As a result, to the extent that different census enumerators have handwriting with varying degrees of legibility, our measure of legibility changes discontinuously at the boundaries of enumeration districts.Figure 2, a map of the city of Yonkers, New York, illustrates the variation in our legibility measures in neighboring enumeration districts.In fact, Figures 1(a) and (b) also embody our research design: these two census schedules contain information on households that live on opposite sides of the same street (South Highland Avenue, Los Angeles, California).This street happens to be on the boundary of two enumeration districts.
One challenge in implementing this research design is that one needs to know which enumeration districts share boundaries.The enumeration district maps are available for each US Federal Census in principle but they have not been digitized until recently.Shapefiles of enumeration districts in the 1940 Census have been made available for 43 cities by Logan and  Zhang (2017). 4e find that legibility does have a causal effect on the quality of linked samples.We measure the quality of a linked sample in two ways: the linkage rate (i.e., the share of linkable population that is linked by a given algorithm) and the share of validated links (share validated henceforth).The definition of a validated link may depend on the particular datasets being linked; we follow Bailey, Cole, and Massey (2020b) and define a link validated if the two records that are linked have matching parents' birth places.As shown in Bailey, Cole, and Massey (2020b), this is informative about the true-ness of a link. 5We also present 4 See section C for the list of these cities. 5 Another important measure of quality of linked samples is whether the linked sample is representative of the linkable population.However, we find that none of the linked samples used in this paper are representative.This is not surprising since most previous studies also find that their linked samples are not representative of their respective population.
results using middle name initials as an alternative validation variable.
We find that moving from the 10th to the 90th percentile of the legibility distribution (from 53% to 88% legibility) increases the linkage rate by 16% to 41%, depending on the linking algorithm used.We also find that the share of links that are not validated drops by 12% to 23%.This implies that legibility has a large causal impact on linking performance, and the effect is much larger for some algorithms than for others.Generally, algorithms that use first and last with minimal cleaning are more sensitive to legibility than those that employ string comparators or phonetic codes.
Finally, we quantify the importance of legibility in determining the overall quality of the linked samples.We do this by estimating counterfactual values of the linkage rate and share validated as if legibility were perfect across our linkable sample.Although our boundary sample has the advantage of providing credible causal inference, it consists only of people living in large cities, and hence may not be representative of the broader population.Hence, in this section we use OLS estimates of the effects of legibility from our entire linkable sample.Using these coefficients, we estimate that perfect legibility would increase the linkage rate by 5-10 percentage points, depending on the linking algorithm used. 6Observed linkage rates are approximately 75% to 88% of what they would be without legibility problems.
A concern with these estimates from the entire linkable sample is that, unlike the boundary sample, our results of the effects of legibility on the linkage rate may suffer from endogeneity.Our results suggest that this is unlikely.First, the coefficients on legibility using the entire linkable sample are very similar to the (more plausibly causal) ones from our boundary sample.Second, our results in this section include township fixed effects, which are likely to absorb much of the potentially problematic variation. 7Third, we show that our results change very little if, as per Oster (2019), we allow for high levels of selection on unobservables.
In sum, we find that low handwriting legibility degrades transcriptions sufficiently to cause quantitatively important declines in linkage rates.

Literature Review
Linked samples created from historical datasets have helped researchers answer important questions in a variety of topics, including immigration (e.g., Abramitzky, Boustan, and Eriksson  2012), internal migration (e.g., Collins and Wanamaker 2014), intergenerational mobility (e.g., Long and Ferrie 2013), and culture (e.g., Bazzi, Fiszbein, and Gebresilasse 2020).At the same time, considerable efforts have been made to evaluate the way linked samples are created, or in other words, evaluating the performances of linking algorithms.This is likely out of concern for the quality of datasets generated from historical sources.Relative to modern datasets, there are several quality issues with 6 Strikingly, algorithms with higher linkage rates see somewhat larger increases in linkage rates from eliminating legibility errors.This suggests 'better' linking algorithms would not compensate for the problems caused by illegibility. 7There are 25,630 townships in our data, far more than the number of counties (3108).
Evaluating linking algorithms would be a straightforward task if true links were observable.This is rarely the case in historical datasets. 8In the absence of true links, some authors rely on manually constructed high-quality links (Bailey et al.  2020a, Bailey, Cole, and Massey 2020b, Abramitzky et al. 2021)  or crowd-sourced links available on genealogical websites (Price  et al. 2021; Abramitzky et al. 2021; Helgertz et al. 2022).
However, these high-quality links that have been the basis of evaluation may also share some of the issues of linked samples that are being evaluated for quality.Eriksson (2017) highlights the fact that when researchers manually link a small sample to a population (because of the large costs involved in manual linking), the resulting links may contain false links.Their reasoning, which is also pointed out by Abramitzky et al. (2021), is as follows: suppose that a link is created between person A in the sample data and person B in the population data.The link between A and B may be false if there exists another person A' that is not in the sample data, but is actually the true match to person B. Eriksson (2017) uses Swedish census data to show that the sample-to-population linkage results in Type-1 error rates of up to 24.4%, and that Type-1 error rates increase as sampling rates fall.
In the case of crowd-sourced links available in genealogical websites, the links are likely to be true because the users of such websites, either genealogists or descendants of people being linked, may use additional information (e.g., birth/marriage certificates) beyond what is available in regular datasets such as the U.S. census.However, the coverage of these user-provided links is typically not sufficient to evaluate all the links that are algorithmically created.For example, most links in Abramitzky  et al. (2021) and Helgertz et al. (2022) (ranging from 63% to 95%), cannot be cross-checked with links on FamilySearch.org,a popular genealogical website.It is not clear whether these links are similar in quality to the links that can be cross-checked with links on genealogical websites, since the users of these websites may not be representative of the population.
Our approach complements existing work on the quality of linked samples.Instead of relying on high-quality linked samples to evaluate link quality, we attempt to investigate why the quality of linked samples is low.In other words, we attempt to measure the effect of variation in a factor (the legibility of historical documents) that might affect the quality of linked samples.There are at least two strengths to our approach: we can quantify the role that this factor plays in degrading the quality of linked samples (see Section 4); and if it is indeed identified as degrading the quality of linked samples, the research community may consider employing/developing technology that can address that issue (e.g., use advanced optical character recogni-tion technology for digitizing the original handwritten census returns).
The remainder of this article proceeds as follows: Section 2 describes our datasets and legibility measure, and introduces the linking algorithms that we evaluate.Section 3 presents the causal effect of legibility on the quality of linked samples.In Section 4, we quantify the effect of (il)legibility by simulating linkage rates and share validated under a counterfactual scenario where names are perfectly legible in all enumeration districts.Comparing simulated rates to actual rates then tells us the degree to which legibility degrades link quality.Finally, Section 5 concludes.Sections denoted with letters can be found in the supplemental appendices.Figures and tables for additional descriptives, analyses, and robustness checks are collected in Appendix H and prefixed with "A." Additional results, including a simple model of census data linkage and legibility, and heterogeneous effects of legibility on the quality of linked samples, can also be found in Appendices F and G, respectively.

Legibility
Our treatment variable is the legibility of census schedules.Our measure of legibility is calculated for each enumeration district (and hence each enumerator).It is defined as the share of records in the enumeration district whose transcriptions of given names and surnames by FamilySearch.organd Ancestry.comare identical after standard cleaning procedures. 9We obtain the transcriptions of Ancestry.comfrom IPUMS and that of FamilySearch.orgfrom their website.We combine the two datasets and are successful in matching 93.1% of the 132,404,766 records in the 1940 census.10Among these, 71.9% have identical transcriptions.Conversely, in 28% of cases the transcriptions do not match.The mean of our enumerator-level legibility measure (across 150,156 enumeration districts) is 0.719 with a standard deviation of 0.147.For all of our analyses in the following sections, we drop enumeration districts that are too small (containing fewer than 50 people) or that do not have two transcriptions for a sufficiently large share of people (less than 90%).We also restrict analysis to White and Black males 8 years or older in 1940 (and hence could plausibly be linked to a 1930 individual).This leaves us with approximately 48 million observations.This sample will be frequently referred to as the "linkable population" in later analyses.Table A2 compares the mean of various observable characteristics of our linkable population with the overall population (all men), and with our sample of individuals who live along a relevant boundary.
We use a sample from our linkable population to identify the causal effect of legibility of census schedules on the quality of linked samples.We refer to this as the "boundary sample." This boundary sample is drawn from 43 cities for which the shapefiles of the 1940 census enumeration districts are available.It consists of individuals who live on either side of a street along the border of two neighboring enumeration districts.We obtain the exact street address of each household in the 43 selected cities from geographic reference files created by the Urban Transition Project (Logan and Zhang 2018).We drop from our sample (a) boundaries of enumeration districts that overlap with township or ward boundaries; and (b) boundaries only one side of which is inhabited.We are left with 739,643 individuals living along 13,838 boundaries.
To identify the causal effect of legibility of census schedules on the quality of linked samples, we exploit discontinuities in our measure of legibility at the boundaries of enumeration districts.To test for balance across these boundaries, we present descriptive statistics for the "Less legible" and "More legible" sides of each boundary.The former group consists of all individuals who live on the side of the street where the measure of legibility is worse than the other side, and the latter group consists of the rest.Table 1 shows the mean of the legibility measure for each group: the difference in mean is 0.116, which is approximately 0.9 standard deviations of the legibility measure in the boundary sample (0.129).The difference is statistically significant at the 99.9% level.Note that in our empirical analysis, we never use this binary distinction between more and less legible sides of a boundary.Instead, we always rely on our full continuous measure of legibility, which uses the actual gap in legibility across these boundaries.Please refer to Section 3 for details of the specification of the model.
To identify the effect of legibility on the quality of linked samples, it is necessary that both observable and unobservable characteristics of individuals on either side of the boundaries are balanced.We find that they are.Tables 2 and A3 compare the mean of observable characteristics between the two groups, the "Less legible" versus the "More legible" group.For most characteristics, the difference in means is not statistically significant.When it is statistically significant, the standardized difference is below 0.1, the threshold recommended in Austin (2009) to determine balance.
As for unobservable characteristics, they may not be balanced if individuals sort across enumeration district boundaries based on these characteristics.Although the existence of such sorting  Note that the share of observations for which "Foreign born mother" and "Foreign born father" are nonmissing is 39% and 34% in the boundary sample.These are greater than 5%, which is the sampling rate for the census "long form"in which parents'birth places are surveyed.This is because IPUMS assigned the birth places of parents to those who did not take the "long form" survey, but were living with one of their parents at the time of the 1940 census.+ p < 0.1, * p < 0.05, * * p < 0.01, * * * p < 0.001 cannot be ruled out, it is unlikely because enumeration district boundaries are drawn only for the purpose of census enumeration and do not serve any other functions that may induce sorting.Since these boundaries may overlap with other meaningful boundaries (such as county, township or ward boundaries), we drop such overlapping boundaries from our sample. 11We refer interested readers to section D for a description of how enumeration district boundaries were determined for the 1940 census.

Linkage Rates and the Share of Validated Links
We measure the quality of linked samples using two outcomes: linkage rates and the share of validated links.The linkage rate is defined as the share of a given sample that is linked.The other measure of quality, the share of validated links ("share validated" henceforth), is defined as the share of linked records that is validated by an auxiliary variable, or "validation variable", that was not used as a linking variable.Our baseline validation variable is parents' birth places.That is, a link is validated if father and mother's birth places recorded in the 1930 and 1940 censuses match.We use this variable for validation because Bailey, Cole, and Massey (2020b) provide evidence, using groundtruth links, that links validated with parents' birth places are 11 We are unable to drop school district boundaries from sample.This is because, as far as we are aware, the digitized maps of school districts are not available for 1940.
more likely to be true links than those that are not.12See section A for further discussion about using parents' birth places as a validation variable.We also check the robustness of our results regarding share validated with an alternative validation variable: middle name initials. 13That is, a link is validated if middle name initials match across the two censuses.Although U.S. Federal Census questionnaires do not specifically ask about middle names, many people report them.In our main sample, 24% of records are associated with a middle name initial. 14The validation status of a link using middle name initials is strongly correlated with that from using parents' birth places.The share of links whose validation status remains unchanged between the two validation methods is about 75% (see Table A1), suggesting that middle name initials are indeed effective for validation.

Linking algorithms
This paper uses the samples created by the following eight linking algorithms. 15In the interest of space, we henceforth use abbreviations for these algorithms: 1. Abramitzky, Boustan, and Eriksson algorithm with exact names as a linking variable ("ABE-exact") 2. Apply ABE-exact algorithm and remove links with names that overlap with anyone else's in the ±2 year band ("ABE-exact5") 3. ABE algorithm with NYSIIS-standardized names as a linking variable ("ABE-NYSIIS") 4. Apply ABE-NYSIIS algorithm and remove links with NYSIISstandardized names that overlap with anyone else's in the ±2 year band ("ABE-NYSIIS5") 5. ABE algorithm where names are considered to match if they are within 0.1 Jaro-Winkler distance ("ABE-JW")16 6. Apply ABE-JW algorithm and remove links with names that are within 0.1 Jaro-Winkler distance from anyone else's in the ±2 year band ("ABE-JW5") 7. Machine learning algorithm (Feigenbaum 2016, "ML") 8.The algorithm that creates the Multigenerational Longitudinal Panel dataset (Helgertz et al. 2022, "MLP") We refer interested readers to a review article by Abramitzky  et al. (2021) (algorithms 1 to 6) or the references cited above (algorithms 7 and 8) for precise descriptions of each algorithm.We note similarities and differences between these algorithms that are important for interpreting our results in the following  sections.The first seven algorithms are similar in the sense that the linking variables they use are individual characteristics that are either time invariant or evolve in a predictable way, such as given names and surnames, race, 17 birth place, and age.The main difference amongst these seven algorithms lies in how they use these linking variables (especially names) to declare links and whether they remove links with names that are common.The MLP algorithm, on the other hand, represents a departure from the other algorithms in that it expands the set of linking variables from only immutable characteristics of an individual to time-varying information about the individual (e.g., place of residence) and also to information about members in the same household (parents, spouse, siblings, etc.).This feature of the MLP algorithm likely leads to over-representation of households whose members do not change across censuses.None of the algorithms generate linked samples that are representative of the population. 18Table A4 compares various observable characteristics of our linkable population to each of the linked samples.All linked samples under-represent Blacks and over-represent Midwesterners relative to our linkable population.The share of Blacks in the linked samples is approximately 45%-78% of that in the population, whereas the share of Midwesterners in the linked samples is approximately 116%-123% of that in the population.On the other hand, for most of the other characteristics, differences in means between the population and the linked samples are statistically significant, but moderate in magnitude.

The Causal Effect of Legibility
We estimate the following model with our boundary sample to obtain estimates of the causal effect of census schedule legibility on the quality of linked samples: where the dependent variable q i is one of our quality measures.For linkage rates, q i is equal to 1 if person i is linked, and 0 otherwise.For share validated, q i is equal to 1 if the link for person i is validated with his/her parents' birth places, and 0 otherwise.
e i is our legibility measure for person i's enumeration district 17 There exists some evidence that recorded race for the same individual changes over time: for example, Dahis, Nix, and Qian (2019) argue that at least 1.4% of Blacks have passed for whites at some point between 1850 and 1940 censuses. 18As far as we know, none of the linked samples used in previous studies were representative of their respective populations.(denoted with e i ).X i is a vector of observable characteristics of person i as well as a constant (see notes under Table 3 for the list of covariates), and δ b i is the boundary fixed effect for person i. Lastly, i captures the effect of unobservable factors on the quality of linked samples.We estimate model (1) separately for each algorithm.We use the entire boundary sample when the outcome is linked/not linked, whereas we use only linked records with nonmissing values of the validation variable when the outcome is validated/invalidated.

The Effect of Legibility on Linkage Rates
We find that the legibility of census schedules affect linkage rates for each of the eight algorithms.Figure 3 illustrates our finding.This figure, created with the boundary sample, plots mean linkage rates for each linking algorithm against different levels of legibility.There is a positive relationship between legibility and the linkage rate for each algorithm.In addition, Figure A2, which focuses on two of the algorithms (ABE-exact and ABE-JW), suggests that the effects of legibility on linkage rates are heterogeneous across these two algorithms.Specifically, the ABE-exact algorithm appears to yield higher linkage rates than ABE-JW when legibility is above 0.57 (denoted with the vertical line), yet the latter yields higher linkage rates when legibility is below 0.57.We observe a similar pattern in Figure A3, where we focus on linkage rate-legibility profiles of three conservative algorithms (ABE-exact5, ABE-JW5, and ABE-NYSIIS5).The linkage rate of ABE-exact5 appears to be larger than that of the other two algorithms when legibility is greater than 0.61, but it falls below ABE-JW5 at 0.61 and below ABE-NYSIIS5 at 0.59 (denoted with vertical lines).Our estimates of model ( 1) are consistent with the impression that Figures 3, A2, and A3 provide.Table 3 presents coefficient estimates on legibility in model (1), estimated separately for each linking algorithm.The coefficients are statistically significant for all algorithms at the 99.9% level, though the magnitude of the coefficients vary.According to our estimates, a one standard deviation increase in the legibility measure (approximately an increase of 0.129 in the boundary sample) raises linkage rates by 2.3-4.5 percentage points, or 6.1%-16.1% of the mean linkage rate, depending on the linking algorithm.
Our estimates in Table 3 also indicate that the coefficient on legibility for the ABE-exact algorithm is larger than that for all of the other seven algorithms.To formally test the equality of coefficients on legibility between the ABE-exact and each of the other seven algorithms, we estimate the following model with "stacked" boundary samples: where 1{non-ABE-exact-algorithm} i is an indicator that is equal to 1 if record i is associated with one of the seven non-ABE-exact linking algorithms, and 0 otherwise.To estimate model (2), we stack two copies of boundary samples, one of which is associated with the ABE-exact algorithm and the other with one of the other seven algorithms.The null hypothesis we test is β 2 = 0, that is, the coefficient on legibility associated with a given non-ABE-exact algorithm is equal to that associated with the ABE algorithm.
The estimates of β 1 and β 2 are presented in Table A5.β 2 's are negative for each of the other seven algorithms, meaning that the coefficient on legibility is smaller for these algorithms than for the ABE-exact algorithm.We reject each of the null (equality of the coefficients) hypotheses at the 99.9% level.The magnitude of the differences (i.e., | β 2 |) are quite large as they are (roughly) close to a half of the magnitude of β 1 , which is the effect of legibility on the linkage rate for ABE-exact.Similarly, we also find that the coefficient on legibility for the ABE-exact5 algorithm is larger than that of each of the other seven algorithms (see Table A6 for the estimate of model (2), with dummy variables replaced appropriately).
The sensitivity of linkage rates to the legibility measure for ABE-exact and ABE-exact5 algorithms is likely because they link two records only if the names on the records exactly match.Our finding suggests that this linking strategy may yield linkage rates that are higher than algorithms that employ string comparators (e.g., Jaro-Winkler distance) or phonetic codes (e.g., NYSIIS) when the source documents are sufficiently legible.However, as the legibility of the source documents deteriorate, ABE-exact and ABE-exact5 likely yield lower linkage rates because poor legibility might induce incorrect transcription of names.Our results suggest that string comparators or phonetic codes can mitigate the effect of poor legibility on linkage rates.
Our results survive two sets of robustness checks.In our first set of robustness checks, we use two alternative measures of legibility.The first alternative is constructed in the same way as our baseline measure, except that we do not remove spaces in between letters in transcribed names.Recall that in constructing our baseline legibility measure, we remove all the spaces between letters in the names before comparing the two transcriptions.In this robustness check, we test if our results are sensitive to this particular name-cleaning procedure.For the second alternative legibility measure, we require that the two transcriptions of a person's name be sufficiently different from each other to be counted as not identical.Specifically, we require the Jaro-Winkler distance between two names to be greater than the 75th percentile value in the population, for it to be counted as not identical.The 75th percentile is equal to 0.044, which is close to 0 because most names are transcribed identically in the two transcriptions (and therefore they have a Jaro-Winkler distance of zero).
Tables A7 and A8 present estimates of model ( 1) using each of the two alternative legibility measures, while Tables A9 to A12 also present estimates from model (2) for each of these two alternatives.Our baseline conclusion remains robust.
In the second set of checks, we show that our results are robust to varying the extent to which legibility changes across enumeration district boundaries.We rerun our analysis restricting our sample to boundaries where legibility changes by at least a certain threshold value.The thresholds are the 5th, 10th, 25th, and 50th percentiles of the distribution of differences in legibility (where the unit of observation of the distribution is a boundary).Table A13 presents estimates from model (1) for each threshold (as well as our baseline results for a reference).The coefficients on legibility are statistically significant and stable for all linking methods across all thresholds.Our conclusion that the linkage rate of ABE-exact( 5) is more sensitive to legibility than other linking algorithms remains robust to this check as well (see Tables A14 and A15).

The Effect of Legibility on the Share of Validated Links
Turning to the second measure of quality, we find that legibility also affects the share of validated links positively.Figure 4 presents the share validated-legibility profile associated with each linking algorithm.The share validated is increasing in legibility across all linked samples.We confirm this with estimates of model ( 1): the coefficient on legibility is positive and statistically significant at the 95% level for ABE-exact and at the 99.9% level for all the other algorithms, indicating that increases in legibility raise share validated (Table 4).To the extent that share validated is negatively correlated with Type-1 error rates, our findings suggest that increases in legibility reduce Type-1 error rates.
The magnitude of the effect of legibility on share validated is modest: our estimates imply that a one standard deviation increase in legibility (approximately 0.129) raises share validated by 0.7%-2.8% of the mean of share validated.Equivalently, it reduces share invalidated by 0.6 to 2.2 percentage points, or 4.3%-10.1% of the mean of share invalidated, depending on  the algorithm.However, we would like to emphasize that the moderate effect of legibility on share (in)validated does not necessarily mean that it has a moderate effect on Type-1 error rates.A validated link can still be false.
We also find that effects of legibility on share validated are heterogeneous across algorithms.Specifically, share validated for the MLP algorithm is less sensitive to legibility relative to other algorithms.This pattern is visible in Figure 4: the share validated-legibility profile of the MLP algorithm appears to be flatter than others.We formally test this by estimating model (2), replacing the dependent variable and the dummy variables accordingly.Table A16 presents estimates of model (2) for the MLP and each of the other algorithms.The estimates indeed indicate that the coefficient on legibility for the MLP algorithm is positive (i.e., β 1 > 0) and smaller than those for other algorithms (i.e., β 2 > 0), and the differences are statistically significant at the 95% level or higher except when compared with ABE-exact5.Our estimates imply that share validated for the MLP algorithm is not only larger than those of other algorithms but the difference also grows as legibility deteriorates. 19 19 These results are robust to alternative measures of legibility, an alternative validation variable, restriction of sample to boundaries across which Our results in this section are robust to various different checks.The first two are the same ones that we conducted for linkage rates in Section 3.1, that is, using alternative measures of legibility and restricting the sample to boundaries where differences in our baseline measure of legibility (across the boundary) is sufficiently large.Tables A21, A22, and A23 present results for these robustness checks and show that the effects of legibility on share validated are statistically significant and their magnitudes are similar to the baseline estimates.
The third robustness check uses an alternative validation variable, that is, the initial of one's middle name, as discussed in Section 2.2.Table A24 presents these results.The estimates of the effect of legibility are statistically significant at the 99.9% level for all of the algorithms.However, similar to the baseline estimates, once again we see that the magnitude of the effects of legibility on share validated are modest.The estimates imply a one standard deviation increase in legibility (0.129) increases share validated by 1.4%-4.9% of the mean (of share validated).
Finally, for the last robustness check, we weight each observation with the inverse of the predicted probability of being linked and having nonmissing values for the baseline validation variable (i.e., parents' birth places).This check is necessary to account for the fact that our estimation uses only linked records when the dependent variable in model ( 1) is a validation variable.To the extent that the data linkage selects on observable/unobservable characteristics, it is possible that these characteristics of individuals are not balanced across boundaries conditional on being linked.Tables A25 through A32 compare the means of observable characteristics between more and less legible sides of the boundaries, similarly to Table 2, but conditional on being linked under each of the algorithms.None of the differences are large enough such that the standardized differences are greater than the threshold of 0.1. 20Our weighting procedure, which corrects for the potential imbalance in observables created by linkage, yields estimates that are similar to our baseline estimates, although three of the eight estimates are no longer statistically significant (see Table A33).See Appendix E for details about our weighting procedure.
the legibility measure is sufficiently different.See Tables A17, A18, A19, and A20. 20Lack of evidence for unbalanced observables does not necessarily mean that linkage will not cause selection on unobservables.

Quantifying the Effect of (Il)legibility
Having established that legibility positively affects linkage rates and share validated, in this section, we quantify the role that it plays in determining the overall quality of linked samples.To do so, we simulate linkage rates and share validated under a counterfactual scenario where our measure of legibility is equal to 1 in all enumeration districts.We then compare the simulated quality of linked samples to the actual quality observed in the data.The ratio of observed quality to simulated quality (or the difference between the two) is our estimate of the degree to which legibility degrades the quality of linked samples between the 1930 and 1940 censuses.
For our simulation, we estimate the effect of legibility on our quality measures using the linkable population, rather than the boundary sample.We do so because the boundary sample consists of those living in large cities, and hence is not representative of the population (see Table A2 for a comparison between our boundary sample and the linkable population).The effect of legibility in the population may therefore not be the same as in the boundary sample, whereby applying estimates from this sample to the population may lead to systematic biases.
In practice, we estimate the following model using the linkable population, with enumeration districts as the unit of observation: where q e and X are enumeration-district level averages of the corresponding variables in our baseline model (i.e., q i and X i in model (1), respectively), and δ f (e) is a fixed effect for an administrative division that includes enumeration district e (e.g., townships, counties, or states).Finally, e captures a random shock to the quality of linked samples in enumeration district e.This specification is similar to our baseline model (1) aggregated at the enumeration district level.The only difference is that, in model (3), we can only control for administrative divisions that are larger than an enumeration district, since our legibility measure varies at the enumeration district level.We therefore use township fixed effects in model (3), because township is the smallest unit of administrative division that is available in our dataset for the entire population.
The OLS estimates of β in model (3) for linkage rates and share validated are presented in the first column (labeled "Unadjusted") of Tables A34 and A35, respectively.We find that the signs of the coefficients on legibility are the same as those obtained from the boundary sample.That is, an increase in legibility raises both the linkage rates and share validated.The magnitude of the coefficients are also similar to our baseline estimates from model (1) obtained with the boundary sample and boundary fixed effects (see "Baseline" column in the same table ).
However, we are less confident that the OLS estimate of β in model (3) represents the causal effect of legibility compared to our baseline estimates using the boundary sample.Township fixed effects may not capture all of the unobservable factors that are correlated with our measure of legibility, leading to omitted variable bias.To address this issue, we adjust for the potential bias in β by adopting the method proposed by Oster (2019)which is devised to address selection on unobservables in linear models.
One of the key parameters in Oster (2019) is the coefficient of proportionality, denoted by δ. 21This parameter measures the strength of selection on unobservables relative to selection on observables.Its value may vary across contexts, but assuming that δ is positive, Oster (2019) suggests that a reasonable upper bound for δ is 1 (i.e., the strength of selection on unobservables is the same as observables).Then she shows that the true treatment effect is between the unadjusted estimate (i.e., the OLS estimate) and the estimate of β under the assumption that δ = 1.
We adopt this approach proposed by Oster (2019) and estimate the effect of legibility on the quality of linked samples under the assumption that δ = 1.We also do it for δ = −1, since we are unable to verify that δ is positive in our setting.Note that δ should be zero for our OLS estimates to be interpreted as causal.
Our estimates do not vary substantially with δ.The columns labeled δ = 1 and δ = −1 in Tables A34 and A35 present these results.For linkage rates, the effect of legibility is statistically significant for all linked samples regardless of the assumption about δ, and the same is true for share validated except for the three conservative algorithms (ABE-exact5, ABE-JW5, and ABE-NYSIIS5) and the MLP algorithm when δ = 1.Below, we also check the sensitivity of our eventual simulation results against alternative assumptions about δ (i.e., δ = 1 or δ = −1). 22nder the counterfactual scenario where legibility is equal to 1 for all enumeration districts, we simulate the quality of linked samples for each enumeration district as follows: where q e and e , respectively, denote the quality measure (linkage rates or share validated) observed in the data and legibility measure for enumeration district e, and β denotes the estimate 21 There is another parameter, what Oster (2019) denotes as R max , which corresponds to the R squared in a hypothetical regression where all variables, observable or unobservable, that are in the true model for explaining variation in the dependent variable are included as controls.As opposed to Altonji, Elder, and Taber (2005), Oster (2019) allows R max to be less than 1 in cases where there are measurement errors in the dependent variable.In our context, however, there are no measurement errors in the dependent variable because our dependent variables are constructed with the information that is already in our dataset.Therefore, we set R max equal to 1 in our implementation of the bias-adjustment procedure suggested by Oster (2019). 22As a further robustness check, we estimate δ using our boundary sample, and then estimate β in model ( 3) that corresponds to the estimated δ.To estimate δ with the boundary sample, we first estimate model (1) with the boundary sample, but replacing the boundary fixed effects with township fixed effects.Then, using the formula in Proposition 3 in Oster (2019), we obtain the δ that corresponds to our baseline estimates, that is, β in model (1) that is obtained with the boundary sample and boundary fixed effects.
Essentially, assuming that the estimate of β obtained with boundary fixed effects is the true effect, we estimate δ that corresponds to that true effect in a model with township fixed effects.Then we estimate model (3) using the linkable population, with δ set at the estimated value.The estimates of β and δ as well as the simulation results can be found in Tables A34, A35, 5, and 6.We find that the estimated δ is close to zero, which suggests that the extent of selection on unobservables is limited, at least for the boundary sample.As a result, the estimates of β and the simulation results are similar to the one obtained without any bias-adjustments.Note that this exercise is valid if the degree of selection on unobservables relative to selection on observables is the same across the two samples.While it is difficult to test this assumption, it is comforting to find that this assumption does not make a large difference in the simulation results.3) (note that its confidence intervals are omitted, given how tight the confidence intervals for the OLS estimates of β is).The columns labeled "δ = 1" and "δ = −1" contain simulated linkage rates under the corresponding assumption about δ, and includes the 95% bootstrap confidence for the simulated linkage rates.The column labeled "Estimated δ" contains simulated linkage rates when δ is set at the estimated value.See footnote 22 in the main text for details about how δ is estimated, and see Table A34 for the estimates of δ.
of β in model (3).We truncate the simulated quality at 1 because that is the upper bound on these measures by construction.Note that the upper bound of 1 is rarely binding for the simulated linkage rates, because linkage rates are far lower than 1 for most enumeration districts.However, it is binding for some enumeration districts when it comes to simulating share validatedthe share of enumeration districts for which the simulated share validated had to be truncated at 1 is at most 10% (see Table A36 for details).Table 5 presents the linkage rates observed in our data alongside the simulated rates for different values of δ.Using our unadjusted estimates of β, we find that observed linkage rates are approximately 74.8%-88% of what they would be if legibility were equal to 1 in all enumeration districts.In terms of differences in levels, illegibility accounts for a 5.2-9.6 percentage point reduction in the linkage rate, depending on the linking algorithm used.Notably, we also do not observe smaller benefits from increasing legibility for linking algorithms with higher (baseline) linkage rates.This suggests that algorithm improvements that increase linkage rates do not compensate for low legibility.
Alternative assumptions about the value of δ do not make a considerable difference in the simulated linkage rates.Under the assumption of δ = 1 (δ = −1), observed linkage rates are between 76% (73%) and 88% (88%) percent of what they would have been if legibility was equal to 1 in all enumeration districts.In terms of differences in levels, illegibility accounts for a 4.4 (5.6) to 8.8 (10.1) percentage point decrease in linkage rates under the assumption that δ = 1 (δ = −1) (results available  3) (note that its confidence intervals are omitted, given how tight the confidence intervals for the OLS estimates of β is).The columns labeled "δ = 1" and "δ = −1" contain simulated share validated values under the corresponding assumption about δ, and includes the 95% bootstrap confidence for the simulated estimates.The column labeled "Estimated δ" contains share validated when δ is set at the estimated value.See footnote 22 in the main text for details about how δ is estimated, and see Table A35 for the estimates of δ.
upon request).The lack of sensitivity of our simulation results for different values of δ is perhaps because legibility can only improve so much, and the estimated coefficients on legibility in model (3) are between 0.2 to 0.3 regardless of the value of δ.Therefore, small differences in β due to different assumptions about δ cannot make a large difference in simulated linkage rates.While illegibility has a large effect on linkage rates, it plays a modest role in reducing share validated.Table 6 presents simulation results for share validated using parents' birth places as the validation variable.Using our unadjusted estimates of β, observed share validated is between 96.1% and 99.2% of what it would have been with perfect legibility.In terms of levels, differences between simulated and observed share validated range from 0.7 to 3.3 percentage points (results available upon request).
These results are robust under the alternative assumption that δ = −1, both quantitatively and qualitatively (see column labeled "δ = −1" in Table 6).However, when we set δ = 1, the differences between simulated and observed share validated are not statistically significant for three linked samples (ABE-Exact5, ABE-JW5, and ABE-NYSIIS5) -the observed share validated for these linked samples are within the 95% confidence interval of simulated share validated.These results are expected since these three linking algorithms impose more stringent conditions to declare a link than other linking algorithms (see Section 2.3 or references therein for the description of the algorithm).Therefore, it is possible that there is little room for improvement in share validated for these (conservative) algorithms even if legibility improves significantly.Our results are robust to using middle name initials as the alternative validation variable (see Table A37). 23

Conclusion
In this article, we document the importance of handwriting legibility on the performance of popular linking algorithms in a case study of U.S. Census rounds 1930-1940.We find that low enumerator handwriting legibility is associated with lower linkage rates and a smaller share of validated links, and that this holds across linking algorithms.We show that this relationship is causal by focusing on boundaries of enumeration districts.
Legibility problems are a quantitatively important source of linkage errors.We estimate that 5-10 percentage points more links would be found across these census rounds if all enumerators had perfect legibility.This improvement is just as large for algorithms with higher linkage rates, suggesting that improvements in linking algorithms would not substitute for improvements in legibility.Automated transcription methods may be a promising source of improvement in link quality for historical sources, if they can be programmed to outperform humans.
One of the remaining issues regarding legibility is how it affects downstream analyses that use linked samples.Our findings suggest that legibility may affect downstream analyses largely through its impact on false negatives, given its limited effect on the incidence of false positives.We leave a thorough investigation of this issue to future research. 24

Supplementary Materials
Figures and tables for additional descriptives, analyses, and robustness checks are collected in the supplemental appendix.Additional results, including a simple model of census data linkage and legibility, and heterogeneous effects of legibility on the quality of linked samples, can also be found in the appendix.It can be downloaded from authors' websites, as well as the webpage for this journal article.

Figure 2 .
Figure 2. Legibility by enumeration districts in Yonkers, NY.NOTE: "Share of people with same transcriptions" is equal to the number of people in each enumeration district for whom two transcriptions (one by Ancestry.comand another by FamilySearch.org)agree, divided by the number of people in that enumeration district for whom the two transcriptions exist.A few enumeration districts are missing because none of the people in those districts have two transcriptions of their names.

Figure 3 .
Figure 3.The effect of legibility on linkage rates.NOTE: We use the boundary sample to create this figure (N=739,634).For each linking algorithm (see legend), the symbol corresponds to the linkage rate for that particular bin.The bins are of equal size.Confidence intervals are omitted for clarity of presentation.

Table 1 .
Legibility and the number of people on each side of the boundaries.
NOTE: The unit of observations for this table is boundary × enumeration district.The standardized difference for a continuous covariate x is equal to: x more legible − x less legible s 2 more legible +s 2 less legible 2 where x more legible and x less legible are sample means of the covariate x for more legible group and less legible group, respectively, and s 2 more legible and s 2 less legible are sample variances for the two groups, respectively.The standardized difference for a binary covariate is defined analogously (see Austin 2009 for a reference).+ p < 0.1, * p < 0.05, * * p < 0.01, * * * p < 0.001.

Table 2 .
Balance of observables at the boundary of enumeration districts.
NOTE: This table presents the means of various observable characteristics of those who live on either side of an enumeration district boundary-the side where legibility is relatively lower ("Less legible") and the side where it is relatively higher ("More legible").The unit of observations in this table is a person.The sample used to create this table is the boundary sample, that is, those who live on streets that serve as the border between two neighboring enumeration districts.The variables are obtained from the 1940 census.

Table 3 .
The effect of legibility on linkage rates.

Table 4 .
The effect of legibility on share validated.
NOTE: We use only the linked observations in the boundary sample to estimate model (1).The dependent variable is validated/not validated (1/0), where the validation variable is parents' birth places.

Table 5 .
Comparison between the observed and simulated linkage rates of the linked samples.This table presents the observed linkage rates for each linked sample (in column labeled "Obs.qual.") as well as the simulated linkage rates under different assumptions about δ.The column labeled "Unadjusted" contains the simulated linkage rate obtained with the OLS estimate of β in model (

Table 6 .
Comparison between the observed and simulated share validated of the linked samples., .889)(.881, .883)(.881, .883)NOTE: This table presents the observed share validated for each linked sample (in column labeled "Obs.qual.") as well as the simulated share validated under different assumptions about δ.The validation variable in this table is parents'birth places.The column labeled "Unadjusted" contains the simulated share validated obtained with the OLS estimate of β in model (