Applications of machine learning in tabular document digitisation

Abstract Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that large and detailed usually imply costly and difficult, especially when the data medium is paper and books. Human operators and manual transcription has been the traditional approach for collecting historical data. We instead advocate the use of modern machine learning techniques to automate the digitization and transcription process. We propose a customizable end-to-end transcription pipeline to perform layout classification, table segmentation, and transcribe handwritten text that is suitable for tabular data, as is common in, e.g., census lists and birth and death records. We showcase our pipeline through two applications: The first demonstrates that unsupervised layout classification applied to raw scans of nurse journals can be used to obtain valuable insights into an extended nurse home visiting program. The second application uses attention-based neural networks for handwritten text recognition to transcribe age and birth and death dates and includes a comparison to automated transcription using Transkribus in the regime of tabular data. We describe each step in our pipeline and provide implementation insights.


Introduction
Big data and machine learning (ML) receive considerable attention in the historical and digital humanities literature. Archives and historians have realized how exploiting the new technologies can increase the scalability and speed of traditional digitization and transcription methods at a fraction of the costs. Specifically, within economic history, Gutmann, Merchant, and Roberts (2018) point out that large collections of historical documents are important examples of big data and calls for the development of new and efficient ML methods for extraction, visualization, and analysis. More broadly, recent work by Colavizza et al. (2021) and Jaillant (2022) discuss how artificial intelligence (AI) and ML can automate workflows for metadata extraction, tagging, and document content extraction, making large historical archives more searchable and more accessible, hence of greater value for both the scientific community as well as the general public. Along these lines, Bartz, R€ atz, and Meinel (2021) present a detailed workflow for adding metadata to documents residing at archives by classifying text fragments and thereby highlighting documents containing, for example, dates or certain words. Detection of selected handwritten keywords, also known as word-spotting, in historical archives using modern ML techniques is also extensively studied, see, e.g., Giotis et al. (2017). Other work on ML-based archiving and digitization methods that are designed for big data include Colavizza, Ehrmann, and Bortoluzzi (2019) and Moss, Thomas, and Gollins (2018).
The focus of our work is limited to tabular documents, that is, documents that contain tables with sequences of numbers and characters organized in cells. Paper-based sources of tabular documents are often manually digitized by researchers or crowdsourcing platforms. However, manual digitization becomes infeasible as the volume of data grows. This is especially true for documents containing structured and tabular data which have high information density and exist in massive quantities at the population level. 1 Examples include census lists, birth and death records, church records, medical records, school grade sheets, logbooks, weather measurements, and school diaries. While such large collections of tabular documents provide compelling opportunities to study historical phenomena, especially long-run outcomes and intergenerational transmission, see, e.g., Gutmann, Merchant, and Roberts (2018), Feigenbaum (2018), and Abramitzky et al. (2021), they also pose significant challenges to the transcription process due to their scale.
ML, combined with computer vision techniques, is a potential solution to this problem as it can fully automate the digitization and transcription processes. The main gains of automated transcription by ML are reproducibility, cost efficiency, and scale. Several methods have been developed to automate the transcription of scanned documents. Traditional optical character recognition (OCR) has been widely applied for documents with machine written text. 2 Handwritten text recognition (HTR) has also been used with success, although it often requires more complex ML methods to achieve adequate performance in documents with varying layout and script. 'Transkribus' (Muehlberger et al. 2019) and 'Monk' (Lambert 2020) are publicly available projects that focus on transcription of historical manuscripts with streams of handwritten text. 3 To use Transkribus, a sufficiently large number of pages must be manually annotated with respect to layout, baselines, and text to establish a ground truth that can guide the transcription. 4 Transkribus then uses ML to perform layout analysis, line segmentation, and transcription. Monk works similarly but improves performance when documents are heterogeneous in layout and handwriting style, see Weber et al. (2018) and Lambert (2020).
Both OCR and HTR methods have significantly contributed to the transcription of books and longform manuscripts like newspaper articles, see, e.g., The Australian Newspapers Digitization Program by the Australian National Library. 5 However, structured and tabular documents pose a substantially different challenge. Tabular data are characterized by a large number of cells that are organized into tables. Such tables have high information density, and due to the cell structure, there is substantial within-table variation in the locations and baselines of the text. These complexities make it difficult to apply off-the-shelf tools. Further, in addition to transcribing the text, we need to analyze the table structure to organize the transcribed information to use it in downstream analyses. Technologies such as Transkribus and Monk are not specifically designed to transcribe tabular data and they require manual adaptations to work on complex tables, see, e.g., Muehlberger et al. (2019). 6 While there has been work to adapt Transkribus to table transcription, the current approaches are not fully automatic and they entail a significant manual workload related to segmentation and drawing of baselines. 7 When the table complexity is high and the document count is large, even minor manual input per document will drastically increase the overall workload. In such cases, fully automated transcription is preferable. These considerations leave significant unsolved challenges in the transcription of large volume tabular data. In this work, we focus specifically on tabular data and we propose a custom end-to-end ML pipeline for automated transcription of tabular documents with handwritten text. As we demonstrate, our pipeline is particularly well suited to transcribe large collections of documents containing population data such as mortality statistics, names, dates, cause of death, occupation, place of residence, sex, civil status, etc.
We test and illustrate our pipeline on two large collections of historical documents. In the first case study, we demonstrate that we can use layout analysis, 'step one' of our pipeline, in itself to collect data. We apply layout classification to a complex tabular dataset of nurse records. We also discuss how the ML method gives additional insights on treatment assignment that complement, and expand on, the results from an intention to treat analysis, without requiring additional manual transcription of the source data. The second case study extends this example and shows that, after the initial layout classification, we can use table segmentation and handwritten text recognition to transcribe handwritten information inside the tabular document. We also show that, based on the raw scans, the pipeline can automatically collect lifespan data that are directly useful in modeling mortality risk. In this application, we consider all three steps of the pipeline, and apply the methods to a subset of a large collection of death certificates. The death certificates are interesting as they represent a realistic example of a large structured dataset, the images are non-trivial to transcribe (significant variation in layout and script), the collection is large, and it is available online. Finally, we also compare the ML approach to traditional crowdsourcing and Transkribus, and show that our ML approach can significantly reduce the transcription burden for tabular documents while performing on-par with crowdsourced transcriptions and better than transcriptions from Transkribus. In particular, we find that the accuracy of our transcriptions are close to those from crowdsourcing, and that the statistical distribution of the transcribed data is roughly equivalent to that of the ground truth data. This is especially important when the data are used in downstream statistical models.
The rest of the paper is structured as follows. Section "Pipeline" provides an overview of our ML pipeline for tabular documents. Section "Applications" discusses two case studies where we test the pipeline. Section "Comparison of ML, crowdsourcing, and Transkribus" compares the results from the ML pipeline with data obtained from crowdsourcing and Transkribus. Section "Discussion and conclusion" concludes. The Technical Appendix, which is available online, documents the implementation details of our approach. Our datasets and code are available on request.

Pipeline
We now discuss the structure of our ML pipeline. The goal is to transform raw scans of tabular documents into datasets. In general, such scans are heterogeneous in resolution, table structure, and script. As a consequence, and to maximize the generalization capabilities, our ML pipeline is split into three sequential components: (1) 'layout classification' sorts the documents based on layout, (2) 'table segmentation' extracts the cells of interest from the source images, and (3) 'transcription' transcribes the extracted cell images. Each component is responsible for a specific task and can be adapted to the problem at hand. Figure 1 illustrates the pipeline. We emphasize that the pipeline is highly customizable and that the downstream research question, and statistical model, should dictate the choice of individual components in the pipeline. For example, if our research question focuses on mortality, we can limit our attention to cells containing death dates and we can tailor our transcription model to dates. Such constraints can accelerate the learning process.
The inputs to the pipeline are scanned documents stored digitally as image files. These images consist of colored dots called pixels. Each pixel is characterized by a location and a color, and stacking a certain number of pixels horizontally and vertically forms an image. Thus, we can consider an image of height h and width w to be an h Â w matrix where each entry corresponds to a single pixel. The objective of the pipeline is to learn a mapping from the image matrix into a representation suitable for statistical analysis or storage in a database. However, this is both complex and challenging as the image matrix can be of high dimension, which calls for specialized statistical models.
The first step of the pipeline is layout classification. Layout classification refers to the process of organizing a document collection by common layout structures, e.g., a common table, a heading, or preprinted landmarks. The layout type is important as the table segmentation relies on a pre-defined template that must match with the preprinted structure in the image. Due to variation in layout across a given document collection, we need to construct one template for each layout type, and when we fit a template to a document, we need the template and layout types to match. This is straightforward if the documents are sorted according to layout as all images of a given layout type will share the same template. Our pipeline allows for different layout classification models and we showcase both an unsupervised and supervised approach in, respectively, Section "Nurse records and infant care" and Section "Death certificates and mortality". 8 The unsupervised method clusters the documents based on visual descriptors extracted by a neural network. However, while its performance is impressive in our first case study, it does not generalize to more heterogeneous documents. This motivates a supervised approach in our second case study. In the supervised setting, we use the Bag-of-Words (BoW) method which was originally developed for classifying chunks of text (Murphy 2012, 87). It has since been succesfully applied in the field of computer vision, see, e.g., Csurka et al. (2004) and Sivic and Zisserman (2009). BoW compares documents based on the frequency of certain clusters of key points, socalled 'visual words'. Examples of visual words are the corners of a table, or a particular document heading.
The key points are based on Speeded-Up Robust Features (SURF) (Bay et al. 2008) as this has historically been the common choice, see, e.g., Csurka et al. (2004). As a result, every document will have an associated histogram that describes its distribution of these visual words. We then train a model to classify a given document into its respective layout group based on its histogram. The classification model can be of any type, but we use support vector machines with radial basis kernels. We provide additional details on layout classification in Section A.2.2 of the Technical Appendix.
The second step of the pipeline is table segmentation where we extract images corresponding to each field (or cell) in a given table in the source image. C€ uuasnon and Lemaitre (2014) provide a general introduction to this topic. For a given classified group of document images, we need to choose a reference or 'template' document. The template can be any well-scanned document in the respective layout group. We note the coordinates of line intersections and line endpoints and refer to them as the template key points. We next construct an 'overlay', which is the set of rectangles that encloses each field of interest and where each rectangle is represented by the coordinates of its four corners. This is done using the same document that provided the template. Constructing the template and overlay is an operation that has to be carried out only once per document group. Figure 2 illustrates the template and the corresponding overlay for a Danish death certificate, see also Section "Death certificates and mortality". With the template and overlay in place, we process the remaining documents of the respective group, which we refer to as the 'target' images. For each target image we need to locate the same key points as those defined in the template. This is done with standard computer vision operations that identify the vertical and horizontal lines of the table structure, see Figure A.5 for a visual example. Based on the identified lines, we use their intersections and endpoints to find the same key points that were defined in the template. What remains is to find the transformation that maps the key points of the target image to their respective counterparts in the template. To do this, we use Coherent Point Drift (CPD) which iteratively aligns the two point sets (Myronenko and Song 2010). After applying CPD, we obtain translation and transformation parameters that, when applied to the target key points, align them with those of the template. We can then extract each field using the transformed coordinates. There are also other methods that could be adapted to segment the documents, see, e.g., Li, Wang, and Fang (2019); Clinchant et al. (2018), and these could seamlessly be integrated into our pipeline. However, we do not consider this here. We provide details on table segmentation in Section A.2.3 of the Technical Appendix.
The final step of the pipeline transcribes the extracted fields. Transcription is the process of converting an image of text into a string representation. We use an attention-based neural network, suggested by Xu et al. (2015) for image captioning, and adapt it to transcription of handwritten text. The model by Xu et al. (2015) applies to tasks with image inputs and sequence outputs, which is also what characterizes the transcription problem where the output is a sequence of characters. An advantage of the attention approach is that the model requires only rough segmentation at the field level and does not rely on, e.g., text baselines. The model processes either characters or words and we limit the set of possible characters or words based on the task at hand. This set could either be the alphabet in the case of name transcription or digits for age transcription. We provide additional details in Section A.2.4 of the Technical Appendix.

Nurse records and infant care
In economics, there is a large literature on the relationship between early-life conditions and later-life outcomes, see, e.g., Almond and Currie (2011a), Almond and Currie (2011b), Almond, Currie, and Duque (2018), and Hoehn-Velasco (2021). This is commonly studied in a regression framework where data on individuals are collected at birth, or during childhood, and linked to outcomes in adulthood. These studies inherently require long-run historical datasets that cover the individuals' lifespans. In the following, we exemplify such a study by considering the impact of early-life care on outcomes in adulthood.
To estimate a causal effect of infant care, we need an assignment of each individual to a treatment or control group. In the social sciences, we often infer treatment assignment based on an intervention or policy that (quasi) randomly has assigned each individual to treatment or control, see, e.g., Angrist and Keueger (1991), Angrist, Imbens, and Rubin (1996), and Imbens and Rubin (1997). This section considers a policy where a subset of infants was made eligible to participate in an expanded care programme. The participants in the programme received additional home visits from nurses. Enrollment in the programme was governed by the date of birth. Individuals born in the first three days of each month were eligible to receive additional monitoring. The details of approximately 95, 000 infants (whether enrolled or not) were collected in journals kept by the health care system. The journals have previously been described and used by Biering-Sørensen, Hilden, and Biering-Sørensen (1980), and Bjerregaard et al. (2014) have manually transcribed a small subset of the contents to study birth weight and breastfeeding. The infants who received additional monitoring have a specific followup table in their journal only if the monitoring took place, i.e., the presence of the table is decided by actual treatment, not eligibility. The journals have been scanned and are available as digital images. 9 While parts of the journals have previously been digitized, the presence of the treatment table was not recorded. Figure 3 illustrates the pages in a typical journal.
In the following, we use the layout classification step of our ML pipeline to detect whether an infant received  extended care, i.e., was treated. This is to illustrate that 'step one' of the pipeline can, in isolation, collect important data for applied research, and that transcription is not always needed after the initial layout analysis. It can save both time and complexity if the digitization process is adapted to the downstream research question, and one strength of our pipeline is that it allows for such adaptations. To detect whether an infant was treated, our ML model analyses the layout of each journal page and identifies the group of children that received follow-up care. We compare this ML-based detection to an intention to treat indicator inferred from the three-day policy and find that there is so-called noncompliance (Angrist, Imbens, and Rubin 1996). Our dataset contains 95, 323 journals with a total page count of 261, 926. The page count and order vary between journals. This implies that all pages in all journals have to be reviewed to identify the treated. Since the treated individuals can be identified by the presence of a particular page in their journal, we can use layout classification to detect treatment. If a page in their journal is classified as containing the treatment table layout, then the individual is classified as treated. We did not have access to a labeled dataset to train a supervised classifier for the treatment pageas will often be the case in practice. Thus, we pursue an unsupervised approach where we rely only on the scanned images without labels. The technical details are described in Section A.1 of the Technical Appendix. Note that we still need to manually construct an evaluation dataset to probe the performance of the applied method. We will show an example of a supervised classifier for the same purposes in Section "Death certificates and mortality", where training data is available.

Results
We now discuss the results from applying 'step one' of our pipeline to the nurse records. The layout classification method clusters pages together according to their visual appearance. To efficiently describe the visual information of the nurse journals, we use a neural network to construct a low dimensional representation of each journal page. We then apply a density-based clustering algorithm to the output of this neural network to recover the different layout types. In Figure 4 we illustrate this process by showing how the pages separate into different clusters based on their visual appearance. Each point in Figure 4 represents a page in a journal and the points are colored according to their assigned cluster. A clear structure is evident. There are 37 clusters which we manually review and annotate according to their contents. Annotation is carried out by randomly sampling 10 pages from each cluster and assigning a label for the whole cluster based on the contents of these pages. This amounts to 370 journal pages that need manual review (out of 261, 926). This procedure shows that the treatment pages are contained in four distinct clusters. We extract all pages residing in these clusters and classify the underlying individuals as treated, i.e., an individual is treated if any page in their journal belongs to one of the four treatment page clusters.
To evaluate the unsupervised procedure, we manually construct a ground truth evaluation dataset by reviewing all 10, 914 pages of 4, 000 randomly selected journals. For each journal, we record the presence of the treatment table. We review each journal twice and find 234 treated journals. Table 1 shows a confusion matrix for the ML treatment detection in the evaluation sample. All 234 treated and 3, 766 untreated individuals are correctly classified. Due to class imbalance, the classifier could obtain an accuracy of 3, 766=4, 000 % 94:15% by simply assigning everyone as non-treated. 10 Thus, more informative performance measures are precision and recall (Murphy 2012, 184-185). In our context, precision is the fraction of individuals correctly predicted as treated out of the total number of individuals predicted as treated', while recall is 'the fraction of individuals correctly predicted as treated out of the total number of treated individuals'. In the ground truth sample, the unsupervised method has precision and recall equal to unity.
Apart from the performance of the classifier itself, the results from the layout detection give insights on treatment assignment. Table 2 shows that not all eligible individuals received the follow-up visits. 8, 596 children were eligible but only 4, 901 individuals born in the three-day eligibility period received a visit. 11 This is a treatment uptake of 57.01%. In addition, there is a group of 397 children born outside the intervention days that still receive treatment even though the policy assigns them as controls. 12 This reveals an issue with noncompliance (8, 596 À 4, 901 þ 397 ¼ 4, 092 non-compliers). Noncompliers are individuals that do not comply with their assignment to treatment (or control), e.g., children born between the 1st and the 3rd that do not receive followup care. For statistical reasons, this is important when studying the effects of extended care in a regression framework, see, e.g., Angrist, Imbens, and Rubin (1996). Without applying our ML pipeline, these details on the intervention would not have been discovered. That is, unless all 261, 926 pages were manually reviewed.

Death certificates and mortality
Denmark introduced the national use of death certificates in 1832. 13 A death certificate documents the death of a single individual. They are used to study a number of historical, economic, and health questions at the individual and population levels (Alonso et al. 2012). Importantly, the certificates contain individuallevel lifespan and cause-of-death data, which could be used to model mortality risk (H€ ogberg and Wall 1986). Potentially, they can also be linked to other registries and datasets to model the relationship between mortality and external conditions, see, e.g., van den Berg, Lindeboom, and Portrait (2006) and Bruckner et al. (2014). The information from the death certificates also plays central roles in studies of place-of-death (Cohen et al. 2007) and pandemics (Simonsen et al. 2011;Gill and DeJoseph 2020).
The death certificates are an important example of historical big data as they are hardly feasible to manually transcribe. It is estimated that over 3 million Danish death certificates exist and each of them contains 10-12 fields, see Figure 5 for an example. 14 This equals millions of individual fields to transcribe. However, as we will illustrate, our ML pipeline is well suited for efficient transcription at this scale. In this case study, we work on a subsample of approximately 250, 000 death certificates that were collected across multiple years and locations. The certificates contain preprinted forms that are filled in by hand. The forms change over time and several subtypes are used to distinguish between deceased infants, suicides, accidents, etc. These are illustrated in Figure A.4 in the Technical Appendix. The de-centralised scanning of the documents, by multiple archives with different volunteers and equipment, has resulted in substantial variation in scan quality. Unlike the first case study, we use all three steps of our pipeline to illustrate a fully automated approach where the pipeline transforms raw scans into transcribed lifespan data. For the sake of illustration, and to limit the complexity, we focus only on transcription of dates (birth and death) and age. These variables are important in studies of mortality and they have a well-defined dictionary consisting purely of numbers 0-9 and months (January-December). Also, these fields allow for an internal consistency check as the difference between an individual's birth and death dates must match the individual's age. The fields are still challenging to transcribe as they are handwritten and vary significantly in script and format, in part due to the presence of multiple authors, see Figures 6 and 7. We limit our attention to the type B death certificates, as the preprinted form has clearly delineated fields and our collection has a large proportion of type B certificates. 15 Figure 5 shows a type B certificate where we highlight the  Treated  234  0  234  Not Treated  0  3,766  3,766  234 3,766 The frequencies are based on a randomly sampled and manually reviewed validation set of 4, 000 journals (10, 914 pages). The treatment detection model does not rely on any segmentation, but detects the presence of the whole page containing the treatment table. Policy assignment is based on an official assignment rule which offered all children born in the first three days of each month to enroll in the nurse visiting programme. The ML assignment is based on the machine learning model and bases assignment on the presence of the treatment page in the journals. This allows for assessment of compliance in addition to the intention-to-treat effect, i.e., the date-of-birth assignment mechanism.
relevant fields, including the fields for age and birth and death dates. We apply the pipeline to the type B death certificates to transcribe age and birth and death dates. The pipeline uses BoW for layout classification, CPD for segmentation, and attention-based neural networks for transcription. In this process, we rely on several ground truth datasets that we describe in Section A.2.1 of the Technical Appendix. As the table segmentation method does not need training there is no dataset for this step. We construct the evaluation and training datasets by manually transcribing a random sample of images from the death certificates. We verify each image twice and discard images with segmentation errors. Importantly, there is no overlap between the training and evaluation datasets.

Results
Layout classification. We train the layout classifier on a dataset containing 7, 000 randomly sampled death certificates and their ground truth layout type. We consider four distinct layout classes: (1) type A certificates, (2) type B certificates, (3) all other certificates, and (4) empty pages. We evaluate the classifier on 2,  184 certificates. Table 3 shows the confusion matrix. The classifier performs well with only two false positives and one false negative, and the class-wise precision and recall for type B certificates are unity. After evaluation, we use the classifier to predict the layout class of the 250, 000 death certificates and extract 44, 903 certificates that are classified as type B.
Table segmentation. Our table segmentation method does not need training data except for an initial template that is manually drawn to match the form in the document. We apply the segmentation method to the 44, 903 type B certificates, extracted in the layout classification step, and we segment the fields of interest. The segmentation is not without errors. Segmentation errors typically implies that parts of the image fields may be left out from the segmented images. Naturally, this affects the final transcription accuracy. We discuss this issue further in Section "Comparison of ML, crowdsourcing, and Transkribus".
Transcription. We evaluate the performance of the transcription models using two metrics. One relates to the average accuracy across individual components (called tokens) in the sequence of numbers and characters, denoted TA, and the other to the accuracy of complete sequences, denoted SA: 16 Specifically, SA m is the proportion of correctly predicted sequences when m mistakes are allowed in the sequence. The token and sequence accuracies are closely related to the character and word accuracies in the HTR literature, see, e.g., Graves et al. (2009). 17 This section focuses on the performance of the ML transcription models in isolation, so all results are conditional on the transcription models receiving correctly segmented images. The trained transcription models are used to predict the content of the images in the date and age evaluation datasets using beam search. Prior to prediction, the images are standardized to zero mean and unit variance using the mean and variance estimated from the entire evaluation dataset. 18 No other preprocessing is applied. Transcription of a single image happens in less than 75 ms. Table 4 shows the performance of the two transcription modelsfor date and age respectivelyon their corresponding evaluation sets. We present results both for training with and without augmentation of the training dataset, where augmentation is the process of slightly altering the images to generate more data. The token accuracy (TA) for dates is 97:9% with augmentation and 92:2% without. For ages the TA is 98:5% with augmentation and 96:6% without. These are the average accuracies of predicting a single token correctly. Figures 6 and 7 show random samples of incorrect and correct dates as predicted by the ML date transcription model.
In Table 4, the zero-error sequence accuracies SA 0 are significantly lower than the token accuracies for both models. Under augmentation, the date model achieves an SA 0 of 90:5% while the age model an SA 0 of 97:2%: In the context of US censuses, Nion et al.  The frequencies are based on a randomly sampled and manually reviewed evaluation set of 2, 184 death certificates. Death certificates are classified into four classes: Empty, A, B, and Other. In this application we are only interested in the type B certificate.
(2013) transcribed age at a sequence accuracy of approximately 85% using convolutional neural networks.
Notice that a sequence prediction is only correct if all tokens are predicted correctly. This implies correctness of 11 tokens for dates and 4 tokens for ages (including <Start> and <End> markers). Hence it is no surprise that the zero-error sequence accuracy is much higher for age than date. Also, the variation in dates is larger compared to ages with respect to both format and combination of digits and characters.
It is apparent from Table 4 that an increase in the number of allowed mistakes per sequence improves the accuracy significantly. For example, if we allow one mistake (i.e., one substitution) in the date sequence, the one-error sequence accuracy is 98:9%: Under some circumstances, e.g., linking, it might be acceptable with a certain number of mistakes in the sequence. Also, in statistical models, the transcription errors might not matter unless they depend systematically on the transcribed information.
Notice also the difference between using augmentation and not in the first and second row of Table 4. The significant differences are due to the small training datasets of 11, 630 dates and 11, 072 ages. If our training datasets were larger, the payoff from augmentation would be smaller. However, the differences display the benefits of augmentation to boost performance in smaller training datasets.

Comparison of ML, crowdsourcing, and Transkribus
To gauge the performance of our ML pipeline, we compare the ML predictions to transcriptions from crowdsourcing and Transkribus. The usefulness of this comparison is twofold. First, we can evaluate if the ML pipeline performs on-par with crowdsourcing and Transkribus. Second, using the large crowdsourced dataset, we can estimate the end-to-end performance of the ML pipeline including errors of segmentation, transcription, and, to an extent, layout classification. 19 The crowdsourced dataset is freely available online from the Danish National Archive, and anyone can contribute to the dataset through their website. 20 Our own evaluation datasetas used in Table 4 was manually reviewed to exclude (1) images where segmentation errors obscured the text in the image and (2) where the image belonged to a document of the wrong layout type (i.e., anything other than type B death certificates). Evaluating the model on this dataset provides a clean measure of the transcription model in isolation. However, it does not give any insights on the transcription performance when the model might receive a badly segmented image. The crowdsourced dataset is different in this respect as it is constructed by humans looking at the raw document, finding the relevant field, and transcribing the text. This implies that the crowdsourced transcriptions cannot be impacted by segmentation or layout classification errors. By running the entire pipeline on the raw images, and comparing the final transcription output to the crowdsourced transcription, we can get a measure of the overall performance of the pipeline in practice. Of course, this relies on the assumption that the crowdsourced data are perfectly transcribed. Thus, to get a baseline indication of the quality of the crowdsourced dataset, we compare it against our ground truth training and evaluation sets (overlap of 2,864 documents) and find that the dates are identical in 96.30% of the cases, see Table 5. For comparison, the performance of the transcription model on the evaluation set (i.e., perfectly segmented images) is 90.5%. If we look at the individual components of the date, the ML performance is 96%, 97%, and 97.2% on days, months, and years, respectively. For crowdsourcing, the corresponding accuracy rates are 98.3%, 98.7%, and 98.8%. Thus, on the individual date components, the accuracies of crowdsourcing and ML are fairly close, although statistically different. However, this neglects the impact of the layout classification and segmentation steps in the pipeline.
Next, we evaluate the ML transcriptions by using the crowdsourced transcriptions as ground truth. The Standard errors are in parentheses. The first row uses only ground truth training samples, while the second row shows the result when training is conducted on the augmented dataset. The date training set contains 11, 320 samples and the evaluation set contains 1, 000 samples. The age training set contains 11, 072 samples and the evaluation set contains 1, 000 samples. The dates contain both birth and death dates. Note that these are not endto-end accuracy rates, so they do not factor in the performance of the table segmentation (i.e., segmentations that obscures the written information have been discarded). The accuracy excludes the <Start> token but includes the <End> token (the <Start> token is forced in the network, so it will always be present).
crowdsourced transcriptions overlap with our sample for 23, 263 documentseach containing a birth and death date for a total of 46, 526 datesand we filter out any overlap with the training sample used to train the ML model. Note that the crowdsourced dataset does not contain transcriptions of age. The dates predicted by the ML model and crowdsourcing are identical in 83.66% of the cases and for 89.96% of the dates the difference is less than one calendar year. 21 This is a substantial difference compared to Table 4 where the ML sequence accuracy was 90.5%. As elaborated above, the performance difference relative to Table 4 stems from two sources: (1) noise in the crowdsourced dataset and (2) the other pipeline steps prior to transcription. Thus, unless (1) is large, this gives an approximation to the end-to-end performance of the whole pipeline. We should keep in mind that the 83.66% sequence accuracy allows for zero mistakes in the predicted sequence and that this is the expected performance if wewithout any pre-processing or adjustmentsfeed a collection of raw scans into the pipeline. Also, given the noise in the crowdsourced dataset, we can argue that 83.66% might be a slightly conservative estimate unless crowdsourcing participants and ML make the exact same mistakes on the exact same documents. Using the ML approach, it is cheap to transcribe additional fields on the documents as it only requires a training sample. As we have seen, the training sample can be much smaller than the full collection of documents. This can be exploited to produce higher sequence accuracy rates if there are internal correspondences between the fields in the source document. For example, the death certificates contain both birth and death date and age. These three fields should be internally consistent. If they do not match then either (1) the source document contains a mistake or (2) the ML model made a mistake. If we transcribe both age and dates and exploit the correspondence between these fields, we can filter out 5,767 cases where the predicted and implied age differ by more than one year. This leaves 17,496 documents where we achieve an ML sequence accuracy of 93.56% end-to-end. Even if the problematic documents need to be manually transcribed, the ML model still produces a reduction in the manual transcription burden by around 75% relative to manually transcribing the whole dataset of 23, 263 documents (46, 526 dates). Hence, relationships between fields can provide automatic verification and be used to flag problematic records for manual review. This method can of course also be applied in a manual or crowdsourcing context, but in this case transcription of additional fields is more costly.
In addition to the accuracy rates, we also compare the data obtained from the ML and crowdsourcing approaches. Any systematic bias in the ML model would produce deviations from the empirical distribution observed in the crowdsourced dataset. Alsobased on the discussion abovewe expect internal consistency between ML transcribed ages and dates. Figure 8 compares kernel density estimates of the age distribution produced by (red line) ML age transcriptions, (blue line) ML date transcriptions, and (green line) crowdsourced date transcriptions. Note that the figure only displays ages in the interval ½0; 100, any ages outside this interval are discarded, and we discard all predicted dates where the year does not contain four digits. In some documents, the year has been abbreviated to only the last two digits (e.g., 1890 becomes 90 or 1910 becomes 10), so the leading digits could be either 18 or 19. This is not a shortcoming of the transcription model but rather a lack of information in the source documents. In Figure 8, we see that the distribution of ML age transcriptions is very close to the age distribution implied by the crowdsourced date transcriptions. The age distribution implied by the ML date transcriptions also appear similar, although with a notable difference around ages 75-85. This deviation is not surprising as the dates contain longer sequences to transcribe (relative to age) and the ML model needs to transcribe both birth and Standard errors are in parentheses. The evaluation set for the ML (pipeline) model consists of 1,000 samples, while the evaluation set for the crowdsourced predictions consists of 2,864 samples as we can pool both our training and evaluation ground truth datasets in this case. The training and evaluation sets have been manually reviewed twice and dates that are unreadable due to bad segmentation have been removed. Age is not directly transcribed in the crowdsourced dataset and hence excluded here. Columns 1-2 are sequence accuracies for the whole date, while columns 3-8 are sequence accuracies on the individual components of the date (day, month, and year). Note that the comparison takes into account common date formatting, e.g., that the dates 01-10-2000 and 1-10-2000 convey the same point in time.
death date correctly (at least down to the year) to get an approximately correct age prediction. Motivated by the notable difference in performance between the whole pipeline and transcription only, we manually review some of the cases where the crowdsourced and model predictions differ. This reveals, albeit qualitatively, that most discrepancies are related to segmentation issues where the CPD segmentation obscures the year in the date fields. The CPD segmentation template makes a cut to separate the age and death date fields. The year is the rightmost component of the date and hence most likely to be impacted by this cut, see the position of the death date and age in Figure 5. This explanation is corroborated if we look at the accuracy rates for the individual date components with the crowdsourced dataset as ground truth: Day accuracy is 95.92%, month accuracy is 96.98%, and year accuracy is 89.11%. Clearly, the accuracy for the year component is notably lower. When discovered, such issues can easily be corrected in the ML pipeline. Had the transcription been done manually, it would be much more costly to correct systematic transcription errors.
As a concluding remark, keeping in mind the resources and time needed to perform manual or crowdsourced transcription, we note that the ML endto-end accuracy of 83.66%trained on only 11,630 dates and 11,072 agesmight in many cases be acceptable, especially for large document collections that are otherwise infeasible to transcribe.
It should be noted that the ML model in this application has not been carefully optimized and it has only been trained on a small training dataset. It is conceivable that the model can perform substantially better. In addition, improvements in segmentation would also positively impact the end-to-end accuracy. Also, in practical applications, the model can be used to speed up transcription while retaining manual review of each (or some) predictions. The reviewed predictions can then be used to re-train and improve the model.
There is also the possibility to rely on the confidence measures associated with model predictions and remove transcriptions where the model is highly uncertain, i.e., where the confidence measure is below a certain threshold. Subsequently, the removed transcriptions could be reviewed manually and the model could be retrained based on the additional labels.
Finally, we apply Transkribus for transcribing dates. For comparison we use the same collection of evaluation death certificates as the date model. Extraction of the date fields is difficult without a segmentation step, hence for a fair comparison of the transcription performance, we assist Transkribus by extracting the relevant table segments. The assisted Transkribus method achieves a sequence accuracy of 73.9%, compared to 90.5% using the date model and 96.3% using crowdsourcing. Not only is the performance by the assisted Transkribus significantly worse, but extracting the transcribed information by Transkribus is also more timeconsuming. In terms of costs, it should also be stressed that using Transkribus to transcribe large collections may be expensive. 22 However, Transkribus is a generalpurpose tool for HTR and can readily be used for most tasks. Also, it does not necessarily need customization for transcribing specific historical documents, which reduces initial costs.

Discussion and conclusion
We have proposed a custom ML pipeline and we show that it can efficiently transcribe massive collections of tabular documents, thereby producing data that are directly usable in quantitative research. We test and illustrate our pipeline on two large collections of historical documents. Figure 8. Lifetime distributions. Kernel density estimates of the lifetime distribution implied by ML age transcriptions (red), ML date transcriptions (blue), and crowdsourced date transcriptions (green) in an overlapping sample of 23, 263 individuals restricted to the age interval ½0; 100: Implied age refers to the difference between the birth and death dates in years.
In the first case study, we apply the layout classification to a complex tabular dataset of nurse records and demonstrate that we can use layout analysis, 'step one' of our pipeline, in itself to collect data. We also discuss how the ML method gives additional insights on treatment assignment that complements, and expands on, the results from an intention to treat analysis, without requiring additional manual transcription of the source data.
The second case study extends this example and shows that, after the initial layout classification, we can use table segmentation and handwritten text recognition to transcribe handwritten information inside the tabular document. We also show that, based on the raw scans, the pipeline can automatically collect lifespan data that are directly useful in modeling mortality risk. In this application, we consider all three steps of the pipeline, and apply the methods to a subset of a large collection of death certificates. Finally, we also compare the ML approach to traditional crowdsourcing and Transkribus, and show that ML can significantly reduce the transcription burden for tabular documents while performing on-par with crowdsourced transcriptions and better than transcriptions from Transkribus. In particular, we find that the accuracy of our transcriptions are close to those from crowdsourcing, and that the statistical distribution of the transcribed data is roughly equivalent to that of the ground truth data. This is especially important if the data are to be used in downstream statistical models.
We wish to emphasize again that our ML pipeline scales well. The cost difference between transcribing hundred of thousands or millions of documents is negligible, as opposed to the cost of the equivalent manual transcription. For our pipeline, the fixed costs of the initial setup are high, while the variable costs are very low. For off-the-shelf tools, such as Transkribus and Monk, the opposite is true. This motivates the use of a custom pipeline for large-scale projects where the variable costs dominate the initial fixed costs. Also, the initial fixed costs for a custom pipeline can be reduced by relying on the modular tools and training datasets we provide in this work. crowdsourced dataset. These certificates will be missed here. 20. See (in Danish) https://bit.ly/2VnFLzb. 21. Not completely identical, as we take common differences in formatting into account, so, e.g., 01-3-2000 and 1/3-2000 would be considered equal. 22. The current price is e0.24 per handwritten page when buying credits in bulk for 24,000 documents at a time, see https://readcoop.eu/transkribus/credits/, with some discounts available through subscription and for members. For millions of documents, this is not financially feasible for small organizations or independent researchers.