An Analysis of BERT (NLP) for Assisted Subject Indexing for Project Gutenberg

Abstract In light of AI (Artificial Intelligence) and NLP (Natural language processing) technologies, this article examines the feasibility of using AI/NLP models to enhance the subject indexing of digital resources. While BERT (Bidirectional Encoder Representations from Transformers) models are widely used in scholarly communities, the authors assess whether BERT models can be used in machine-assisted indexing in the Project Gutenberg collection, through suggesting Library of Congress subject headings filtered by certain Library of Congress Classification subclass labels. The findings of this study are informative for further research on BERT models to assist with automatic subject indexing for digital library collections.


Introduction
Natural language processing (NLP) is challenging and problematic because human languages are complicated and contain lexical ambiguity, a phenomenon still difficult for computers to comprehend and process. For instance, most English words have multiple meanings and a word has no meaning without being in a given context. Structured data helps to disambiguate, but not everyone or everything is mapped to the knowledge graph. Ontology driven natural language processing has entities and relationships linked, yet NLP needs to help fill in the gaps between named entities. 1 In 2018, BERT (Bidirectional Encoder Representations from Transformers), a transformer-based machine learning technique for NLP pre-training, was created and published by the Google AI Language team. Google introduced BERT to their search engine backend on October 21, 2019, and applied BERT to English search queries, impacting 10% of all search queries and especially improving search results for complicated search queries that depend on context. 2 BERT has made significant contributions to NLP because it provides "context, " with bi-directional language models and pre-training from unlabeled text. BERT is an improvement on all previous language models (such as Word2Vec and GloVe) which build context-free word embeddings and uni-directional language modelers. 3 To further elaborate, GloVe embedding is not contextual embedding, and "is an unsupervised learning algorithm for obtaining vector representations for words, while training performed on aggregated global word-word co-occurrence statistics from a corpus." 4 As a result, the two banks, "Bank" of America and "bank" of a river, have the same word embedding. A word is meaningless unless it is in a given context. In a nutshell, semantic context unquestionably matters! More current contextual embeddings, such as BERT or transformers (masked language modeling), are better able to understand the context of words. BERT has improved certain natural language tasks such as named entity determination, textual entailment/next sentence prediction, coreference resolution, question answering, word sense disambiguation, automatic summarization, and polysemy resolution. 5 Furthermore, the original data used to train BERT is BOOKCORPUS and Wikipedia, 6 and the Longformer (the Long-Document Transformer efficient for longer sequences) used three datasets: WikiHop, TriviaQA (Wikipedia setting), and HotpotQA. 7 BERT's groundbreaking work on contextualization has led many scholars to do further research on BERT related models and tools, especially in terms of its context-specific functions. To cite an example, Ettinger has applied psycholinguistic diagnostics to the BERT model as a case study, and found "it robustly retrieves noun hypernyms, but it struggles with challenging inference and role-based event prediction-and, in particular, it shows clear insensitivity to the contextual impacts of negation." 8 Furthermore, Yu and Ettinger questioned "how and at what points they might be combining word meanings into phrase meanings since transformers maintain representations for every token," and found that "phrase representation in these models relies heavily on word content, with little evidence of nuanced composition. 9 As an analogy, NLP is a cross-disciplinary research field between computer science/information engineering/AI and linguistics, or a dialogue between computer and natural/human languages. For decades, library subject indexing and classification tools have included controlled vocabularies such as Library of Congress Subject Headings (LCSH), in conjunction with classification schedules such as Library of Congress Classification (LCC), for subject searching and classification in online catalogs or library discovery platforms. Subject indexing has been one of users' top demands for accessing library resources effectively. With the rapid increase of digital resources, automated subject indexing has been an important research subject in recent years. However, there are major challenges for providing quality subject access to users, according to various studies. 10 This study explored various digital collections and decided to use Project Gutenberg as the main test platform. The Project Gutenberg ebook collection is a library of over 60,000 free ebooks, primarily works of literature on literature and humanities resources. 11 All the digitized books are available in full text in ten data formats for batch downloading and testing. Most importantly, each book has the bibliographic metadata with LCC subclass labels and LC subject headings. As LC subject headings are perceived as the most widely used subject terms, it is essential to harness these controlled terms for the assessment of recommended subjects or subject indexing. 12 The majority of the ebooks are written in English or Romance languages. Therefore, we do not need to handle another complex issue, non-Latin scripts, especially related to OCR (Optical Character Recognition), or Unicode issues. With all of these considerations in mind, the authors identified Project Gutenberg as a well-suited candidate with text as data to perform a BERT/NLP test for automatic subject indexing.

Literature review
Before determining a research methodology and designing test workflows, the authors conducted a literature review in the areas of both library and information sciences with an emphasis on automatic indexing for subject terms, digital text, and humanities. Research has been carried out on machine-aided indexing (MAI), and computer assisted indexing (CAI) or automatic subject indexing. Greenberg et al. identifies some key pieces of foundational work such as the Cranfield experiments as a forerunner of NLP, and Salton's 1968 Automatic Information Organization and Retrieval. 13 Furthermore, Liddy conducts a series of studies about automatic metadata generation, 14 automatic classification of documents, 15 and automatic text summarization. 16 All of these studies have laid a solid and robust foundation of empirical research.
In the field of knowledge organization (KO) or knowledge organization systems (KOS), the studies of automatic indexing, clustering, automatic classification, or automatic subject indexing of text have been conducted continuously in recent years. Smiraglia and Cai studied the evolution of clustering (unsupervised learning method), machine learning, automatic indexing and automatic classification (supervised learning) in knowledge organization and find "there is an emphasis on KO for information retrieval; there is much work on clustering (which involves conceptual points within texts) and automatic classification (which involves semantic groupings at the meta-document level.)" 17 On the other hand, Golub noted, "automatic subject indexing solutions have still not been widely adopted in operative information systems of libraries and related institutions." 18 Digby and Dinsmore report their use case of machine-aided indexing with enhanced metadata performed in an e-collection, but no automated tools or algorithms were explicitly used or explained in the automated process. 19 Golub et al. studied and found that subject access for humanities journal articles are not supported in either the world's largest commercial abstract and citation database, Scopus, or the local repository of a public university in Sweden. 20 The HIVEing project conducted research on science and technology resources 21 and its recent study includes HIVE (Helping Interdisciplinary Vocabulary Engineering) for materials science and use HIVE to enhance and explore medical ontologies. 22 Papadakis et al. examined subject-based information retrieval within digital libraries employing LC subject headings for the subject, accounting. 23 Empirical research concerning automatic indexing for "humanities" ontologies has been limited and this is one of the key areas in this study.
Topic modeling, a type of statistical model for discovering abstract topics in documents, has been widely used for DH (digital humanities) and text mining since Latent Dirichlet Allocation (LDA) was created by Thomas Hofmann in 1999. 24 Concerning criticism of some topic modeling efforts, Drucker stated "some network diagraming and topic modeling tools are just too crude for humanistic work." 25 Dobson critiqued "context-less data" and "information loss," arguing certain stop words such as whatever, but, and no, are considered superfluous and removed before running the text through the algorithm, but they "might have important meanings and semantic value depending on genre and period (think 1980s valley girl)." 26 Due to these concerns of "context-less data" when using statistical models, the authors determined that our tests should definitely focus on using BERT to assist in automatic subject indexing. BERT offers textualization which corresponds to the semantically structured LC subject headings and the related LC classifications for filtering or clustering related subject terms. 27 Yi and Chan 28 and Julien et al. 29 were studied during our tests because the authors sought out additional information regarding whether or not the hierarchical structures of LCSH (such as tree structures or ontologies) have been implemented and are available through an open-access platform. One of the primary reasons for this interest was for these hierarchical structures to aid in the testing of tools with machine learning or pretrained language models. Unfortunately, while both articles implemented pilot testing or used data visualization tools, they did not make the platform available on the web. Julien et al. explicitly explained the limitation of not being "generalizable to knowledge domains beyond science and engineering whose topic network might differ significantly (e.g., collections from the humanities), " and concluded "organized collections have difficulty demonstrating the value created by expensive manual indexing using CVs [controlled topic vocabularies] that are not often used explicitly or understood by its users." Lastly, they stated that "ongoing development efforts are focused on managing the computational needs of the large redundant LCSH tree and ensuring its usability." 30 Similarly, in machine learning, a classifier is an algorithm that automatically sorts or categorizes data into one or more "classes." Targets, labels, and categories are all terms used to describe classes. The job of estimating a mapping function (f) from input variables (x) to discrete output variables (y) is known as classification predictive modeling. Classification is a type of supervised learning in which the input data is also delivered to the objectives. A classifier is an algorithm -the principles that robots use to categorize data. The classifier is used to train the model, and the model is then used to classify your data. Training datasets are provided to supervised and semi-supervised classifiers, which teach them how to categorize data into specified categories. There are different types of classifiers in machine learning and popular classification algorithms include Naive Bayes, K-Nearest Neighbors, Support Vector Machines, and Random Forest. 31 The Swedish National Library tests automatedly produced Dewey Decimal Classification (DDC) classes for Swedish digital collections and evaluates the performance of two machine learning algorithms with the top three hierarchical levels of the DDC. Evaluation shows that Support Vector Machine (SVM) with linear kernel outperforms Multinomial Naive Bayes algorithm. 32 As mentioned in the introduction, there are major challenges of providing quality subject access to users. This article aims to examine how to improve the effectiveness of information retrieval, especially through automatic subject indexing, which provides accurate subject terms or consistent search results for users.

Initial tests of BERT and deep learning models
Fonseca et al. propose an automatic contextualization research methodology pipeline, including text representation such as word embeddings and classifiers (e.g., SVM and BERT). 33 They used two different types of word embeddings, including GloVe (Global vectors for word representation). For deep learning models, they used Convolutional Neural Networks (CNN) and others followed by multiple layers such as a convolutional layer. 34 CNN was inspired by the way neurons operate in the human brain and is one of the most popular types of deep learning algorithms. 35 CNN is also one of the most popular types of Deep Neural Networks, which refers to functions with higher complexity in the number of layers and units in a single layer. Deep neural networks excel at finding representations that solve complex tasks with large datasets. The two phases of neural networks are called training (or learning) and inference (or prediction), and they refer to development versus production. The developer chooses the number of layers and the type of neural network, and training determines the weights. 36 Multi-label classification is an AI text analysis technique that automatically labels (or tags) text to classify it by topic and it can apply more than one classification tag to a single text. 37 This is in sync with the goal of our tests to cluster and connect related subjects. Our test approaches are elaborated as follows.

Initial test results
To recommend subject headings, we first tested three Convolutional Neural Networks (CNNs) with GloVe embeddings for multi-label classification: (1) standard classification; (2) hierarchical classification; and (3) word vector prediction. Both CNNs and Recurrent Neural Networks (RNNs) are popular types of deep neural networks. The authors decided to use a CNN since CNNs tend to be much faster than RNNs, according to DeepBench's test results. 38 Since we were dealing with long texts, the runtime of RNN would be too long and the results would not be ideal.
In standard classification, we simply viewed all subject headings at the same level in the subject headings structure. For example, we consider "Philosophy" and "Epistemology" to have the same level, while in fact the latter is a narrower term of the former. This approach failed since the one-level subject headings are too sparse; the number of subject headings were roughly the same as the number of books we tested to classify. The second approach, hierarchical classification, tried to improve the sparse one-level subject heading problem by classifying subject headings a level at a time. That is, considering the subject headings to be a tree-like structure, with classes on top and subclasses under classes, we built a classification model for each non-terminal node. However, as there were still too many non-terminal nodes and too few books belonging to each leaf of the tree, we could only apply classification models at nodes that were closer to the root, i.e., for subject headings that are broader terms.
Contrary to the prior two approaches which try to classify subject headings as labels, the third approach, word vector prediction, tries to predict the embedding of subject headings. We simply used the pre-trained GloVe embedding weights and averaged the embedding vector of each word in a subject heading as our target vector. Nevertheless, this approach still failed as the contextual information of words seems to be hindered when we averaged the embedding vectors, making the prediction targets of the model less distinguishable from each other.

Initial test analysis
From our initial tests we discovered that there are too many different narrower terms of subject headings, and the structure of these narrower terms of subject headings is too complex for standard classification approaches. Nevertheless, we do discover that the classes (e.g., "B, " "N, " "P") and subclasses (e.g., "BC, " "BR, " "PA") have stabler structures. Therefore, we decided to adopt a supervised-unsupervised hybrid approach to suggest subject headings. For the upper part (broader terms/classes) of the tree structure, we built classification models to classify classes and subclasses; for the lower part (narrower terms/classes) of the structure, we recommend subject headings based on the subject headings of similar books. When the subject headings are too broad (e.g., 1000+ possible subject headings), the classification models generally do not perform well. Therefore, k-NN (k-Nearest Neighbors; not supervised) is used to overcome this issue.

Goal
In order to improve the information retrieval effectiveness, various experiments and studies were conducted to examine if controlled vocabularies or structured thesauri with lexical-semantic relationships, focusing on term definitions and grammatical treatments, recorded in thesauri, etc., can increase relative recall or precision. 39 LCSH has been studied and interpreted with rich lexical-semantic relationships of vocabulary terms such as UF (Used for) for equivalence relationships, BT/NT (Broader Term/ Narrower Term) for hierarchical relationships, and RT (Related Term) for associative relationships in light of the semantic linking defined in the ANSI/NISO Z39. 19-2005 (R20210): Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. 40 The authors would like to explore whether or not BERT (AI/NLP contextual) can assist in machine-aided subject indexing in the digitized text collections such as Project Gutenberg, through suggesting accurate LC subject headings filtered with a range of LCC subclass labels such as B (Philosophy (General)/BJ (Ethics)/BR (Christianity). To sum up, we determined the goal of our research project as follows.

Research goal
Can BERT (AI/NLP contextual) models assist in automatic subject indexing of the Gutenberg collection (digitized text resources) through suggesting accurate LC subject headings? We aim to evaluate the outcome of this goal from the following perspectives.
• How do we design a process to leverage BERT models to suggest LC subject headings? • How do we evaluate the effectiveness and models' performance of the process? • Can LC classification and LC subject headings embedded in bibliographic metadata in digitized books support the evaluation of test outcomes of BERT (AI/NLP contextual) tools?

Test steps
Our test data comes from 3,671 English e-books with LC class B (Philosophy. Psychology. Religion) in the Project Gutenberg collection. The test data includes not only the text contents of the e-books but also embedded bibliographic metadata, including LC subject headings and LCC subclass labels such as BJ.

Step 1: Document embeddings: word embeddings and tokenization
The most natural question to ask is: how do we define the similarity of books' subjects? Words are strongly connected due to co-occurrence, words sharing similar neighbors, as well as similarity and relatedness. "Language models are trained on very large text corpora or collection loads of words to learn distributional similarity and build vector space models for word embeddings." Position (distance and direction) in the vector space can encode semantics in a good embedding. For instance, the following visualization of real embeddings show geometrical relationships that capture semantic relations like the relation between a country and its capital ( Figure 1). 41 Word embedding is the representation of words in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. 42 The process of word embeddings converts words or phrases from the vocabulary mapped to vectors of real numbers with high dimensions. 43 In the beginning of this conversion process, text needs to be tokenized. Tokenizer takes the input sentence and will decide to keep every word as a whole word, split it into sub-words, or decompose the word into individual characters. For instance, "embeddings" would be converted to ['em' , '##bed' , '##ding' , '##s']. After breaking the text into tokens, the sentence is converted from a list of strings to a list of vocabulary indices. 44 Here we used the embedding technique based on the Longformer model in Hugging Face's transformers library 45 on its website, 46 providing APIs to easily download and train state-of-the-art pretrained models, to calculate similarity, which is also meant to find text similarities with machine learning algorithms. The Longformer, Long-Document Transformer, has an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer and this pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. 47 Devlin et al. denote that BERT's model architecture is a multi-layer bidirectional Transformer encoder and contextual embeddings are used as input to a randomly initialized twolayer 768-dimensional BiLSTM (Bi-directional Long Short Term Memory) before the classification layer. The test results demonstrate that "the best performing method concatenates the token representations from the top four hidden layers of the pre-trained Transformer. " 48 First, we applied the "Longformer Tokenizer," which uses Byte-level Byte-Pair-Encoding (BBPE), 49 to tokenize the texts of books. Tokenization is the process of breaking down a piece of text into small units called tokens, and it is one of the most foundational NLP tasks and a difficult one, because every language has its own grammatical constructs, which are often difficult to write down as rules." 50 One significant advantage of BBPE is that it can easily handle out-of-vocabulary (OOV) words, which can be a serious problem as library collections often include various niche areas and these areas may include words not covered in the pre-training step, by representing OOV words with subwords. 51 Secondly, we passed the first 4,096 tokens of the books to the Longformer model and extracted the average of the last 4 hidden states, each of which has 768 dimensions. Finally, we averaged the embedding of these 4,096 tokens and defined the outcome as the embedding of the book, that is, for each book, after tokenizing the feature dimension would be (4096, 768, 4) as we extracted the last 4 hidden states. Then, we concatenated the last 4 hidden states to dimensions (4096, 3072, 1), as 3072 = 768 × 4. Lastly, we averaged the 4,096 tokens and obtained our final 3,072 (3072, 1) dimensions feature as the embedding of the book. Note that currently the Longformer model implemented by Hugging Face supports at most 4,096 tokens. Now, as we obtained the embedding of books, we can calculate the similarity of books based on cosine similarity ( Figure 2). 52 Cosine similarity is one of similarity based metrics, which determines the most similar objects with the highest values as it implies they live in closer neighborhoods. Figure 3 demonstrates that the cosine similarity calculates the cosine of the angle between two vectors. 53  Step 2: k-Nearest neighbors (k-NN): to suggest subjects k-Nearest Neighbors algorithm is a straightforward machine learning technique that predicts an unknown observation by using the k most similar known observations in the training dataset. 55 Given the similarities of books' subjects, we suggested the subject headings of each book based on its k nearest neighbors, that is, k books with the most similar subjects. We would consider all subject headings that these k books have as subject heading candidates for suggestion. In our tests, we set k = 10 but the number can be finetuned in the future research.

Step 2.5: Connect LCSH/LCC to document embedding
To link books to the corresponding LC subject headings, we assigned an unique ID for each book. The IDs remain the same after we transformed text into embeddings in step 1. Then, in step 2, each nearest neighbor embedding has its own unique ID. Given these IDs, we find their corresponding subject headings in the stored metadata table. That is, we join the embeddings on the metadata table which includes subject headings, by the IDs.
To illustrate this technique, if the ten books have embedded LC subject headings as well as LC class/subclass labels: {"BJ, " "Ethics"}, {"BJ, " "Conduct of Life"}, {"BV, " "Sermons, English"}, etc., the subject heading candidates would be the union of these LC subject headings and LC classes/subclasses: {"BJ, " "Ethics, " "Conduct of Life, " "BV, " "Sermons, English, " etc.}. The LCC subclasses are used as grouping labels rather than a part of prediction outcomes. As a result, a book labeled B may receive suggested subjects in BR because some of B and BR books may be close to each other. It should be noted that we intentionally removed subject headings that only appeared once in our k-Nearest Neighbors algorithm as including these subjects will greatly increase the number of candidates and hardly benefit recall, which corresponds to the finding of Boiculese et al. for improving recall of a k-Nearest Neighbor algorithm for classes of uneven size. 56

Step 3: Filter results with LCC subclass labels
It is possible that there may be too many subject heading candidates as we simply suggested all subject headings that the nearest neighbors have. This might not be an issue if the nearest neighbors are closely clustered yet might be a serious problem if the nearest neighbors are scattered far away from each other in the embedding space. Therefore, to filter candidates, we limited the candidates to the subclass that the classification algorithms (RF/SVM/ NN, to explain as follows) predicted. For instance, if step 2 passes subject heading candidates: {"BJ, " "Ethics"}, {"BJ, " "Conduct of life"}, {"BV, " "Sermons, English"}, …, and the classification algorithm predicts subclass BJ, we would not include {"BV, " "Sermons, English"} in the candidates, making the final candidates list be {"BJ, " "Ethics, " "Conduct of life, " …}. In a word, even though the candidates are the union of both BJ and BV, this step will decide what candidates are included based on these classification algorithms.
Bear in mind that this step can be applied before the k-NN step as subclasses can limit the search space of the nearest neighbors, as seen in some approximate nearest neighbors algorithms to achieve faster runtime. This step 3 is a simpler and naive version of the test process. However, we will apply this step prior to step 2 for improving speed if we do further research in future.
Among various classification algorithms, categorizing a given set of data into classes, in machine learning, we decided to test 3 classification algorithms: (1) Random Forest (RF); (2) Support Vector Machine (SVM); and (3) Neural Network (NN) with a single hidden layer. RF is a "Tree"-based algorithm that uses the qualities features of multiple decision trees for making decisions, and can be referred to as a "forest" of trees and this algorithm is a forest of "randomly created decision trees." 57 SVM uses algorithms to train and classify data within degrees of polarity, taking it to a degree beyond x/y prediction. 58 NN with a single hidden layer is also called shallow neural networks and the hidden layer of the perceptron would be trained to represent the similarities between entities in order to generate recommendations. 59 Hidden layers reside in-between input and output layers. The word "hidden" implies that they are not visible to the external systems and are "private" to the neural network. 60 Accuracy of these classifiers have been tested and researched in various articles. 61 Two tests are run to predict LCC subclass labels for (1) both all B subclasses 62 and (2) subclasses B/BJ/BR for three classifiers; that is, we have 2 × 3 = 6 test accuracies for comparison. Three subclasses B/BJ/BR are selected by the authors due to correlated subject domains based on LCC class B and its subclasses. Therefore, the comparison of accuracy results can be conducted thoroughly. The input features for these three classification algorithms are the document embedding we obtained in step 1. After we obtained the predicted subclasses in step 3, we inserted them and the real subclasses into the seaborn heatmap function, 63 which will automatically produce the confusion matrix.
Below is a flowchart to summarize the steps (Figure 4).

Suggested subjects filtered in certain subclasses for evaluation
As noted in the previous paragraphs, some subclasses such as BH and BM in B class are more difficult to be classified by SVM due to very limited data. To keep filtering more reliable and consistent, we decide to filter test results only in B/BJ/BR for evaluation purposes.

Test results: Assessment and annotation (with randomly selected samples in certain subclasses)
The test data includes 3,671 titles in B class, and 728 titles in B, BJ and BR subclasses. The author/developer randomly selected 20% of B/BJ/BR subclass titles (728 × 0.2 = 146) and included the full titles of 146 works on a spreadsheet so that the annotator could evaluate the test results. The author/annotator reviewed 25% of test results and provided feedback and questions on some irrelevant suggested subjects. The author/developer reviewed the codes and found one bug, which mixed up a part of training data and test data. After debugging, the test results were slightly different for some books. The screenshot of some suggested subjects in the test results is listed in Figure 5. The entire spreadsheet of randomly selected test results is in Supplementary material.

Using precision, recall, and F1-score, calculated in both micro and macro ways
The recommended subject headings are ultimately evaluated by the author/ annotator. However, since human assessments are too time consuming, we need quantitative metrics to determine which processes can result in objectively better subject headings recommendations that can align with human assessments during the development phase. We used three metrics: precision, recall, and F1-score, and calculated them in both micro and macro ways, to evaluate our test results for suggested subjects and compare results between all B subclasses and only B/BJ/BR subclasses. F1 2 precision recall precision recall * * / An F1-score is only an aggregated measure of both precision and recall. 64 The formula of an F1-score is merely a harmonic mean of precision and recall, meaning that an F1-score can be considered the average of precision and recall. 65 That is, when we are trying to evaluate if approach A is better than approach B and want to look into only one metric for evaluation, we will choose the F1-score instead of precision or recall individually. Also, regarding the meaning of metrics, if we want to recommend new subject headings instead of only the existing ones, which were used as the right answers, precision may not mean much as we will hope to create more subject index suggestions than the existing ones. Using precision here is only a preliminary evaluation to compare candidate approaches with the existing indices. The final call would be human evaluation, which we will address in the assessment and annotation section, to see if this approach did create something different but valid.
Micro F1 score calculates metrics globally by counting the total true positives, false negatives, and false positives. 66 Macro F1 score calculates metrics for each subject heading and finds their unweighted mean. 67 Macro F1 score does not take class imbalance into account and therefore may be a better technique of calculation if we are interested in recommending subject headings that are rarely seen, while micro F1 score should be a more general evaluation of our tests.
We input the predicted subclass labels and the real subclasses into scikit-learn functions to automatically obtain precision, 68 recall, 69 Micro F1 score, and Macro F1 score. The following tables show the accuracy of our test results for suggested subjects. For all subclasses in class B, see Table 1. For only subclasses B, BJ, and BR, see Table 2. According to both Tables 1 and 2 both precision and recall perform better in the micro than the macro setting, meaning that our approach is not good at recommending rare subject headings. This is not surprising as our recommendation is based on multiple nearest neighbors, and popular subject headings are more likely to appear in suggestions. As for the comparison between Tables 1 and 2, Table 2 improved slightly in the micro setting but almost none in macro. Therefore, increasing classification accuracy may not improve suggesting rare subject headings while may improve overall recommendation performance.
Regarding the comparison between the two tables, it is evident that the F1-score increased in both micro and macro settings when we limited our subclasses to B, BR, and BJ. The F1 scores provided us with empirical evidence to refine our approach. If we can further improve the subclass classification filter for all subclasses, we can surely improve the F1-score. Since the classification filter does not work well for all subclasses, we decided to limit the effect of suboptimal filters by using only 3 subclasses to see how well the nearest neighbor step works in step 3.

Using confusion matrix and accuracy to evaluate 3 classifiers in step 3
We would like to further discuss the classification results in step 3 and their implications. We compared the accuracies between different classifier model candidates (Random Forest, SVM, and Neural Network) and plotted a confusion matrix to further analyze the classification results. A confusion matrix is "a specific table layout that allows visualization of the performance of an algorithm. … Each row of the matrix represents the instances in an actual class while each column represents the instances in a predicted class, or vice versa." 70 The confusion matrix for the SVM classifier will be elaborated in the latter part of this section. Table 3 shows the accuracy comparisons between the three model candidates. The accuracy is calculated simply by: accuracy = number of books with correct classes / number of all books SVM (Support Vector Machine) shows the best result with test accuracy of roughly 66% on all subclasses in class B, and NN (Neural Network) with one hidden layer shows the best result with test accuracy of roughly 89% on subclasses B, BR, and BJ. We decided only to continue with one To further examine the classification results of SVM, we used the confusion matrix to analyze the distribution of suggested classes and compare them to the true classes. In Figure 6, the visualization of confusion matrix results, the block where True Labels = BC and Predicted Labels = B has a value 0.67 and means that 67% of books in BC are incorrectly classified by SVM as B. Accordingly, the diagonal line shows where the algorithm correctly classified the books.
We can see that the SVM cannot correctly classify subclasses BC and BD, with both subclasses having 0 on the diagonal line. It is not so surprising to see that 67% of BC (Logic) and 29% of BD (Speculative philosophy) have been misclassified as B (Philosophy (General)), because BC and BD are close to B by their subclass nature or scope. Another example is that BM: "Judaism" has 75% of the data misclassified as BS: "The Bible." As we can see in Figure 7, BC, BD, and BM all have too little data to train, producing less than ideal classification results, with other potential factors such as internal structure of the subclass itself. However, these examples do not only necessarily show the limitation of the classifier (SVM), but also show whether or not document embeddings can truly reflect the context of books. In a word, the causes of incorrect classifications consist of both classifier's and embeddings' accuracy. For instance, for class BR, only 32% of the embeddings are correctly classified to BR (Christianity) while 35% of the embeddings are wrongly classified to BX (Christian Denominations). It is interesting to see that even when misclassified, the embeddings are misclassified to related subclasses instead of ones that have less correlations with BR, e.g., BC (Logic), BF (Psychology). That is, even though the embeddings cannot be perfectly classified, the misclassified subclasses are still mostly highly correlated to the true subclass, reassuring that the embeddings can truly represent the context of books. In addition to these factors, it would be challenging to assign only one classification subclass when a book is about multiple topics and our tests include only the first 4,096 tokens for each book.

Annotator's assessment and analysis
The test results are converted from a CSV (comma-separated values) file to a spreadsheet, with title, creator(s), filtered subjects (embedded in the bibliographic record), and subjects suggested by BERT. The author/annotator verified each suggested subject heading against filtered (assigned) LC subject headings with the structure of broader, narrower and related terms, in the bibliographic record of each book. Both suggested and filtered/assigned subject headings appear on the spreadsheet with subdivisions removed. Furthermore, any subject headings were removed if they appeared only once. If no related LC subject headings were available, a full-text search was conducted to examine whether or not the suggested subject terms(s) are related subjects. The annotations were classified into three categories: all/some terms related, no additional subject terms added, and some terms were unsure. The annotator's notes were recorded on the spreadsheet for data analysis, percentage calculation, and discussions.
All of these suggested subjects were reviewed by the author/annotator. As the evaluation outcome indicates, 61 out of 146 titles are totally matched, 78 out of 146 are partially matched, 6 out of 146 have no matched subjects, and 1 has no additional subjects suggested. The percentage of these categories are listed in Figure 8.
Totally matched subjects represented 41.8% of the total and 53.4% were partially matched. As a result, the predominant majority is for either totally or partially matched subjects. The annotator was extremely impressed by some additional subjects suggested by BERT and felt these subjects would be very helpful for researchers. For instance, the bibliographic metadata for Recent Tendencies in Ethics, written by W. R. Sorley, only has one LC subject assigned (Ethics) but "Ethics, Evolutionary" was a subject suggested and is a topic mentioned in the book. Another great example is Life of Luther. Its bibliographic metadata has only one LC subject assigned (Luther, Martin, 1483-1546), but subjects suggested included "Education" and "Reformation," which are very relevant subjects. The LC subject heading "Education" includes variant forms such as "Students-Education" and "Youth-Education" and chapter VII, "Luther's Student Days," of this book describes Luther's student education and another chapter denotes how Lutheran Saxons contributed to the education of ministers in the United States in 1839.
There were only 16% (24 out of 146) titles with subjects removed due to appearing only once, as noted in the methodology. There were still 57.5% titles with partially or non-matched subjects suggested. For instance, The Religious Situation written by Goldwin Smith, has LC subject "Christianity" (BR), yet subjects suggested include "Etiquette for children and teenagers," "Home economics," "Japan," and " Table," which are in BJ. The authors analyzed these subjects and proposed possible causes including the quantity of training data and neighboring books with diverse subjects. It seems likely that the more books that are in the training data, the better the nearest neighboring suggestions will be. Applying this AI principle, the results will be greatly improved if our future tests can benefit from more data. On the other hand, future research may find that these seemingly irrelevant subjects may uncover hidden topics or could be an indicator of related books. For example, the annotator could not find the subject "Ethics, Evolutionary" or related terms in the book entitled The Philosophy of Despair, written by David Starr Jordan. Nevertheless, the annotator could find the terms of despair or pessimism in the books about evolution and ethics.
There is only one title, A Philosophical Dictionary, Volume 04, without any additional subject suggested. We found that Volume 02, 03, and 05 − 10 are in the training data. Therefore, subject headings for Volume 4 might be suggested by the rest of the volumes in the training data. We also found that A Philosophical Dictionary, Volume 01 has an additional subject suggested on the test result spreadsheet. However, this is not related to Volume 04's suggested subjects. We suspect that Volume 01 has content more dissimilar to the rest of the volumes; therefore, even though most of its suggested subjects were identical to Volume 04, Volume 01 has the suggested subject "Philosophy, Hindu. " Because most books in Project Gutenberg are classic books, some of their terminology could vary from modern terms. As explained in the literature review for BERT, Wikipedia and the other original datasets used to train BERT have more contemporary terms, which could be a barrier for testing using classical literature. The author/annotator attempted to search certain terms in variant forms and had some successful search results for related terms. For instance, the annotator searched "being" instead of ontology, "morality" for ethics, "manners" for etiquette, and found "Hindoo" in the text rather than Hindu. "African Americans," the LC subject suggested, resulted in more no-matched outcomes than other terms and the major reason could be related to the evolution of language and terms. Likewise, today almost nobody says "melancholy," but instead we say "depression." The new/old distinction will be a persistent problem as long as older books are still being read and studied. Therefore, anybody using BERT-based approaches to find books on "depression" from the 19th century should fine-tune models on corpus data that have more archaic context. Nevertheless, there is humanities research being developed that apply NLP/BERT models or perspectives on semantic change such as "detecting semantic change in diachronic corpora and representing the change of concepts over time." 71

Limitations and further improvements
Given our limited manpower and time, our tests have following limitations:

Limited tokens embedded
Only the first 4,096 tokens of the books were passed to the Longformer tool. Accordingly, books with length and key concepts in the later part might not be well indexed through this tool. Therefore, to embed the information from later parts, it may be better to iterate the same embedding process (step 1) and concatenate all the embeddings vectors accordingly. However, this will greatly increase the computational cost and complexity.

Limited training data
Regarding classifiers, we could not fine-tune the Longformer classifier since we did not have sufficient book data and computing resources/ memory to fine-tune the classifier. Notwithstanding, it should be expected that the transformer-based classifiers should outperform our current simple SVM classifier if the dataset is large enough. In addition to training a better classifier, more training data could also bring better nearest neighbor recommendations. If we increase the density of the embedding space, the nearest embeddings should provide subject heading candidates that are more similar to the recommended books.

Limited corpora diversity
We only tested all B subclasses using the Project Gutenberg eBooks. Therefore, more tests on different corpora should be considered to validate our proposed process. However, as mentioned, the Longformer model and its preceding BERT model family were pretrained on mostly modern word contexts from BookCorpus and Wikipedia. Therefore, we expect that our proposed process should provide better results if it was applied to contemporary corpora.
A possible way to improve the BERT embeddings' understanding of classical literature is to fine-tune the embeddings on Project Gutenberg ebooks. That is, reproduce the BERT pre-training (masked language modeling) process with the Project Gutenberg ebooks so that the fine-tuned embeddings can understand historical ancient context.

Conclusion
Key tasks completed in these tests include designing a recommender system based on Longformer embeddings and k-NN to suggest correct LC subject headings (without subdivisions). Based on confusion matrix, SVM (Support Vector Machine) classifier was chosen for a simplified process with 87% accuracy rate for suggesting LCC subclasses B/BJ/BR labels in comparison with all B subclasses, achieving F1 score of 0.31 on recommending 750+ LC subject headings filtered with LCC subclasses B/BJ/BR (728 titles) from 3671 Project Gutenberg English ebooks. One hundred forty-six titles (randomly selected 20% of B/BJ/BR subclass titles (728 × 0.2 = 146)) are evaluated by the author/annotator to verify if suggested subjects match LC subject headings including their narrower, broader and related terms. Of the total, 41.8% of the subjects were totally matched and 53.4% were partially matched. In a word, the predominant majority is for either totally or partially matched subjects.
Our tests focused on recommending LC subject headings. Nevertheless, we would like to explore whether or not machine learning classifiers can predict LCC subclass labels correctly and three classifiers (SVM, NN and RF) perform effectively during revised tests, step 3. Even though our initial tests failed, we are able to explore and test different machine learning classification algorithms, including the hierarchical multi-label classification algorithm to classify books in accordance with LCC class/subclass for B/BJ/BR.
For optimizing test procedures and results, the test results of each step are analyzed through quantitative metrics and evaluation outcome supports the decision of next steps or the choice of algorithms/techniques/approaches; the final test results include qualitative analysis performed by a human annotator with metadata knowledge. To illustrate, the test results were calculated using cosine similarity (measuring how similar two documents are likely to be in terms of their subject matter), F1-score (a weighted average of the precision and recall), and confusion matrix (measuring the performance of a supervised learning algorithm through this error matrix to see whether the system is confusing two classes: actual and predicted classes) for their accuracy, and suggested LC subject headings are evaluated for their relevance and accuracy by a human annotator with metadata knowledge. Most importantly, cosine similarity, F1-score, and confusion matrix have been consistently used in the fields of data mining and machine learning research work, and this article has followed the usage of these quantitative approaches accordingly.

NLP models are still work in progress
In terms of coding, Longformer implementation has a disclaimer: "This model is still a work in progress, if you see something strange, file a Github Issue." 72 Undoubtedly, there are still complicated issues to resolve in NLP contextual models, because of lexical ambiguity in human languages, like such phrases and negative sentences, mentioned earlier in this article.

Tree structure in LCSH & LCC subclass for computational needs and usability tests
In our initial tests, we had to modify the process due to the complexity of LCSH and its related LCC classes/subclasses. The recommendations of Julien et al. resonated with the authors: "Ongoing development efforts are focused on managing the computational needs of the large redundant LCSH tree and ensuring its usability. This will be followed by usability tests and retrieval performance comparisons with traditional, separate browsing and searching systems." 73 Similarly, Yi and Chan (2010) conducted a syntactical and structural analysis of LCSH for the digital environment. They recommended a number of directions such as preserving strong hierarchical relationships but abandoning weak ones, in order to develop LCSH into a viable system for digital resources. 74

Research goal
Can BERT (AI/NLP contextual) tools assist in automatic subject indexing of the Gutenberg collection (digitized text resources) through suggesting accurate LC subject headings?
Yes, BERT tools recommended correct LC subject headings with high accuracy rates. As a result, the predominant majority of 146 titles is for either totally matched (41.8%) or partially matched (53.4%) subjects. This recommender system achieves an F1 score of 0.31 on recommending 750+ related LC subject headings filtered with LCC subclasses B/BJ/BR, from 3,671 Project Gutenberg English ebooks.
• How do we design a process to leverage BERT models to suggest LC subject headings? We can leverage BERT embeddings to conduct nearest neighbor searches to find related books for subject headings recommendations. We can further improve recommendations by filtering related subclasses through SVM classifiers. • How do we evaluate the effectiveness and models' performance of the process? We used precision, recall, and F1 score as quantitative metrics and further assessed the LC subject heading recommendations with a human annotator. We also analyzed the subclass classifier with accuracy and confusion matrix. • Can Library of Congress classification and subject headings embedded in bibliographic metadata in digitized books support the evaluation of test outcomes of BERT (AI/NLP contextual) tools? Yes, it is indeed very helpful. Our tests focus on automatic subject indexing in the contextual setting rather than frequency-based keyword clusters, which might not be contextual or demonstrate any semantic relationship among these keywords. LC subject headings support the evaluation of our test results more accurately and objectively. On the other hand, the authors had to perform additional tasks to make them work, with defined conditions. The major barriers would be complex structures/syntaxes of LCSH subdivisions and LCC subclass levels.
Research to repurpose those rich data inside LC subject headings for broader, narrower and related terms would be a great candidate for the next test. Furthermore, using LC subject headings to train a BERT-based model for fine-tuning progress would be an essential task to pursue.

Further research
The authors deeply believe that big data is leading the future. In his lecture "Datasets make algorithms: how creating, curating, and distributing data creates modern AI" at the 2019 AI for LAM Conference, Bryan Catanzaro, Vice President, Applied Deep Learning Research at NVIDIA, noted that AI/machine learning projects need quality datasets which should be created by researchers/educators/librarians and maintained by libraries which can store and provide access to books and human knowledge. 75 For instance, we chose the Project Gutenberg collections because they provide text data in multiple formats while many digitized library collections are not processed by OCR. Cataloging and metadata librarians need to strengthen their knowledge on how to communicate and collaborate with AI/NLP experts to optimize the quality of metadata and datasets and to provide user-centered services by engaging in user communities.
After our tests were completed, we found a few new articles published in early 2022 or late 2021 and are very excited with these initiatives. Kazi, Lane and Kahanda develop predictive models to automatically annotate scholarly articles with a subset of biology-related LCSH concepts, with a gold-standard dataset generated with extracted keywords mapped with LCSH for developing multi-label classification models. Three models are used and BERT significantly outperforms both DT (Decision Tree) and ANN (Artificial Neural Network). 76 The National Library of Finland (NLF) has created Annif, an open source toolkit for automated subject indexing and classification and launched Finto AI, used to suggest subjects for text in Finnish, Swedish and English and to integrate semi-automated subject indexing into their metadata workflows. 77 As Investigating the National Need for Library Based Topic Modeling Discovery Systems White Paper commented "it is obvious that machine learning for cultural heritage, library, and scholarly use outside of strictly computer science disciplines is at a relatively early stage," albeit "the profession as a whole is transforming from supporting research through scholarly resource acquisition and access to collaborative immersion in the creation of scholarship itself." 78 Setting a higher priority for strategic directions is imperative for academic libraries, technical services, collection development, scholarly publishing, and library discovery systems.
Lately, more recent research has looked into exploiting semantic word clustering representation for enhanced topic modeling, 79 which has been used for digital humanities for decades. Even though BERT and topic modeling are fundamentally different from their learning Figure 9. the future of interdisciplinary collaboration. models, research on topic modeling (unsupervised training) with BERT (supervised training) is ongoing and has increased swiftly in recent years. 80 Library metadata communities are committed to creating linked open data, with entities and relationships linked in the bibliographic metadata. However, not everyone or everything is mapped to the knowledge graph in digitized text collections, and NLP can help fill in the gaps between named entities. 81 The authors strongly believe these three domains: library metadata, BERT (NLP) and digital humanities (topic modeling), will have more cross-collaborations as an emerging trend. In Figure 9, we envision the future (in the center) will be expanded as the three disciplines' circles converge and move into the future (center) as one domain or community.