Weakly Supervised Learning for Textbook Question Answering

Textbook Question Answering (TQA) is the task of answering diagram and non-diagram questions given large multi-modal contexts consisting of abundant text and diagrams. Deep text understandings and effective learning of diagram semantics are important for this task due to its specificity. In this paper, we propose a Weakly Supervised learning method for TQA (WSTQ), which regards the incompletely accurate results of essential intermediate procedures for this task as supervision to develop Text Matching (TM) and Relation Detection (RD) tasks and then employs the tasks to motivate itself to learn strong text comprehension and excellent diagram semantics respectively. Specifically, we apply the result of text retrieval to build positive as well as negative text pairs. In order to learn deep text understandings, we first pre-train the text understanding module of WSTQ on TM and then fine-tune it on TQA. We build positive as well as negative relation pairs by checking whether there is any overlap between the items/regions detected from diagrams using object detection. The RD task forces our method to learn the relationships between regions, which are crucial to express the diagram semantics. We train WSTQ on RD and TQA simultaneously, i.e., multitask learning, to obtain effective diagram semantics and then improve the TQA performance. Extensive experiments are carried out on CK12-QA and AI2D to verify the effectiveness of WSTQ. Experimental results show that our method achieves significant accuracy improvements of 5.02% and 4.12% on test splits of the above datasets respectively than the current state-of-the-art baseline. We have released our code on https://github.com/dr-majie/WSTQ.


I. INTRODUCTION
Q UESTION answering, such as machine reading comprehension [1], [2] and visual question answering [3], [4], [5], has attracted extensive attention due to its popularity in some intriguing real-world applications, e.g., autonomous driving [6] and image retrieval [7]. Recently, a new task Textbook Question Answering (TQA) [8], [9] possessing both of the characteristics of machine reading comprehension and visual question answering pushes forward vision-and-language comprehension. In particular, TQA is the task of answering diagram and non-diagram questions given multi-modal contexts shown in Figure 1, which is analogous to the real-life process of a human learning new knowledge from a lesson and estimating achievements. Taking a diagram question as an example, this task requires a system to have deep semantic understandings of multi-modal inputs and then predict answers accurately.
TQA presents some challenges due to its specificity. First, it is difficult to learn deep semantic understandings of long textual contexts with limited training data, e.g., only 15,153 samples in the CK12-QA 1 train split [9]. The textual contexts, especially the most relevant text of questions, are very important to predict answers. For example, the text with olive backgrounds in Figure 1 is the key knowledge to answer Question 1. Secondly, it is difficult to learn effective semantic representations of diagrams without annotations. The semantics of diagrams in textbooks, which are also very essential to predict answers, are expressed by a collection of items with 2D positions and a collection of relationships between items. Such relationships are expressed by the connections or overlaps between the items. For example, the diagram of Question 1 shown on the right of Figure 1 depicts nitrogen cycles by the 1 The TQA dataset is collected from http://www.ck12.org. In this paper, we call the TQA dataset CK12-QA to distinguish TQA tasks from TQA datasets.
See https://www.ieee.org/publications/rights/index.html for more information. overlaps between regions and contains the important visual knowledge to answer this question, i.e., the fertilizer flow direction. However, there do not exist annotations for diagrams such as items and relationships.
The de facto paradigm for tackling TQA is to first apply recurrent neural networks like LSTM to learn the semantic representations of questions and paragraphs that are extracted from textual contexts using text retrieval methods such as TF-IDF, then employ objection detection methods such as YOLO [10] to capture the diagram semantics, finally use multimodal fusion to predict answers. Existing methods [9], [11], [12] only take the result (output) of previous steps as the input of next steps, not making full use of the results of intermediate procedures. Although the results are not completely accurate, they can be used to alleviate the mentioned challenges.
In this paper, we propose a Weakly Supervised learning method for TQA (WSTQ), in which the incompletely accurate results of essential intermediate procedures for TQA are regarded as supervision to develop Text Matching (TM) and Relation Detection (RD) tasks, and then the above tasks motivate the system to learn effective text comprehension and diagram semantics respectively. Concretely, we apply the result of text retrieval that is an important intermediate procedure for TQA [13], [14] and is used to find the most relevant text of questions, to develop TM. We consider the text t i , which is most relevant to the question q i in the lesson l u , as the matching text of q i and regard the text t j of q j ∈ l v , u = v, as the mismatching text of q i . The text understanding module of WSTQ is first pre-trained on TM and then fine-tuned on TQA to learn deep text understandings. We construct positive and negative relation pairs by checking whether there is any overlap between the items/regions detected from diagrams by object detection to develop RD. The detection is also an important procedure of TQA [12], [15] and is used to obtain diagram features. The RD task forces our method to learn the relationships between regions, which are crucial to express the diagram semantics. To learn effective diagram semantics and improve TQA performance, our method is trained on RD and TQA simultaneously, in which the parameters of the diagram understanding module are shared by both tasks, i.e., multitask learning. It is worth noting that TM and RD are developed automatically rather than manually. In addition, the labels of them are not always accurate, i.e., weak supervision [16].
We evaluate WSTQ on two TQA datasets including CK12-QA [9] and AI2D [8]. Experimental results show that our method achieves the new State-Of-The-Art (SOTA) accuracy of 52.61% and 72.05% on CK12-QA and AI2D test splits respectively. To summarize, our contributions are mainly threefold.
1) We propose a novel multitask learning framework that applies TM and RD to drive WSTQ to deepen the text understanding and learn the effective diagram semantics respectively. 2) We propose a weakly supervised developing strategy that uses the results of essential intermediate procedures for TQA to build TM and RD automatically. 3) We conduct experiments and ablation studies on CK12-QA and AI2D extensively to verify the effectiveness of WSTQ. We are the first to report the performance on various types of questions such as what and how within the mentioned datasets. The remainder of this paper is organized as follows. Section II introduces the related works. Section III describes the task formulation. The details of our method is described in Section IV. The experiments on CK12-QA and AI2D are discussed in Section V. Finally, we make the concluding remarks in Section VI.

II. RELATED WORK
Researchers have proposed various methods for TQA, visual question answering, and video question answering, which try to address either multi-modality interaction or explainability challenges. In this section, we introduce how they address the issues.

A. Multi-Modality Interaction
The information interactions between questions and multi-modal contexts play a key role in predicting answers. Kembhavi et al. [8] first softly embedded textual contexts that are most relevant to questions as well as candidate answers via an attention mechanism and then projected textual and visual representations into a common space to predict answers. IGMN [11] finds the contradictions between textual contexts and candidate answers to build contradiction entity relationship graphs and then reasons over multi-modal inputs in the instructor of graphs. In contrast, F-GCN [14] applies graph convolutional networks [17] on textual contexts and diagrams to build unified graphs that memorize relevant question background information and predicts answers by reasoning over the graphs. EAMB [18] applies the essay-anchor attentive multi-modal bi-linear pooling method to learn the joint representations of text and diagrams. It first builds textual graphs based on textual contexts and then applies bilinear-based MFB [19] model to fuse graph and diagram representations. MoQA [20] regards textual contexts and diagrams as knowledge and then selects the top K most similar knowledge to answer questions. It also explores the TQA performance obtained by different information representations. All of the above methods were only conducted on the TQA validation split [9] due to the unavailable test split at that time and they were end-to-end trained only on CK12-QA. By comparison, ISAAQ [12] achieved SOTA results relying on fine-tuning large pre-trained models, ensemble learning and large datasets. The textual ISAAQ is pre-trained on RACE [21], ARC-Easy, ARC-Challenge [22] and OpenBookQA [23] datasets and fine-tuned on CK12-QA. Similarly, the multi-modal ISAAQ is pre-trained on VQA abstract scenes, VQA [3] and AI2D [8] datasets and fine-tuned on CK12-QA. In parallel with the above TQA methods, there are many works [24], [25], [26], [27] addressing multi-modality interaction between natural images/videos and languages. UNITER [28], a large pretrained model, provides universal image-text representations for vision-and-language tasks. It is pre-trained on four tasks including masked language modeling, masked region modeling, image-text matching, and word-region alignment. CLIP-BERT [29] is a general framework for end-to-end video and language learning, which employs sparse sampling to use only a few sampled short clips from videos at each training stage.

B. Explainability
Practical TQA methods should not only answer textbook questions but provide students with explanations accurately, which helps them have a deeper understanding of what they have learned. There is only one work XTQA [13] researching on the TQA explainability. It regards the whole textual contexts of lessons as candidate evidence and applies a coarseto-fine grained algorithm to extract span-level explanations for answering questions. However, it can only provide textual explanations for students rather than both textual and visual explanations. In parallel with the mentioned TQA method, many neural-symbolic methods [30], [31], [32], [33] focus on improving the transparency and explainability of visual/video question answering. NS-VQA [31], NS-CL [32], DCL-Oracle [33], and DCL [30] first apply image/video and question parsers to obtain the scene/event representations and symbolic programs respectively, and then execute the program to predict answers.
The challenges WSTQ tries to address are different from the above works. Similar to RAFR [34], our method tries to learn effective diagram representations. However, RAFR only considers the text in diagrams, which causes the loss of visual information. By contrast, WSTQ not only considers the region representations but the relationships between them to learn more effective diagram semantics.

III. TASK FORMULATION
The questions can be classified into three categories including Non-Diagram True or False (NDTF) with two candidate answers, Non-Diagram Multiple-Choice (NDMC) with four to seven candidate answers, and Diagram-Multiple-Choice (DMC) with four candidate answers. Following previous works [12], [13], we split TQA into NDTF, NDMC and DMC. We regard NDMC and DMC as a multi-class classification and consider NDTF as a binary classification. We only use the text of multi-modal contexts due to the lack of diagrams in some lessons. An example of multi-modal contexts is shown on the left of Figure 1. In this section, we describe the formulation of each subtask.

A. NDTF and NDMC
Given a dataset S ψ consisting of N ψ triples (q i , t i , A i ) with a question q i ∈ Q ψ , text t i ∈ T ψ and candidate answers A i ∈ A ψ , NDTF and NDMC can be defined as follows: where a i,m denotes the m-th candidate answer for q i andâ i denotes the predicted class. We use |A i | denotes the number of candidate answers. For NDTF, |A i | = 2 and |A i | = 7 for NDMC.

B. DMC
text t k ∈ T φ and candidate answers A k ∈ A φ , DMC can be defined as follows: where a k,m denotes the m-th candidate answer for q k ,â k denotes the predicted class and |A k | = 4.
To describe the differences between subtasks, we use different subscripts such as q i and q k . In the following subsections, we do not use the subscripts to distinguish questions belonging to different subtasks, which may make readers easily understand our method. Following previous works [12], [13], we devise a corresponding method for answering the questions of each subtask respectively.

IV. METHOD
In this section, we first provide a brief overview of the architecture of our method. Then, we describe weakly supervised learning methods for NDTF and NDMC. Finally, we introduce a weakly supervised multitask learning method for DMC.  Figure 2. Here, (q 0 + A 0 , t 0 ) and (q 0 + A 0 , t 5 ) are positive and negative TM pairs respectively, where A 0 denotes the candidate answers of the question q 0 , t 0 denotes the most relevant text extracted by information retrieval methods of q 0 , and t 5 denotes the most relevant text of q 5 .

A. Overview
We show the RD developing on the bottom right of Figure 2, where r i, j denotes the j -th region detected by object detection methods within diagram d i . Region pairs with overlaps such as (r i, 15 , r i, 17 ) are considered as positive samples. Instead, region pairs without overlaps such as (r i,0 , r i,1 ) are regarded as negative samples. In non-diagram question answering, we first pre-train the text understanding module on TM and then fine-tune it on NDTF and NDMC respectively. In diagram question answering, the parameters of the diagram understanding module are shared by DMC and RD. The text understanding module pre-trained on TM is also fine-tuned on DMC. We train our method on DMC and RD jointly.

B. Weakly Supervised Learning for NDTF and NDMC
The text understanding is important to answer questions accurately due to the TQA specificity but it may not be learned well using limited training data, e.g., only 15,153 samples in the CK12-QA train split. Text retrieval is an essential intermediate procedure of TQA because it is used to find the most relevant text of questions. Inspired by this, we regard the incompletely accurate results of text retrieval methods as supervision to develop the TM task and train the text understanding module on TM to overcome the mentioned issue.
Information Retrieval (IR), Next Sentence Prediction (NSP), and Nearest Neighbors (NN) methods are applied to retrieve the most relevant text of questions respectively following the previous work [12]. Particularly, we first concatenate the question q i and its candidate answers A i as a query. Then, (1) a traditional search engine like ElasticSearch is used to perform IR. (2) we treat the text retrieval task as NSP using a Transformer [35] with frozen parameters. (3) we apply the Transformer to obtain the representations of queries and sentences within the textual context respectively and compute the cosine similarity between them to obtain NN. The text retrieval methods can also be replaced by other technologies such as TF-IDF [36]. WSTQ applies the above three methods respectively to explore their differences on TQA performance.
Relevant knowledge may exist in adjacent lessons due to the TQA specificity, e.g., carbon and living things in Lesson 1 and carbon cycle in Lesson 2. This would cause a situation where negative text pairs are relevant. To address this issue, we devise a strategy to develop a relatively precise TM task. Specifically, we sort all the questions according to the lesson order and regard denotes the number of questions in the dataset, and t j is the most relevant text of q j .
Obviously, achieving high performance on TM and TQA requires a deep understanding of the text t i . Inspired by this, we apply TM, which is developed automatically via weak supervision, to drive the text understanding module to learn deep text understandings. Specifically, we first train the text understanding module on TM by optimizing a binary cross-entropy loss function L TM . Then, we fine-tune it on NDTF and NDMC by optimizing L NDTF and L NDMC respectively, which denote binary and multi-class cross-entropy loss functions. RoBERTa [37] is applied as the text understanding module to learn the joint representations of q i , a i,m and t i , and it can be replaced by existing text representing methods.

C. Weakly Supervised Multitask Learning for DMC
Object detection is also an important intermediate procedure of TQA and is used to extract image/diagram features [12], [15], [38]. The relationships expressed by the connections or overlaps between regions play a key role in expressing the semantics of diagrams. Inspired by these, we first apply object detection methods such as YOLO [10] to detect regions and check whether they have overlaps to develop positive as well as negative relation pairs. Then, we devise the multitask learning architecture to drive WSTQ to learn on not only RD but DMC tasks. This enables our method to learn effective semantic representations of diagrams and achieve good DMC performance.
1) Diagram Understanding (DU): WSTQ applies CNNs such as ResNet [39] to learn a x-dimensional vector of the k-th region r i,k detected by YOLO within the diagram d i . The coordinate c i,k ∈ R 4 of r i,k is projected into a x-dimensional position vector using a Fully Connected (FC) layer due to its importance to relationship representations. Our method considers the arithmetic mean of them to be the representation d i ∈ R μ×x of d i as follows: where μ denotes the number of regions within d i , LN denotes the layer normalization [40] and W c ∈ R 4×x denotes the learned weight matrix.
2) RD Optimization: Due to the lack of diagram annotations, our method only learns the implicit relations instead of explicit ones such as (subject, relationship type, object) in the visual relation detection task [41], [42]. To obtain the relationship scores between regions, our method first repeats the first and second dimension data of d i μ times, which is denoted as d 0 i ∈ R μ×μ×x and d 1 i ∈ R μ×μ×x respectively. Then, they are multiplied using the Hadamard product to obtain the joint representations d i ∈ R μ×μ×x . Finally, WSTQ applies a FC layer to infer the relationship scores s r i ∈ R μ×μ . The above steps can be denoted as follows: where W r ∈ R x denotes the learned weight matrix.
Our method regards RD as a binary classification. In this task, negative relationship pairs are much more than positive pairs. For example, the diagram with 18 regions shown on the right of Figure 1 has 18 * 18 = 324 possible relationship pairs but only exists 51 positive relationship pairs. In order to make the positive samples being focused on, a weighted binary cross-entropy loss function L RD is applied as follows: where N φ denotes the number of questions within DMC, w + and w − denote the weights of positive and negative relationship pairs respectively, y r i ∈ {0, 1} μ 2 denotes the labels of relationship pairs within d i ,ŷ r i ∈ [0, 1] μ 2 denotes the probability of relationship pairs being predicted as positive classes, μ 2 denotes the number of possible relationship pairs within d i and σ denotes the sigmoid function.
3) Text Understanding (TU): Learning deep understandings of t i is also important for answering questions of DMC. Hence, the text understanding module pre-trained on TM is applied to learn text representations, which has the same setting as the above subsection. For simplicity, WSTQ applies this module to learn the joint representations e i,m ∈ R x of t i , q i and a i,m as follows: where TU is RoBERTa. 4) Information Fusing: Attention mechanisms are widely used to obtain the attended representations of diagrams. For example, top-down [15] and question-guided attention mechanisms [34] are used to learn the global attended image representations, which can improve the performance. However, the attention mechanisms cause reductions of TQA performance in WSTQ. Therefore, our method treats the weight of each region as the same and obtains the global diagram representations d α i ∈ R x by summing the representation of each region.
To obtain the multi-modal fusion representations f i ∈ R |A i |×x of e i ∈ R |A i |×x and d α i ∈ R x , WSTQ applies the Hadamard product to fuse them. The mentioned steps can be denoted as follows: where d i is the learned diagram representations with μ regions and |A i | denotes the number of candidate answers of q i . 5) DMC Optimization: WSTQ uses a FC layer to predict the scores of candidate answers s a i ∈ R |A i | as follows: where W a ∈ R x denotes the learned weight matrix. WSTQ regards DMC as a multi-class classification and applies the multi-class cross-entropy loss function L DMC to optimize where y a i ∈ {0, 1} |A i | denotes the answer label,ŷ a i ∈ [0, 1] |A i | denotes the probability of candidate answers belonging to their corresponding classes and softmax denotes the softmax function.

6) Multitask Optimization:
To optimize DMC as well as RD simultaneously, the weighted sum L MTL of L DMC and L RD is applied as follows: where λ denotes the weight to adjust L RD .

V. EXPERIMENTS
In this section, we first describe the experimental setups such as evaluation datasets and implementation details. Then, the results of each subtask within CK12-QA and AI2D are discussed. Third, we introduce the ablation studies on CK12-QA. Finally, we show the results on various type of questions such as how and what.

A. Experimental Setup 1) Datasets and Evaluation Metrics:
To the best of our knowledge, existing TQA methods except ISAAQ [12] are only evaluated on CK12-QA. Following [12], we evaluate WSTQ not only on CK12-QA [9] but AI2D [8], which contains textbook (diagram) questions. Specifically, CK12-QA is developed from middle school curricula including life science, earth science and physical science. It is split into a training split with 666 lessons, a validation split with 200 lessons and a test split with 210 lessons. AI2D is developed from grade school curricula and only consists of diagram questions. The detailed statistic on each split of the mentioned datasets is shown in Table I. In CK12-QA, NDTF, NDMC and DMC have 3,490, 5,162 and 6,501 training questions respectively. In AI2D, DMC contains 7,824 training questions. Please Note AI2D only contains DMC questions. Following previous works [9], [12], [13], we use accuracy to evaluate our method.
2) Implementation Details: We introduce the implementation detail of each module within WSTQ as follows. In DU, the pre-trained ResNet-101 backbone is fine-tuned to learn the x = 1024 dimensional representation of each region within diagrams. In RD Optimization, WSTQ applies YOLO that is fine-tuned on AI2D [8] with an initial learning rate 1e −4 to detect regions within diagrams. Our method applies w + = 1.5 and w − = 1 to optimize the relation detection. The RoBERTa-large [37] is applied to be the text understanding module, which is first fine-tuned on TM and then fine-tuned on subtasks of CK12-QA. Our method selects the maximum input sequences of 64 tokens for NDTF and 180 for NDMC, DMC and TM. In Multitask Optimization, λ = 0.1 is used to be the weight of L RD . Our method is trained during 6 epochs by Adam optimizer [43] with linearly-decayed learning rate and warm-up. We select the initial learning rate 1e −5 for NDTF, 2.5e −6 for NDMC, DMC and 1e −6 for TM. The dropout value 0.1 is chosen to avoid over-fitting. We implement WSTQ based on PyTorch and run our code on one NVIDIA Tesla V100 card.

B. Results on CK12-QA 1) Comparison With SOTA Baselines:
We compare WSTQ with the previous SOTA methods on CK12-QA validation and test splits. We select XTQA [13], RAFR [34], ISAAQ [12] to be baselines because the other works introduced in Section II lack the results on the test split. The authors of these works have not released their codes. RAFR analyzes the dependencies between text within diagrams to build visual graphs and then applies dual attentions to predict answers. It obtains the best performance on validation splits compared with the model without pre-training and fine-tuning. XTQA achieves the best results on the test splits under the mentioned comparison conditions. However, their results are rather modest. ISAAQ achieves the current SOTA results based on fine-tuning the large pre-trained model, training on large datasets and ensemble learning. Please see details in Section II.
We select three ISAAQ versions including ISAAQ I R , ISAAQ N S P , and ISAAQ N N that are trained only on CK12-QA to fairly compare with our method. Please see details about IR, NSP and NN in Section IV-B. We run the codes of ISAAQ and WSTQ three times on the same machine with random seeds. The best result of each time is selected to compute the average and standard deviation. Table II shows the main result on CK12-QA validation and test splits. We can see that WSTQ I R significantly outperforms the current SOTA method ISAAQ I R , improving the accuracy on the whole questions of the test split from 47.78% to 52.61%. The improvement is observed consistently on other versions of WSTQ, e.g., WSTQ N N outperforms ISAAQ N N by 5.02% on all the questions of the test split. Our method performs best on all subtasks, especially on DMC. It can be seen that WSTQ/ISAAQ with different text retrieval methods have significantly different performance, which demonstrates the importance of the text most relevant with questions. The traditional IR methods such as ElasticSearch may have the best retrieval performance. We will investigate how to retrieve the more accurate text in the future, which may improve the TQA performance substantially. We can also see the generalization ability of WSTQ and ISAAQ on DMC is slightly weaker than that on NDTF and NDMC, which may be caused by the difficulty of diagram understanding and the different data distribution between splits. For the former, explicit relations between regions like visual relation detection [41], [42], [44] may improve the diagram understanding. For the latter, fine grained attentions may enhance the reasoning ability to overcome the data shift [45], [46]. Furthermore, we conduct the pair-wise significance test (paired t-test) between WSTQ * and ISAAQ * on each subtask. We can see that WSTQ is significantly better than ISAAQ except on NDTF ( p ≤ 0.05) within the test split. This demonstrates the effectiveness of our method. We can also see that the results of pre-training (fine-tuning) based methods such as our method and ISAAQ are better than RAFR and XTQA that are trained from scratch and do not use pre-training and finetuning. We can conclude that large pre-training models can bring a significant improvement on specific tasks with limited data.  We also compare WSTQ with the previous SOTA methods on AI2D [8]. There have a few works on TQA and most of them only conducts experiments on CK12-QA. DQA-NET [8], which is the first work on this dataset, answers the questions within DMC by reasoning over the diagram parse graphs. It does not depend on pre-training and fine-tuning and its results are rather modest. ISAAQ [12] is the current best-performing method on this dataset as well. We choose ISAAQ trained only on AI2D to compare with our method fairly. AI2D contains no textual context and does not require to perform information retrieving. Therefore, there is only one version for our method and ISAAQ. We use the settings on CK12-QA to obtain the result on AI2D.

2) Results:
2) Results: Table III shows the accuracy on validation and test splits of AI2D. We can see WSTQ achieves the new SOTA performance, significantly improving the accuracy on the test split by 4.12%. We also conduct the pair-wise significance test (paired t-test) between WSTQ and ISAAQ. It can be seen that our method is significantly better than ISAAQ on AI2D validation and test splits ( p ≤ 0.05). The results also show that our method and ISAAQ significantly outperforms DQA-NET. In summary, WSTQ pushes forward the SOTA results on two public datasets, demonstrating its effectiveness.

D. Results on Diagram Question Answering
To further demonstrate the efficacy, we compare our method with UNITER [28] and ClipBERT [29] which are state-of-theart models in VQA and video question answering respectively. We have introduced them in Section II. We do not use their pre-trained models to conduct the experiments for two reasons: (1) TQA is different from VQA and video question answering.
(2) All the methods are only trained on CK12-QA and AI2D training split respectively. Owing to the multimodal specificity of UNITER and ClipBERT, they cannot be used to perform textual question answering. Therefore, we only conduct the experiments on diagram question answering.
We modify UNITER and ClipBERT so that they can adapt to TQA: (1) Owing to various objects in diagrams and lack of annotations for object classes in TQA datasets, object detection models such as YOLO [10] have high performance on localization but very low classification performance. Therefore, we use ResNet to learn the representations of objects detected by YOLO [10]. (2) We modify the classification layer to adapt to TQA. (3) We concatenate questions, closest sentences, and each candidate answer as textual tokens. (4) UNITER and ClipBERT can not converge on the limited training data. Therefore, we modify the number of hidden layers in the BERT encoder to 1. Table IV shows the experimental results on CK12-QA and AI2D test splits. The results on CK12-QA of all the methods are the average accuracy of their versions with IR, NN, and NSP respectively. As can be observed, our approach WSTQ surpasses all of the baselines, indicating its efficacy. The result further demonstrates that the relationship between objects plays a key role in diagram understanding and TQA. In addition, although our method requires more supervision than other methods, the supervision of TM and RD is obtained by the results of intermediate procedures in TQA paradigm automatically. Existing methods [8], [12], [28], [29] only take the result (output) of previous steps as the input of next steps, not making full use of the results of intermediate procedures.

E. Results on Various Types of Questions
Previous works [14], [34] only reported the experimental results on NDTF, NDMC and DMC, making their model analyses less detailed. We classify the questions of DMC and NDMC into 8 categories like [47] to further analyze our model: what, how, which, where, when, who, why and other. The questions are classified by determining whether there exist the above-mentioned class labels. We think the results on these various types of questions can provide a comprehensive analysis for WSTQ. For example, achieving high performance on when, why, where and how questions usually require the high-level reasoning ability [47]. We do not classify the questions of NDTF because they do not contain the above-mentioned labels.
We show the experimental results in Figure 3. Concretely, Figure (3a) and (3b) show the detailed NDMC and DMC accuracy within CK12-QA respectively. The DMC subtask does not contain when, who and why questions. It can be seen that WSTQ obtains the better performance on all types of questions within NDMC compared with that within DMC, which demonstrates multi-modal question answering is more challenging than textual question answering. We can see the accuracy of WSTQ on how questions is not as good as that what questions within CK12-QA, because how questions may necessitate the high-level multi-hop reasoning ability. For which questions such as "which letter denotes the mitochondrion in this diagram," they usually require models to have a deep semantic understanding and word-region alignment. Figure (3c) shows the detailed DMC accuracy within AI2D. We can see that WSTQ achieves good results on all types of questions within the test split, demonstrating its strong generalization and reasoning ability. Moreover, it can be seen that our method obtains the better accuracy on AI2D compared with that on CK12-QA. This may be caused by the following reasons: (1) Achieving high performance on questions within CK12-QA usually requires external knowledge but it may not be necessitated for questions within AI2D. (2) TQA and AI2D are developed from middle and grade school curricula respectively, which means questions within the former are more challenging than that of the latter. The accuracy on how questions is slightly worse than other types of questions, which shows the similar situation on CK12-QA.

F. Ablation Study
In order to further analyze WSTQ, we carry out ablation studies shown in Table V   is pre-trained on a binary classification TM task. It may match better on NDTF and make bigger contributions compared with that on NDMC because the former is also a binary classification task. Actually, we can see that the decrease on NDTF is more than that on NDMC. This setting also demonstrates the effectiveness of regarding information retrieval results as supervision to develop TM.

3) W/O RD:
The DMC accuracy drops from 50.92% to 49.08% and is close to the performance achieved by WSTQ w/o diagrams, which demonstrates the effectiveness of RD on semantic representations of diagrams and the effectiveness of regarding objection detection results as supervision to develop RD. Combined with the first and this setting, we can conclude that current diagram understanding has potential improvement. For example, learning explicit relations and building diagram graphs [48] under different relations may be effective.

4) Freezing DU:
We freeze the diagram understanding module to explore whether the RD loss in Eq. (5) can be used to optimize other modules. This loss is specifically designed to optimize the diagram understanding module, allowing WSTQ to learn effective semantic representations of diagrams. In this setting, the variant of our method performs RD with freezing DU, which means the RD loss is forced to optimize other modules. We can see that this variant of WSTQ outperforms that without RD by 1.12%, which proves our argument.

G. Case Study
We present the case studies for our method. Figure 4 depicts a qualitative case for each TQA subtask respectively in order to show the strengths of WSTQ I R intuitively. WSTQ I R and its variant without pre-training the text understanding module on TM and training the diagram understanding module on RD use the same relevant text to answer the specific question but make very different predictions.
1) NDTF: As can be seen on the top left of Figure 4, the variant of WSTQ I R may not fully comprehend the textual context and may make predictions solely based on text similarities. However, our method predicts the answer B with a high degree of certainty, demonstrating that TM motivates WSTQ I R to learn a deeper text understanding.  2) NDMC: On the bottom left of Figure 4, it can be seen that the variant predicts a close probability for each candidate answer, which shows that it may make predictions based on the text similarity. By contrast, our method predicts extremely different probabilities for (A, D) and (B, C). Moreover, WSTQ I R can still predict the answer C accurately, although there has a long distance between felsic lavas and explosively. These demonstrate TM drives our method to have a deep understanding and summarization ability for long text. The above cases also show the strength of considering the results of information retrieval as supervision to develop TM in spite of some noise existing.
3) DMC: The variant of WSTQ I R without pre-training the text understanding module on TM and training the diagram understanding module on RD only considers the separate region information to be the representations of diagrams, resulting in incorrect predictions. Nevertheless, our method explicitly predicts the relationships between regions such as seasons for deep understandings of diagram semantics, making completely accurate predictions for the questions within DMC. This show the strength of regarding the results of object detection as supervision to develop RD despite some noise existing. In addition, this also intuitively shows RD and TQA can enhance each other, i.e., the advantage of multitask learning.
In addition, we also present some failure cases in Figure 5 to analyze the weakness of our method deeply.

4) Undetected Regions:
The question on the left of this Figure requires our method to first detect all the regions and grasp the relations between them, and then to perform accurate predictions with the help of extracted textual contexts. However, our method fails to detect region A (stem) and returns an incorrect answer. Therefore, we can focus on improving the localization accuracy of object detection methods to improve the TQA performance.

5) Unrelated Textual Contexts:
The question on the right of this Figure requires our method to first understand the meaning of all detected regions and relations between them, and then to predict the answer with the help of extracted textual contexts. However, our method fails to extract the text most relevant to questions and brings some noisy information in the prediction stage. Similar to ISAAQ [12] and RAFR [34], our method extracts the textual context in a pipelined way, resulting in the error accumulation. Therefore, we can apply the answer supervision to optimize text extracting to alleviate this issue.
In summary, each procedure in the TQA paradigm has an effect on performance. TQA requires the system not only to grasp the relations between objects within diagrams but also to perform reasoning with the help of extracted textual contexts.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a weakly supervised learning method for TQA called WSTQ, which considers the intermediate procedures that are essential for this task as supervision to develop TM and RD tasks, and then uses them to drive itself to learn deep semantic understandings of text and diagrams respectively. To be more specific, the TM task motivates WSTQ to learn a deep text understanding. The RD task drives our method to take into account the relationships between regions, which are important in expressing diagram semantics. Extensive experiments and ablation studies demonstrate the effectiveness of WSTQ and the contribution of each module. We also show the experimental result on various types of questions, such as where and when to further analyze our method.
In the future, we will investigate the following directions. 1) We will explore how to generate the textual attribute for each detected region and devise an attribute-word guided attention mechanism to learn more effective visionlanguage representations. 2) We will explore how to obtain more accurate relevant textual context that is important to answer questions accurately. 3) We will explore how to detect the explicit relationships between regions and apply graph neural networks to learn diagram representations under specific relations.