SolveWords: An Algorithm for Automatically Solving Mathematical Word Problems with Machine Learning
AubinBenjamin
2019
<div>For a long time, a goal for researchers has been to automatically solve mathematical word problems (Upadhyay & Chang, 2017). Contained within the text of most mathematical word problems is all the information needed to solve it, meaning they can easily be solved without understanding the full implications of what words mean. Still, they combine both a task that is easy for humans, natural language parsing and understanding as well as advanced mathematics, and act as a measure of artificial intelligence (Clark & Etzioni, 2016).</div><div><br></div><div>A current standard for automatic word problem solvers advocates solving word problems by deriving templates and solving them (Upadhyay & Chang, 2017). Template equations are limited in scope, and can only solve equations similar to those it’s already seen - it’s not smart enough to do multi-step problems. Another published method, called ARRIS, uses verb categorization (Hosseini, Hajishirzi, Etzioni & Kushman 2014). Verb categorization methods fall prey to errors related to parsing errors, set completion, irrelevant information, and entailment, reaching accuaracy of only 70% in perfect scenarios (Hosseini et. al 2014). </div><div><br></div><div>New research shows that it’s possible to parse sentences with 92.1% accuracy (Kong, Alberti, Andor, Bogatyy, & Weiss, 2017) by using Sequence to Sequence models (Suskever, Vinyals & Le, 2017). I propose an algorithm that uses a sequence to sequence model, based on these recent advances in machine learning technology, to extract equations from word problems.</div><div><br></div><div>The results of this research could be generalized to enhance personal assistants like Apple’s Siri or the Google Assistant.</div><div><br></div><div>Materials</div><div>-----<br></div><div><br></div><div><div>The model was built with Keras (Chollet F., 2015) and using the Tensorflow backend (Abadi et al., 2015).</div><div><br></div><div>The training and validation dataset is a mix of problems from MAWPS (Koncel-Kedziorski et al., 2016) and Dolphin 18k (Huang, Shi, Lin, Yin, & Ma, 2016) and consists of about 5,000 problems when cleaned. The testing dataset is from Upadhyay S. & Change, 2016. All datasets are cleaned to only contain problems with correct equations that use numbers from the text.</div><div><br></div><div>Due to the small size of publically available word problem corpuses and the high cost of generating and labeling new ones, the model takes spaCy (Honnibal & Montani, 2017) word vectors as input to lessen training data requirements.</div><div><br></div><div>The model was trained on Google’s ML Engine - a public cloud platform for training neural networks built with TensorFlow.</div><div><br></div><div>Sympy is used to solve the generated equations, an open-source Python computer algebra system (Meurer et al., 2017).</div><div><br></div><div>The source of the algorithm is written in Python 2.7</div><div><br></div><div>All graphs and visuals created by the researcher using TensorBoard and Matplotlib (Hunter, 2007)</div></div><div><br></div><div><div>Design Process & Explanation</div></div><div>-----</div><div><br></div><div><div>The initial design for Solve Words was a rule-based process. It involved parsing a sentence using a natural language parser, classifying each token to create a tree (“classification“), then converting that tree through a series of defined steps to “reify” the tree into an equation ready to use with a computer algebra system (“reification”). This proved problematic: English is not static, even with nicely written word problems. Attempting to take a static route from tree to an equation is almost impossible to do well.</div><div><br></div><div>To address this difficulty the design shifted to combine the classification and reification stages using a neural network. Sequence to sequence models work well for machine translation (Wu et al., 2016), thus the design of the network is based on a sequence to sequence model.</div><div><br></div><div>The model consists of three sections: the encoder, decoder, and attention. The encoder section creates a tensor containing a representation of each word’s use in the equation. The attention section generates a score for each word’s importance, which is used to compute a weighted average of the encoded representation for each output. The decoder decodes the attention and outputs a set of values, each corresponding to either an operation, a variable, or a number from the input sentence.</div><div><br></div><div>Instead of writing equations in standard notation, a machine-friendly prefix notation is used instead. It reduces errors by assigning numerical indexes to variables in equations and can be represented with high information density.</div><div><br></div><div>Because of limitations on available data, the model is only able to generate equations with a length of fewer than 70 characters, using single-token numbers contained in the first 71 words of the problem. Pre-trained word vectors from SpaCy are used as inputs, cutting data requirements. </div></div><div><br></div><div>Results</div><div>-----</div><div><br></div><div>The best model scored 94.09% per-character accuracy on its testing dataset of problems. This translates to generating correct equations for 25.74% of the problems in the model’s training dataset and 12.34% of problems outside of the model’s training dataset.<br></div><div><br></div><div>Conclusion</div><div>-----<br></div><div><br></div><div><div>Computers struggle with understanding word problems. When compared to other models on real-world problems, SolveWords achieved 12.35% accuracy. In Huang et al. (2016), the highest accuracy solver scored about 15.9%, using a similarity-based model that may fail outside the scope of problems on which it has been trained. Since SolveWords processes the entire sentence, it should be more accurate tested against problems unlike those it has seen before compared to other models, but a lack of data makes it hard to verify this.</div><div><br></div><div>SolveWords achieved testing scores of 25.74% (and could obtain more with training time), an indicator that the model fits word problems well, but “overfits” (failed to generalize) on training data because of a lack of samples. In the future, it should be possible to achieve higher accuracy with this model by simply gathering more data to train with, ideally at least 30,000 problems (which was impossible due to budget constraints). Still, using limited publically available datasets, SolveWords is a state-of-the-art model for solving word problems.</div></div>