Classification and Generation of Grammatical Errors

The misuse of grammar is a common and natural nuisance, and a strategy for automatically detecting mistakes in grammatical syntax is warranted. This research defines and implements a unique approach that combines machine-learning and statistical natural language processing techniques. Several important methods are established: (1) the automated and systematic generation of grammatical errors and parallel error corpora; (2) the definition and extraction of over 150 features of a sentence; and (3) the application of various machine-learning classification algorithms on extracted feature data, in order to classify and predict the presence of grammatical errors in a sentence.


INTRODUCTION
Digital communication lends prevalence and persistence to language in the written form: it comprises the documents we write, e-mails we send, instant messages we exchange, the web pages we view.Naturally, there is a higher tendency for its misuse, especially in the form of grammatical mistakes, and there is good incentive to eliminate these errors wherever possible.It is highly desirable then, to have an automated and feasible way of detecting grammatical errors where they exist.This research presents methods for automatically predicting the grammatical correctness of a sentence through machine-learning [1].
Given sufficiently many examples (or evidence) of grammatically correct or incorrect constructions, it may be possible to discern a pattern between the two classes.Certainly, humans perform this kind of learning intuitively: grammar is acquired fluidly, through observing its use by native speakers [2].Similarly, it is possible to have computational algorithms learn by example and make statistical observations about the frequency of patterns and features in data that may aid in predicting future events.This process is commonly referred to as machine learning [1].It is the primary interest and focus of this research to develop a machine-learning method to detect and classify various forms of grammatical incorrectness, as learned from natural and synthesized example data.
In support of this approach, this research accomplishes several major goals: (1) the collection and generation of a suitable dataset of example sentences; (2) the definition and extraction of a set of features from the example data; and (3) the application of machine-learning classification algorithms to the extracted feature data in order to predict the grammatical correctness of a sentence.

NLP & RELATED RESEARCH
Natural language processing (NLP) is a modern domain of computer science comprised of artificial intelligence and linguistics [3,4].Several important topics of NLP form the supporting theory of this research work: processing textual data, parsing it into words and grammatical constructs, and analyzing and verifying their grammatical structure.
Many of the approaches towards NLP rely on collections of examples of written text, referred to as a corpus (plural, corpora), in order to learn and make statistical inferences about an observed phenomenon [4].There exist several wellknown corpora for the English language, such as the Brown corpus, the Wall Street Journal corpus, the British National Corpus, and a hybrid of the former two: the Penn Treebank corpora [5].

Tagging
Tagging is the process of categorizing each word in a sentence into its part of speech (e.g.noun, verb, adjective).During the tagging process, a part-of-speech tag is assigned to each word from the set of possible tags for that word, as defined in a tag dictionary [5].Rule-based methods for tagging rely on a set of defined rules in order to determine the part of speech of a given word [3].In contrast, prob-abilistic approaches view tagging as a process of statistical inference [4], where each word and its corresponding part of speech can be assigned a probability based on how likely it is to occur in the corpus, in a given context.A particularly effective probabilistic approach is the concept of maximum entropy [6,7] applied towards tagging; it is able to capture and exploit contextual information from words and tags.

Parsing
Syntactic parsing of a sentence involves generating a parse tree from the constituents of a sentence [3].Constituents are meaningful groups of words, formed from sub-constituents such as part-of-speech tags and phrasal groups.Probabilistic constituency parsing attempts to generate a tree that represents the most likely intended parse of a given sentence.For example, the arbitrary sentence "The election official gathers the ballots" is expressible as the following parse tree: Statistical approaches comprise some of the leading techniques towards parsing in natural language processing [8,9].One approach is probabilistic context-free grammar (PCFG) [9], which learns probabilities of phrase-structure rules [10,8] from an annotated corpus.Another popular approach applies the concepts of maximum entropy to the task of parsing [11].

Grammar Verification
Perhaps one of the most familiar and commonplace grammar and style checking applications can be found in the Microsoft Word document-processing software [12,13].Increasingly, there are now several other proprietary, online grammar checking systems, such as WhiteSmoke [14] and Ginger [15].However, we limit our scope to non-proprietary research in grammar verification.LanguageTool, an Englishlanguage grammar checker [16], and GRANSKA, a Swedish language grammar checker [17] are examples of grammar verification based on heuristics.More recent and adaptive approaches to the problem of grammatical verification apply machine-learning techniques to model and predict errors; two such approaches are discussed towards the end of this section.

LanguageTool
LanguageTool [16] is an open-source rule-based grammar and style checker for the English language that performs checking via rule matching.Rules are developed for the rule base by considering a grammatical error and reducing it to its general form expressed in its parts of speech.
In tests on a small corpus of collected errors, Language-Tool was compared with Microsoft Word 2000 's grammar checker, detecting 42 errors in comparison to 49 detected by Microsoft Word [16].

GRANSKA
GRANSKA is a grammar checker that targets specific errors common to the Swedish language [17] (although it can be adapted to other languages).Particular focus is given to three types of errors: split compounds (similar to compound nouns in English), noun-phrase disagreement (e.g. in number, tense, or gender), and subject-predicative disagreement (improper use of adjectives).Grammatical verification occurs by matching rules in its rule base against the sequence of generated tags.
Performance was tested by the authors on a test set containing 418 grammatical errors spanning news, science, and essays; it achieved an overall 53% precision and 52% recall.In comparison to Microsoft Word 2000 's Swedish grammar checker on the Svante corpus, GRANSKA performed at 82% precision and 63% recall, while Microsoft Word's performance was 47% and 66%, respectively [17].

Grammatical error prediction
During the course of this research, a related doctoral thesis entitled Grammatical error prediction [18] was published by the University of Cambridge; therein Andersen describes a machine-learning approach to the binary classification of a sentence as grammatically correct or incorrect.Three principle tasks are required by the machine-learning approach: evidence gathering, feature extraction, and classification [18].
The Cambridge Learner Corpus (CLC) [19] is used as evidence data in Andersen's research; it is an error-annotated learner corpus 1 that contains both grammatically correct and incorrect sentences.The annotation scheme encodes various types of errors and their corrected forms.
Andersen defines a set of 10 features to be extracted from each sentence [18] which capture information about the words, tags, and parses in a sentence.The extracted features are used as training data for several machine-learning algorithms: Naive Bayes, Balanced Winnow, Maximum Entropy, and Support-Vector Machines (SVM).As seen in Table However, detection rates depend greatly on the types and number of errors present in a sentence, as seen in the experimental results shown in Figure 2. Spelling errors and 1 A learner corpus primarily contains works by authors learning the language as a second-language [20].
inflection errors (such as improper pluralization) result in the highest detection accuracies, between 77% and 79%; by contrast, errors resulting from improper verb conjugation or omitted words can be much more difficult to detect, with accuracies of 58% to 62%.
Figure 2: Classification accuracy by error type [18] A particular limitation is the lesser performance of the system on conjugation errors, such as errors in the tense of a verb [18].
Andersen's results appear to corroborate that a machinelearning approach is at least viable in predicting grammatical errors, provided that there is a good set of examples for both grammatical correctness and incorrectness.Andersen notes that the set of sentence features defined is not exhaustive [18], and that further experimentation may seek to define more and better features.

Attempts to verify grammar
Prior to the appearance of Andersen's work, we published a similar machine-learning approach to grammatical verification via sentence classification [21].In contrast to Andersen's approach (using bigrams and trigrams to define features), we defined a feature set derived from statistical natural language processing techniques, such as the probabilities associated with part-of-speech tags and parse trees.
Example data to be used in the machine-learning process was extracted from the Open American National Corpus (OANC), which is a collection of written English text spanning several domains of writing such as articles, magazines, travel guides, non-fiction and fictional works [22,23].Examples of grammatical incorrectness were procedurally synthesized from this corpus by injecting errors into the original sentences, specifically displacement errors, which occur when certain words are misplaced within a sentence.The approach we take to grammatical verification in this research extends earlier work, especially that of [21], and accompanies the thesis work in [24]; it is described in more detail in the following sections.

METHODOLOGY
The methods in this research focus on the prediction of grammatical errors and grammatical correctness using statistical natural language processing and machine-learning techniques.Succinctly, machine-learning encompasses algorithms for pattern recognition and representation [1]; it is a sub-domain of artificial intelligence, and as described in Section 2, it leverages example data in order to learn and make predictions about the phenomenon of interest.Patterns and concepts learned from the example data are expressed by a model, which is capable of making predictions about future, unknown data.
In the context of this research, the primary concept of interest is grammaticality -determining whether a sentence is grammatically correct or incorrect.This section discusses three important methods in support of the machine-learning process: (1) gathering example data in the form of correct and incorrect sentences; (2) defining and extracting features from a sentence; and (3) performing machine-learning classification on the extracted feature data, using various algorithms.These three phases are summarized in Figure 3 For the concept of grammatical correctness, an appropriate data set is comprised of sentences that exhibit grammatical correctness and grammatical incorrectness (manifesting one or more errors).Example 3.1 is a trivial data set constructed from just two sentences (grammatically correct and incorrect, respectively), with syntax errors underscored.
The official results are quickly tallied and a winner is announced.The official results is quickly tallied and a winners is announced.
There exist many corpora of overwhelmingly correct sentences, these are described in subsection 3.1; however the same cannot be said for grammatically incorrect sentences -where they exist, they are either from non-public sources (such as the CLC) [25,26], or tend to be interspersed amongst correct sentences, making their extraction a difficult task.A novel approach taken in this research is to generate examples of grammatical incorrectness by synthesizing errors in preexisting sentences that are assumed to be correct.This approach to gathering example data is explained further in subsection 3.2 to subsection 3.4.
Another important requisite for the machine-learning process is a set of defined features (or attributes) that represents interesting characteristics about the data [1].Features may be defined over a sentence as a function of the underlying data, such as the words, tags, or parse tree.This research defines over 150 features for a sentence, derived from natural language processing techniques.The methods for extracting feature values are discussed in subsection 3.5 to subsection 3.7.
Machine-learning algorithms operate on feature data (or attribute values), extracted from example data [1].Concepts and patterns learned from the data are encapsulated in a model, in a representation dependent on the algorithm (for example, a set of rules, or a probability distribution).Machine-learning classification is used to generate models capable of predicting grammatical correctness, and the presence of grammatical errors in a written sentence; this procedure is described further in subsection 3.8.

Grammatically correct samples
As discussed in Section 2, there exist several well-known English corpus, such as the Wall Street Journal corpus [27], the British National Corpus (BNC) [28], the American National Corpus (ANC) [22,29], the Brown Corpus [30], and others [5].They consist of writings from multiple domains of English, including published articles, news wires, reports, stories, and other publications [22].Providing that the published writings are largely free of grammatical errors, these collections serve as convenient sources of example data.While this does incur some degree of noise 2 , we deem it acceptable, and moreover, the only systematic and automated way of obtaining example data for grammatical correctness.
The Open American National Corpus (OANC) [23], and the Manually Annotated Sub-Corpus (MASC) [31,32,33] are used in this research as sources of correct sentences.Both corpora are freely-available derivatives of the American National Corpus (ANC) [22,29]; they are comprised of published works by native English speakers and span several domains of writing, including fictional and non-fictional works, newspaper articles, blogs, technical papers, travel guides, magazines, etc.The MASC is a manually-annotated, 500,000-word subset of the OANC, which itself contains over 15 million words.We constrain these corpora to a subset that is representative of the most common uses of English, such as newspaper and magazine articles.The corpora used in this research can be retrieved from [34].

Grammatically incorrect samples
Obtaining sources of purely incorrect sentences proves difficult -where these sources are isolated (or at least isolatable), they tend to be rare and non-public (as is the case with the CLC); where grammatically incorrect sentences are publicly abundant, they tend to be interspersed amongst grammatically correct sentences, presenting the problem of extraction.For example, the International Corpus of Learner English (ICLE) [20] features collected writings by secondlanguage learners of English, which can be expected to contain a much higher incidence of grammatical errors.Another example is the source of grammatical errors to be found in unmoderated comments on public forums or social media websites.However, in both of these examples the challenge then becomes extracting the incorrect sentences from the likely many grammatically correct sentences, and filtering by hand is typically infeasible for constructing large corpora.
2 A sample of 100 sentences taken from the Verbatim subcorpus of the OANC [23] yields above 80% (±10%) grammatically correct sentences.Textual fragments such as titles, headings, figures, listings, dates, serve to form part of this noise; and though not treated in this research, it may be possible to reduce this noise to some degree through filtering and preprocessing techniques.

Generating grammatical errors
A more feasible approach to obtaining grammatically incorrect sentences is to generate them.Using natural language generation (NLG), sentences can be constructed with grammatical errors systematically injected into them, thus producing grammatically incorrect sentences.While it is possible to use NLG to construct malformed sentences, it still requires some semantic context be provided (e.g.subjects, objects, actions).Rather than generate arbitrary prose, it is more convenient to modify preexisting sentences so that they become incorrect; this also has the advantage of retaining the semantics and naturalness of the original sentence, as much as possible.Ambiguity can lead to situations in which errors in the intention of a sentence may not necessarily manifest as errors in its syntax; however the impact of this type of noise is minimal 3 .Most importantly, a systematic method of generating errors provides a degree of control to the experimental design.An implementation of this method is discussed further in subsection 3.3.

Grammatically parallel corpora (GPC)
Generating grammatically incorrect sentences from preexisting natural sentences is not only convenient and feasible, but also especially useful for isolating the concept of grammatical incorrectness.Sentences which have been modified to contain grammatical errors typically retain most of their other linguistic properties; domain and semantics (such as the subjects, objects, and actions) tend to remain similar or unchanged (as is the case in Example 3.1).Thus, variations between the original sentences and their error-induced parallels are isolated to the presence of grammatical errors.For example, the pair of sentences in Example 3.1, while differing in syntax and validity, still largely retains much of the same meaning.Given a corpus of grammatically correct sentences, it is possible to generate a parallel corpus of grammatically incorrect sentences, similar in most ways except where it exhibits grammatical errors.In this research, we refer to a pair of corpora constructed in this way as grammatically parallel corpora (GPC) 4 .By modifying sentences from an existing corpus rather than artificially creating new sentences, the resulting corpus generally retains much of the same words, meaning, context, and domain as the original, thereby eliminating variations that may not necessary be useful to the concept of grammatical correctness.Conceptually, a grammatically parallel corpus is the juxtaposition of grammatically correct sentences with the ways in which these sentences may become grammatically incorrect.

CorpusTools
Implementations for the methods discussed thus far are provided by CorpusTools [34] -software developed during the course of this research to construct and preprocess corpora, especially grammatically parallel corpora.Provided a source of grammatically correct sentences, it offers an automated, systematic method of generating various types of grammatical errors, specifiable in type, number, and context.It is capable of generating the following errors: Each type of error is constrained to a set of word types to which it may be validly applied.CorpusTools distinguishes between the following part-of-speech categories: noun, verb, determiner (or article), adjective, adverb, preposition, pronoun, punctuation, and "other".

Error selection & application
For each input sentence, a set of applicable error types is constructed based on the types of words present in the sentence.One or more error types are then selected and applied to the tokens in the sentence (or the sentence is discarded if no errors can be applied).Each error type is associated with a specifiable weight -an arbitrary positive integer that implies the desired frequency 5 with which the associated error should appear in the generated error corpus.Errors are selected via weight-proportional probability: the probability of selecting an error type is proportional to its weight, in proportion to all other error weights that are valid for a sentence.
Once an error type has been selected, a subset of applicable tokens is constructed from the sentence.A token is selected at random from this set, and the selected error type is applied to the token 6 .The token is then marked as consumed by the error generation process and is not available for future selection (in the case where more than one error is being applied) 7 .The process of error selection and application iterates until a valid error is generated or until there are no more valid, unconsumed tokens remaining.If an error is generated successfully, the resulting sentence is retained and made available for further error generation, or it is output to the resulting corpus.
The result of the error-generation process, applied to a corpus of correct sentences, is an output corpus of grammatically incorrect sentences, modified with one or more errors, up to (or exactly) the number of errors specified.When taken together with the original corpus, the result is a grammatically parallel corpus.

Corpora
Mentioned in subsection 3.1, two freely available English corpora are used throughout this research as sources of grammatically correct sentences: the Open American National 5 The weights merely imply the desired frequencies of error types in the resulting corpus; they do not statistically guarantee them.This is due to two reasons: (1) the pseudorandom nature of the error selection process; and (2) the possibility that a generated error may not result in a valid change to the sentence and is discarded.A constant seed can be specified for further control over the determination of the pseudo-random process (defaulting to 1). 6 Algorithms for error application are described comprehensively in the complementing thesis work [24]. 7While it is conceivable to have multiple errors applied to a single token, the implementation refrains from this to avoid possibly undoing or reversing any previously applied errors.
Corpus (OANC) [23] and the Manually Annotated Sub-Corpus (MASC) [31,35,32,33]; both are sub-corpora of the American National Corpus (ANC) [22,29].Subsets were extracted from each corpora to ensure a manageable size (the full OANC contains over 15 million words), and to limit the domain to commonly-used English, such as found in magazine articles or newspapers.The specific subset corpus used in generating the results for this research is the Verbatim corpus.It contains over 580,000 words across a collection of issues of the Verbatim magazine [23].
CorpusTools was used with each of these corpora to generate a collection of grammatically parallel corpora, each exhibiting one or more of the error types defined in subsection 3.3.1, in varying quantities.The error corpora are generated in a systematic manner: for each error type,  different error corpora are generated, where  is the number of errors per sentence.Thus, the total number of error corpora generated from each input corpus is  • , where  is the number of error types.In this research,  = 7 (the seven error types defined in subsection 3.3.1),and  = 5 (up to five errors per sentence); this results in 35 grammatically parallel corpus generated per input corpus 8 .
Generating corpora in this systematic manner allows results to be compared and categorized individually by corpus, error type, and number of errors per sentence, as presented in Section 4.

Extracting feature data
Machine-learning algorithms operate on datasets of values defined by features [1].A feature (also known as an attribute) represents a piece of interesting information about an instance in the example data.The intent in defining a feature is to discover and encode information that may directly or indirectly contribute to learning the concept of interest.The example data in this research consists of sentences; the concept to be learned is grammatical correctness.Thus, a feature captures some information about a sentence that may have an impact on its grammatical correctness.For example, the parse probability of a sentence may constitute a feature; another feature may capture the average probability of the part-of-speech tags in the sentence, and perhaps another may record the lowest tag probability and its associated word; and so on.A collection of such features forms the feature set, and each instance in the example data takes on a set of feature values.
A total of 152 sentence features are defined by this research.The values extracted by each feature assume one of two types: numeric -an integer or real-valued number; or nominal -a predefined value from a set of finite values (such as true, false, or noun, verb, article).Particular interest is given to defining features that are derived from statistical natural language processing techniques, especially those that associate probabilities with the decisions they make (such as tagging and parsing, described in Section 2).A comprehensive list of all 152 features can be found in the related thesis work [24].

GrammarTools
The GrammarTools software [34] provides an implementation for feature extraction, along with integration of several other useful NLP tools.It was developed during the course of this research in support of the methods in this section, and its implementation is further described in [24].An important function of GrammarTools is to transform a corpus of input text into a output dataset of feature values, suitable for the machine-learning process 9 .In conjunction with the Weka data mining software [36], it serves as the primary platform for experimentation in this research.

Feature extraction
With the necessary tools in place, the next phase is the extraction of feature data to be used in the machine-learning process.GrammarTools provides a dataset tool [34] for the task of extracting feature data from corpora.It implements algorithms and calculations -especially those that involve statistical natural language processing -that operate on a sentence in order to derive the values of each feature.Simple examples of features are the opennlpParseProb and stan-fordParseProb features which record the parse probabilities returned by the OpenNLP and Stanford parsers, respectively Another example is the opennlpDeltaParseProbOmit-Min feature, which records the change in parse probability when the word with minimum tag probability has been removed from the sentence.A more complex example is the avgDeltaVerbChangeOpennlpParseProb feature, defined by Equation 1, which calculates the average difference (delta) in the parse probability of a sentence (as obtained using the OpenNLP parser) that results from altering the form of each verb in the sentence such that it yields the highest parse probability for that sentence: where  is the number of verbs in the sentence , and  is the verb currently being altered.Consider the following example of an incorrect sentence with improperly conjugated verbs: Example 3.2.
A winner be announces.
In Example 3.2 the verb "be" will be altered to the form which yields highest parse probability for the sentence (i.e."is") and the difference is taken between the parse probability of the original sentence and its altered form; likewise, "announces" is changed to "announced" and the difference is taken; the average difference is then calculated and used in various related features.Such features may serve to indicate possible errors in verb form: a positive average delta may suggest that better forms (or at least statistically more common forms) exist for one or more verbs within the sentence.Many other features are defined similarly, and leverage statistical natural language processing to extract information about a sentence that may be useful to the machine-learning process.
Given the large amounts of data from the various corpora, the process of extracting feature data was expedited by executing it in batch and distributing it across available computing hardware 10 .Time for completion varies depending on the size of the dataset: from several minutes to hours 9 Feature data generated by GrammarTools can be encoded in either ARFF or CSV file format; these are particularly well-suited for use with Weka [36]. 10 The typical computer used for this task was equipped with an Intel i7 quad-core processor, and 8 gigabytes of main memory.
for smaller corpora (such as the MASC), and several days to weeks for larger corpora (such as the OANC).The extracted feature data from each corpus is then post-processed (removing any invalid values such as occurring from parsing failures) and merged with the feature data from its original corpus.This results in a collection of feature datasets that each contain both grammatically correct and incorrect instances, categorized by corpus, error type, and quantity.The feature datasets, in their entirety, can be retrieved from [34].

Machine-learning classification
The Weka data mining software [36] is used as an experimentation platform to investigate the performance and accuracy of several well-known classification algorithms on the feature data discussed in the previous section.This research leverages several classification algorithms, with implementations provided through Weka 11 .
Stratified ten-fold cross-validation [1] is used to ensure the relevancy and applicability of the results.This process entails that the training data is divided into  subsets or "folds" ( = 10); on each iteration, training data is made available from all folds except one (which is held out) and the trained model is tested on the held-out fold.The process iterates  times for each fold and the results are averaged.
The effects of any imbalance that may occur in some data sets are mitigated by Synthetic Minority Over-sampling Technique (SMOTE) [37], along with cost-sensitive classification and boosting via AdaCost [38] [1].
Due to the large size and number of corpora, it was necessary to expedite the process through distribution and automation.Weka provides a library (for Java) which has been integrated within the GrammarTools suite of software in order to automate and distribute classification tasks across available computing hardware 10 .Time for completion varies, and depends on the size of the feature data as well as the computational complexity of the classification algorithm; simplistic algorithms (such as OneR) applied to smaller datasets (such as the MASC) completed within minutes to hours, while more computationally rigorous algorithms (such as Logistic) applied to larger data sets (such as the OANC) required days or weeks to complete (or in some cases did not complete at all).
The results of the completed classification tasks are collected and analyzed; classification accuracies for each algorithm are organized by error type, and number of errors per sentence, presented in the following section.

RESULTS & DISCUSSION
In this section we present the results of the machinelearning classification process discussed in subsection 3.8, using the algorithms listed in Table 2, and applied to the datasets described in subsection 3.4.For succinctness, the results of only the largest and most thorough dataset, Verbatim, are given here; a more comprehensive treatment covering all corpora and classifiers used can be found in the complementing thesis work [24].
The results are presented in bar-chart format, displaying classifier accuracy versus the quantity of errors per sentence, categorized by the type of grammatical error present (as in subsection 3.3.1).Accuracy is given in percentage, and indicates the rate of success with which the top-performing classifier (using stratified ten-fold cross-validation) is able to predict the grammatical correctness of a sentence.The accuracy appears above each plotted bar value, rounded to the nearest percentage, along with the name of the classifier used to achieve it; two numbers appear directly below this, corresponding to the number of grammatically correct (upper number) and incorrect (lower number) instances that were present in the training data.The names of some classification algorithms have been abbreviated to fit within the plots; the displayed names and their corresponding algorithms are listed in Table 2 The Verbatim dataset is a subset of the OANC, featuring articles on language and linguistics from the Verbatim magazine [23].The corpus approximately contains 23,000 original sentences, from which errorful sentences were generated (in most cases, doubling the dataset in size).The following figures show the accuracy of the best-performing classifiers in predicting grammatical correctness in the presence of various errors, as the quantities of these errors increase from 1 to 5 per sentence.Figure 4 shows between 64% and 81% classification accuracy for displace errors; Figure 5 between 55% and 70% for omit errors; Figure 6 between 75% and 93% accuracy for tense errors; Figure 7 between 54% and 72% for number errors; Figure 8 between 67% and 99%12 accuracy for object errors; and Figure 9

Discussion of classifier performance
An immediate observation is the tendency for classification accuracy to improve as the number of grammatical errors increase within a sentence; this appears to be true for all error types.The improvements in accuracy in many cases are modest (between 3% and 10%); in some cases, such as tense, displace, and insert errors, this effect becomes less pronounced towards 5 errors per sentence, while in other cases, such as number and omit errors, the gains are incremental.However, some noticeable irregularities occur from imbalanced data, such as in Figure 8 (plotted with a dashed outline).
It is also evident that several classifiers routinely achieve best performance, such as SimpleLogistic (SL) and Sequential minimal optimization (SMO), and appear to perform comparably well in different contexts, across error type and quantity.Results data in tabular form for each corpus and error type can be found in [24].
In summary, we achieve between 75% and 93% accuracy in determining the grammatical correctness of sentences in which tense errors occur (ranging between 1 and 5 errors per sentence); between 64% and 81% for displace errors; 60% to 76% for insert errors; 55% to 70% for omit errors; 54% to 72% for number errors; and between 67% and 99% for object errors 12 .Thus we find that sentences exhibiting tense and displace errors are the most accurately and reliably detected for grammatical incorrectness, while sentences containing omit and number errors are more difficult to detect; further, grammatical incorrectness becomes more reliably predictable (beyond 70%) as the number of errors per sentence increases.

Comparison with similar methods
Grammar verification is an open-ended challenge, and varying approaches can assume different goals in the types of errors they purport to detect.Perhaps the closest comparative findings are from the research outlined in subsection 2.5.1.Comparing Figure 2 to the results presented in this section: for Missing word errors, detected with 62% accuracy in the compared research, this research achieves between 55% and 70% accuracy for comparatively similar omit errors, for a similarly sized corpus (Verbatim); where Unnecessary word errors are detected with 60% accuracy in the compared research, this research achieves between 60% and 76% for similar insert errors; for Incorrect tense errors detected with 58% accuracy in the compared research, this research achieves between 75% and 93% accuracy for sentences containing tense errors; where Word order errors are detected with 61% in the compared research, this research achieves between 64% and 81% in the case of similar displace errors; finally, where the overlapping error types Wrong form, Agreement, Derivation, and Incorrect inflection are detected with between 68% and 77% accuracy (collectively) in the compared research, this research achieves between 54% and approximately 80% accuracy 13 (collectively) among similar number and object errors. 14 Comparing classification accuracy by error quantity, Andersen's work detects single errors with an accuracy of 49%, and towards 5 errors, an accuracy of 72% research [18]; the results in this research achieve an accuracy of 63% in detecting grammatical correctness for sentences containing a single error 15 , and up to approximately 81% for five errors.However, it is important to note that this comparison is made from a single classification model in the compared research, in contrast with several independent models for different error types in this research.While the results and comparisons in this section do not have exact parity with other research, we believe that this research achieves some noted improvements in accuracy over the nearest methods. 13The upper-bound is corrected for any potential bias introduced by underrepresentation in the Verbatim dataset for object errors; it disregards the outlying value of 99%. 14The compared research does not specify or constrain the number and type of errors per sentence for the accuracies reported previously [18], hence a range is given (for error quantities 1 through 5) when comparing with the results in this research. 15This is obtained by averaging the classification accuracies for a single error across all error types from the Verbatim corpus.

CONCLUSIONS
The grammatical structure of human natural language uniquely shapes the understanding and exchange of information, especially in the digital and written form, and its common misuse warrants methods for automatically detecting mistakes in grammatical syntax.The work accomplished in this research seeks to address this challenge, and in doing so, defines and implements a unique approach that combines machine-learning and statistical natural language processing.Several core methods are established: (1) the automatic generation of grammatical errors, and the formation of grammatically parallel corpora; (2) the definition and extraction of over 150 features from a sentence; and (3) the application of extracted feature data to be used in machinelearning classification.
Using these methods, this research finds it is possible to predict grammatical incorrectness in sentences containing various types of synthesized grammatical errors, and reports classification accuracies between 60% and 93% over a large corpus (depending on the types and quantities of errors present).Tense and displace errors are most reliably detected, while omit and number errors are more difficult to detect.A corollary is that, regardless of error type or corpus, all errors become more robustly detected (above 70% accuracy) as the error quantity approaches or exceeds 5 errors per sentence.

Limitations
The research is conducted with a discrete set of grammatical errors (subsection 3.3.1)that are constructed artificially; while these errors aim to represent many of those that occur commonly in English language misuse, it is important to note that they are only a subset of what may render a sentence grammatically incorrect.Further, models are trained on a per-error basis -there is little reporting on the performance of a single model applied to various error types, or across different corpora.

Future Work
New features may be defined to improve data separation and classifier performance, as well as features that address sentences on a semantic level.Feature selection strategies may help to reduce the set of features to only those which are most predictive.
Further research should explore the inter-domain and crossdomain applicability of trained models to determine how relevantly models trained on one corpus may apply to another.
Finally, the results of this and future research should be realized into a usable grammar verification system for error detection and correction.

Figure 3 :
Figure 3: A machine-learning approach to grammar verification.

3. 3 . 1
Error Types insertion insertion/duplication of a token omission removal of a token displacement interchanging of adjacent tokens number alteration of noun plurality tense alteration of verb tense objectivity alteration of pronoun objectivity case alteration of noun capitalization

Table 2 :
. Bar values plotted with a dashed outline indicate results that have been obtained from an imbalanced dataset, and have had boosting or cost-sensitive classification techniques applied as described in subsection 3.8.The following table lists the classification algorithms that were used in the machine-learning process.Classification algorithms and their abbreviated labels