Evaluation of neural networks applied in forensics; handwriting verification example

ABSTRACT There is a growing interest in the possibility of artificial neural networks’ applications in forensics. Extensive research has been published on this subject, especially in the field of handwriting examination. However, it seldom discusses forensic and legal standards, which are the most fundamental of conditions for the acceptance of artificial neural networks in forensics. From the perspective of handwriting analysis, we have exemplified and systematized general methods for an informal falsification of artificial neural networks applied to verification of offline handwritten documents’ authorship. These approaches should be generally effective against applications of neural networks in forensics, aimed to objectively expose and prove models as unreliable.


Introduction
Artificial neural networks (ANNs) are one of the most popular machine learning methods utilized for automatic problem-solving.In essence, they are mathematical functions, i.e. relations over sets, where one set consists of data describing the problem (in a numerical form), and the other set consists of solutions to this given problem (in a numerical form), while the problem is solved by the network function mapping of one set to the other.Network functions consist of many simpler mathematical functions (sequences of weighted multiplications, summations and nonlinearities, called neurons), which are highly adaptive due to many interconnecting them parameters (numerical weights, called synapses), allowing for such networks to be trained through a process of automatic parameter optimization (gradient descent).Overall, neural networks are organized into hierarchically ordered layers of neurons interconnected by modifiable synapses.Thus, to solve a problem one needs to find a network function that approximately maps one set onto the other.Overall, ANNs are perfectly suited to solving problems, because -under certain conditionsthey are universal function-approximators 1 .However, ANNs are hardly interpretable as to their decision patterns and decision-relevant features of the data, hence they are referred to as black-boxes 2 , thus it is crucial to establish a methodology for their evaluation and procedures for practical applications in forensics.
The first research into applications of ANNs in forensics was conducted in the 1990s 3 , while, computational forensics became a distinct discipline during later decades, e.g. the first International Workshop on Computational Forensics (IWCF) was held in 2007 4 ; since then, seven such workshops have been organized (the last one in 2018) 5 and over 75 papers have been presented and published in these workshops.To the best of our knowledge, only a few ANNs were ever utilized in the practice of forensic examination 6,7 .On the subject of offline -where the questioned document is not itself digital, even if its examination is digitized -and online document examination, about 24 papers have been published in connection with the IWCF.Other notable papers were published by Fiel and Sablatnig 8 and CEDAR 7 .There are many popular computer programs for handwriting examination, but these do not utilize -or we do not know whether they utilize -ANNs.
We understand: (1) verification is tests aiming to prove a hypothesis, while falsification is tests aiming to disprove a hypothesis 9 ; (2) reliability is the ability of a model to maintain its performance under different conditions 10 ; (3) that falsification is successful (i.e.model is found to be unreliable) if differences among results obtained under different conditions are statistically significant.
Considering standards for admission of new forensic methods, e.g. the internationally acclaimed Daubert Standards 11 , the aforementioned papers were not significantly concerned with methodology or procedures for computational criminalistics.These would have to be developed if ANNs were to be accepted as forensic tools and utilized in practice 12 .While ANNs may be easily verified through basic testing (based on random sampling of training and testing subsets), their falsification is highly restricted because their decision-making processes are uninterpretable.Nonetheless, the best solution is to attempt their falsification, hence we can: (a) succeed at falsification and uncover their unreliability -often present despite extremely high-performance rates; (b) fail at falsification and prove them applicable.Considering the standard approach to evaluation of ANNs in computational forensics 13 , especially during competitions 14 , there are many useful examples of verification and performance benchmarking, but no examples of falsification.
In our opinion, it is very important to support further research into applications of ANNs in forensics; however, it is not enough to develop better models while we lack an adequate methodology to tell them apart, and while there are no procedures for their use in practice.Until these are in place, we need to ensure understanding of the problem among the experts, e.g.there were many useful observations provided recently by the NIST experts on the subject of automated handwriting examination systems and their evaluation, experts who at the same time argued that: 'The scores produced [by CEDAR-FOX] showed over 95% accuracy, which provided support for admitting handwriting testimony in Daubert' 15 .We shall demonstrate that accuracy scores aimed at proving reliability (by random sampling of training and testing subsets) are not enough to support the Daubert standards.
There is a vast body of literature concerned with probabilistic evaluation of evidence and forensic methods, but these approaches are either highly general or particular to these methods.
Hence, we have conducted an empirical study into the problem of methodology and procedures for forensic applications of ANNs, aiming at their informal falsification from the perspective of handwriting analysis and machine learning guidelines 16 .We present and systematize examples of informal falsification methods, based on an ANN applied to verification of offline handwritten documents' authorship (classification of document pairs to a positive 'same writer' or negative 'different writers' class).

Data
The first dataset consisted of 1604 offline handwritten documents by 310 writers (full page scans) from the CVL database 17 .It was based on seven templates (five English and two German), where templates numbers 7 and 8 were test-set exclusive, while, numbers 3 and 6 were German.The second dataset consisted of 1539 offline handwritten documents by 657 writers (full page scans) from the IAM database 18 .It was based on 500 templates from the English LOB corpus, where all test-set templates were exclusive to it.The training set (2740 document images by 822 writers) and test-set (403 document images by 145 writers) were dichotomous.

Preprocessing
Images were converted to greyscale, then colour inverted, the writing space was extracted (extracts), extract dimensions reduced to 1024 px on 1024 px, and divided into 256 px by 256 px patches, and finally they were converted from TIF to PNG format.Patches that did not contain or contained only small amounts of text were omitted.

Dataframes
CSV files contained names, directories and classes of image pairs.Images were paired assuming that the number of positive and negative class instances ought to be equal.

Model
The model consisted of two identical convolutional neural networks, each path processing only one image of the pair, outputting a feature vector, where pairs of these feature vectors were passed together to a dense neural network for binary classification.
In terms of architecture: (1) there were three convolutional layers per each path (256-512-1024 filters of size 12-6-3 and stride 4-2-1, ReLU activated); (2) after each convolution, there was a batch-normalizing and max pooling layer (size 2 and stride 2), where the last one was a global average pooling layer (GAP) producing a feature vector; (3) outputs from the GAPs were fed into four dense layers (1024-512-256-1 neurons, ReLU activated), each preceded by a dropout layer (probability = 0.5) and superseded by a batchnormalizing layer; (4) there was a cosine and Euclidean distance calculated between the GAPs' outputs; (5) finally, there was a single neuron (sigmoid activated) tasked with binary classification based on the two distance measures and a singular output from the last dense layer.

Evaluation methods
There were two approaches to evaluation that we have utilized, i.e. (1) subset-based evaluation, where we split the test subset into many smaller ones based on properties of the data, data sources or datasets; (2) input-based evaluation, where we modify the input data with respect to standards of handwriting examination.

Subset-based evaluation
Criteria.The model was trained and tested on combined CVL and IAM databases, hence it is a basic None criterion, followed by separate IAM and CVL test-set criteria.While the IAM test-set and trainset did not share document templates with each other, an additional Hard criterion was necessary for the CVL test-set, containing only its test-set exclusive templates.Because no cross-databases image pairs were utilized for training or testing of the model, the Negative criterion denotes a test-set of cross-database image pairs (we have assumed that all such pairs are negative).The Negative-Denoised criterion contains the same image pairs, which were additionally denoised by thresholding of pixel values lower than 55.
Categories.The most fundamental approach to evaluation of biometric models is to test them with respect to different properties of the data sources.However, because neither the IAM nor the CVL database contained relevant subsets, we have decided to manually categorize the test writers, based on the handwriting features, suggested by Huber et al. 20 , that are indicative of femininity, masculinity, sinistrality or dextrality of handwriting (but not of the writers).

Input based evaluation
Different input quantities.One of the most crucial standards in the field of handwriting analysis is the quantitative baseline for examination of questioned and comparative materials, where more data always ensures more reliable results, while the minimum quantity of data depends on the quality of that data.Sometimes utilized to define the robustness of probabilistic methods, which should maintain their performance when the quantity of data is reduced 21 .Conversely, we have increased the quantity of data to test whether the accuracy of the model will also increase -by supplying it with extracts of documents' writing space (vide Preprocessing).

Different input qualities.
Because questioned documents are often written on grid paper, it is necessary to test the model against such documents (especially that they were absent from the test and training data).Thus, we have altered the quality of the input data by adding a realistic grid to the images (uniform width of 2 px; pixel value of 54; intersections at every 0.5 cm).

Different input difficulties.
Given the current data, we decided that the following subsets will simulate increasing difficulty levels of handwriting examination, i.e. (1) subset excluding pairs of identical patches; (2) excluding pairs of patches from the same document/extract; (3) including only pairs of patches in different languages (here, the None and CVL criteria were equivalent).

Different preprocessing methods.
We suspected that the model could learn to differentiate between writers based on the handwriting instruments (assuming that different writers used different instruments and used them consistently).Hence, we have decided to test the model based on denoised (by thresholding of pixel values less than 25 to 0) and binarized images.

Significance
To measure the significance of our results (α = 0.05), we have employed two approaches: (1) z-test for the difference of two proportions 22,23 ; (2) a discrete probability distribution of the given metric with respect to the given range of subset sizes (Figure 1).

Criteria
There are significant disproportions among the non-Negative criteria results (Table 1).Furthermore, despite the exclusion of cross-database pairs from the model's training and testing, results for the Negative criteria are significantly higher than other results, while the Negative-Denoised results are both significantly lower than the Negative, and significantly higher than other results, therefore, noise does not seem to be the key determinant of these results.
We may hypothesize that there exists at least one such feature, which is the most discriminatory for image pairs belonging to different databases, but otherwise is not the most discriminatory.Therefore, average feature vectors were calculated -separately for the IAM and CVL -to determine which features are frequently observed for one but not the other test-set (i.e.whether their normalized and averaged difference in frequency was at least equal to 0.25).These features were subsequently removed from the model (89 out of the 1024 third layer filters were zeroed) and it was tested again.Considering the Removed criteria (Table 2), the Negative accuracies are not significantly different from the None ones, but there is a significant preference for positive classification.The hypothesis thus seems promising, and this technique could become useful for obtaining better models.

Categories
As may be observed (Table 3 and Figure 1), the least populous category led to significantly lower results of the model.Most probably, because this category was characteristically different and underrepresented on the training subset, the model did not learn well enough to generalize and distinguish between the writers of this category.

Different input quantities
It may be observed (Tables 1 and 4), that -other than the Negative -results for the wholetext document extracts are significantly lower than for document patches (also, significant class disparities may be observed).We hypothesize that the distribution of features for document patches is different from the distribution of features for document extracts.Hence, if there exist such subsets of patches and extracts, where every patch has at least one equivalent extract -in terms of the authorship and distribution of features -then results should be equivalent for both subsets.Thus, 84 pairs of extracts and patches were found, where equivalence was defined as a cosine distance of normalized features' distributions (L1-Norm applied; maximum distance at a 0.05 threshold).In contrast to the most exemplary of previous results (Tables 1 and 4), results of the evaluation are not significantly different for both subsets (Table 5), except for the Negative criteria.Thus, the model has failed because it should have achieved higher results given a higher quantity of data.

Different input qualities
As may be observed (Table 6), results for all criteria are significantly lower and the model is unsuitable against documents written on grid paper.In practice, this problem would not be solved by removal of grid lines from questioned documents since the model is not robust against variations in features' distributions, which could be entailed by different writing conditions (e.g.grid paper).

Discussion
Until ANNs are interpretable we cannot prove them reliable (verify their reliability), however, we can fail to prove them unreliable (fail to falsify their reliability).As was highlighted by the United States Supreme Court in Daubert's case: 'Scientific methodology today is based on generating hypotheses and testing them to see if they can be falsified; indeed, this methodology is what distinguishes science from other fields of human inquiry' 11 .
While there is no universal solution, we can propose two levels of hypothesis aimed at informal falsification of black-boxes from the outside: first, where we differentiate among the testing subsets and break up the results into meaningful criteria or categories; second, where we manipulate the input data based on standards of the discipline, and break them down by applying the first level approach.
Overall, we have demonstrated that ANNs cannot be applied in practice without prior falsification (identifying under which conditions they are unreliable, or failing to find any unreliability), whilst they cannot maintain their performance under highly favourable laboratory conditions.Thus, we have demonstrated that ANNs should be utilized by examiners with caution, and each particular application preceded by a case-relevant evaluation of the model.Furthermore, we have also demonstrated that falsification could be constructive, by exposing the unreliability of the models and allowing us to design specific remedies.
Ultimately, one should remember that without interpretability, the only way to prove artificial neural networks useful and valid tools of forensic practice is to try to falsify them fiercely but fail completely.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Discrete probability distribution of the model's accuracy.Probability was calculated as a frequency of the given accuracy occurring with respect to the given number of randomly drawn writers (each drawing was repeated 500 times from the pool of 145 writers; accuracies were rounded, hence discrete).

Table 1 .
Criteria based evaluation results.

Table 2 .
Criteria based evaluation results after removal of outlying features.

Table 4 .
Evaluation results on document whole-text extracts.