Multi-lingual character handwriting framework based on an integrated deep learning based sequence-to-sequence attention model

Online signals are rich in dynamic features such as trajectory chronology, velocity, pressure and pen up/down movements. Their offline counterparts consist of a set of pixels. Thus, online handwriting recognition accuracy is generally better than offline. In this paper, we propose an original framework for recovering temporal order and pen velocity from offline multi-lingual handwriting. Our framework is based on an integrated sequence-to-sequence attention model. The proposed system involves extracting a hidden representation from an image using a convolutional neural network (CNN) and a bidirectional gated recurrent unit (BGRU), and decoding the encoded vectors to generate dynamic information using a BGRU with temporal attention. We validate our framework using an online recognition system applied to a benchmark Latin, Arabic and Indian On/Off dual-handwriting character database. The performance of the proposed multi-lingual system is demonstrated through a low error rate of point coordinates and high accuracy system rate.


Introduction
Handwriting analysis has been an active area of research, such as handwriting recognition [8,10], writer identification, and signature verification [6,12]. When handwriting is captured using different acquisition techniques, it gives rise to two handwriting categories: online and offline. In case of online handwriting, it may require special digital devices and it can represent numeric data as a succession of points ordered in time. Consequently, this mono-dimensional signal is noted as having dynamic features: the temporal order, the pen velocity, the pressure, and the pen up/down. In contrast, offline handwriting requires a camera or a scanner to capture handwriting from the paper. Therefore, online devices are more expensive compared to the offline ones. However, it is obvious that online handwriting has become an efficient choice thanks to its dynamic features, thus authorizing more features to be available to the recognition systems. It is necessary to mention the importance of the pen velocity that develops the online information and to improve the recognition accuracy. In [22], the authors demonstrated the effectiveness of the 1 3 velocity in the recognition task applied on Arabic handwriting characters. They obtained 98.8% for a re-sampled online signal against 95.8% for an online signal without velocity. In addition, considering that the offline category is a set of static images, the storage of handwriting images is larger than online data. Those images are presented by a set of pixels without dynamic information. Generally, the presence of dynamic features in online systems leads to the effectiveness of the performance compared to offline systems. Qiao et al. [21] proved the effectiveness of the online system compared to the offline one, using the recognition rates as an evaluation metric. They got 96% for online digits, against 90% for offline images. To exploit the advantages of offline and online processing, researchers have presented a lot of methods to recover the temporal order from static handwriting images. Reconstructing the drawing order has been implemented since the nineties. In general, this process is based on different steps: preprocessing, ambiguous zone selection, terminal point detection, and searching for the smoothest path [19]. According to Rousseau et al. [24], each step had an effect on the next step. Moreover, these last methods have suffered from assumption problems because the direction will be different according to the language of writing. In [19], the authors affirmed that the trajectory chronology proved to be more promising for reconstruction, but there was no way to recover some dynamic information like pen velocity. Actually, deep learning can handle these types of problems without using complicated algorithms or assumptions. Memory recurrent networks, with the potential to treat long term sequential tasks, realized great success. Among these investigations, image caption achieved the level of translating images into text. Motivated by this, we assume that Sequence-to-Sequence (Seq2Seq) with attention models have a great potential to become the newly state-of-the-art for handwriting recovery problems. Our framework contains: (a) a convolutional neural network (CNN) to extract the lower level features, (b) a bidirectional gated recurrent unit (BGRU) to encode the last extracted features to a single vector, and (c) a BGRU to decode the encoded features into ordered coordinates based on attention model. To the best of our knowledge, we have been the first to implement a Seq2Seq-BGRU with attention model for temporal-order and pen-velocity recovery. The major contributions of this work are as follows: • Investigating a novel Seq2Seq with attention model based on BGRU NN to predict the dynamic information from static handwriting images; • combining CNN and BGRU to extract features from images; • recovering, for the first time, the pen velocity besides the temporal order; • the ability of this end-to-end system to recover a multilingual character, so there are no assumptions about the pen order.
The rest of the paper is presented as follows. Section 2 sets an overview of related work. Section 3 describes the framework of the study. Section 4 discusses the implementation and the obtained results. Finally, Sect. 5 provides the implications of the study and the conclusion.

Related work
Some work quoted in the literature has recovered the temporal trajectory order according to one category: a contour or skeleton approach. The contour technique [26] has suffered from high computational time. For example, in [26], the authors were interested in loop analysis. They processed different models for loop types and performed a thorough loop contour analysis. Nevertheless, the effectiveness of the proposed investigation on loops was not clear enough as they did not show any practical results of handwriting recognition and since the valuation time was high. On the other hand, the use of the skeleton approach has given good results and a more rapid response compared to the contour method [7,9,21,24]. Based on the edge continuity relation [21], the authors suggested three main steps. First, they identified different relations at each node. For a node of degree four, they used the NN, otherwise they utilized some assumptions. Second, they selected double-traced lines using maximal weighted matching. Their last step was to find the smoothest possible path to go through all the curves of the handwriting graphic model. Based on the optimal Euler path, they selected the smoothest one. However, their work was applied on a single stroke for a mono language. In [24], the authors utilized the handwriting knowledge to propose the possible start/end points. Afterwards, different paths were produced and the best one was chosen. Their approach was applied on multi stroke letters. To demonstrate the performance of their approach, they presented a good recognition rate. Even so, they used the assumptions based on the Latin language only. In addition to the presence of contour and skeleton categories, surrounding areas of handwriting recovery could be divided into two groups: local and global search methods. The local tracing method goal was to search for the smoothest path at each ambiguous zone based on the tracing history and the actual configuration [6]. The major limitation of this method was that the design of heuristic rules applied for different handwriting styles was difficult. This limitation could be overcome by using the global graph technique. It aimed to create a graph model of the input skeleton images and then use a search technique to find an optimal path through text [9,21,24]. The drawback of the global method was that the computational time was high and it depended on the complexity of the algorithms used as search techniques. There were also some cases which were hardly treated. For example, Phan et al. [20] used the greedy algorithm for searching about the optimal path in a global model. Their work was based on limited assumptions about start/end points, ambiguous zones and double traces segments, which gave rise to a difficult decision for obtaining the right trajectory. In [7], the authors considered that start point detection and skeleton separation represented a hard task. In addition, there was a higher complexity of searching the smoothest path at the junction zone. Based on the skeleton graph, they separated touching characters and crossing strokes. The optimal path was fixed by the greedy algorithm. Their model was still sensitive to the processed language, and they had the common problem of being slow and complex [21,24]. However, it was not clear whether the use of the local method could achieve a more effective performance compared to the global method. Both suffer from some problems. Consequently, some existing work has combined these two tracing methods together [9], where the number of possibilities was optimized by adding some local features such as the curvature and inclination angle. It cannot be denied that previous work has got perfect performances, particularly in the Latin language. However, most of them have been weakened by some languages, like the Arabic handwriting corpus, hence having many versions for each language [9]. Moreover, the problem of handwriting recovery has been based on finding out the correct terminal points (start/end points), junction points and the main direction in the detected ambiguous zone. Furthermore, Rousseau et al. [24] demonstrated that each step of the recovery procedure could affect the results of the recognition rate. Thus, in our opinion, these challenging parts were complicated. Some early work [2-4, 12, 22] addressed the use of an end-to-end system for handwriting recovery. A number of recent works [2-4, 12, 22] addressed the use of an end-to-end system for handwriting recovery. For example, the authors in [2] proposed an end-to-end model based on an encoder-decoder BLSTM NN. Here, a CNN-BLSTM was an encoder module that takes as input, the offline Indic handwriting and outputs a set of features. The decoder BLSTM state was initialized by the last encoder BLSTM state to finally generate equidistant points sequentially. The main limitation of this model was the absence of velocity. In addition, in [3], a variational auto-encoder model was proposed to reconstruct the online signal from its offline counterpart and also to generate offline Latin characters from its corresponding online script.
In [4], the authors integrated attention layer between encoder and the decoder models to reconstruct a multi-stroke signals from offline Japanese kanji characters. However, this model suffered from loss of velocity, making it extremely slow, and can be as described as unnatural compared to human-like writing. Thus, this type of a multi-stroke signal could affect the online recognition system.
In [12], a heuristic-based system was developed to recover complex handwritten signatures. Their system comprised three main stages: point classification, local examination and global reconstruction. The advantage of this work is the analysis of complex signatures, however, the velocity features have not been considered yet and the question of speed recovery has been also unaddressed.
In [22], the authors used the VGG-LSTM to extract features from images and utilized the BLSTM as a decoder model. Their system is related to our work, with a different focus. We use the CNN-BGRU to extract features from images, instead of the VGG-LSTM. Further, it is more challenging to produce human-like velocity for the task of recovering temporal order from offline handwriting. The authors in [22] recovered an online signal with equidistant points and utilized a re-sampling step to add velocity to the obtained signal. However, in this paper, our framework produces an online signal that is characterized by a trajectory chronology with velocity.

End-to-end recovery framework
In this section, we clarify the main architecture of the proposed framework. We employ a Seq2Seq with attention model [1] to transform a sequence of offline handwriting characters into a sequence of their homologous online signal. The obtained signal contains dynamic features like the temporal order and the pen velocity. The main objective of this work is shown in the following equation: where pc is the generated sequence of point coordinates characterized by dynamic features. Those points correspond to image I. Moreover, fp is a sequence of pixels which represent a static feature of I. As a result, the length and type of the input (image) and output (signal) are actually different. Our model consists of three parts: a CNN, an encoder BGRU NN and a decoder with attention model. These parts are considered as an end-to-end system. Therefore, training is supervised, taking into account image I and its counterpart online signal points n, i.e. {I} = x 1 , y 1 , … , (x n , y n ) ∈ R 2 * n .  D() function is a multilayer BGRU model. It receives the local context state generated by the attention layer and the previous predicted coordinates. In Sect. 4, we will show the effectiveness of different basic encoder-decoder models and prove that a BGRU NN is the best candidates. To summarize, the following equations present the recent processes: where < x t−1 , y t−1 > are the previously predicted points.

Preprocessing and pen velocity process
The training step is based on two inputs: (a) the handwriting character image, and (b) its counterpart online signal. Generally, the first step in handwriting recovery is preprocessing. This process can include image normalization, taking into account that all images have the same size (64 × 64). Those images are passed to the ConvExtractor C(). Their counterpart online signals are used as an input to the Decoder-Extractor D(). This type of signals is necessary to train the supervised decoder BGRU network. The number of online signal points is fixed to 50 points. A sampling step is necessary to obtain a signal with the desired point number (see the algorithm of normalization step). In fact, online handwriting is a series of non-equidistant points saved during the writing process. A study made on the neuronal and muscular effect shows that the pen velocity decreases at the terminal points of strokes and at the junction zone of the curve. This information is used in online systems to calculate the velocity. Thus, those points are presented as a two-dimensional matrix and characterized by the pen velocity. To the best of our knowledge, researchers have not reconstructed the pen velocity when solving the handwriting recovery problem. However, in [9,22], the authors used the re-sampling step to add the velocity to the recovered signal, without generating a significant signal with temporal order and pen velocity at the same time. By way of contrast, our framework does that. Figure 2a shows the ground truth signal. The speed of the pen decreases at the beginning, at the end of strokes and at the insignificant angular zones of the curve. This type of signals is used as an input to the algorithm of the normalization step. Figure 2b represents the output signal after normalization according to the number of points (50 points), while maintaining the same speed as the original. The 50 points are normalized to (64 × 64).
The GRU and LSTM networks can be adapted to different sizes (with an end-of-stroke token). However, the proposed framework treats 50 points as in [2,21,22], since each point is informative and long dependencies are reduced. The normalization step keeps the velocity information consistent within a character, and the distance between two points will depends on the natural speed of the pen and on the total length of the character. For more detail, the normalization to 50 points per character results from a compromise between relevant dynamics and geometric data of the trajectory, a reduction in the number of signal frames and in the number of hidden layers network (BGRU decoder) which is responsible for offline prediction of signals with speed profiles. As a consequence, the strength of the following algorithm is to solve this issue and normalize the online signal with a fixed point number. The proposed algorithm for the normalization step receives as an input the original signal of size N_I, and it outputs its corresponding normalized signal with size N_O (50 points), reflecting the dynamic profiles of its tracing (velocity, radius of curvature). Here, we begin by calculating the average speeds Vm_I and Vm_O of the original and normalized paths, respectively, while initializing the latter at the starting point M_O (1) = M_I (1). Next, a new current point of the normalized path M_O (i) is marked iteratively on its trace according to its predecessor M_O (I − 1) using an elementary curvilinear x_curved_O (i). Figure 3a presents the ground truth of the long Arabic character "sin". On the other hand, in Fig. 2b there is the normalized signal with 50 points. This type of signal preserves the same dynamic characteristics like its ground truth, as shown in Fig. 3c, d, respectively. As illustrated in these figures, our experimental tests show that the choice of 50 points ensure the extraction (detection) of inflection points corresponding to the extremes of variation in the curvature radius of essential for the segmentation of the trajectory into 'Beta' strokes, and subsequently, in the extraction of parameters of the Beta-elliptic model. To conclude, in accordance with the curves, the number of points do not impact the beta-elliptic features, hence the recognition rate.

Convolutional extractor
Deep CNNs [17] have been used in different tasks [2,15]. Given the CNN success in providing a solid representation of features [29,30], we use CNNs extensively to extract a sequence of image features. The main goal of the convolutional extractor is to convert the character image into a visual feature. A deep CNN model is used without its last fully connected layer. The normalized images are fed into the network. Then, the convolutional layers produce the feature maps. The feature is extracted from left to right, and column by column, from the feature maps. One feature vector corresponds to a rectangular region. Those features describe the image region. The details of the CNN configuration are described in Sect. 4. The input image (64 − 64) is transformed to CNNfeatures of size (Batchsize, N, D). Specifically, the batchsize is fixed to 32, and N and Dare the length and depth of the CNN entity, respectively. As illustrated in Fig. 1, the CNN output is noticed by F = (F 1 , F 2 ,..., F N ) and F i ∈R D (D = 512).

Encoder extractor
The encoder-decoder model is previously employed for machine translation [1], but recently it has been applied to other tasks [2,15]. The encoder is the portion of the network that processes the input to produce a single hidden representation of all the input information. Among the varied types of neural nets, Recurrent NNs (RNNs) can forecast the most accurate results [23]. In fact, the famous problem of simple RNNs is the vanishing gradient problem, which the GRU [5] and LSTM [13] can alleviate. According to previous studies, there were no conclusions that could be drawn about the best: LSTM or GRU. Both RNNs have achieved similar results in many tasks. Even so, the GRU can be better in terms of time thanks to the parameter size. In addition, according to [5], the GRU could be a better choice for modeling the temporal information compared to the LSTM. That is why the GRU is chosen instead of LSTM. The hidden state is computed by the following equations: where g t , r t and c t are respectively the updated gate, the reset gate and the candidate activation value, σ is the sigmoid function, F t is the input which is the extracted CNN features, and W fg , U hg , W fr , U hr , W fh and U rh represent different weights. To obtain more information and a better representation, the BGRU with multiple layers is chosen as the most efficient encoder type adapted to our problem. Precisely, after many experiments of various models with different configurations, the encoder is fixed as a BGRU NN with three layers, and each layer has 512 units. The encoder BGRU model is built after obtaining the CNN features extracted from the image character. This feature vector is mapped to a fixed-length vector through the encoder. As depicted in Fig. 1, the ConvExtractor generates an intermediate feature F which is reshaped to an F' map (F' 1 , F' 2 ,…, F' N ) with two dimensional features. Those features are used as an input to the first layer of the BGRU encoder. The remaining layers use as an input their previous hidden state. For each time step, the output of the encoder is h t ∈ H calculated by Eq. (8):

Decoder extractor
The decoder is a BGRU NN with three layers as the encoder architecture. Each layer has 512 units. The role of the DecoderExtractor is to generate the predicted coordinate sequences, as shown in Eq. (9): with P 1 , … , P i = [< x 1 , y 1 > , … , < x i , y i >] , and where i is the number of the desired points and F is the set of features. Besides, F N ∈ R D where N and D are respectively the length and depth of the CNN. According to the entire feature sequence (F 1 ,…, F N ), the decoder estimates a set of points (P 1 ,…,P i ). Moreover, S is the last encoder state that summarizes the whole input feature. The hidden state of the decoder is initialized with S at the same time step. In the first layer, the decoder takes as an input the last state of encoder S and the first coordinate < 0, 0 > . Then, the remaining layers receive as an input the hidden state of the previous decoder layer S d i−1 and the previous coordinate < x i−1, y i−1 > , as shown in Eq. (10): where S d i is the current state of the decoder, S d i−1 is its previous state, and g is a GRU function. The generated coordinates are calculated by Eq. (11) as follows: Here, we apply a linear activation function (dense). Actually, U and b are the weights and bias, respectively.
As mentioned above, we introduce the recovery process of a classical Seq2Seq [28], which can resolve the problem of handwriting recovery.

Attention mechanism
The classic Seq2Seq without attention model has defects that all hidden representation of the encoder are compressed into a fixed length S context vector. Consequently, the prediction accuracy will gradually decrease when the input length increases [1]. Thus, this paper proposes an attention mechanism for the handwriting recovery process. An attention layer is placed between the encoder layer and the decoder. The input of the BGRU encoder is F' = (F' 1 , F' 2 ,…, F' N ).  Fig. 3 Example of the long Arabic character/si:n/. the first signal is the original signal before normalization (a), the second signal represents the normalized one with 50 points (b). c, d shows the dynamic profile curves of both ground truth and normalized signal At each time step N, the encoder reads F' N and updates its hidden state h t , then, the attention context vectors are produced as a weighted sum of h t , which is used to detect the best hidden representation of the encoder. The following equation describes the attention process: where formula (12) is the align computation between the encoder hidden state h t and the decoder hidden state S d i . Formula (13) represents the attention weights that indicate the importance of the input value at time step t to generate the output at time step i. The softmax function is used to normalize the vector e i (of length N) as the attention mask on the input sequence. Formula (14) shows the final attention state c a . At each time step, the decoder predicts a probability distribution of point coordinates. The ground truth point coordinates are used to train the network how to generate an online signal compatible to the original script (see Fig. 2b where i is the number of points. We calculate loss L1 using the predicted vector P. The training continued until loss L1 converges and the model is saved. The test step uses the offline handwriting image as an input; and based on the obtained model, the framework can generate a set of point coordinates representing humanlike writing.

Experiments
In this section we study the effect of different settings of basic encoder and decoder candidates. Then, we explore the performance of the proposed model and we compare our results with the existing methods on different datasets.

Datasets
We test the proposed framework on Arabic, Latin and Indian corpora. Specifically, for Arabic, we use the dual on/off Arabic [16]. It contains 28 letters, which are joined or isolated (we choose the isolated form) and the training set contains 2500 samples for each letter. For Latin data, we adopt the dual on/off Latin IRONOFF dataset [27]. We use the isolated letters (upper and lower cases) and digits, sorted into 26 and 10 classes, respectively. The training set is constructed by 2500 and 7000 samples for each character and for each digit, respectively. For Indian, we use the Telugu 1 dataset. It contains 116 Telugu characters, where 500 samples per character are considered as a training set. All these datasets are used without considering the pen up/down. We created additional patterns using a data augmentation strategy based on distorted samples (changing angle inclination, smoothing, and baselines). The obtained signals are converted to offline handwriting. First, we concatenate the pen points to obtain the skeleton of the image. Then, we use a filter to grow the skeleton so that if the current point is foreground, all its neighbors are set to the foreground. Table 1 represents the number of samples used for training, testing and validation.

Metrics and implementation
Thanks to the available on/off data for each corpus, the evaluation step becomes easy. Thus, both signals (truth and generated) are not only compared by the root mean square error (RMSE) and the Euclidean distance (ED), but also recognized by an online recognition system. We state the effectiveness of our proposed framework on three evaluation metrics: the RMSE, the ED and the online recognition system based on the LSTM NN. The RMSE and the ED are chosen as evaluation criteria for distance parsing. The RMSE is used to measure the rate of transformation from one set of points to another. Its definition has been used differently in related work such as [6,12]. In our case, it represents the difference between the online signal and the recovered one according to the following formula: where n is the number of samples, li is the number of points in each sample, x t and y t are the coordinates of the online signal, x' t , y' t represent the recovered coordinates, and L denotes the total length of characters. The better system is the one that reaches the lower RMSE value. Different results are represented in Table 4.
The chosen ED is used in [6], as the following formula: where p and q vary between 1 and L. The ED metric minimizes the cumulative distance calculated between the ground truth and the predicted elements via a warping path. Thus, the best method is the one that achieves the lowest values.
The results are given in Table 4. We agree that the evaluator metrics, the RMSE and even the ED, are brutal when evaluating a predicted signal with a small deviation, even if the temporal order is compatible with its ground truth. Nevertheless, the accuracy result could be unexpected. After this interpretation, we use an online LSTM recognition system [22], based on the beta elliptic and grapheme segmentation method [10] which requires the existence of the temporal order and the velocity to produce a reasonable result. We extract 10 characteristics from the original online signal to train the network. Then, we test the network with our reconstructed signal.
Our framework is trained over one Nvidia GT 650M with GPU in a Tensorflow platform. The training batch has 32 image-signal pairs. To update parameters, we choose Adam Stochastic with 10 -3 of the learning rate. We save the checkpoint model every 700 iterations. Training time lasts for 16 hours and testing takes around seven minutes for each 20 samples.

Ablation study
Recently, there have been many networks used to extract features from images: ResNet-50 [11], VGG-16 [25] and CNN, followed by LSTM NN or GRU NN, to obtain a higher feature representation. Thus, we evaluate the encoder combinations on the IRONOFF digit dataset, and the results of the training loss are presented in Table 2. This table shows the following studies: (1) we use 5 networks (i.e. ResNet, VGG, CNN1, CNN2 and CNN3) followed by LSTM as an encoder. The CNN configurations are listed in Table 3. We fine-tune the pre-trained deep models (ResNet and VGG). The results demonstrate that with the same followed network (i.e. LSTM), CNN2 performs better than other networks in terms of training accuracy. We use eight convolution layers with a kernel size of (3 × 3). The primary role of the Max pooling is to extract the main significant characteristics from the output of the previous convolution layers.
Each layer is preceded by an activation function (rectified linear unit). Batch normalization is employed after the third and last Conv. layers. (2) With CNN2, the results show that the GRU NN is better than LSTM. GRU-CNN2 is the best combination that achieves the lowest training loss. We investigate whether the proposed framework can generate the closest signal to the original. With the same GRU-CNN2 based encoder, we use five decoder-based combinations (i.e. Seq2Seq-LSTM, Seq2Seq-GRU, Seq2Seq-BGRU, and Att-BGRU). Figure 4 depicts the performance of different decoder combinations in terms of training model. In addition, the attention model achieves an accurate result compared to accuracy. The results show that BGRU performs better than LSTM and GRU for the Seq2Seq simple Seq2Seq. In other words, we choose GRU-CNN2 as the encoder and Att-BGRU as the decoder for handwriting recovery.

Architecture exploration
We aim to demonstrate whether the proposed framework and velocity reconstruction can boost the performance of the handwriting recovery process. Specifically, we compare five different handwriting recovery models: (1) the basic model (Att-BGRU) presented in Sect. 4.3, which utilizes equidistant points as online training data, so it is described as a framework without velocity; (2) sequence-to-sequence based BLSTM with velocity (S2S-BLSTM-V), which is inspired from [2] and which integrates the velocity feature. Moreover, training online data is passed by the algorithm of normalization (introduced in Sect. 3.1) to obtain an online script with velocity and a fixed point number; (3) sequence-to-sequence with a VGG-16 and BLSTM based encoder (S2S-VBLSTM-V), which is inspired from [22] and takes the similar training data as S2S-BLSTM-V; (4) sequence-to-sequence with a CNN2 and BGRU based encoder (S2S-CBGRU-V), which takes similar training data as the latter model; and (5) our Table 1 Dataset details   Scripts  Training  Testing  Validation   LMCA  70,000  20,000  10,000  IRONOFF Lower-case  65,000  17,800  10,200  IRONOFF upper-case  65,000  17,800  10,200  IRONOFF Digits  70,000  20,000  10,000  Telugu  58,000 17,400 11,600 Att-BGRU-V framework which integrates the velocity concept as detailed in Sect. 3.1. All the experimental results are shown in Table 4. From Table 4, we can see the following: (1) BGRU with the attention model (Att-BGRU) performs better than both S2S-BLSTM-V and S2S-VBLSTM-V. This indicates that the attention mechanism can improve the performance of handwriting recovery. However, in some cases, it has low margin effectiveness compared to S2S-CBGRU-V. This demonstrates the effectiveness of BGRU trained by signals with velocity, so, the pen velocity reconstruction improves its effectiveness. (2) S2S-CBGRU-V performs slightly better than both S2S-BLSTM-V and S2S-VBLSTM-V, thanks to the BGRU NN which outperforms BLSTM for handwriting recovery. (3) The results show that our Att-BGRU-V achieves the best performance over all the evaluation metrics (i.e., 2.0 RMSE, 22.8 ED and 94.6% recognition rate) on the IRONOFF upper-case. (4) We show the effectiveness of the pen velocity process and we assess the model on another scheme without velocity (Att-BGRU), which is obtained after training the framework with offline handwriting and its corresponding online one (a set of equidistant points). By comparing between the latter and our proposed framework (Att-BGRU-V) with velocity, we find that using the pen velocity is very efficient for handwriting recognition parsing.To sum up, the experimental results demonstrate that the proposed framework with the jointly attention model and BGRU with velocity enhances the effectiveness of the handwriting recovery process.

Comparison to existing methods
To evaluate the effectiveness of the proposed framework, we compare our work with three other located systems, which have been re-implemented and tested under the same environment conditions. Our proposal consists in using the CNN2-BGRU based encoder and the Att-BGRU based decoder. This model is intended to recover an online signal with velocity. Precisely, we compare four frameworks: (1) Elbaati's approach [9] was based on the graph model to represent an image as a set of segments. The genetic algorithm was used to find the smoothest path across those segments (2) S2S-BLSTM proposed in [2], was the classical Seq2Seq without the attention model with CNN-BLSTM as an encoder and BLSTM as a decoder. (3) S2S-VBLSTM proposed in [22], was similar to S2S-BLSTM but with a VGG-16 and BLSTM based encoder, and the authors applied the sampling step on the recovered signal to add velocity. (4) Our baseline model (Att-BGRU-V) is introduced in Sect. 4.4. All the experimental results are provided in Table 5. From Table 5, we can admit that:   (1) The approach of Elbaati et al. [9] had a low margin effectiveness compared to all deep models and over all the evaluation metrics. This indicates that with the rise of deep learning, handwriting recovery can be handled more efficiently. (2) The framework of Ayan et al. [2] was the first deep model for handwriting recovery. Their higher accuracy rate (91.9%) can be achieved when testing the LMCA dataset, but it will remain below both the S2S-VBLSTM [22] rate (98.8%) and our rate (98.9%). This indicates that pen acceleration obtained after a sampling step [22] proves its performance. In addition, the attention model (ours) with velocity has the advantage of improving the recognition accuracy. (3) Rabhi's framework [22] was the best state of the art framework. The authors re-sampled the recovered signal (set of equidistant points) and obtained an online signal with pen acceleration. However, our framework generates a significant signal with velocity without a post-processing step based on a sampling step. As a result, they achieved a lower accuracy compared to ours, thanks to the attention mechanism which can handle the performance. In addition, we can interpret that the BGRU NN proves its efficiency in terms of accuracy and time (0.7 s/step) compared to BLSTM (1 s/step). (4) Our proposed Att-BGRU without velocity achieves better results compared to related approaches [2,9,22]. This demonstrates the significance and benefits of our proposed architecture. In addition, our novel framework Att-BGRU-V performs better than the state-of-the-art models in terms of all evaluation metrics. It outperforms Elbaati's approach [9], Ayan's system [2] and Rabhi's framework [22] in 16, 0.3 and 0.1 absolute error points, respectively, when testing the data of digits. For the ED evaluation, it can be affirmed that our proposed framework achieves the best performance. It surpasses Elbaati's approach [9], Ayan's system [2] and Rabhi's framework [22] in 30.9, 0.2 and 0.1 absolute error points, respectively, when testing the Arabic data.
In addition, the recognition rate, whatever the database, is higher than other methods. To sum up, the effectiveness of the velocity is captured when we apply the attention mechanism with BGRU NN.
In Table 6. We can observe a range of effectiveness among existing work. There are no standardized metrics that have been developed to-date to evaluate the performance of drawing order reconstruction. However, a number of baselines have been included in our experiment, even though, they do not lend a fair basis for comparison. For example, when we compare our results with existing systems [3,7,21,24], the experimental set up is not consistent since we do not use the same recognition engines or datasets. For instance, Qiao et al. [21] uses a K-nearest neighbor approach, and the online Unipen 1a digit database to generate static 2D images (with a total of 15,612 samples). Further, Rousseau et al. [24] do not specify the nature of the engine they use, so the performance of their recovered signal cannot be analyzed. Finally, we are unable to objectively evaluate comparative results since they do not used the same metrics and datasets. However, we argue the signal generated by our framework is stronger due to the recovery of speed, which is the first such attempt reported to-date.

Visual analysis of the reconstructed velocity
We analyze the velocity prediction of the proposed framework using a visual graphic on the Arabic letter <<waw>> from the LMCA dataset. Figure 5a-c shows the velocity curves of both ground truth and predicted models (S2S-BLSTM-V, S2S-VBLSTM-V and our model ATT-BGRU-V). As shown in these figures, the difference of deviations between the predicted velocity of different models and the ground truth velocity are not too large. Our model has the closest deviation to the ground truth. To further analyze the reconstructed velocity, Fig. 5d-f shows the trajectory reconstruction corresponding to the latter curves. These trajectories are divided into strokes based on the pen velocity reconstruction improves its effectiveness inflection points which are located according to the variation of pen acceleration [10]. As indicated in these figures, the difference between the location of points is not informative. However, our model reconstructs a trajectory with a flexible curvature as human writing, thanks to the attention layer which can focus on the detailed oval curves. Figure 6 presents two successful recovered samples of an Indian character and an IRONOFF digit. These scripts are reconstructed successfully using the models with and without an attention layer.

Analysis of reconstructed characters
Despite that, the models without attention (S2S-BLSTM-V, S2S-VBLSTM-V) could recover the online trajectories which are not matched to the offline image. Because the encoder-decoder model extracts the final encoder state from different character samples. Due to the absence of the attention layer, the decoder generates an online trajectory which can be the same as some existing samples in the training dataset (see Fig. 7). Consequently, the attention layer can avoid overfitting and adapt to truth samples. Figure 7 shows two reconstructed samples of Latin and Arabic characters. These samples are successfully reconstructed with our attention framework and fail with models without attention. In some cases, the latter model generates erroneous scripts which are different to the counterpart offline handwriting. These scripts could refer to some existing samples of the target dataset. For example, when recovering the character <<sin>>, we obtain an erroneous signal which refers to the existing character <<ba>>. In addition, the Latin letter <<b>> is recovered as character <<z>> which exist in the IRONOFF dataset. Here, the decoder refers to the last encoder feature which can be similar to existing encoder feature of other samples. Figure 8 shows our recovered signal with a small deviation compared to its counterpart ground truth signal. Simultaneously, the temporal order is suitable for the online one. The proposed framework uses as an input an offline image and generates an online signal with temporal order and pen velocity. Figure 9 represents the pen velocity of our recovered signal and the truth signal.
Acceleration decreases at the zones of the red circles similarly to the ground truth signal. This makes us admit that the recovered signal respects the velocity like human writing. As indicated in Fig. 10, the recovered signal passes the loop in the truth direction. Both start and terminal points are basically compatible to the original one. Thus, the temporal order of our recovered signal respects those of the truth signal. Figure 11 depicts some cases of the suggested framework where the zone marked by a pink arrow is wasted because of the up/down pen. This process has not  been treated yet. In fact, the proposed system deals with mono-stroke isolated characters. Our contribution focuses on the reconstruction of the pen velocity feature, which has not been treated yet, neither with old systems nor with new ones.

Velocity performance
The velocity curve varies between extremums of velocity (maxima, minima) which specifies the number of strokes. In fact, the effectiveness of pen velocity reconstruction from an offline image is to give sense and dynamic information to offline handwriting. Hence, we become able to segment an image into a primitive line based on the reconstructed pen tip velocity. In addition, we can obtain more features to improve the offline handwriting recognition rate. In this study, velocity reconstruction visually appears when plotting the character (see Fig. 12b) where the points are not equidistant. In addition, the magnitude of the pen velocity (Fig. 12d) shows the variation in acceleration as a function of time. As illustrated in Fig. 12 we obtain an online signal; and based on the beta-elliptic model [10], there are two feature types, which are the dynamic and geometric profiles.
In the geometric profile, each beta stroke can be represented by an elliptic arc described by four geometric features: a, b, teta, and teta_p, where a and b are the half and small dimensions of the elliptic arc and teta and teta_p are respectively the angle of the ellipse and the tangent inclination. Those profiles are more detailed in [10].

Conclusion
In this study, we have introduced a novel framework based on a Seq2Seq model to recover the temporal order and pen velocity of multilingual handwriting characters. We have proved the importance of the attention model to focus on the local state while recovering online trajectories. The framework is an end-to-end system based on a CNN to extract features and an encoder-decoder BGRU with attention model to generate a signal with temporal order and velocity information.
The effectiveness of our proposed system is demonstrated through a number of aspects. We first distinguish its competitive effectiveness when using a novel Seq2Seq with an attention model, based on the BGRU NN which can focus on detailed oval curves. In addition, our model is able to  Recovered pen velocity for Arabic letter/sa:d/ reconstruct a trajectory with a flexible curvature similar to human writing. Secondly, we demonstrate the importance of training the network with a normalized signal which is characterized by pen velocity. In this way, our framework generates a signal with temporal order and pen velocity which improves its recognition rates compared to state-of-the-art methods. Consequently, we have achieved a higher recognition rate compared to other existing state-of-the-art models. For research reproducibility, the framework is developed in Python and it is freely downloadable from github. 2 Among the challenges that could be addressed in the future is to better recover the dynamic information from words, sentences and a complicated signature taking into consideration the pen up/down information. Thus, the use of meta learning will be required in this context, especially when we need to generalize the recovery model to unseen offline handwriting proficiently. We will also deploy a dependent bidirectional recurrent NN as it solves the Seq2Seq erroneous prediction problem, potentially improving the accuracy results. Finally, other deep learning and reinforcement learning models will need explored [18], and a range of benchmark multi-lingual handwriting character databases developed for comparative evaluation with other state-of-the-art (e.g. multi-task [28] and multi-model deep learning [14]) approaches.