DNN acoustic modeling with modular multi-lingual feature extraction networks

In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.


INTRODUCTION
In recent years, neural networks have again become inherent parts of state-of-the-art automatic speech recognition (ASR) Supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD/ARL) contract number W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number OCI-1053575. systems. After first successful applications to phoneme recognition [1], [2], as well as to continuous speech recognition afterwards [3], [4], were demonstrated about 20 years ago, neural architectures for acoustic modeling had been widely abandoned in favor of Gaussian mixture models (GMMs), that often performed well enough and offer training algorithms which are easy to parallelize.
Today, improved training algorithms, large amounts of available reference data as well as parallel hardware in the form of GPUs are fueling the development of larger and deeper network architectures that can leverage the modeling power of their sometimes billions of connections. It could thus be shown that training deep neural networks (DNNs) to predict context-dependent phonetic target states results in acoustic models that achieve remarkable improvements over GMMs when used in hidden Markov model based ASR decoders [5], [6].
Besides their high modeling capacity, neural networks have other desirable properties that can be exploited in speech recognition systems. Usually, neurons and their trainable connections are organized in multiple layers. Each layer can be regarded as a representation of the input data that has been optimized towards the network training criterion. This allows for architectures in which some of those representations (layers) are shared between tasks, while others are allocated exclusively to individual problems. Such networks are amendable to joint training of shared and exclusive network layers and may perform better on certain tasks since the parameters in the shared layers can be trained with more data. Recent work showed that neural network acoustic models with shared hidden layers can indeed benefit from being trained on multiple languages [7], [8].
It is also possible to re-use learned intermediate representations in order to solve complex tasks more easily. This was first explored in the context of phoneme recognition, where layers of networks trained to predict only few classes were re-used in a larger network that was trained to discriminate between all classes [9]. Related ideas have been and are still used to construct hierarchical architectures, e.g. for preprocessing of speech features [10] [11], in which networks combine the outputs of previously trained networks or merge them with different features. In more recent work, feature extraction networks trained with a bottleneck layer were employed as modules for constructing large neural networks for acoustic modeling, which resulted in significant gains over training standard DNNs on acoustic features directly [12].
In low-resource settings, i.e. when only a small amount of transcribed data is available for acoustic model training, it becomes heard to obtain good speaker-independent acoustic model networks. Unsupervised, layer-wise pre-training helps in preventing large networks from overfitting on the training set, but the relative gains achieved over GMM systems become smaller as less data is available [13].
In the following, we will propose several architectures that apply the ideas motivated above to exploit the availability of training data in multiple languages in order to create significantly better acoustic models for a low-resource target language.

MODULAR ACOUSTIC MODELING
In this section, we describe our general approach to neural network acoustic modeling with separate networks for feature extraction and prediction of phonetic target states, which is motivated by the success of bottleneck networks in extracting low-dimensional discriminative features for GMMs [14]. Although standard feed-forward networks are capable of handling rich and highly correlated input such as raw images or mel scale filterbank coefficients, it could be shown that DNN acoustic models do benefit from from bottleneck features as well [12].

Feature Extraction
Our feature extraction scheme follows the general approach described in [15], which applies deep learning techniques [16] to bottleneck feature (BNF) extraction. In the standard BNF setup described in [14], a neural network with small hidden "bottleneck" layer, placed between two larger hidden layers, is trained to predict phonetic target states. The activations of the units in the bottleneck can then be used as input features for Gaussian mixture models.
In order to initialize the deep bottleneck feature (DBNF) network, a stack of auto-encoder layers is first trained on standard speech features in a greedy, layer-wise and unsupervised fashion [17]. The auto-encoder layers can be converted into a simple feed-forward network, and the architecture is completed by adding a small bottleneck layer, another hidden layer and the final output layer. The resulting network is then trained to predict HMM states, which yields the final bottleneck features in the small hidden layer.

Neural Network Acoustic Modeling
In most contemporary work, neural network acoustic models are employed in a hybrid approach to compute acoustic scores for hidden Markov models [3]. Scores are class conditioned probabilities given a vector of acoustic features x, which can be estimated from the posterior probabilities p(q|x) obtained at the neural network output layer with Bayes' rule as p(x|q) = p(q|x) p(x) p(q) −1 . The class priors p(q) missing for maximizing p(x|q) are commonly estimated from the available training data. In current setups, the phonetic classes q are context-dependent phone states determined by standard clustering algorithms from previously trained Gaussian mixture acoustic models [5].
The connection of bottleneck feature extraction and acoustic model network as described in [12] yields a large DNN in which the bottleneck network is shifted in the time domain over a large input feature window (Fig. 1). While the application of multiple copies of the same network at neighboring feature windows introduces temporal invariance [1], the dimensionality reduction performed by the bottleneck layer makes shifting over many neighboring frames computationally feasible.

MULTI-LINGUAL ARCHITECTURES
We now describe possible approaches to multi-lingual neural network training within the modular acoustic modeling framework described above. In particular, we focus on using data from medium-sized corpora in multiple languages to improve feature extraction networks for a low-resource setting with only 10 hours of transcribed training data.

Shared Hidden Representations
As noted previously, neural networks offer the ability to share intermediate (hidden) representations across different tasks. This works particularly well for speech recognition, where different languages may have distinctive sounds but may also share acoustic cues (or combinations thereof) which can be learned simultaneously on many languages. Successful demonstrations include training feature extraction networks in which all layers are shared (target states are obtained from a merged phone set) [18] or with one or more language-specific layers at the network output [19], [7], [8]. Since most modern DNN acoustic models are pre-trained in an unsupervised fashion, it is also possible to use multiple languages during pre-training only. While pre-training has indeed been shown to be language-indepedent [13], current algorithms hardly benefit from adding more unlabeled data for acoustic model training [20].
Here, we focus on training bottleneck feature extraction networks with shared hidden representations and languagespecific output layers. The auto-encoders used to initialize the hidden layers prior to the bottleneck are pre-trained on multiple languages as well.

Target Language Adaptation
Another variant of sharing representations is the adaption of previously trained layers to a new task. For multi-lingual network training, this has been successfully applied to both bottleneck feature extraction [21] as well as for acoustic modeling [22]. In addition to being straightforward to implement it is also possible to obtain good acoustic models in a short amount of time as the source networks for adaptation might already be available from past experiments.
When adapting previously trained DBNF networks in our setting, one approach is to simply fine-tune them by performing another training run in the target language before training the DNN acoustic model. Since the bottleneck layer is connected to the acoustic model network, it is also possible to jointly train both networks by backpropagating errors through the bottleneck layer. In this case, the DBNF network is adapted to the target language without an intermediate supervised training step.

Mono-lingual Network Modules
In a similar manner to what has been proposed in the past for re-using networks that detected particular phonemes [9], feature extraction networks trained on single languages might also be used as modules. A possible architecture with two bottleneck network modules is depicted in Fig. 2. Both networks are applied to the input features, and the bottleneck activations are concatenated and repeated over neighboring input feature windows.
In our framework, individual bottleneck networks can be adapted to the target language as before by backpropagating errors obtained at the first acoustic model layer. Furthermore, the acoustic model can be connected to the input features by different means, e.g. by adding another layer of hidden units that observe to the whole input feature window. Those new units are then connected to first layer of the acoustic model Acoustic modeling extraction features Fig. 2. Multiple feature extraction networks can be used as modules that are applied in parallel to the input features. network (Fig. 3). This concept was introduced in [9] as "connectionist glue" and used to obtain information from the input data that may be relevant for the current task but is ignored by the modules originally trained on a different task.

Extension to Language-specific Input Features
Depending on the characteristics of the target language, it may be desirable to use additional acoustic features that capture specific elements of the speech signal. For tonal languages such as Mandarin, where tonality is used to define lexical meaning, features that extract pitch information from the acoustic signal are of interest. Here, we investigate how to integrate fundamental frequency variation (FFV) features [23] into multi-lingual architectures. Recent work demonstrated their suitability for automatic speech recognition, especially when used as input features for neural networks, on a larger version of the Vietnamese corpus used here [24].
A straightforward approach is adding FFV spectrum filterbank outputs to the input features on which the DBNF networks have been trained. Even though adding those features was shown to not hurt performance on non-tonal languages [24], this might require the time-consuming retraining of all feature extractors on the modified input feature   space. Alternatively, extra features can be added alongside bottleneck features trained on the common features or can be integrated by adding glue units as discussed previously. This means that only the acoustic model network for the target language has to be trained (which has to be done in any case).

Corpora and Baseline Description
We perform experiments with various corpora released in the course of the ongoing Babel program [25] as listed in Table 1.
All corpora contain narrow-band, conversational telephone speech from land lines as well as mobile phones. Decoding was done on 2 hours of Vietnamese speech, while only 10 hours of transcribed data were provided for in-domain training. An additional 344 hours of data that could be used to improve acoustic models was available in Pashto (PUS), Tagalog (TGL), Turkish (TUR) and Cantonese (YUE). The baseline was provided by a flatstart GMM/HMM system trained on the respective languages in Table 1 only. After several iterations of training, context-dependent target states for neural network training were clustered and the required alignment of feature frames to states was generated.
We trained our networks to predict roughly 2000 contextdependent targets from 30 log mel scale filterbank coefficients extracted from 16 ms windows with a 10 ms frame shift. Features from neighboring frames were concatenated to context windows resulting in feature vectors of 630 elements. Bottleneck networks were trained on smaller windows with 330 elements first, and were then applied to neighboring sub-windows of the full input.
Hidden network layers were pre-trained without supervision as denoising auto-encoders, in which a single layer is trained to properly reconstruct its original input from a version that has been corrupted with random noise [17]. We applied Gaussian noise to corrupt the real-valued mel scale input features and masking noise (i.e. turning elements randomly to zero) for subsequent layers. For supervised fine-tuning, we selected learning rates with the "newbob" algorithm, in which two separate thresholds control the start of learning rate decay and the total duration of training by monitoring the framelevel classification accuracy on a held-out validation set.
The feature extraction networks contained 4 auto-encoder layers with 1024 units each, i.e. 7 layers in total (with bottle-neck, additional hidden layer and output layer). 42 units were used in the bottleneck layer, while the layer afterwards contained 1024 units, too. Acoustic models were not pre-trained and consisted of 3 larger hidden layers containing 2048 units each as well as the final output layer predicting the target states.
A 3-gram language model was build from the reference transcriptions of the Vietnamese corpus. The actual decoding was done with the Janus speech recognition toolkit [26], while networks were trained on GPUs with Theano [27]. Table 2 lists the performance in word error rate (WER) of the baseline systems. The GMM system is a context-dependent system using the same states as the hybrid setups and was trained from the same alignment described in the previous section. A standard DNN acoustic model does not provide much improvement in this low-resource condition (about 4% relative). The modular combination with deep bottleneck features, denoted as DBNF-DNN in the following, performs better with 72.2% WER. Jointly training the architecture improves this result to 70.8% WER.
Results for applying multi-lingual training with shared hidden layers to the feature extraction networks are listed in Table 3. Different methods of using the target data are compared: no inclusion (none), i.e. training the acoustic model on DBNF networks that have not been exposed to any Vietnamese data yet; including the VIE data in the multi-lingual training (incl); adapting a DBNF network trained without VIE by training on VIE only (adapt); and jointly training both acoustic model and feature extraction networks not exposed to VIE yet (jointly). It can be seen that feature extractors trained on completely different languages increase the recognition performance of the DBNF-DNN setup compared to the baseline performance of 72.2% WER. Here, Pashto provides the best single-language features with 68.4% WER, which is even slightly better than DBNFs from Cantonese, another tonal language (69.2% WER). Performance increases steadily as more languages (and thus a larger amount of data) are provided for DBNF training. When integrating the small Vietnamese dataset, best results are obtained by jointly training the acoustic model and the DBNF network. The network trained on all 4 extra languages achieved in 64.2% WER this way, an improvement over the baseline systems of 9.3% (jointly trained DBNF-DNN) and 17.4% (GMMs). Adapting a previously trained DBNF network on Vietnamese mostly results in slightly fewer recognition errors compared to including the target language in the multi-lingual training stage.

DBNF
In a separate experiment, we investigated whether jointly fine-tuning both networks is generally helpful, i.e. even if the feature extraction network has been adapted already. For an adapted Pashto DBNF network, this resulted in 66.6% WER, which is an improvement over 67.6% obtained without joint training but slightly worse that performing joint training without adapting the DBNF network first (66.2%).
The results of experiments in which mono-lingual networks were used as feature extraction modules are shown in Table 4. The architecture benefits from the combination of multiple modules, even though using those modules on their own does not increase recognition accuracy (see singlelanguage results in Table 3). Word error rates obtained with this approach are slightly higher than the results for performing multi-lingual training with shared layers. Adapting the DBNF networks on Vietnamese before using them as modules helps, but as for DBNF networks with shared layers, joint training yields the largest improvements. Adding glue units (we settled with 128 units) that are directly connected to the input only results in small improvements and does not match the performance achieved by adapting the DBNF networks.
In Table 5 bottleneck and glue integration, although training the DBNF network directly in the augmented feature space works best.

DISCUSSION & CONCLUSION
The results presented in this work show that DNN acoustic models benefit significantly from bottleneck features trained on different languages for which a larger amount of data might be available. For adapting the feature extractor networks to the target language, joint training of an unadapted DBNF network and a DNN acoustic model performed best. Including the target language during multi-lingual training resulted in slightly worse features compared to adapting a DBNF network as more data from other languages was added. It stands to reason whether this observation will persist when larger amounts of target language data are available.
It could be shown that adding mono-lingual feature extraction networks as modules improves recognition performance as well. This implies that several pre-existing networks can be re-used for building acoustic models in a new language -the more, the better. However, sharing representations resulted in better accuracy at the expense of additional time required to train a new DBNF network on multiple languages at once.
We could confirm gains as reported in [24] by including tonal features in our architecture. While training new DBNF networks on the augmented features worked best, integrating the tonal features at the bottleneck level or via glue units improved the resulting acoustic model as well.
Future work will consist of investigating whether the proposed architectures are able to benefit not only from multiple languages but also from both wide-band and narrow-band audio, which was shown to be helpful for training standard DNN acoustic models [28]. The authors are looking forward to explore how multi-lingual data can be leveraged to improve acoustic model network training as well and to further enhance the architectures suggested.