ShefCE: A Cantonese-English bilingual speech corpus -- speech recognition model sets and recording transcripts Wai Man Ng Alvin C.M. Kwan Tan Lee Thomas Hain 10.15131/shef.data.4522925.v1 https://orda.shef.ac.uk/articles/dataset/ShefCE_A_Cantonese-English_bilingual_speech_corpus_--_speech_recognition_model_sets_and_recording_transcripts/4522925 This online repository contains the speech recognition model sets and the recording transcripts used in the phoneme/syllable recognition experiments reported in [1].<div><br></div><div>Speech recognition model sets<br></div><div>-----------------------------------------</div><div>The speech recognition model sets are available as a tarball,</div><div>named model.tar.gz, in this repository.</div><div><br></div><div>The models were trained on Cantonese and English data. For each language, two model sets were trained according to the background setting and the mixed-condition setting respectively. All models are DNN-HMM models, which are hybrid feed-forward neural network models with 6 hidden layers and 2048 neurons per layer. Details can be found in [1]. The Cantonese models include a bigram syllable language model. The English models include a bigram phoneme language model. All model sets are provided in the kaldi format.</div><div><br></div><div>1. The background-cantonese model was trained on CUSENT (68 speakers, 19.4 hours) of read Cantonese speech.</div><div>2. The background-english model was trained on WSJ-SI84 (83 speakers, 15.2 hours) of read English speech</div><div>3. The mixed-condition-cantonese model was trained on background-cantonese data and ShefCE Cantonese training data (25 speakers, 9.7 hours).</div><div>4. The mixed-condition-english model was trained on background-english data and ShefCE English training data (25 speakers, 2.3 hours)</div><div><br></div><div><div>Recording transcripts<br></div></div><div>----------------------------</div><div>The recording transcripts are available as a tarball, named, stms.tar.gz, in this repository. These transcripts cover the ShefCE portion of the training data and the ShefCE test data.</div><div><br></div><div>Four files can be found in the stms.tar.gz archive. </div><div>- ShefCE_RC.train.v*.stm contains the transcripts for ShefCE training set (Cantonese)</div><div>- ShefCE_RE.train.v*.stm contains the transcripts for ShefCE training set (English)</div><div>- ShefCE_RC.test.v*.stm contains the transcripts for ShefCE test set (Cantonese)</div><div>- ShefCE_RE.test.v*.stm contains the transcripts for ShefCE test set (English)</div><div><br></div><div><div><br></div></div><div>The ShefCE corpus data can be accessed online with <a href="https://doi.org/10.15131/shef.data.4522907">DOI:10.15131/shef.data.4522907</a><br></div><div>Please cite [1] for the use of ShefCE data, models or transcripts.</div><div><div><p>[1] Raymond W. M. Ng, Alvin C.M. Kwan, Tan Lee and Thomas Hain, "ShefCE: A Cantonese-English Bilingual Speech Corpus for Pronunciation Assessment",  in <i>Proc. The 42th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>, 2017.</p></div></div> 2017-03-10 14:07:02 Cantonese English data sets speech recognition system Language learning Chinese Languages English as a Second Language English Language Natural Language Processing