Assessing language development in Arabic: The Arabic language: Evaluation of function (ALEF)

Arabic is characterized by extensive dialectal variation, diglossia, and substantial morphological complexity. Arabic lacks comprehensive diagnostic tools that would allow for a systematic evaluation of its development, critical for the early identification of language difficulties in the spoken and written domains. To address this gap, we have developed an assessment battery called Arabic Language: Evaluation of Function (ALEF), aimed at children aged 3 to 11years. ALEF consists of 17 subtests indexing different language domains, modalities, and associated skills and representational systems. We administered the ALEF battery to native Gulf Arabic-speaking children ( n ¼ 467; ages 2.5 to 10.92; 55% boys; 20 children in each 6-month age band) in Saudi Arabia in two data collection waves. Analyses examining the psychometric properties of the instrument indicated that after the removal of misfitting items, the ALEF subtests had reliability coefficients in the range from 0.78 to 0.98, and resulting subtest scores displayed a consistent profile of positive intercorrelations and age effects. Taken together, the results indicate that the ALEF battery has good psychometric properties, and can be used for the purpose of evaluating early language development in Gulf Arabic speaking children, pending further refinement of the test structure, examination of gender-related differential item functioning, and norming.


Introduction
Arabic is the mother tongue of over 313 million people in the Middle East and North Africa, the fifth most spoken language in the world (Simons & Fennig, 2018).Even though there has been much descriptive work done on classical Arabic and its modernized version, Modern Standard Arabic, very few studies of spoken Arabic, and especially of child language acquisition and developmental language disorders (DLD), have been conducted.Although the typological differences in phonology, syntax, morphology, and orthography between Arabic and Indo-European languages, like English, necessitate such research, few studies have examined early child language development in Arabic-speaking children.One of the major reasons that research on DLD is currently so limited in Arabic is that there have been no standardized assessment devices suitable for such research.The absence of such instruments impedes not only research on child development in general and DLD in particular, but also educational and clinical practice, effectively preventing evidence-based identification and treatment of early language disorders.
In this article, we report on the development of a novel assessment battery named Arabic Language: Evaluation of Function (ALEF) developed to address the need for a theoretically-mindful and psychometrically sound assessment of spoken Arabic.Although the battery can be adapted for use with speakers of a variety of dialects, the data reported here were collected with children acquiring Gulf Arabic, a group of closely related varieties of vernacular Arabic spoken in several countries on the Arabian Peninsula along the Persian Gulf, from northern Kuwait to Oman (Hole, 1995).The core area of Gulf Arabic encompasses the eastern coast of Saudi Arabia, including Al-Hasa-Dammam, the area where the present data were collected.Here we provide the theoretical context that motivated the development of ALEF, present the data from two waves of data collection aimed at establishing ALEF's preliminary psychometric properties, and gauge its clinical potential through the examination of subtest intercorrelations and their sensitivity to age effects.

Language acquisition in children with disorders of spoken and written language
Theories of language acquisition assume that, when unfolding typically, it progresses relatively uniformly in all languages (Guasti, 2002) and entails a sequence of genetically and experientially regulated developmental stages that result in the acquisition of language-specific representations (e.g., inventories of sounds, word roots and affixes, inflectional paradigms) and a grammar (i.e., a combinatorial system that forms the basis for the generative capacity of human language).Neurodevelopmental disorders may affect one, some, or all levels of linguistic representations and/or the capacity for manipulating them in real time, depending on the severity of the condition and its individual manifestations.These deficits are central to the cluster of primary disorders of language development and functioning, exemplified here by DLD and language-related learning disorders, highly prevalent childhood-onset neurodevelopmental disorders, still largely underidentified and understudied in most linguistic and cultural settings.These disorders are characterized by diverse patterns of deficits in phonological, morphological, lexical, syntactic, semantic, and pragmatic, as well as written language domains.In addition to considerable within-disorder heterogeneity, there also exists a large degree of cross-disorder commonality in linguistic traits (Levy & Schaeffer, 2011).For example, whereas it has been documented that phonological awareness is particularly impaired in reading disability (Frost et al., 2009;Kovelman et al., 2012;Ziegler & Goswami, 2005), its deficits have also been reported in DLD (Boudreau & Hedberg, 1999); conversely, morphosyntactic deficits are an area of weakness specifically associated with DLD (Armon-Lotem, 2012;Bedore & Leonard, 2001;Hansson & Nettelbladt, 1995;Rice, Tomblin, Hoffman, Richman, & Marquis, 2004), although such weaknesses have been reported in children with reading disability (Cantiani, Lorusso, Guasti, Sabisch, & Mannel, 2013;Rispens, Roeleven, & Koster, 2004); and even though pragmatic deficits are typically observed in Autism Spectrum Disorders (Ben-Yizhak et al., 2011;Reisinger, Cornish, & Fombonne, 2011;Young, Diehl, Morris, Hyman, & Bennetto, 2005) and social (pragmatic) communication disorder (Norbury, 2014), they have also been documented in DLD (Bishop, Chan, Adams, Hartley, & Weir, 2000).
This cross-disorder commonality of symptoms, as well as the transformation of disorder phenotypes throughout development, necessitates a dimensional and developmental approach, rather than treating each disorder as categorically distinct and invariant across individuals and across the life span.Correspondingly, to study DLD, multivariate phenotypes should be derived from assessments that tap into distinct but related aspects of language functioning, encompassing sublexical, lexical, grammatical, and pragmatic knowledge, as well as the cognitive systems that support the acquisition of this knowledge and its online application (e.g., memory systems, executive functions, and sentence processing algorithms).A language assessment that aims at being comprehensive should sample from multiple types of linguistic representations (e.g., phonological, lexical, morphological, and syntactic), modalities (e.g., comprehension and production), and relevant related cognitive processes (e.g., verbal and phonological working memory, naming fluency, phonemic awareness).

Typical and atypical acquisition of Arabic
The status of research on typical and atypical acquisition of the Arabic language, although limited in scope, has provided the foundation for the development of the ALEF.Here we review the existing relevant literature with regard to the aspects of language acquisition that are reflected in the ALEF: phonology, morphology, syntax, and literacy.We also pay special attention to the distinct features of Arabic, namely diglossia and dialectal variation in spoken language.

Phonology
In Arabic phonology, most published research has focused on typical development.For example, Amayreh and colleagues (Amayreh, 2003;Amayreh & Dyson, 1998, 2000;Dyson & Amayreh, 2000) established developmental trajectories for the acquisition of Arabic consonants in a sample of Jordanian children 2 to 6 years of age.The study focused on the standard form of Arabic ("Educated Spoken Arabic"), rather than the vernacular form children would be acquiring as their mother tongue, which likely affected the results, especially with respect to late developing sounds.Overall, these studies indicated patterns of Arabic consonant acquisition similar to the patterns typically observed cross-linguistically, but with some Arabic-specific effects.
Thus, just as has been reported for other languages, bilabial and alveolar (nonemphatic) stops and nasals were the earliest acquired consonants (acquired by the age of two to three); most fricatives, particularly sibilants, were acquired later (by the age of six), with the rhotics, interdental fricatives, affricates, and emphatic consonants being late acquired (with complete mastery not reached until the age of nine).Interestingly, these studies indicated that atypical trajectories of phonological development in Arabic might be best captured by tests that emphasize the production of consonants in word-medial positions, typically ignored in Englishbased tests of articulation.
A study of phonological disorders in Jordanian Arabic demonstrated that children with phonological deficits employ a variety of substitutions and simplifying lenitive processes, both common in normal development cross-linguistically, such as fronting, stopping, prevocalic voicing, deaffrication, assimilation, final consonant deletion, and consonant cluster reduction, as well as atypical processes also found in disordered phonological development in other languages, such as backing and glottal replacement, and Arabic-specific processes, such as deemphasis and spirantization (Bader, 2009).
A descriptive study of phonological development in 140 Qatari children 16 to 40 months of age (Al-Buainain, Shain, Al-Timimy, & Khattab, 2012) reported examples and frequencies of occurrence for various phonological errors in children's spontaneous speech, largely consistent with patterns found in other studies of phonological acquisition of Arabic.
Finally, a recent study of speech sound acquisition in Kuwaiti Arabic found more advanced speech sound acquisition compared to previous studies, with many complex sounds in place in the speech of 4-year-olds, including pharyngeals and uvulars.The emphatic consonants, the trilled /r/, and sibilants were the groups of sounds still produced with errors in the speech of 5-year-olds (Ayyad, Bernhardt, & Stemberger, 2016).

Morphology
A unique aspect of Arabic grammar that makes it particularly interesting for the study of DLD is its pervasive morphological complexity, such that nearly all words are morphologically complex, containing at least two templatic morphemes: a tri-or quadriconsonantal root, which encodes the semantic meaning (Holes, 2004), and a vocalic pattern, which denotes grammatical information (e.g., part of speech, tense, number).The abstract templates (i.e., consonantal roots and the vocalisms) constitute separate morphemes (Habash, 2007) that have to be acquired separately by the child.As these morphemes never occur as continuous phonetic entities, they must be inferred from underlying distributional patterns (Boudelaa, Pulvermuller, Hauk, Shtyrov, & Marslen-Wilson, 2010).Another unique property of Arabic morphology is that there is no clear "division of labor" between templatic and affixal morphemes with respect to their function (inflection versus derivation), as both perform each function, thus blurring the lines between inflectional and derivational morphology.
There are few published studies of the acquisition of morphology by Arabic-speaking children.Published studies, limited by small sample sizes, focused on plural noun inflection (Abdalla, Aljenaie, & Mahfoudhi, 2013) and tense and agreement (Abdalla & Crago, 2008).The latter study of Hijazi Arabic (spoken in Jedda, Saudi Arabia) reported over 94% correct subject-verb agreement marking in both present and past tense in typically developing children between 2.0 and 5.2 years of age.In contrast, it was found that typically developing children still may not reach high proficiency with plural marking of nouns by the age of 5 (Abdalla et al., 2013), resorting to regular substitutions of irregular ("broken") plural forms as well as rulebased masculine "sound plural" with a default feminine sound plural.The findings regarding children with DLD were in accord with what has been reported cross-linguistically, namely that children with DLD have pronounced difficulty with verbal tense and agreement morphology (Abdalla & Crago, 2008).The errors included errors of omission, when children used an imperfective bare stem, thought of as an acquisitional default.Children with DLD also substantially underperformed on plural noun inflection, a grammatical category that has shown mixed results in cross-linguistic studies of DLD (Abdalla et al., 2013), but has been proposed as a potential specific language impairment (SLI) marker in Arabic.

Syntax
There is a notable scarcity of research with respect to the development of syntax in Arabic.Arabic subject-verb word order is variable and determined by a complex confluence of factors related to the properties of a given noun or a pronoun, definite or indefinite, as well as its discourse-pragmatic status, i.e., whether it expresses given or new information (Owens, Dodsworth, & Rockwood, 2009).Despite this greater variability of word order, a study of the acquisition of Egyptian Arabic (Omar, 1973), based on recorded speech samples of 37 children from 6 months to 15 years old, indicated that most aspects of syntax were acquired by the children on a timeline comparable to what was reported for other languages (between 2 and 5 years of age), and that their word order was flexible from early on.This suggested that typically developing children successfully master the complex relationship between word order and the syntactic/pragmatic categories that determine it at an early age.With regard to the syntactic development of children with DLD, an unpublished study (Shaalan, 2010) demonstrated that child speakers of Gulf Arabic affected with DLD had pronounced difficulties with complex syntactic structures, particularly those that involved the fronting of object noun phrases and verbs.

Literacy
Studies of literacy acquisition in Arabic indicate that while the acquisition of reading skills in Arabic is heavily dependent on both visual and phonological processing skills, the development of phonemic awareness and phonological memory is an especially critical area of weakness for Arabic children with DLD and reading disability (Abu-Rabia, Share, & Mansour, 2003;Ibrahim, 2011).This supports the idea that reading acquisition follows certain universal patterns across alphabetic and consonantal (abjad) orthographies, as phonological awareness has been shown to be an important predictor of reading development across a number of languages (Ziegler et al., 2010).This research also pointed out some interesting Arabic orthography-specific effects, namely demonstrating that phonological processing might be an even more critical skill for Arabic literacy acquisition given the existence of two parallel lexicons (one of the spoken and the other of the literary language), which places great demands on the phonological abilities of the child (Bentin & Ibrahim, 1996;Ibrahim, 2011).

Language variation and diglossia in Arabic
Like many other languages, and perhaps to a much greater extent than most other languages, due to being the mother tongue in many countries across a large geographic area, Arabic is characterized by considerable variation among its spoken varieties.This presents a challenge for creating a language development assessment, as it needs to be adapted to each spoken variety or be dialect-neutral.Furthermore, because of the exclusive use of Classical or Modern Standard Arabic as the language of literacy in the Arab world, the spoken language differs from the written/formal Arabic.Thus, although the distinction between written and spoken language and formal and informal registers exists in many languages, Arabic has a more complex situation known as diglossia (Ferguson, 1959), that is, regular parallel use of two distinct languages for distinct purposes: one in everyday life, the other in formal situations and written communication.Classical Arabic, as the language of the holy Qur'an, has been carefully preserved in written form as its modernized equivalent, Modern Standard Arabic (MSA).This highly codified formal language, largely learned through formal education, exists alongside the dynamic and constantly evolving spoken vernacular.The two are highly divergent from each other.The linguistic distance between the two Arabic varieties is particularly acute in phonology, morphosyntax, word formation, and syntax, including phonemic inventories, syllabic structure, phonotactic constraints, stress patterns, and inflectional categories (Aoun, Benmamoun, & Sportiche, 1994;Aoun, Choueiri, & Benmamoun, 2010;Brustad, 2000;Holes, 2004;Watson, 2002).For example, MSA has a richer system of agreement compared to the less differentiated systems in some spoken varieties-different word order, distribution and frequency of verbal patterns, passivization, and nominal constructions (Benmamoun, 2000;Shawarbah, 2007;Shlonsky, 1997).Lexically, MSA and spoken varieties overlap only partially, with 80% of the lexicon of young children consisting of words with divergent forms in spoken and formal Arabic.
One consequence of the existence of the spoken vernaculars alongside the standard written form, each reserved for a distinct purpose, is that the latter is revered by Arabic speakers the world over as beautiful, perfect and "correct," while the former has low prestige and is not even always acknowledged as a legitimate language, but rather as "slang" (Versteegh, 1996).This attitude creates a serious challenge for developing assessments of child language development, as applying the prescriptive norms of literary Arabic to a spoken language assessment would not yield a valid measure.However, as all educated adults are trained on these prescriptive norms, it is quite challenging to overcome these language attitudes and develop norms based on the spoken language.Secondly, because of the linguistic distance between MSA and spoken Arabic, learning literary Arabic in school has been likened to learning a second language (Maamouri, 1998).For example, one study demonstrated that Arabic-speaking students faced with a lexical decision task in spoken Arabic exhibited priming effects only when primed with a spoken Arabic word and not with either Hebrew or literary Arabic words (Ibrahim, 2009).This suggests that for Arabic speakers, who are bi-dialectal, their spoken and written lexicons are distinct, similarly to what was found with respect to lexical organization of nonbalanced bilinguals (Kroll & de Groot, 1997;Sholl, Sankaranarayanan, & Kroll, 1995).This means that for Arabic-speaking children literacy develops essentially in a second language (spoken Arabic being the primary and literary Arabic being the secondary language); hence, literacy acquisition in Arabic is a process substantially different from that in monolingual, nondiglossic environments.

Need for a standardized assessment tool of spoken Arabic
Due to the unique diglossic nature of the language learning environment in Arabic-speaking countries and the properties of the Arabic language (including its complex root-and-pattern morphology and flexible word order) and writing system (e.g., consonantal script), quantitative studies of the acquisition of Arabic with adequate sample sizes are very important, as they may contribute unique insights to the understanding of universal aspects of language development and impairment, as well as show how language-specific characteristics influence typical patterns acquisition and the presentation of DLD.
Currently, it is widely acknowledged that there is a dearth of research on DLD in Arabic (Abdalla & Crago, 2008) and an even direr need for the development of theory-rooted and psychometrically-sound assessment tools.The majority of existing language tests used in Arabic are direct translations of English tests, which have validity problems, as they show lack of sensitivity to age-related change in the age range 6-9 (Balilah & Archibald, 2018).There are few tests that have been developed for and validated on Arabicspeaking children.A recent special issue of the Arab Journal of Applied Linguistics (Mahfoudhi & Abdalla, 2017) included a paper that presented a set of Arabic developmental language tests for Qatari-speaking children (Shaalan, 2017), which showed limited sensitivity to age-related change.An only other assessment of language development in Arabic with a preliminary report of adequate psychometric properties is, the Egyptian Arabic Pragmatic Language Test (EAPLT) (Khodeir, Hegazi, & Saleh, 2017).
This means that the validation and verification of the battery presented in the current study, the ALEF battery, will provide an invaluable tool for identifying and measuring language development in Gulf Arabicspeaking children.The ALEF is a comprehensive standardized assessment of language comprised of three modules.The main battery involves a wide range of tests of spoken language, encompassing all linguistic domains in both expressive and receptive modalities: phonology (i.e., word articulation, sound discrimination), lexicon (receptive and expressive vocabulary), grammatical morphology, comprehension, production and imitation of complex syntax, and pragmatic knowledge.In addition, there are two supplementary batteries: one comprised of measures of important prerequisite cognitive processes relevant for language development (i.e., short-term memory, pseudoword repetition, serial naming speed, and phonological awareness); and another targeting basic literacy skills (word-and pseudoword decoding, spelling, paragraph reading fluency, and reading comprehension).Thus, the ALEF provides a way to systematically assess a wide variety of language skills in children from 3 to 11 years of age, constituting an invaluable tool for both clinicians and researchers.

Participants
We recruited 467 Arabic-speaking children from local kindergartens and schools in the Dammam region, where the population is relatively ethnically homogeneous and predominantly native Saudi.Only public child-care centers and schools were sampled.Based on observations in the literature (Denman et al., 2017) and preliminary power analyses, the recruitment design aimed for 20 children (10 boys, 10 girls) in each of the 13 sampled 6-month age bands.The inclusion criteria included: (a) Arabic as the native language; (b) at least two generations (parent and grandparent) of Saudi nationalities; (c) knowledge of the tribal origin of the family; and (d) Saudi citizenship.The exclusion criteria included known history of (a) any developmental or educational problems; (b) known medication for serious conditions (e.g., epilepsy); and (c) any explicit knowledge, by administrators at child care centers and schools, of ongoing source of stress in children's lives (e.g., loss of a parent, family distress).
Our goal was to develop a comprehensive, theorydriven and psychometrically-sound assessment of language development and functioning in Arabic.Correspondingly, for the vast majority of the 17 ALEF subtests, described in the ALEF Battery section, we opted to initially develop an excessive number of test items that would be calibrated in terms of their psychometric properties.Given the large size of the ALEF battery, and since the participants included young children, both fatigue and logistical considerations led to a study design that involved administering the ALEF subtests in two independent data collection waves (see Table 1).
For Wave 1 data collection, we administered the following subtests-Expressive Vocabulary, Receptive Vocabulary, Word Articulation, Pseudoword Discrimination, Sentence Imitation, Pseudoword Repetition, Digit Span, and Rapid Automatized Naming-to 240 children in the age range from 2.5 to 10.92 years (M ¼ 6.22, SD ¼ 1.71; 119 girls, 121 boys).For the Wave 2 data collection, we administered the remaining subtests-Sentence Completion, Sentence Comprehension, Pragmatic Knowledge, Spelling, Word Reading, Pseudoword Reading, Paragraph Comprehension, and Phoneme Awareness-to n ¼ 227 children in the age range from 2.75 to 9.83 years (M ¼ 6.40, SD ¼ 1.86; 93 girls, 134 boys).The reading subtests were only administered to children age 5 and above.Note that we administered ALEF to only two children below the age of 3 years.The removal of their data from the dataset did not result in changes in parameter estimates, and we therefore retained their data.
Informed consent was obtained from the children's parents, and all children provided an oral assent for completing the study tasks at the time of the assessment.Ten booklets were assembled to randomize the effect of specific sequences of first and last subtests; no booklet (subtest order) effect was documented.Testing was discontinued for younger children when excessive fatigue was detected.

ALEF battery
The ALEF battery is an individually administered assessment of language development in Arabic that consists of three modules: ALEF-SL (Spoken Language, the core module), ALEF-WL (Written Language, a supplemental module), and ALEF-CP (Language-Related Cognitive Processes, a supplemental module).
The choice of subtests for the ALEF was guided by two chief considerations.First, we followed the descriptive-developmental model of DLD as the one with the most clinical utility (Paul & Norbury, 2012).According to this model, the best-evidence method of language assessment for the purpose of identifying DLD includes obtaining a detailed description of a child's performance on a range of linguistic tasks involving speaking and listening (the core module ALEF-SL), and locating the child's performance in the sequence of normal development.These tasks may be augmented by, but not replaced with, an assessment of collateral skills (e.g., short-term memory; ALEF-CP).In addition, for school-age children, the evaluation of literacy skills (ALEF-WL) may be appropriate.Overall, whether used for the purposes of establishing base rate performance for selecting intervention goals, tracking intervention progress, or for grouping children in research studies, any subset of the subtests may be used in any combination, as appropriate.
Secondly, the subtests selected for each module aim at being fully comprehensive, covering a wide range of skills in all major linguistic domains in different modalities.For example, the subtests in ALEF-SL cover skills in all oral language domains in both expressive and receptive modalities within the formcontent-use framework (Bloom & Lahey, 1978;Paul & Norbury, 2012): phonology, morphology and syntax (form), lexical semantics (content) and pragmatics (use).The subtests in ALEF-WR cover phonological decoding, reading fluency and reading comprehension, as well as spelling.Finally, ALEF-CP combines subtests that can be grouped under the umbrella of "phonological processing," broadly defined, and include subtests similar to those included in the Comprehensive Test of Phonological Processing, CTOPP-2 (Wagner et al., 1999), widely used to asses language and reading-related skills in children.
With respect to specific skills, the core, spoken language (ALEF-SL) module (see Figure 1 The development of ALEF was carefully aligned with the considerations presented in the Introduction, and was conducted by an international team comprised of specialists in language acquisition, psychometrics, neuro-cognitive development and the Arabic language.The development of the test structure and test items was completed in several iterations with three rounds of reviews of the test materials (including stimuli, instructions, and training and scoring materials for the test administrators and data collectors) conducted by a panel of specialists from King Faisal University.Illustrations for test items were commissioned from a professional graphical artist, and were carefully reviewed for cultural appropriateness as well as construct validity with respect to the targeted lexical and grammatical categories.Test administration was performed by a set of trained school teachers who received a two-day training course on the basics of the assessment procedures in general and the details of the ALEF administration in particular.

Spoken language (ALEF-SL) module
Receptive Vocabulary and Expressive Vocabulary.The aim of these two subtests is to assess the receptive and expressive knowledge of concrete vocabulary and basic concepts using developmentally appropriate lexical items sampled from main grammatical categories (nouns, verbs, adjectives, adverbs, and prepositions), as well as various semantic classes (animals, transportation, home/household items, body parts, nature, food, clothing, materials, emotions, actions, colors, and shapes).In addition, they assess the knowledge of verbs sampled from all Arabic verb classes (I-X).Words were selected and arranged according to a three-tier taxonomy: (a) basic, high-frequency words learned in everyday social interactions; (b) high-and moderate-frequency words mainly learned through educational experiences; and (c) more advanced words, learned from exposure to scholarly texts.The words were selected from a frequency dictionary of Arabic (Buckwalter & Parkinson, 2014) and their appropriateness for inclusion was confirmed independently by native Arabic speaking experts with advanced degrees in Psychology and Linguistics.The procedure for the Expressive Vocabulary subtest is picture naming: the child is shown pictures and asked to name each one.In the Receptive Vocabulary subtest, the child is shown three pictures for each item (one depicting the target object, concept or action, and two distractors) and asked to point to the picture depicting the target given by the test administrator.The subtests took around 10 minutes to complete (each).
Sentence Comprehension and Sentence Completion.These two subtests measure the child's receptive and expressive grammar skills.They are designed to target the following specific aspects of Arabic grammar: formation of plural nouns, verb tenses, subject-verb agreement, and comparative forms of adjectives.During the receptive subtest, Sentence Comprehension, the child is shown three pictures (one corresponding to the target sentence and two foils) and asked to point to the picture that corresponds to the target sentence.In the corresponding Sentence Completion test, the child is shown one picture and given the sentence describing it, then shown a second, related picture and asked to complete the sentence describing this picture using a cloze format.For example, for an item targeting the child's knowledge of plural noun formation, he or she is given sentences with their corresponding pictures analogous to the following: "In this picture, there is a cat, and here we have two _______" (expected response, "cats").The administration of these two subtests took around 7 minutes per subtest.
Sentence Imitation.Research in (a)typical language acquisition has established the important role of sentence imitation as a probe into the child's internalized grammatical system.Elicited Sentence Imitation allows us to assess children's knowledge of precise grammatical factors, and, based on any changes (omissions or substitutions) that the child makes in his or her response, to evaluate his or her stage of grammatical development.This assumption is based on the observation confirmed by a large amount of empirical evidence that in order for the child to imitate a structure, the structure must already be part of the child's grammatical competence (Lust, Chien, & Fynn, 1987).In addition, processing and imitating complex sentence structures taps working memory resources (Just & Carpenter, 1992) and is one of the best tasks for differentiating children with DLD from their typically developing peers .Our subtest contains sentences of varying and increasing complexity that target specific syntactic structures (e.g., wh-questions of varying complexity, sentences with negation, sentences containing complex predicates and conjoined clauses, sentences with subordinate clauses of various types and complexity).The child is asked to repeat them exactly as they are spoken by the adult, and the score for each item is assigned based on the number and type of errors (e.g., omission, substitution, permutation).The subtest took around 7 minutes to complete.
Story Telling.Research has shown that eliciting narratives is a valid measure of language development both in typically and atypically developing children (Norbury & Bishop, 2003).For the purpose of the assessment, we developed a supplementary subtest in the form of a culturally appropriate wordless picturestory (analogous to the one widely used in the West, "Frog, Where are You?" by M. Meyer) with original illustrations.The child is asked to look through all of the pictures and then tell a story about them.
The narrative is recorded and may be scored using two different options, as dictated by the specific needs of the researcher/clinician. First, there is a brief option that does not require transcription and therefore can be used in clinical practice by implementing a "roughand-ready" scoring procedure in the following scales: (a) intelligibility; (b) amount of output; (c) grammatical well-formedness; and (d) global narrative quality.The second option is more detailed and involves transcribing the narrative and scoring it on a number of lexical, syntactic and pragmatic characteristics, which provide a rich source of information with regard to the child's level of productive language, measured in an ecologically valid way (as part of connected discourse).Given the open-ended nature of this subtest and the resource demands of appropriate transcription and scoring procedures, we do not report the data on Story Telling in this manuscript, while explicitly stipulating that such an analysis will be performed separately through the lens of clinical linguistics and language acquisition.The subtest took around 8 minutes to complete.
Pragmatic Knowledge.This subtest consists of two parts.Part 1 (Social Use of Language) is aimed at measuring the child's familiarity with the conventions guiding the use of language in social situations.This part is designed for use with children 3-11 years of age and consists of orally presented situations (accompanied by pictures as memory aids) followed by a question.Each item describes an aspect of daily life that requires using language in a certain way to accomplish a particular goal (e.g., greeting, offering help, asking for help, requesting food, comforting a friend in distress, congratulating).Each response is judged on whether it is (a) appropriate for the situation; (b) in a socially appropriate form; and (c) formulated correctly.Part 2 of the subtest (Inferences) is aimed at measuring more advanced pragmatic skills, namely being able to infer implied meaning from stated information.In some items the inference depends on world knowledge, in others on the knowledge of semantically complex words, and in still others on familiarity with conversational conventions.Thus, it assesses children's knowledge of those aspects of meaning construction that go beyond lexical semantics and the meaning directly asserted in the sentence.This part is aimed at children 6-12 years of age and includes mini-scenarios and a question posed to the child at the end of each situation.The child is instructed to listen to the stories and answer the questions using clues from the story.Items targeting different types of implied meaning and requiring imputation of the speaker's intention are included.The scoring rule for each item is explicitly provided.This subtest is specifically targeting the language difficulties typical in children with Autism Spectrum Disorders and Social Communication/Pragmatic Language Disorder.The subtest took around 12 minutes to complete.
Word Articulation and Pseudoword Discrimination.These are two phonetic tests aimed at evaluating the sound inventory development of spoken Arabic.In the Word Articulation subtest, the child is given an opportunity to pronounce a list of words, each containing either a certain developmentally significant consonant or a difficult to articulate consonant cluster in the initial, medial or final position.To avoid having to provide an auditory prompt containing the target sound (to avoid the child imitating the adult), the child is prompted by a picture depicting each item and asked to complete a sentence spoken by the test administrator using the word depicted in the visual prompt.In the Pseudoword Discrimination subtest, the child is given pairs of contrasting pseudowords (i.e., identical except for one consonant or consonant cluster) interspersed with identical pairs and asked to judge whether the two in each pair are the same or different.This procedure is aimed at testing the child's ability for sound discrimination, a skill crucial for both spoken and subsequent written language development.Both subtests took around 8 minutes to complete each.

Written language (ALEF-WL) module
The written language module aims at evaluating a range of literacy skills and includes the following subtests: Spelling, Word Reading, Pseudoword Reading, and Paragraph Reading.This part of the assessment includes the measures typically used to assess a child's reading and spelling skills: namely, (a) a dictated Spelling task (a list of words of varying frequency and complexity are presented first in isolation and then in a syntactic frame, and the child is asked to spell each word; 5 minutes); (b) Word Reading and Pseudoword Reading (a list of words and pseudowords of varying complexity spelled in the vowelized script that the child is asked to read out loud; the number of items correctly read in 1 minute is recorded); and (c) a Paragraph Reading subtest that indexes reading fluency and reading comprehension (two age-appropriate passages that the child is asked to read, while the examiner records the number of words read correctly in one minute; the child is then asked to answer multiple questions based on the content of the passages to assess reading comprehension; 15 minutes).This module is designed to be used with children of school ages and is aimed at identifying difficulties with literacy acquisition, characteristic of children with various types of DLD.

Collateral cognitive processes (ALEF-CP) module
This part of the assessment evaluates cognitive components identified as relevant prerequisites for language and literacy development, namely, verbal working memory (Digit Span), phonological short-term memory (Pseudoword Repetition), naming speed (Rapid Automatized Naming), and Phonological Awareness (Elision task).These are commonly used short assessments that involve asking the child to repeat series of digits, forward and backward, repeat pseudowords of increasing length and complexity, name as fast as they can a serially arranged recurring set of letters, numbers, colors and objects (depicted on 4 cards, one for each type of stimuli), and to repeat a word and then modify it by saying it without one sound or syllable elided either from the beginning, the middle or the end of the word.Pseudoword repetition tasks have been previously identified as a robust clinical marker for DLD (Conti-Ramsden, Botting, & Faragher, 2001).Phonological awareness and rapid automatized naming have been shown to be impaired in children with reading disability (Vellutino, Fletcher, Snowling, & Scanlon, 2004;Wolf, Bowers, & Biddle, 2000).Finally, verbal working memory has been previously identified as an important factor in both reading comprehension (Carretti, Borella, Cornoldi, & De Beni, 2009) and sentence processing (Montgomery & Evans, 2009).The ALEF-CP module took around 20 minutes to complete.

Statistical analyses
To establish the ALEF's psychometric properties, we used a set of complimentary approaches.First, all of the ALEF subtests were examined for their internal consistency (as indexed by Cronbach's coefficient alpha a) to evaluate reliability.Following recent recommendations in the literature (Dunn, Baguley, & Brunsden, 2014), we also computed McDonald's coefficient omega, x, a recent alternative to coefficient alpha based on the congeneric (compared to the tauequivalent) model, which estimates the general factor saturation of a subtest, and follows similar interpretational rules.
In addition, we fit a set of item response theory (IRT) models to item-level response pattern data.Unidimensional IRT models were fit using the mirt package for R (Chalmers, 2012): 2PL (estimating item difficulty and discrimination) models were fit for Receptive Vocabulary, Word Articulation, Pseudoword Discrimination, Pseudoword Repetition, Spelling, Phoneme Awareness, Sentence Comprehension, Pragmatic Knowledge, Word and Nonword Reading.A generalized partial credit model (GPCM) was fit to the data for subtests that allowed partial credit and required rubric-based scoring: Sentence Imitation, Sentence Completion, and Expressive Vocabulary.Subtest scores were computed using the expected a posteriori (EAP) method after the removal of items that showed <1% or >95% average response accuracy (for dichotomous items), showed a negative item-total correlation, or displayed significant IRT misfit according to the plausible value statistic (PV p < .05).
Age and gender effects were established by fitting a set of linear models to the data with centered age and age 2 (to account for possible nonlinearity) terms and dummy-coded gender (0 ¼ girl, 1 ¼ boy) as fixed effects predicting subtest scores.P-values were obtained using permutations for the linear model as implemented in the lmPerm R package.
If the child failed to comply with the instructions or did not provide a single correct response for the entire subtest, their data were coded as missing.

Results
Preliminary psychometric analyses of the ALEF battery found that all of the subtests (but one paragraph in Reading Comprehension) had high to very high internal consistency estimates (generally convergent for Cronbach's and McDonald's estimates).After the removal of poorly performing items (from 2.9% removed items in Pseudoword Repetition to 18.9% removed items in Receptive Vocabulary), internal consistency coefficients ranged from a ¼ .67 to .98 (Median ¼ .88)and x ¼ .78 to .98 (Median ¼ .89),providing strong support for ALEF's reliability as an assessment of early language development in multiple language domains and modalities (see also Table 1 for a summary of key psychometric properties).
IRT models explained from 21% (Sentence Completion) to 72% (Spelling) of variance in item response patterns (Median estimate ¼ 44.5%), with estimated empirical reliabilities for EAP scores ranging from .78 (Sentence Comprehension) to .96(Word Reading and Sentence Imitation; Median estimate ¼ .90).Obtained item characteristic curves (ICCs) are presented in the Supplementary Data.
Subtest scores were generally normally distributed, but with significant (Shapiro-Wilk's p < .05corrected for multiple testing using the Bonferroni correction) deviations from normality established for Word Articulation and Pseudoword Discrimination (due to ceiling effects), as well as Receptive Vocabulary, Sentence Imitation, Spelling, Word and Nonword Reading subtests, and the RAN measures.Visual analyses of the subtest score distributions suggested that the departures from normality in Receptive Vocabulary, Sentence Imitation, Spelling, and reading subtests was likely due to the presence of a subpopulation with low scores that formed the second peak in the distributions, potentially pointing to possible DLD and RDrelated deficits.
Gender and age effects were evaluated jointly in a set of linear models summarized in Table 2. Demographic variables jointly explained from 0 (Sentence Comprehension) to 54% (RAN Errors) variance in subtest scores, with the median estimate of R 2 ¼ 0.16.With the exception of Spelling, Sentence Comprehension, and RAN Time, all ALEF subtests were sensitive to age effects (in many cases the quadratic effect for age was also significant) in the predicted direction (i.e., ability estimates increased positively with age), pointing to the validity of ALEF as a test of language development.
Corrected for multiple comparisons, only two subtests displayed significant gender effectsboys scored slightly higher than girls on Word Articulation (with the standard effect estimate of B ¼ 0.26), and lower than girls on the Pragmatic Knowledge -Part 2 subtest (B ¼ À0.43), indicating that this subtest is sensitive to potential gender differences in the developmental trajectories for pragmatic language skills.
Analyses of subtest intercorrelations were performed separately for wave 1 and wave 2 of the data collection, and revealed a predictable positive manifold of intercorrelations (see Tables 3 and 4).The strongest correlations for wave 1 were obtained between Digit Span and Expressive Vocabulary (r ¼ .60,p < .001)and Digit Span and Pseudoword Repetition (r ¼ .51,p < .001).Notably, naming error counts from the RAN subtest were also negatively related with most wave 1 subtests.For wave 2, the strongest correlations were established between Nonword Reading and Reading Comprehension (r ¼ .61,p < .001),and Nonword Reading and Word Reading (r ¼ .52,p < .001).Interestingly, although Phoneme Awareness scores did not correlate with reading subtests, they were positively related to Spelling (r ¼ .41,p < .001),suggesting a close relationship between phonological skills and spelling ability in Arabic.
Overall, the results indicated that our newly developed assessment of Arabic language development and functioning showed high internal consistency, substantial   age sensitivity, minimal gender bias, and predictable patterns of intercorrelations among subtests.

Discussion
The main goal of the study (representing the pilot stage of a long-term effort) was to establish the preliminary psychometric properties of the individual subtests of the ALEF.To date, there is no comparable comprehensive language test of Arabic with adequate psychometric properties.A recent study investigated the validity of the existing language development tests available for Arabic children, the majority of which represent direct translations of English language instruments (Balilah & Archibald, 2018), including the Arabic Expressive-Receptive Vocabulary Test (El-Halees & Wiig, 1999), the Arabic Language Screening Test (El-Halees & Wiig, 1999), as well as the Arabic Picture Vocabulary Test and the Arabic Language Test (Shaalan, 2010).The study found that Arabic language tests were not sensitive to age-related differences across the 6-9 year age range, and that there was a need for tests measuring more complex language skills for older children.Another study pointed to a similar conclusion (Shaalan, 2017) that tests suitable for a broader age range of children are needed.
The ALEF, a comprehensive standardized assessment tool for Arabic-speaking children covering all language domains in expressive and receptive modalities, was constructed with a careful consideration of the unique linguistic features of Gulf Arabic, with test stimuli controlled for the relevant key linguistic and cultural characteristics.As a result, we developed an instrument that is theory-driven, psychometrically-sound, and culturally-sensitive.We used a set of complementary analyses to identify a number of misfitting items, which were then removed to shorten the subtests while preserving each subtest's psychometric fidelity.After the removal of misfitting items, all ALEF subtests demonstrated high reliability estimates, generally convergent between classical test theory and IRT, suggesting that they may be used as reliable tools for indexing spoken and written language abilities in Arabic-speaking children.All of the subtests of the ALEF -Spoken Language (with the exception of Sentence Comprehension) showed expected (both linear and nonlinear) age effects, indicating that the ALEF is a valid test of language development.Furthermore, the positive manifold of correlations among the subtests suggests that the ALEF can be effectively (in a flexible way, as permitted by the assessment's structure) used to evaluate Arabic language development in children as young as three years of age.The resulting shorter length makes the ALEF more practical for clinical use.
The test provides researchers and clinicians with a new tool for the evaluation of language development in Arabic-speaking children whose home language is Gulf Arabic, with a possibility of adapting it to a wide variety of spoken Arabic dialects.Because of its breadth, the ALEF can be used to measure selectively  and in depth a child's performance in all subdomains of language, either expressively or receptively, to evaluate a child's level of literacy acquisition, and/or to get additional indicators measuring cognitive skills known to be pre-(or co-requisite) of language and literacy acquisition (phonological and verbal working memory, rapid automatized naming and phonemic awareness).The subtests can be used in any combination, as age-and developmentally appropriate for each individual child.
The results reported here are too preliminary to recommend the ALEF, in its current state, as a freestanding diagnostic tool.Even after a sufficiently large and representative norming sample for the ALEF is obtained in the future, as with any clinical evaluation, a combination of methods would be required to determine a differential diagnose (including language sampling, a behavioral observation and a caregiver report).However, even in its current form the ALEF represents an important clinical and research assessment tool for the study of language development and impairment in Arabic.For example, an important clinical application of the instrument would include obtaining a comprehensive description of a child's baseline linguistic profile, including language comprehension (in the domains of Syntax, Semantics, and Pragmatics) and production (Phonology, Morphology, Syntax, Semantics and Pragmatics), which can be used to determine areas of strength and weakness, to establish intervention goals, and to track a child's progress towards achieving therapy goals.Another useful feature of the battery is the inclusion of a number of processing-based tasks (such as pseudoword repetition), known to be sensitive clinical markers of language impairment (Conti-Ramsden, 2003).Furthermore, this task was shown to minimize cultural bias in language assessment (Oetting & Cleveland, 2006), an important consideration for Arabic speakers, given a substantial cultural and linguistic diversity in this population.A clinician may adapt a combination of tests (e.g., Expressive vocabulary and Sentence Imitation), in conjunction with the Pseudoword Repetition test as an informal screener (until test development process is completed and normative tables are published) to identify children with language difficulties.Finally, the test can be used with school-age children, whose main weakness main be manifested as difficulty acquiring literacy skills (including decoding, reading comprehension, and spelling).
While this represents an important advance, it is only a stepping-stone in the creation of a fully functional diagnostic tool.Future research is required to conduct further tests of validity, construct age norms (norm tables separate for boys and girls), adapt the test for use in Arabic-speaking populations from other regions, and determine the test's diagnostic accuracy (i.e., sensitivity, specificity, and predictive value).Further research would allow us to create, in addition to individual subtest scores, composite scores addressing separate modalities and/or language subdomains.For example, subtests may be combined to derive composite Index scores covering such areas as Core Language (e.g., combining Receptive Vocabulary, Sentence Completion, and Sentence Comprehension), Expressive Language (e.g., combining Expressive Vocabulary, Sentence Completion, and Sentence Imitation), and Receptive Language (e.g., combining Receptive Vocabulary, Sentence Comprehension, and Inferences from Pragmatics), as performed, for instance, in standardized instruments in English.This will require a new, larger scale data collection than reported here, now made possible by the present pilot stage.Which subtests should be combined in each index will be determined using a data-driven approach.Having index scores, in addition to the individual subtest scores and composite module scores, would allow the administration of the ALEF to be individualized, time-efficient, and flexible.Further research will be aimed at overcoming several other limitations of the present study.For example, given the large initial pool of items, it was necessary to administer the assessment in two data waves, with no overlap between individuals in waves 1 and 2, thereby prohibiting us from conducting a covariance-based set of procedures aimed at the construct validation of ALEF (i.e., the confirmatory factor analysis of the structure depicted in Figure 1).Because of the size of the battery at the initial stage of test development, it was also not feasible to collect additional data using other assessment measures for evaluating ALEF's convergent and divergent validity.For example, given the aims of the pilot and the length of the battery, we opted not to add to an already large battery an assessment of general cognition.In the future, such an assessment should be added to rule out participants with intellectual disability, and to provide a measure of divergent validity.Our task was also complicated by the absence of an existing "gold standard" for language assessment in Arabic, which, we hope, the ALEF has the potential to become.
We also did not examine differential item functioning (DIF) using ALEF subtest data, as DIF analyses generally require the availability of at least 5 to 10 individuals in the dataset per response category for all items.Although this was not the case in our dataset, future studies using the revised ALEF battery will directly examine and remove items that show gender (or other, for example, age, family SES, and parental education) bias.Of note is that the gender effects observed in this study were small in magnitude, and the only significant differences after corrections for multiple testing were obtained for Word Articulation and Pragmatic Knowledge, with boys outperforming girls on the former and underperforming on the latter.The presence of gender differences in Pragmatic Knowledge is notable given that it indexes social use of language skills that form diagnostic features for disorders with traditionally higher prevalence indicators for boys such as autism spectrum disorders and social pragmatic communication disorder.
In sum, this article presents the first step in the development and the pilot usage of the battery, which is now ready to be administered to a larger representative sample of Arabic-speaking children in the Kingdom of Saudi Arabia in order to complete the development of this comprehensive and precise clinical diagnostic tool.
As the development of the ALEF continues, we are confident that future standardization and norming efforts (as well as adaptations to closely related spoken Arabic dialects) will result in a polished product that will advance our understanding of (a)typical language development and related neurodevelopmental disorders.
Due to the limitations of this research, the ALEF in its current stage is not intended to serve as a standalone norm-based diagnostic tool.Furthermore, because of the considerable linguistic and cultural diversity of the Arabic-speaking world, much work is needed to understand how or even whether a single instrument can be used across various Arabic speaking populations.However, the existence of this instrument and its first encouraging psychometric properties in the scientific literature should trigger researchers working with Arabicspeaking populations in different settings to utilize it in their studies.Collectively, the field should be able to arrive to a situation where one of the largest language groups in the world can count on the availability of a standardized instrument for the assessment of language competencies in children.
) consists of the following subtests: Receptive Vocabulary, Expressive Vocabulary, Word Articulation, Pseudoword Discrimination, Sentence Comprehension, Sentence Completion, Sentence Imitation, Story Telling 1 , and Pragmatic Knowledge.ALEF-WL contains the following subtests: Word Reading, Pseudoword Reading, Paragraph Reading, and Spelling.In addition, ALEF-CP contains the following subtests: Pseudoword Repetition, Digit Span, Rapid Automatized Naming, and Phonological Awareness.Together, these 17 subtests represent key linguistic skills tested both receptively and expressively, which differ in their modes of administration and response (e.g., picture selection, picture naming, cloze format).

Table 1 .
Psychometric properties of the Arabic Language: Evaluation of Function (ALEF) battery subtests.Note. a ¼ Cronbach's coefficient alpha; x ¼ MacDonald's coefficient omega; Var IRT ¼ proportion of variance in response patterns explained by an IRT model; r xx ¼ empirical reliability of estimated IRT ability scores; RAN ¼ Rapid Automatized Naming.

Table 2 .
Estimated regression model parameters for Arabic Language: Evaluation of Function (ALEF) subtests (with permutation-based p-values).