Content validity to support the use of a computer-based phonological awareness screening and monitoring assessment (Com-PASMA) in the classroom

Purpose . This study investigated the content validity of a computer-based phonological awareness (PA) screening and monitoring assessment (Com-PASMA) designed to evaluate school-entry PA abilities. Establishing content validity by con-ﬁ rming that test items suitably ‘ ﬁ t ’ and sample a spectrum of difﬁ culty levels is critical for ensuring educators can deduce accurate information to comprehensively differentiate curricular reading instruction. Method . Ninety-ﬁ ve children, inclusive of 21 children with spoken language impairment, participated in a 1-year longitudinal study whereby the Com-PASMA was administered at the start, middle and end of the school year. Result . Estimates of content validity using Rasch Model analysis demonstrated that: (1) rhyme oddity and initial phoneme identity tasks were most appropriate at school-entry and sampled a spectrum of difﬁ culty levels, (2) more challenging phoneme level tasks (e.g. ﬁ nal phoneme identity, phoneme blending, phoneme deletion and phoneme segmentation) became increasingly appropriate and differentiated between high-and low-ability students by the middle and end of the ﬁ rst year of school and (3) letter-knowledge tasks were appropriate but declined in their ability to differentiate student ability as the year progressed. Conclusion . Findings demonstrate that the Com-PASMA has sufﬁ cient content validity to measure and differentiate between the PA abilities of 5-year-old children on entry to school.


Introduction
Modern day educators are faced with the challenge of providing high-quality instruction suitably differentiated to meet the learning needs of an increasingly diverse population of students (Bender, 2012). In the area of reading acquisition, this diversity is well documented throughout international studies of reading achievement such as the Progress in International Reading Literacy Study (PIRLS) and the Program for International Student Assessment (PISA). For example, in New Zealand educators are confronted with an ongoing " long tail " of underachievement in reading performance, where 14% of 10-year-old students perform at an advanced reading level and 25% of students fail to reach an average reading level (Mullis, Martin, Foy, & Drucker, 2012). It is estimated that up to one in fi ve children in New Zealand classrooms struggle with reading development (Nicholson, 2009). In Australia, up to 10% of 10-year-old students show advanced The role of technology in supporting classroom reading practices Over the past decade, advances in technology coupled with improved affordability and enhanced utility have seen classrooms worldwide transform into highly connected digitalized environments. In New Zealand, 92% of primary schools report that more than three quarters of their classrooms are linked to a network (Johnson, Hedditch, & Yin, 2011). Up to three-quarters of New Zealand primary school Principals report that the Internet has a signifi cant impact on teaching and learning practices in their classrooms (Johnson et al., 2011). Notably, the literacy and numeracy initiatives of more than 90% of New Zealand primary schools are supported by resources located online (Johnson et al., 2011). In Australia, the government reports that all schools provide computer and internet access (Australian Government, 2014). Indeed, the Digital Education Revolution, an Australian commonwealth government initiative, supported high-speed broadband internet connectivity to schools, with the states and territories committing to provide enhanced technological infrastructure and improved availability of digitalized resources to schools (Australian Government, 2008). Despite the exponential growth of technology in the classroom, few studies have evaluated the role of technology in supporting educators to differentiate between children with high and low phonological awareness (PA) abilities on entry to schooling.

Classroom computer-based phonological awareness screening and monitoring
Although a number of skills support the acquisition of reading profi ciency, the omnipresence of PA as a critical contributor to early reading success is widely recognized (Ehri, Nunes, Willows, Schuster, Yaghoub-Zadeh, & Shanahan, 2001). Children who enter school with an impoverished awareness of the syllable, onset-rime and phonemic units of sounds within spoken words are at far greater risk of falling up to 3 years behind in reading acquisition by age 10 compared to their peers who approach reading instruction with these skills (Torgesen, Wagner, & Rashotte, 1994). What is less understood is the role of technology in supporting the routine screening and monitoring of PA as children enter school and engage in their fi rst year of formal education. Few technology-based tools exist to support teachers in measuring the PA skills of children in their classrooms, with even fewer supporting measurement at the critical level of PA -the phoneme level (see Carson, Boustead, & Gillon, 2013a for a review). Many web-based applications that support screening and progress monitoring of literacy skills take the form of a data-management system whereby the educator administers an assessment and then, once completed, enters data into an online system. Examples include the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) (Good & Kaminski, 2003) and the Academic Improvement Measurement System (AIMSweb) (AIMSweb, 2014). Computerbased assessments such as the Cognitive Profi ling System (CoPS) (Singleton, Thomas, & Leedale, 1996) and Performance Indicators in Primary School (PIPS) (Tymms, 1999) have the added feature of test administration in addition to automated scoring and storage of results; however, these assessment tools only measure the syllable and onset-rime layers of PA, not the critical phoneme level.
In recognition of the lack of technology-based assessments that measure PA at the phoneme level and can administer, record and score results for classroom educators, the authors developed the freely available, " Computer-based PA Screening and Monitoring Assessment " (Com-PASMA) in 2010. Specifi cally, development of the Com-PASMA aimed to provide classroom educators with a method of: (a) time-effi cient screening and monitoring of skills known to be predictive of reading outcomes; (b) identifying risk for reading impairment from the outset of formal schooling; and (c) differentiating between high-and low-ability students to support curricular decision-making and, in turn, the minimization of large gaps in reading outcomes.
The Com-PASMA is comprised of six PA and two letter-knowledge (LK) tasks and presents all test items, automatically scores responses and stores data into a centralized database. Collectively these features enable children to self-administer the assessment in the classroom. Specifi c skills measured include rhyme oddity (e.g. which word does not rhyme: cat, mat, bus), initial phoneme identity (e.g. what word starts with the /s/ sound: bee, sun, tent), fi nal phoneme identity (e.g. what word ends with the /t/ sound: hat, hole, sun), phoneme blending (e.g. what word do you think I am saying? m-ou-se), phoneme deletion (e.g. if I say " spin " without the /s/ sound, what word do we get?), phoneme segmentation (e.g. how many sounds do you hear in the word " hand " ?), letter-name recognition (e.g. show me the letter " m " ) and letter-sound recognition (what letter makes the /mmmm/ sound?).
Recent studies show that the Com-PASMA is 30% faster in administration time compared to paper-based equivalents (Carson, Gillon, & Boustead, 2011), produces congruent scores with identical paper-based testing methods (Carson et al., 2011) and can predict reading outcomes at 6-years of age with 94% accuracy when administered at 5 and 5.5 years of age (Carson et al., 2013a). Alongside the reporting of time-effi ciency and predictive validity, it is important that content validity be established to ensure test items appropriately " fi t " and sample a range of abilities. This is particularly important given that technology-assisted assessment practices may support educators in differentiating between children with high and low PA abilities on entry to school in an effort to help re-dress international issues related to widening gaps in reading achievement.

Establishing content validity of computer-based phonological awareness screening and monitoring
According to Guernsey, Levine, Chiong, and Severns (2012), the technological boom has resulted in a signifi cant quantity of easy-to-download literacy-based applications claiming to support reading development with little evidence of validity, reliability, efficacy or effectiveness to support their claims. The purpose of the current investigation is to report the content validity of the Com-PASMA. Content validity refers to the process of ensuring that test content accurately refl ects the knowledge being measured (Anastasi & Urbina, 1997). According to Lissitz and Samuelsen (2007), test items are the building blocks of any measurement tool and are, therefore, a critical source of content validity. Construction of a test with a suffi cient degree of content validity includes analysing items within a test to determine whether they are appropriate for the intended population and whether they sample a range of diffi culty to enable differentiation between high-and low-ability students (Crocker & Algina, 1986); a particularly important feature for classroom assessment, planning and the minimization of risk for future reading impairment.
A number of statistical methods can be applied to establish the content validity of an assessment tool; one of which is Rasch Model analysis (Bond & Fox, 2007). The Rasch Model is a measurement model whereby the " fi t " and diffi culty spectrum of test items to an intended population can be formally reviewed and was selected as the model of choice in the current investigation. This model provides a theoretical range against which test developers can compare patterns of responses to determine whether items show a " fi t " or " misfi t " to the ability of test takers. Items deviating from the ideal range (i.e. showing a misfi t) require adaptation or removal from the measurement tool. The model evaluates a test item in terms of diffi culty (the item parameter) and people in terms of their ability (the person parameter) to score a correct or incorrect response on a particular test item. The Rasch Model is occasionally referred to as the One-Parameter Logistic Model under Item Response Theory (IRT). Despite sharing a mathematical similarity, these models differ on a conceptual level (Baker, 2001). In the Rasch Model, data must conform to the properties of the model for measurement to take place (Andrich, 2004). Items that do not conform to the model (i.e. show a model " misfi t " ) require careful investigation and an explanation as to why this is the case. In IRT, the importance of data-model fi t is emphasized. However, additional model parameters enable the model to adjust to refl ect the pattern of the data (Embretson & Reise, 2000).
To ensure the Com-PASMA contains satisfactory levels of content validity to support educators in accurately differentiating between children with high and low PA abilities, the following hypotheses were proposed: (1) Test items within each task in the Com-PASMA will demonstrate an appropriate " fi t " to the abilities of 5-year-old children in the fi rst year of formal schooling.
(2) Test items within each task in the Com-PASMA will sample a range of diffi culty levels among 5-year-old children.

Participants
The participants were 95 New Zealand children (39 boys, 56 girls) commencing their fi rst year of formal schooling and were aged between 5 years 0 months and 5 years 2 months ( M ϭ 60.41, SD ϭ 0.59). Of this group, 74 children presented with typical spoken language skills (TD) and 21 children presented with spoken language impairment (SLI). Inclusion of children with (i.e. SLI) and without risk (i.e. TD) for reading problems purposefully enabled the authors to evaluate whether test items in each task of the Com-PASMA demonstrated an appropriate " fi t " to the ability spectrum of the intended population and sampled a wide spectrum of diffi culty. All participants presented with typical cognitive, hearing and physical development and attended a mainstream school. The participants formulated a control sample for another larger study evaluating the effectiveness of teacher-directed classroom PA instruction (see Carson, Gillon, & Boustead, 2013b).

Classifi cation of SLI and TD
To be classifi ed as having TD, children needed to score within or above the average range on the following assessments: (a) Clinical Evaluations of Language Fundamentals -Preschool, Second Edition (CELF-P2) (Wiig, Secord, & Semel, 2006)  To be classifi ed as having SLI, children needed to perform at least one standard deviation below the mean on one of the baseline language measures (e.g. CELF-P2 or PIPA) or present with phonologically based speech errors with a PCC below 93%. In practice, children who perform at least 1 -2 SD below the mean are considered to have impaired language skills and, thus, a cut-off of at least 1 SD below the mean was used in this study. A PCC of 93.4 -100% is considered typical for children aged 5 years and 0 months to 5 years and 11 months (Shriberg et al., 1997). Children in this study who had a PCC below 93% were considered to present with phonologically based speech impairment, ranging in severity from mildto-severe. The majority of children (81%) classifi ed as having SLI presented with defi cits in both language and speech development. Table I profi les the language, PA, speech and non-verbal intellectual abilities of children with SLI and TD at the start of the study.

Procedure
A 12-month longitudinal research design was employed whereby each participant completed all tasks in the Com-PASMA at the start, middle and end of the fi rst year at school. Test items within each task were systematically reviewed using Rasch Model analysis to establish content validity of the Com-PASMA at each of the three points throughout the school year.

Assessment measures
For the purposes of the current investigation, test items in the Com-PASMA are the assessment measures of focus. Each participant completed each task in the Com-PASMA at ∼ 5 (i.e. start of the year), 5.5 (i.e. middle of the year) and 6-years of age (i.e. end of the year). Children were assessed individually under the supervision of a qualifi ed speech-language pathologist in a quiet area of the school environment. The Com-PASMA contains six PA and two LK tasks (i.e. rhyme oddity, initial phoneme identity, fi nal phoneme identity, phoneme blending, phoneme deletion, phoneme segmentation, letter-name recognition, letter-sound recognition). Tasks take ∼ 4 -5 minutes to complete. Calculations of sensitivity and specifi city from the start and middle of the school year with reading outcomes at the end of the school year are 0.89 and 0.95, respectively (Carson et al., 2013a). Average performances on each task in the Com-PASMA at start, middle and end of the school year are profi led in Table II for the entire study sample and for the sub-set of children with SLI.

Scoring reliability
Tasks in the Com-PASMA were administered individually during one session at the start, middle and end of the school year. Data was recorded in realtime as well as by the automated scoring system of the Com-PASMA. In addition, all assessment sessions were recorded using a DVD recorder. Half of the assessment sessions were viewed anonymously and in their entirety by an independent researcher at the start, middle and end of the school year. Onehundred per cent inter-rater reliability agreement between real-time scoring, automated Com-PASMA scoring and DVD scoring was achieved for all tasks and test items that were reviewed.

Analysis procedure
Rasch Model analysis was conducted on the responses of the 95 children to each test item at the start, middle and end of the school year by entering responses into a software programme called Winsteps (Version 3.70) (Linacre, 2010). This analysis provided information on: (a) which test items showed a signifi cant model " fi t " or " misfi t " and (b) how the test items in each task related to each other in terms of diffi culty. Using Winsteps " simulate data " option, a minimum sample of 64 participants was required to achieve a confi dence interval of 95% with a Ϯ ½ logit score. Winsteps provides several types of statistical analyses to evaluate test items in relation to the latent trait being measured. For the purposes of our analysis, the " outfi t statistics " of mean-square and ZSTD for each test item were used to evaluate which items showed a " fi t " or signifi cant model " misfi t " . The relevance of using the mean-square statistic and the ZSTD statistic are described below: • Mean-square statistic (MNSQ) : The meansquare statistic draws attention to the accuracy of an item by providing an indication of the size of an item ' s " misfi t " to the model. An item with a mean square close to 1.0 suggests that the item is accurate. An item with a mean square less than 1.0 is considered less accurate, but this does not cause any real problems (Linacre, 2010, p. 23). An item with a mean square greater than 2.0 is considered inaccurate and in need of attention. • ZSTD statistic : A ZSTD statistic is assigned to each mean-square statistic to indicate whether the size of the " misfi t " is statistically significant. The ZSTD is " standardized like a Z -score " (Linacre, 2010, p. 25). An item with a ZSTD statistic between -2 and ϩ 2 indicates a statistically signifi cant model " fi t " .
In line with Linacre (2010), test items with a mean-square statistic greater than 2.0 and a ZSTD statistic less than -2 and greater than ϩ 2 were interpreted as showing a statistically signifi cant model " misfi t " . Items may show a " misfi t " for a number of reasons, including: (1) being too easy or diffi cult, (2) confusing or ambiguous instructions or (3) lack of image quality (i.e. animated or static graphics) or heightened linguistic complexity. Items demonstrating a " misfi t " may require adaption or deletion from the instrument (Bond & Fox, 2007).
The point-measure correlation is another outfi t statistic. It provides an indication of an item ' s discrimination ability. Positive correlations above 0.3 indicate that the item is well correlated to the ability being measured. Negative correlations or correlations close to zero suggest there is little relationship between the item and the ability being measured. This indicates that an item does not effectively distinguish between individuals with more or less ability, which is cause for concern.

Results
Using Rasch Model analysis, Tables III and IV profi le examples of the mean-square statistic, ZSTD statistic and point-measure correlation for test items at the start of the school year in the rhyme oddity and phoneme segmentation tasks, respectively. Rhyme oddity was considered to be the easiest PA task, as well as appropriate for 5-year-old children. Therefore, it was anticipated that the majority of rhyme oddity test items would demonstrate a model " fi t " and point-measure correlation above 0.3. Phoneme segmentation, however, was considered the hardest task and, according to the literature, is challenging for 5-year-old children (Adams, 1990). Hence, it was expected that a number of phoneme segmentation items would demonstrate a signifi cant model " misfi t " at the start of the school year. * Age range (years; months); * * Tasks in the Com-PASMA where RO ϭ rhyme oddity, IPI ϭ initial phoneme identity, FPI ϭ fi nal phoneme identity, PB ϭ phoneme blending, PD ϭ phoneme deletion, PS ϭ phoneme segmentation; LN ϭ letter-name recognition; LS ϭ letter-sound recognition; numbers in brackets ϭ standard deviations.
" misfi t " could not be attributed to the item being too easy or too diffi cult for 5-year-old children. In addition, the point-measure correlations for items 1 and 2 were just below 0.3, indicating that these items do not differentiate well between high-and low-ability students. Table IV shows that phoneme segmentation items 1, 2 and 4 demonstrate a model " fi t " , while items 3, Table III demonstrates that, of the rhyme oddity test items, eight showed a model " fi t " and items 1 and 2 demonstrated a signifi cant model " misfi t " . Inspection of responses to item 1 revealed that 87% of children responded correctly to this item, suggesting that the " misfi t " occurred because the item was too easy. Inspection of responses to item 2 revealed that this item was of average diffi culty and that the items ϭ Misfi t MNSQ, mean-square statistic; ZSTD, ZSTD statistic; PT-measure correlation, point-measure correlation; " Fit " or " Misfi t " indicates whether the data fi t the properties of the Rasch Model; italics ϭ correct response to test item; underlining ϭ items that show a signifi cant model " misfi t " and may not be appropriate for measuring PA ability in 5-year-old children at school-entry. (number) ϭ the number in brackets indicates the number of phonemes in the target word according to New Zealand English pronunciation; MNSQ, mean-square statistic; ZSTD, ZSTD statistic; PT-measure correlation, point-measure correlation; " Fit " or " Misfi t " indicates whether the data fi t the properties of the Rasch Model; italics ϭ items that show a signifi cant model " misfi t " and may not be appropriate for measuring PA ability in 5-year-old children at school-entry. 5,6,7,8,9,10,11,12,13,14,15,16,17 and 18 demonstrate a signifi cant model " misfi t " . Items 1 ( " moon " ), 2 ( " tooth " ) and 4 ( " cup " ) were simple CVC words containing either long medial vowels, earlier developing sounds (e.g. /k/, /p/, /m/) or both. This may have increased the salience of the phonemes in these items, allowing 5-year-old children to segment them correctly at school-entry. Item 3 ( " cow " ) was thought to be easy during the construction of the test. However, over 85% of children indicated that this item had three sounds, as opposed to two. From video observations, it was revealed that children often added a fi nal schwa phoneme when segmenting " cow " verbally (i.e. /ka ʊ ə /). Only 9% of children scored correctly on this item. The majority of phoneme segmentation items demonstrating a signifi cant " misfi t " had a CCVC or CVCC syllable structure. Less than 10% of participants provided a correct response to phoneme segmentation items 5 -18, indicating that these items showed a " misfi t " because they were extremely diffi cult. Table V aggregates the mean-square statistic, ZSTD statistic and point-measure correlation for all test items at the start, middle and end of the fi rst year at school. This table illustrates that 108 out of 114 test items demonstrate a model " fi t " at one or multiple points throughout the fi rst year at school. All items demonstrating a model " fi t " also had pointmeasure correlations above 0.3, indicating that they discriminate well between individuals of high-and low-ability. Items demonstrating a model " misfi t " may do so because of test item diffi culty (as described in the following section) or another extraneous factor such as instruction quality, as opposed to the presence of SLI.

Identifying a hierarchy of item diffi culty in each Com-PASMA task
Rasch analysis enables comparison of item diffi culty through computation of " estimates of item diffi culty " (Bond & Fox, 2007). Winsteps refers to " estimate of item diffi culty " as the " measure " statistic, which in essence is a logit (log-odds) score assigned to an item to indicate its diffi culty (Linacre, 2010). A logit score is plotted along an interval scale called a logit scale. The logit value of zero represents an arbitrary mean. Therefore, items with a logit score near zero are considered to be of average diffi culty. Items with increasingly positive logit scores are more diffi cult, while Table V. Summary of items by task demonstrating a " fi t " or signifi cant model " misfi t " at the start, middle and end of the school year.
items with increasingly negative logit scores are easier. In theory, a logit scale can range from negative infi nity to positive infi nity (Bond & Fox, 2007). Therefore, for the purposes of this investigation, the following diffi culty descriptions were applied to logit values: 8 and above ϭ very diffi cult; 5 -7 ϭ diffi cult; 2 -4 ϭ moderately diffi cult; 1 to -1 ϭ average difficulty; -2 to -4 ϭ moderately easy; -5 to -7 ϭ easy; and -8 and below ϭ very easy.
Supplementary Tables A, B and C to be found online at http://informahealthcare.com/doi/ abs/10.3109/17549507.2015.1016107 summarize the hierarchy of diffi culty for test items in each task at the start, middle and end of the school year by plotting the " measure " statistic (i.e. logit score) for each item against a logit scale. From the start to the middle of the school year (i.e. Supplementary Tables A and B to be found online at http://informahealthcare. com/doi/abs/10.3109/17549507.2015.1016107), items in the fi nal phoneme identity, phoneme blending, phoneme deletion and phoneme segmentation tasks begin to sample a wider range of diffi culty levels. This is likely because tasks and items that were more diffi cult at school-entry became easier as children ' s PA skills developed. From the middle to the end of the school year (i.e. Supplementary Tables B and C to be found online at http://informahealth care.com/doi/abs/10.3109/17549507.2015.1016107), test items became increasingly easier for children to complete as they approached 6-years of age.

Ensuring test item " fi t " and diffi culty are supported by reliable test parameters
Internal consistency between items within each PA task at the start, middle and end of the fi rst year at school were calculated using Cronbach ' s alpha to profi le stability at each assessment point as well as across time. Cronbach ' s alpha scores above 0.7 indicate that the items within a task are internally consistent (Field, 2009) . Table VI profi les the Cronbach ' s alpha scores for each task throughout the school year.
At school-entry, rhyme oddity and initial phoneme identity showed a high degree of internal consis-tency, with Cronbach ' s alpha scores of 0.81 and 0.89, respectively. In addition, the letter-name and letter-sound tasks achieved scores of 0.81 and 0.80, respectively. Unsatisfactory Cronbach ' s alpha scores were calculated for fi nal phoneme identity, phoneme blending, phoneme deletion and phoneme segmentation at the start of the year. This is consistent with Rasch analysis fi ndings, indicating that these latter tasks are generally more diffi cult and less reliable at school-entry. By the middle and end of the school year, high Cronbach ' s alpha scores were calculated for all tasks in the Com-PASMA.
Test -re-test reliability refers to the consistency of a test across repeated administrations under identical conditions (Thorndike & Thorndike-Christ, 2010). Test -re-test reliability was conducted by correlating each task with itself at each of the three assessment points during the school year. The resulting correlation matrix, shown in Table VII, revealed signifi cant correlations at p Ͻ 0.01 for each task.
Collectively, internal consistency and test -re-test reliability provide strong evidence that the Com-PASMA is internally reliable and highly consistent over repeated administrations.

Discussion
This study investigated the content validity of a computer-based PA screening and monitoring assessment (Com-PASMA). Content validity was established through systematic item review using Rasch Model analysis to determine: (a) appropriateness of test item " fi t " to the PA ability spectrum of 5-year-old children and (b) suffi cient sampling of a range of diffi culty levels to support differentiating of student abilities to inform curricular reading instruction.

Ensuring appropriateness of " fi t " to the ability spectrum of 5-year-old children
The fi rst hypothesis predicted that test items in each of the eight tasks in the Com-PASMA would demonstrate an appropriate " fi t " to the abilities of 5-year-old children in the fi rst year of formal schooling. Rasch Model analysis using the " outfi t statistics " of mean-square, ZSTD and point-measure correlation demonstrated that the majority of test items (i.e. 108 out of 114) across the eight tasks showed a model " fi t " at one or multiple points during the fi rst year at school. At school-entry, phoneme segmentation was the only task where the majority of test items (15 out of 18 test items) did not show a model " fi t " . This suggests that phoneme segmentation items, as part of the Com-PASMA, are less suitable for measuring the PA ability of 5-year-old children at school-entry. However, by the middle and end of the school year, 14 out of 18 phoneme segmentation items demonstrated a " fi t " , indicating that these items become increasingly appropriate measures as children begin to interact with beginning classroom literacy instruction. This is consistent with research fi ndings showing that phoneme segmentation is a diffi cult task at 5 years of age (Adams, 1990). Importantly, Rasch Model analysis demonstrated that, as PA tasks, including phoneme segmentation, became increasingly appropriate across the fi rst year of schooling, the capacity of those items to differentiate between high-and low-ability students increased, as evidenced through improved pointmeasure correlation statistics.
Of the 114 test items, six items demonstrated a signifi cant model " misfi t " at all three assessment points in the school year. These items were rhyme oddity items 1 and 2, phoneme deletion item 1 and phoneme segmentation items 16, 17 and 18. Other items demonstrated a " misfi t " at only one or two assessment points in the school year. These include rhyme oddity item 4, fi nal phoneme identity items 2 and 4, phoneme deletion items 6, 5 and 11 and phoneme segmentation item 1. Adapting these " misfi t " items in terms of linguistic complexity (i.e. word familiarity, syllable structure and manner of articulation) and presentation (i.e. animated and static graphics or verbal instructions) at the point at which the " misfi t " occurred will be required in future investigations. However, some items demonstrating a significant " misfi t " may not necessarily require adaption. This is because their low level of diffi culty is a purposeful part of test construction, to help ensure graded levels of diffi culty within tasks. For example, rhyme oddity item 1 was developed using a simple syllable structure, a high-frequency rhyme unit and salient contrasts in the manner of articulation between correct and distractor options. This was done to ensure children ' s success and familiarity with the responding procedure on what would usually be one of the fi rst tasks administered as part of the Com-PASMA.

Sampling a range of diffi culty levels among 5-year-old children
The second hypothesis stated that test items within each task in the Com-PASMA would sample a range of diffi culty levels among 5-year-old children. Rasch Model analysis confi rmed this hypothesis. Using " estimates of item diffi culty " , results demonstrated that, at school-entry, test items in the Com-PASMA sampled a wide range of diffi culty levels (see Supplementary Table A to be found online at http://informahealthcare.com/doi/abs/10.3109/17549507.2015 .1016107). Rhyme oddity and initial phoneme identity items provided an even spectrum of easier items to those that are more diffi cult. For example, rhyme oddity items ranged from moderately easy to moderately diffi cult and initial phoneme identity items ranged from easy to diffi cult. Final phoneme identity, phoneme blending and phoneme deletion items predominantly sampled the moderately diffi cult to diffi cult range. For example, eight out of 10 fi nal phoneme identity items were classifi ed as diffi cult, while nine out of 15 phoneme blending and 10 out of 15 phoneme deletion items were considered moderately diffi cult. This suggests that, while the majority of test items in these tasks demonstrate a " fi t " at school-entry, their spectrum of diffi culty indicates they may be more appropriate later in the fi rst year of schooling. Only four out of 18 phoneme segmentation items could be analysed at school-entry; 14 items were extremely diffi cult and could not be analysed because the majority of respondents scored incorrectly. This suggests that these items do not adequately sample a range of diffi culty levels at this stage of schooling. The majority of letter-name items were moderately easy, whereas letter-sound items tended to be of average diffi culty and are considered appropriate at school-entry.
During the middle of the school year (see Supplementary Table B to be found online at http:// informahealthcare.com/doi/abs/10.3109/17549507. 2015.1016107) rhyme oddity and initial phoneme identity continued to sample a range of easy to moderately diffi cult ability levels, with initial phoneme identity items being less spread than at school-entry. While the majority of fi nal phoneme identity, phoneme blending, phoneme deletion and phoneme segmentation items continued to be of greater diffi culty, they began to sample the moderately easy to easy range. For example, four out of 15 phonemeblending items, fi ve out of 15 phoneme deletion items and fi ve out of 18 phoneme segmentation items were either moderately easy or easy by the middle of the school year. This is in comparison to one out of 15 phoneme blending items, one out of 15 phoneme deletion items and two out of 18 phoneme segmentation items being classifi ed as moderately easy or easy at school-entry. Letter-name and letter-sound test items became easier to complete by the middle of the school year. By the end of the school year (see Supplementary Table C to be found online at http://informahealthcare.com/doi/ abs/10.3109/17549507.2015.1016107), test items were more evenly spread across high to low logit scores, particularly for items in the fi nal phoneme identity, phoneme deletion and phoneme segmentation tasks. Interestingly, phoneme deletion item 11, which was moderately diffi cult at the start and middle of the school year, became very diffi cult by the end of the school year and will require further investigation in future studies.

Implications for classroom assessment practices
Providing evidence of content validity through the use of Rasch Modal analysis has implications for classroom practices in that educators are informed of which PA tasks are appropriate at exactly which stages of the fi rst year at school. Specifi cally, aggregation of test item " fi t " with " estimates of test item diffi culty " demonstrate that: (1) rhyme oddity and initial phoneme identity test items are most appropriate at school-entry and sample a spectrum of diffi culty levels, (2) more challenging phoneme-level test items (e.g. fi nal phoneme identity, phoneme blending, phoneme deletion and phoneme segmentation) become increasingly appropriate and differentiate between high-and low-ability students by the middle and end of the fi rst year of school and (3) letter-knowledge test items are appropriate but decline in their ability to differentiate between highand low-ability students as the fi rst year of schooling progresses. Such fi ndings are consistent with previous research on the developmental progression of increasingly complex PA skills (Elbro, Borstrom, & Petersen, 1998;van Bon & van Leeuwe, 2003). Although the majority of test items in the Com-PASMA demonstrate a " fi t " to the ability spectrum of 5-year-old children, sample a range of diffi culty levels and discriminate well in the fi rst year of schooling, educators should select those tasks within the Com-PASMA that are best suited to the age-level (i.e. 5-years, 5.5 years and 6-years) of their students when using this tool in the classroom.

Limitations
In light of the positive outcomes of this investigation, it is noteworthy to identify limitations that can be addressed by future research. First, the study consisted of a small sample size of 95 participants, all within the same age range. Future investigations could enhance the current investigation by increasing the sample size and measuring the PA abilities of children both prior to, during and after the fi rst year of schooling to identify where test items are appropriate for a range of age-levels. Second, the Com-PASMA currently consists of a total of 114 test items over eight tasks. Future investigations should ideally focus on constructing a larger range of test items to help support the development of an adaptive version of the assessment. Computer-adaptive testing requires large banks of test items so that the computer is able to select appropriate test items when adapting to the responses of the child. The number of test items in the current investigation that demonstrate a " fi t " at some stage in the school year and sample a range of diffi culty levels provide a starting point for the expansion of test items and the construction of an adaptive form of the Com-PASMA.
Ensuring all young children have the opportunity to develop profi ciency in reading is currently a key area of interest globally. Large gaps in reading outcomes between strong and weaker readers, particularly among developed countries, has created a need for researchers to investigate methods that support the early identifi cation and prevention of risk for reading impairment as part of daily curricular practices. Providing educators with technology-based assessment methods that have demonstrated a robust ability to differentiate between children who enter school with high and low levels of skills, such as PA, that are known to predict later reading success, is perhaps on way in which educators, researchers and policymakers can help reduce the prevalence of large gaps in school-aged reading outcomes.