Academic performance of students from entrance to graduation via quasi U-statistics: a study at a Brazilian research university

We present novel methodology to assess undergraduate students' performance. Emphasis is given to potential dissimilar behaviors due to high school background and gender. The proposed method is based on measures of diversity and on the decomposability of quasi U-statistics to define average distances between and within groups. One advantage of the new method over the classical analysis of variance is its robustness to distributional deviation from the normality. Moreover, compared with other nonparametric methods, it also includes tests for interaction effects which are not rank transform procedures. The variance of the test statistic is estimated by jackknife and p-values are computed using its asymptotic distribution. A college education performance data is analyzed. The data set is formed by students who entered in the University of Campinas, Brazil, between 1997 and 2000. Their academic performance has been recorded until graduation or drop-out. The classical ANOVA points to significant effects of gender, type of high school and working status. However, the residual analysis indicates a highly significant deviation from normality. The quasi U-statistics nonparametric tests proposed here present significant effect of interaction between type of high school and gender but did not present a significant effect of working status. The proposed nonparametric method also results in smaller error variances, illustrating its robustness against model misspecification.


Introduction
Assessment of undergraduate performance from entrance to graduation has been of great interest in the literature ( [7,12,14,19,20] and references therein). For instance, since the implementation of a quota system in the Public Federal Universities in Brazil, the interest in the performance of undergraduate students has grown. The law states that (it started in 2013) all the courses/majors in Public Federal Universities must have at least 12.5% of their positions/spaces reserved to students who studied all High School years in public high schools (PuHS). In addition, this percentage should gradually get to 50% in 2017. It is important to point out that most of the *Corresponding author. Email: hildete@ime.unicamp.br c 2015 Taylor & Francis middle-class students go to Private Schools (Elementary, Middle and High Schools) and there is a great competition to enter Public Universities. Therefore, the socioeconomic status is being measured indirectly by looking at the High School system of the students. There are highly competitive exams to get into the Public Universities in Brazil. The State Universities in São Paulo, Brazil, do not have a quota system, but they are implementing some affirmative action programs. The State University of Campinas (Unicamp), located in the state of São Paulo, is one of the top universities in Brazil and since 2005 implemented an affirmative action program, which allows students who studied all High School years in Public schools to receive a bonus in their final scores of Unicamp's highly selective Entrance Exam. Unicamp has an average of over 15 candidates per undergraduate position offered each year (www.comvest.unicamp.br). For instance, the affirmative action program stated in 2005 at Unicamp gives a bonus of 30 points in their Entrance Exam Score (EES) for students coming from PuHS and 10 more points for those PuHS students who declared themselves as Black, Pardo 1 or Indian. Note that indirectly the affirmative action program is adjusting for income level and race.
Most of the studies were based on classical analysis of variance or correlation analysis. In some cases when the normality assumption is not met, nonparametric tests based on ranks have been used [19], such as Kruskal-Wallis test for one-way analysis of variance, Friedman test for complete block designs or Durbin test for incomplete block design and others [4,6,8]. However, none of these tests accommodate interaction effect hypotheses. In this paper we propose a nonparametric methodology which allows the analysis of non-normal or unspecified distribution data as well as normal or other known distribution data. Furthermore, we add the possibility of interaction effects, which is not possible with the most popular nonparametric approaches. The proposed test statistics are asymptotically normal under both null and alternative hypotheses for a large class of models [16,17]. Moreover, here we extend the results presented in [1] for homogeneity tests based on quasi U-statistics by including tests for interaction effects.
Pedrosa et al. [14] proposes regression models to assess the performance of undergraduate students using as response variable the relative gain which is based on the relative rank of final (or last) recorded GPA (grade point average) and the entrance exam grade rank. The relative gain, proposed by Pedrosa et al. [14], is defined as follows. The students of each course/major and who entered in the same year (same 'class', say) are ranked twice. The student's first rank is based on the EES ; the second rank is based on the final (or last) GPA score. For instance, the student with the worst EES or worst GPA will be assigned rank 1, while the one with the highest score will have rank n, and so on. We thus have an initial and final rank for each student. The ranks are then divided by the total number of students in the same 'class' (same major, entering in the same year). The relative gain is then obtained by the difference between the final and the initial relative ranks. Therefore, the relative gain is a continuous variable symmetric around zero, limited between −1 and 1 and the tails of its distribution are lighter than the normal distribution, that is, its distribution is leptokurtic. Even for large sample sizes, the normality approximation may not be appropriate.
We pursue more robust methods for cases which can be applied for small or moderate sample sizes as well as any in which a normal assumption is not reasonable. Robust methods is key to evaluate the performance of students, since GPA scores can vary a lot from one area to another, for example, from Exact Sciences to Social Sciences, for instance. Even in the same course, we may have very different grades according to the class/lecturer. So, it is crucial to employ more robust measures of performance.
The proposed method is based on measures of diversity and decomposability of quasi U-statistics [16,17] to properly decompose between and within group distances. Furthermore, the method allows us to look at interaction effects by extending the decomposition of diversity measures using the idea in [18]. The main emphasis here is given to the sector of high school education from which college students come -private or public. A homogeneity test is proposed for group comparisons using a nonparametric approach. The data comes from the State University of Campinas (Unicamp) as well as the data in [12]. We further the analysis in [12] in two ways. We present here a thorough analysis for the consolidated data set. Moreover, the analysis is more inferential in opposition to the more descriptive analysis on the preliminary data [12]. The data set consists of students who have enrolled at Unicamp from 1997 to 2000 with 76.8% who have already graduated their courses and 23.2% who have dropped out from the University.
In Section 2, we present a short introduction about diversity measures, the development of a hypothesis test to evaluate the homogeneity between groups and the extension of the procedure to a multifactor design, which can be a parametric or a nonparametric approach based on quasi U-statistics. The decomposition of diversity measures based on quasi U-statistics presented so far in the literature [1,16,17] is analogous to a one way ANOVA. Here, we extend the quasi U-statistic method to multifactors as in [18]. In Section 3, we present the data that motivated the study. Section 4 presents an application to the Unicamp database with all the students (those who graduated and those who dropped out) as well as an application with only those who graduated. A discussion follows in Section 5.

Measures of diversity and decompositions
Several metrics have been constructed to measure distances in qualitative or quantitative variables ( [2,9,11] and references therein). Measures of diversity have been widely used to measure variability in ecology, genetics, physics and many areas. Measures of diversity can be used to decompose the total diversity into within and between groups diversity due to a certain number of factors [18]. When we have a mixture of groups, one can be interested in knowing if the amount of diversity is due to the difference within or between groups. The classical ANOVA case may be thought as a special case of decomposition of measures of diversity. For instance, the total sample variance can be written as the mean of all the pairwise comparisons of half of the squared difference between x i and x j , which in other words is a U-statistics of degree 2 with kernel φ( [5,10]. In the classical ANOVA case, the X i 's are assumed to be independent with normal distribution, leading the distribution of the between and within mean sum of squares being chi-square and the ratio of them having an F-Snedecor distribution. Here we will use a more general approach for which we do not need an assumption for the distribution of the X i 's. Consider a symmetric and non-negative function φ(X i , X j ), which is a measure of the difference between two individuals. Then define, for g = g = 1, . . . , G, where θ g is the diversity within group g and θ gg is the diversity between groups g and g . Estimators of θ g and θ gg can be found using U-statistics [5]. If φ(x i , x j ) is a convex function, then The excess dissimilarity measure between groups g and g is given by that is, D gg is the excess measure of diversity between groups g and g compared to the average of the within diversities measures in groups g and g .
Additionally, the estimators of θ g and θ gg , for g = g = 1, . . . , G, are given bȳ where n g is the number of individuals at the gth group. Note thatθ g is a U-statistic of degree 2 andθ gg is a two-sample U-statistic of degree (1,1), so that E(D g ) = θ g and E(D gg ) = θ gg . The overall mean distance is defined as the total variability of the pooled sample and it can be estimated by with n = G g=1 n g . As shown in [15,16], we can decompose U n as Note that W n > 0, so that but B n may assume any real value and Here we use φ(X i , X j ) = (X i − X j ) 2 , with X i and X j being the relative gain of students i and j, respectively.

Hypothesis testing
We can test homogeneity between groups using the individual performance data. Intuitively, we can say that, under H 0 : D gg = 0, ∀ 1 ≤ g < g ≤ G, and given Equations (1) and (2), we can write these hypotheses as The average (θ g + θ g )/2 can be thought as a baseline. Under the null hypothesis, θ gg = (θ g + θ g )/2, which means that the diversity between groups is not greater than the baseline, which means that the excess diversity between groups is zero, that is, D gg = 0. Large sample values of B n are indications of large values of D gg . Thence, one rejects H 0 for large values of B n . Under mild regularity conditions B n is asymptotically normal [16,17]. Let which is the Hoeffding decomposition of B n 's kernel φ. Note that, under H 0 , we can rewrite B n as where X 1 , . . . , X n represent the ordered pooled sample, in which the first n 1 observations relate to group 1, the next n 2 to group 2 and so on; and if i and j are both from the same group.
Pinheiro et al. [16] showed that, under the null hypothesis of homogeneity between groups the asymptotic distribution of B n is normal, i.e. where and From the result given in Equation (5), one can also find the power of this test. Note that where p g = n g /n and p g = n g /(n − 1), for all g, g = 1, 2 . . . , G. Then, .
The power of the test will be given by The classical tests (both parametric and nonparametric) assess differences in location. Differences in scale are nuisance to the analysis leading only to loss in statistical power. The quasi U-statistics are built upon differences on the distributions. Therefore, both scale and location sample differences contribute to rejecting the null hypothesis.

The multifactors problem
The results shown in the previous sections correspond to a one way ANOVA. Therefore, first in this section we will extend the quasi U-statistics method to two factors. For more than two factors, the theory is easily generalized. Consider two factors, A 1 with s levels and A 2 with t levels, as in [13]. To obtain the combined effect of A 1 and A 2 , consider the cross-classification of A 1 and A 2 as a single factor with s × t levels. We can then decompose U n by where B n (A 1 , A 2 ) is computed as Equation (4) with s × t groups. Thus, the expected value for B n (A 1 , A 2 ) will be for k, k = 1, . . . , s × t, and such that γ kk is the diversity between the groups k and k and γ k is the diversity within the group k. Therefore, the null hypothesis to test the interaction effect between the factors A 1 and A 2 would be Moreover, as in the sum of squares for the classical analysis of variance, B n (A 1 , A 2 ) can be decomposed as with Therefore, B n (A 2 |A 1 ) can be interpreted as a weighted average of the diversities between the levels of A 2 for each level of A 1 . This represents the proportion of variability not explained by A 1 which is explained by A 2 . If we define the expected value of B n (A 1 ) by for l, l = 1, . . . , s. The the null hypothesis to test the effect of A 2 given A 1 can be defined by The method proposed here can be parametric or nonparametric, that is, we may assume a known distribution for X (the relative gain in this particular case) or we can work under a large class of distributions for X. In the parametric approach maximum likelihood estimators (MLE) of the moments of X can be used for the estimation of the mean and variance of the test statistic B n . Details and analytical results of the MLE for the normal and triangular distributions are shown in Appendix A of the supplementary material.
For the nonparametric approach, no assumption is made about the distribution of the relative gain, the variance of B n is estimated using the jackknife variance according to [3] and p-values are computed according to the asymptotic distribution given in Equation (5). Details of the algorithm is given in Appendix B of the supplementary material.

Database
The data set is composed by 8225 students which have enrolled at Unicamp at years 1997, 1998, 1999 and 2000 in all Bachelor degree's majors. These students entered the system without any differential treatment due to Race or Public/Private High School system. The academic situation of these students were classified as following: Graduates (6316 students who have already graduated their courses -76.8%), and Others (1909 students who dropped out from the University -23.2%). Socio-economic and demographic characteristics of the students were provided by a questionnaire filled out by the students when they registered for their entrance exam. The EESs as well as their final GPA scores were also provided.
The students were, in their majority, between 16 and 24 years old (only 8.1% have more than 24 years of age), both genders (58.6% male and 41.4% female), from all Brazilian regions and enrolled in 45 different majors from the areas of Health Sciences, Engineering and Exact Sciences, Social Sciences and Arts.
Students who declared having studied all or most of their high school years in private schools, were considered coming from private school (PrHS). Analogously, the ones who declared having studied all or most of their high school years in public schools were considered coming from public schools (PuHS). Table 1 shows the distribution according to type of high school and entry year. In total 30.7% of all the students who enrolled between 1997 and 2000 come from PuHS.
There is no information about Race in the database as this question was only introduced in Unicamp's questionnaire in 2003. Another characteristic of interest evaluated by the models is whether students worked at the time of their enrollment. In the sample 28% of the students declared they worked when enrolled at the university. Within students from PuHS and PrHS, 48.5% and 18.6% declared to be working at enrollment time, respectively. Table 2 presents the distribution of the students according to type of high school by gender and by working status.
The Spearman correlation coefficient between the initial and final relative rank is 0.249, indicating a weak correlation between both relative ranks. Tables 3 and 4 present some descriptive statistics of the relative gain according to gender, type of high school and working status as well as results of two-sample t-tests and Wilcoxon rank sum tests for group comparisons using the whole data set and only those who graduated, respectively. We would like to point out that, even though the standard deviations seem to be equal or very close, in many cases they are not (the equality of variances was statistically rejected in all three cases for Table 3 and for gender in  Table 4). So, when performing the two-sample t-tests we use the p-values for unequal variances in Table 3 and for gender in Table 4. The PuHS students present a mean relative gain significantly greater than those who studied in PrHS. When comparing genders, women have a significantly greater relative gain than men. On the other hand, when comparing students who worked to those who did not, we do not observe any difference in the relative gain for the whole data set, but there is a significant result when we look only at the students who graduated, with a greater relative gain for those who did not work.    Looking at Figure 1 (the whole data set), it seems that differences in relative gain between types of high schools is slightly greater among women (meandifferences = 0.0771) than among men (mean difference = 0.0563), suggesting a possible interaction between gender and type of high school. In Figure 2 (only those who graduated), the situation is reversed. The differences in relative gain between types of high schools is slightly greater among men (meandifferences = 0.0823) than among women (meandifference = 0.0722).
In addition, we adjusted a classical linear model for all the students with all three main effects (type of high school, gender and work) and all pairwise interactions. None of the interaction terms were significant and the analysis of variance table for the final model is presented in  Table 5. The linear model uses dummy variables (Female = 1, Male = 0; PuHS = 1, PrHS = 0 and didnotwork = 1, worked = 0) as codification for the design matrix. According to the parameter estimates, all the coefficients are positive, which means that female students has greater relative gain than male students, students coming from PuHS have greater relative gain than those from PrHS and those who did not work have a greater relative gain than those who worked. When we look at the results of the t-tests and Wilcoxon rank sum tests in Table 3, there is no   Table 5.
difference in relative gain between those who worked and those who did not work, whereas in the model it is statistically significant. This may be due to the fact that the model is adjusting for the other effects. Figure 3 shows the Q-Q plot of the residuals for the model in Table 5 and we can see that the tails of the distribution are significantly lighter than the normal. The test for normality of the residuals rejects the null hypothesis of normality (both tests, Anderson-Darling and Kolmogorov-Smirnov, present p-values < .0001 ). The test results clearly show that the normality assumption is not satisfied by the data set. This motivates us to pursue the analysis with alternative methods which are robust to distributional deviations from normality, but, at the same time we would like to test main effects and interactions. In the following sections, we present a nonparametric method which attains both robustness and model flexibility.

Application
As neither normality nor triangular (−1,0,1) distributions are reasonable for the dataset here (see Figure 4), we will use the nonparametric method, since it is a more robust procedure.   Tables 6 and 7 displays the estimates of B n , their standard deviations and the p-values obtained from the nonparametric method for all the students and only those who graduated, respectively. In Table 6, we observe a significant difference between types of high school (p−value = .0002) and genders (p−value < .0001 ). We do not observe significant differences between the students who worked or not (p−value = .3716 ). Using the expansion of quasi U-statistics for multifactors (discussed in Section 2.3), which is comparable to the ANOVA results given in Table 5, there are significant effects of gender and type of high school (p−valuesare < .0001) when adjusted for the other variables. On the other hand, we do not find a significant effect of working status adjusted by gender and type of high school (p−value = .3153) and there is a significant effect of the interaction between gender and type of high school (p−value < .0001 ), which is different from the conclusions of the classical analysis of variance in Table 5. As observed in Figure 1 (all students), the difference in the relative gain between students who came from PuHS and PrHS, among female students is slightly greater than the difference among male students. As Table 7. Analysis of diversity (only for the students who graduated) using the nonparametric approach. for Figure 2 (only those who graduated), the situation is reversed, with the difference in relative gain between type of High Schools slightly greater among men than among women (see details in Section 3).
Although there are small numerical differences between Tables 6 (all students) and 7 (only those who graduated), the conclusions in terms of significance of the effects are the same.
Typically, when the assumptions of normality are not met, the tails of the distribution are heavier than the normal and the SE's, assuming normality, tend to be smaller than they should be. Here the tails are lighter and the empirical distribution are more concentrated than the normal distribution. For this reason the SE's for the nonparametric method is smaller, showing here an advantage over parametric procedures when the assumptions are not met.

Discussion
We present a nonparametric method as alternative to analyze continuous response variables when the assumption of normality is not met and the use of analysis of variance may be compromised. Furthermore, the proposed method, compared with classical nonparametric methods, has the advantage of allowing the analysis of more than two factors and also their interactions, making it more flexible, robust and more competitive to the analysis of variance. It is a general tool that can be applied in many different fields.
We propose the new methodology to analyze the performance of college students from entrance to graduation with a real data set from the University of Campinas, Brazil. The leptokurtic nature of academic performance data makes the robustness of the nonparametric method even more relevant. The difficulties on the analysis of this kind of data are twofold. First of all, given the complexity and diversity of grading systems, the mere use of numerical measures may inadvertently lead to weighting systems biased according to unknown discrepancies among the subjects. These discrepancies may be in scale, in location or in both. The use of ranks helps ameliorate the effects of the grading system. The second problem regards the specification of the underlying distribution. Both normal and triangular can be postulated as feasible options. We amply illustrate that these are not adequate distributions for the data set in question. The nonparametric quasi U-statistics decomposition can be used as a solution for unspecified distributions in general.
Finally, the results show that the misspecification of the underlying distribution may lead to wrong conclusions, since the results with the proposed method and the analysis of variance diverged. Although the solutions proposed here demand more computer power than their linear counterparts, the method can be easily implemented in any computer language, specially in R, and they are within grasp for any researcher on the field.