Multiple linear regression with compositional response and covariates

ABSTRACT The standard regression model designed for real space is not suitable for compositional variables; it should be considered, whether the response and/or covariates are of compositional nature. There are usually three types of multiple regression model with compositional variables: Type 1 refers to the case where all the covariates are compositional data and the response is real; Type 2 is the opposite of Type 1; Type 3 relates to the model with compositional response and covariates. There have been some models for the three types. In this paper, we focus on Type 3 and propose multiple linear regression models including model in the simplex and model in isometric log-ratio (ilr) coordinates. The model in the simplex is based on matrix product, which can project a -part composition to another -part composition, and can deal with different number of parts of compositional variables. Some theorems are given to point out the relationship of parameters between the proposed models. Moreover, the inference for parameters in proposed models is also given. Real example is studied to verify the validity and usefulness of proposed models.


Introduction
With the development of society, there are gradually diversified and complicated data types, such as compositional data, symbolic data, functional data, and so on. Compositional data are vectors in which all the components are positive real numbers and carry only relative information [18]. For column vector x = (x 1 , x 2 , . . . , x D ) T denoting a D-part composition represented with constant sum k, its sample space is the simplex S D [1] defined as where k is an arbitrary positive real number and is usually 1 or 100 depending on the units of measurement. This type of data often occurs in many applications, for example, the geochemical composition in geosciences [16,20,21], the vote proportions of election in economics [16], the gut microbiome composition in biosciences [14] and examples in many other disciplines [1].
Compositional data refer to parts of some whole and provide information only about the relative magnitudes of the components. The two important principles in compositional data analysis are scale invariance and subcompositional coherence. The ratios of components can guarantee the two principles, but the variances or covariances of ratios are awkward to manipulate. To solve this problem, the log-ratio methodology is suggested in compositional data analysis. Aitchison [1] proposed log-ratio transformations including additive log-ratio (alr) transformation and centered log-ratio (clr) transformation. In fact, both alr and clr transformation represent coordinates with respect to the Aitchison geometry [18]. The alr coordinates are non-isometric and asymmetric. The clr coordinates are isometric and symmetric, but the covariance matrix of clr coordinates is singular. Later Egozcue et al. [7] proposed the ilr coordinates with respect to a given orthogonal basis, which is not unique and can be obtained through sequential binary partition of parts of a composition [6].
The traditional regression model results in some misleading statistical inference when it is directly applied to compositional variables. The earlier research for Type 1 introduced the log contrast model [2], later the linear regression models in log-ratio coordinates were proposed [12,14], the latest research was spatio-temporal regression on compositional covariates [4]. There are many models for Type 2, such as the Dirichlet component regression model [11], linear regression model [9,21] and the regression model by using distributions defined on the hypersphere [19]. The Type 3 is studied in literature [22] which proposed multiple linear regression when both response and covariates are all compositional data. In particular, the model in the context of compositional time series was introduced [13]. Furthermore, the non-parametric regression for all three types was proposed [5].
In this paper, we are interested in Type 3. The model in literature [22] assumed that the compositional response and covariates have the same number of parts, and the different parts of compositional covariate have the same regression coefficient. In practice, the numbers of parts of different compositional variables are not the same in some examples. For this case, we expand the model in literature [22], and propose multiple linear regression with compositional response and covariates which have different numbers of parts. The proposed models include model in the simplex using matrix product and model in ilr coordinates. Furthermore, we also give inference for parameters in proposed models.
This paper is structured as follows: In Section 2, some preliminaries for compositional data are given. Section 3 gives the proposed multiple linear regression models. Real example is given in Section 4 to verify the validity and usefulness of proposed models. Section 5 presents some concluding remarks.

Preliminaries
In this section, several basic operations in the simplex and log-ratio coordinates are reviewed. What is more, we have given the definition and some important properties of matrix product which plays an important role in this paper.
For any two compositions x = (x 1 , x 2 , . . . , x D ) T , y = (y 1 , y 2 , . . . , y D ) T ∈ S D , real number α ∈ R, the perturbation ⊕ and powering operations [1] are defined as follows: where C(·) refers to the closure operation [1] (i.e. each component is divided by their total sum and then multiplied by k). The two operations define a (D − 1)-dimensional vector space structure in the simplex S D . To obtain a Euclidean vector space structure [3,17], we take the following inner product ·, · a with its related norm · a and Aitchison distance d a (·, ·), where the subscript 'a' stands for Aitchison, g m (x) denotes the geometric mean of the components of x, x y is equal to the perturbation x ⊕ ((−1) y).
Log-ratio coordinates are widely used in compositional data analysis. For any composition x = (x 1 , x 2 , . . . , x D ) T ∈ S D , the expression of a composition in clr coordinates [1] is given by , log where clr(x) i (i = 1, 2, . . . , D) denotes the ith coordinates of clr(x). Then a D-part composition x ∈ S D is transformed to a D-part real vector clr(x) ∈ U ⊂ R D , where U contains all real vectors with constant sum zero. The clr coordinates can preserve the invariance of inner product, namely x, y a = clr(x), clr(y) = (clr(x)) T clr(y), where ·, · denotes the inner product of two real vectors. For real vector u = (u 1 , u 2 , . . . , u D ) T ∈ U, the inverse mapping is given by The other widely used log-ratio coordinates are ilr coordinates [7] with respect to a given orthogonal basis. There are several ways to construct such a basis. For a given D × (D − 1) matrix the corresponding orthonormal basis {e 1 , e 2 , . . . , e D−1 } can be obtained through then composition x = (x 1 , x 2 , . . . , x D ) T ∈ S D can be transformed to the ilr coordinates ilr(x) = (ilr(x) 1 For real vector z = (z 1 , . . . , z D−1 ) T ∈ R D−1 , the inverse mapping is given by From the scale invariance property of compositional data, the obtained composition x can be represented as vectors with a chosen constant sum constraint. The relationship between ilr coordinates and clr coordinates of composition x [7] is where G D is defined in Table 1. In particular, the first coordinate of ilr(x) and the first coordinate of clr(x) have the linear relationship as follows: The coordinate ilr(x) 1 is a scaled sum of all the logratios of part x 1 and the other parts of x. It extracts all relative information concerning x 1 and captures the relative contribution of x 1 with respect to all the other parts [10,12]. To explain all relative information about component x l , we define the permuted composition where P D,l defined in Table 1 is a permutation matrix. Similar to the interpretation of the relation between ilr(x) 1 and x 1 , the first ilr coordinate ilr(x (l) ) 1 of permuted composition x (l) explains the relative information of x l with respect to all the other parts. In order to keep the consistence of notations in this paper, the notations and definitions of elementary matrices are given in Table 1, some relationships between elementary matrices are mentioned in the following property.
Column vector of length D with 1 in the ith position and 0 in every other position Table 1 have the following properties.
(2) According to Property 2.1(1), we have The third equality follows from P D,i P T D,i = I D and P D,i J D = J D . Similarly, we can get In the next section, we study multiple linear regression with compositional response and covariates which have different numbers of parts. In order to express linear function between simplex S D 2 and S D 1 , a special kind of matrix product in the simplex is introduced in the following definition.

Definition 2.2: Let x be composition in S D 2 and consider a real D
The matrix product can transform composition x ∈ S D 2 to composition A x ∈ S D 1 and has the following property.

Property 2.3: For any composition
(1) Matrix product has another kind of expression form: According to Property 2.3(1) and Equation (8), we have that If A also satisfies Aj D 2 = 0 D 1 , then AG D 2 = A. It follows from Equations (8) and (9) that

Regression model in the simplex
The multiple linear regression model in the simplex is as follows: where , the two operations ⊕ and are defined in Section 2. We assume that the compositional residual ε i ∈ S L follows the normal distribution on the simplex [15]. The sum of squared-norms of the error The last equality follows from Property 2.3(3). The parameters can be estimated by minimizing SSE with respect to log(a 0 ), A 1 , A 2 , . . . , A q , respectively. These normal equations can be obtained as follows: where 0 L×D k is an L × D k matrix of zeros. If both sides of Equation (11) are multiplied from the left-hand side by matrix G L , since G L A j = A j , the estimated parameters can be solved through Equation (12).

Regression model in ilr coordinates
The multiple linear regression model in ilr coordinates can be formulated as follows: where ilr(v i ), ilr(u ji ) represent the ilr coordinates of v i , u ji (j = 1, 2, . . . , q), respectively, b 0 , B 1 , . . . , B q are the parameters, ilr(ε 1 ), ilr(ε 2 ), . . . , ilr(ε n ) are residuals following multivariate normal distribution with zero mean and same covariance matrix. Using the least-squares (LS) method, parameters b 0 , B 1 , . . . , B q are estimated through solving the following equation: The following theorem proved in Appendix 1 points out that the parameters of regression model in real space are related with the parameters of model in the simplex.
The parameter in the first row and first column of matrix B j (j = 1, 2, . . . , q) explains the influence of the first coordinate ilr(u j ) 1 on the first coordinate ilr(v) 1 where ilr(v qi ) represent the ilr coordinates of the permuted composition v 1, 2, . . . , q) is the regression coefficient matrix. The model in Equation (13) is a special case of Equation (15). In the following theorem, which is proved in Appendix 2, we show that the intercept vector and regression coefficient matrices in Equation (15) can be expressed by parameters in Equation (13). The relationship between b 0 , B 1 , . . . , B q in Equation (13) 1, 2, . . . , q).

Theorem 3.2:
The model in Equation (15) is a classical multivariate responses regression model. We focus on the interpretation of parameter in the first row and first column of matrix B (l 0 ,l j ) j (j = 1, 2, . . . , q), that is, we are interested in the first response e T L−1,1 ilr(v (l 0 ) ), so we only study the following model: Denote l = (l 0 , l 1 , . . . , l q ), and where D = q j=1 D j − q, the model in Equation (16) can be written as The regression coefficients β (l) can be estimated by LS method,

Inference for the parameters in proposed models
Although the estimation of parameters a 0 , A 1 , A 2 , . . . , A q can be obtained, it is difficult to get the inference for these parameters due to the constraint a 0 ∈ S L , j T 1, 2, . . . , q). To solve this problem, we first study the inference for parameters β (l) of model in Equation (17). For given m (m = 0, 1, . . . , D), consider the hypothesis testing the test statistic is For the hypothesis testing are not all equal to 0, the test statistic is D ) T is the estimated regression coefficient vector except for intercept, and {((Z (l) ) T Z (l) ) −1 } −1,−1 denotes that the first row and the first column are excluded from the matrix ((Z (l) ) T Z (l) ) −1 . If the null hypothesis holds the test statistic F (l) follows Fisher's F-distribution with D and n−D−1 degrees of freedom.
Through the above process, we obtain the estimation and test of parameters β (l) , the . . , q) explains the influence of ilr(u (l j ) j ) 1 on ilr(v (l 0 ) ) 1 . When l 0 is fixed, we are mainly interested in the permutation invariance of test statistics T (l) k j and F (l) , which is given in the following theorem proved in Appendix 3. (16), when l 0 is fixed,  1, 2, . . . , q). When l 0 varies from 1 to L, l j varies from 1 to D j (j = 1, 2, . . . , q) in the regression model equation (16)

Theorem 3.4:
The vector c 0 is related with compositional vector a 0 , and matrix C j (j = 1, 2, . . . , q) has linear relationship with the regression coefficient matrix A j of model in the simplex, that is The proof of Theorem 3.4 can be seen in Appendix 4. The order of matrix A j (j = 1, 2, . . . , q) is the same as C j , so the coefficients in A j and C j are one-to-one correspondence, therefore we can get the inference for coefficients A j through the inference for parameters C j .

Example
In this section, we verify the usefulness and effectiveness of proposed linear regression models with compositional response and covariates. The models discussed in the previous section are applied to modeling the linear relationship between the consumption structure and age structure of Shanxi province, China. The consumption structure composition (y) includes food (y 1 ), clothing (y 2 ), housing (y 3 ), household equipment (y 4 ), medical care (y 5 ), transportation and communication (y 6 ), educational entertainment (y 7 ) and other activities (y 8 ). The age structure composition (x) consists of three parts: age < 15(x 1 ), age 15-64 (x 2 ), age > 64(x 3 ). The consumption structure and age structure data from the statistical yearbook of Shanxi Province are given in Table 2.
Consider the consumption structure as the response variable, and the age structure as explanatory variable, we build the regression model in the simplex where a 0 and A 1 are the parameters. The regression model in ilr coordinates of the permuted consumption composition and age composition is as follows: E(e T 7,1 ilr(y where the response and covariates are e T 7,1 ilr(y (l 0 ) ) (l 0 ∈ {1, 2, . . . , 8}) and ilr(x (l 1 ) ) (l 1 ∈ {1, 2, 3}), respectively, e T 7,1 b is the the regression coefficient vector. When l 0 varies from 1 to 8 and l 1 varies from 1 to 3 in  Table 3. Regression results for model in ilr coordinates of the permuted consumption composition and age composition.
Equation (20), the estimation and test of parameters are given in Table 3. 1 e 2,1 , the value in the parentheses is p-value of corresponding t statistic. In addition, the test statistic F (p-value) is also given in Table 3. When l 0 is fixed, by Theorem 3.3(3), the test statistic F can be obtained from any of the models in Equation (20) with arbitrary l 1 (l 1 = 1, 2, 3).
Based on Table 3, we can get c 0 and C 1 defined in Equation (18). According to Theorem 3.4, we obtain that where the value in bold represents that the parameter is significant at the 0.1 significance level. We can prove thatâ 0 is a compositional vector, the sum of each row and each column ofÂ 1 are both zero. The parameter in the ith row and jth column of matrix A 1 interprets the influence of log-ratio representing component x j on the log-ratio representing component y i , namely explains the influence of all relative information about component x j on all relative information about component y i . The interpretation of regression coefficient matrix A 1 is as follows: the age groups < 15, 15-64 and > 64 pay more attention to educational entertainment, transportation and communication and housing, respectively; the age group < 15 has positive influence for explaining educational entertainment because education is an eternal theme starting from the child; the age group 15-64 has positive influence for explaining transportation and communication because the car and mobile phone are essential in daily life for middle-aged people; the age group > 64 has positive influence for explaining housing, which corresponds to the social phenomenon that the old people buy house for the children. Accordingly, the age group < 15 and educational entertainment, the age group 15-64 and transportation and communication, the age group > 64 and housing have a directly proportional relationship over time, respectively, this is consistent with data trend in Table 2.
Consider another model with untransformed data in real space where the parameter in bold is significant at the 0.1 significance level, the value in the parentheses is the corresponding p-value. From the regression coefficientsγ , we can find that the age group < 15 and 15-64 have significant influence for explaining the same response variables y 1 , y 3 , y 6 , y 7 , with same sign, which is unreasonable due to the differences between age group < 15 and 15-64.
To compare the prediction accuracy of two models with the original compositional data in Equations (19) and (21), we consider the mean of squared distances MSD = ( 11 i=1 d 2 a (y i ,ŷ i ))/11 as the evaluation index. For the model in Equation (21),ŷ i,8 = 100% − 7 j=1ŷij . The MSD of two models in Equations (19) and (21) are 0.024 and 0.025, respectively. The results show that the prediction accuracy of proposed model in Equation (19) is very close to that of the model in Equation (21), though, the first one performs better from the perspective of interpretability of regression parameters.

Conclusions
Regression analysis is one of the data analysis methods widely used in application. The early researches for regression analysis with compositional variables focus on Type 1 and Type 2, the existing models are applied to the log-ratio coordinates, particularly, the ilr coordinates are widely used. Because the clr coordinates have a constraint which would cause multicollinearity, clr coordinates are not used in the regression model. The proposed models in this paper are applied to Type 3 with compositional response and covariates which have different numbers of parts. Theorems given in Section 3 are used to get the estimation and test of regression coefficients in proposed models. Applying the proposed models to real example, the interpretation result of regression coefficients seems to reflect the real patterns.
When the number of parts of composition is particularly large, the proposed models cannot be applied for high-dimensional compositional covariates, but just in the case of LS estimation. Linear regression is not always satisfied in practice, therefore the nonlinear regression, nonparametric regression and semiparametric regression are needed. These two aspects can be discussed further.