Inference for Multivariate Regression Model Based on Synthetic Data Generated Using Plug-in Sampling

Abstract In this article, the authors derive the likelihood-based exact inference for singly and multiply imputed synthetic data in the context of a multivariate regression model. The synthetic data are generated via the Plug-in Sampling method, where the unknown parameters in the model are set equal to the observed values of their point estimators based on the original data, and synthetic data are drawn from this estimated version of the model. Simulation studies are carried out in order to confirm the theoretical results. The authors provide exact test procedures, which in case multiple synthetic datasets are permissible, are compared with the asymptotic results of Reiter. An application using 2000 U.S. Current Population Survey public use data is discussed. Furthermore, properties of the proposed methodology are evaluated in scenarios where some of the conditions that were used to derive the methodology do not hold, namely for nonnormal and discrete distributed random variables, cases in which the inferential procedures developed still show very good performances.


Introduction
Methods of statistical disclosure control are used to achieve the competing goals of publishing statistical outputs from surveys, while protecting the survey respondents' confidential data from disclosure. Statistical disclosure control methods include data swapping, perturbation with randomly added or multiplied noise, and the release of synthetic data. The use of synthetic datasets has gained considerable popularity and importance in recent times (Klein, Mathew, and Sinha 2013). In this article, we investigate some inferential aspects of statistical analysis based on synthetic data for situations when either a single or multiple synthetic datasets based on the original data are created as substitute for publication and analysis. Little (1993) and Rubin (1993) first advocated the use of synthetic data for statistical disclosure control, using the framework of multiple imputation (Rubin 1987). Rubin (1993) argued that synthetic data so created do not correspond to any actual sampling unit, thus preserving the confidentiality of the respondents. Inferential methods for fully synthetic data were developed by Raghunathan, Reiter, and Rubin (2003). Reiter (2005a) presented an illustration and empirical study of fully synthetic data and Reiter and Raghunathan (2007) provided an overview of multiple imputation techniques, including its use in statistical disclosure control. Reiter (2003) presented methods for drawing inference for partially synthetic data. This is exactly the context of our article.
There are two main methods one can use to generate synthetic data: Posterior Predictive Sampling (PPS) and Plug-in Sampling (Reiter and Kinney 2012), and statistical methods of data analysis can be developed for both methods.
Although most inferential methods for synthetic data are based on multiple imputation, Klein and Sinha (2015a, 2015b, 2015c, 2016 in a series of recent papers developed exact parametric inferential methods based on singly imputed synthetic data for several probability models, including the multiple linear regression model where the sole response variable is taken as sensitive, thus requiring protection, while the covariates are treated as nonsensitive. There are cases where singly imputed synthetic data have been released (Hawala 2008;Kinney et al. 2011;Kinney, Reiter, and Miranda 2014), and therefore procedures for valid data analysis for this case are desirable.
Our main objective in this article is to extend this scenario to the case of a multivariate linear regression model where there are multiple sensitive responses following a multivariate normal distribution with means modeled as linear combinations of multiple non-sensitive covariates. Based on the fitted multivariate linear regression model, we synthesize the sensitive responses based on the Plug-in Sampling method, and develop exact inferential data analysis procedures for both single and multiple imputation.
A brief description of the Plug-in Sampling method, which will be used throughout the article, follows. Suppose that Y = (y 1 , . . . , y n ) are the original data which are jointly distributed according to the probability density function (pdf) f θ (Y), where θ is the unknown (scalar, vector or matrix) parameter. We start by taking the value of a point estimatorθ (Y) of θ , and plug it into the joint pdf of Y. The resulting pdf, with the unknown θ replaced by the observed value of the point estimatorθ (Y), is denoted by fθ . The singly imputed synthetic data, denoted by V, are then generated by drawing V = (v 1 , . . . , v n ) from the joint pdf fθ (Y). In case of multiple imputation, this procedure is independently repeated M times to generate M synthetic datasets.
In terms of the multivariate linear regression model, in our context, we consider several sensitive response variables y j , j = 1, . . . , m, originating the vector of response variables y = (y 1 , . . . , y m ) , and a set of p nonsensitive predictors x = (x 1 , . . . , x p ) . We assume that y|x ∼ N m (B x, ), with B and unknown. We write Y = (y 1 , . . . , y n ) with y i = (y 1i , . . . , y mi ) and X = (x 1 , . . . , x n ) with x i = (x 1i , . . . , x pi ) . We also assume that rank(X : p × n) = p < n and n ≥ m + p. We are thus considering the following multivariate regression model where E m×n is distributed as N mn (0, I n ⊗ ). It is well known that, based on the original data,B = (XX ) −1 XY is the MLE and the UMVUE of B, distributed as N pm (B, ⊗ (XX ) −1 ), independent ofˆ = 1 n (Y −B X)(Y −B X) which is the MLE of , with nˆ ∼ W m ( , n − p), and therefore S = nˆ n−p will be the unbiased estimator of .
There are several tests for B, based on the original data, in the literature (Anderson 2003). In this article, the authors will develop two new procedures to be used with synthetic data to draw inference for B, and also for C = AB and = ABD where A is a k × p matrix with rank(A) = k ≤ p and k ≥ m, and D is an m × r matrix with rank(D) = r ≤ k.
The organization of the article is as follows. In Section 2, based on singly and multiply imputed synthetic data generated via Plug-in Sampling, we develop two exact inference procedures for the matrix of regression coefficients B. These will be based on pivot statistics which are different from the classical test statistics for B under this model (see Anderson 2003). These classical statistics are shown to be nonpivotal in the case of imputed synthetic data generated via Plug-in Sampling. The new exact inferential procedures are compared with Reiter's asymptotic methodology for multiple imputation synthetic data (Reiter 2005b). In Section 3, we present some simulation results in order to check the accuracy of the theoretically derived results for the singly imputed and multiply imputed synthetic data, comparing the latter with the results obtained using an adaptation of Reiter's methodology. We also define the radius (distance between the center and the edge) of the confidence sets for the matrix of regression coefficients B, both for the original data, as well as for the singly and the multiply imputed synthetic data. The Plug-in Sampling method offers smaller radius of the confidence sets than the PPS method and also gives estimates of the parameters closer to the ones obtained from the original data, despite giving slightly higher levels of disclosure risk (Moura 2016). Section 4 presents data analyses under the proposed methods for singly and multiply imputed synthetic data in the context of public use data from the 2000 U.S. Current Population Survey (CPS), and the results are compared with those obtained from the original data. In Section 5, using the CPS data, we present an evaluation of the level of protection of the released synthetic datasets by comparing single and multiple imputation scenarios. Some concluding remarks are added in Section 6. Proofs of the theorems, corollaries, and other technical derivations appear in Appendices A and B.
We conclude this section with an observation regarding the existence of sufficient statistics. Suppose the original data are Y ∼ f θ , and the synthetic data V = (V 1 , . . . , V M ) are generated such that V 1 , . . . , V M |Y are iid from fθ . Suppose that T(Y) is a sufficient statistic for θ based on the original data. Then the pdf of the synthetic data which implies the following result.
Lemma 1.1. Suppose that when the original data Y are observed, T(Y) is a sufficient statistic for θ . Then when the synthetic data V = (V 1 , . . . , V M ) are made available, (T(V 1 ), . . . , T(V M )) is jointly sufficient for θ. Furthermore, if M = 1, the sufficient statistic is simply belongs to the exponential family.

Analysis Under Single and Multiple Imputation
In this section, a likelihood-based approach for analysis of synthetic data generated from a multivariate regression model is presented for the Plug-in Sampling method. First, we provide two new and exact inferential procedures based on the likelihood principle for single and multiple imputation synthetic data (for M = 1, the single imputation case, both procedures concur) and then work out an adaptation of Reiter's method for our setup.
Consider the multivariate linear regression model (1) with Y, X, B, ,B and S defined in that same context.

A First New Procedure Based on the Mean Synthetic Covariances
Toward the aforementioned objective of drawing inference on B, based on the partial synthetic data, let B * j = (XX ) −1 XV j and (5) then, we may build (i) a test for the parameter matrix C: in order to test H 0 : C = C 0 versus H 1 : C = C 0 at a given γ level, we reject H 0 whenever the computed value of T M,C 0 exceeds its 1 − γ quantile; in particular a test for B = B 0 follows upon taking A = I p , (ii) a confidence set for C: a (1 − γ )-level confidence set for C is given by where δ M,k,m,p,n;γ is the 1 − γ quantile of T M,C (the value of δ M,k,m,p,n;γ can be obtained by simulating the distribution of T M,C , by first generating W ∼ W m (I m , n − p) and then generating T M,C |W from Equation (5)), 5. to infer about ABD = where A is a k × p matrix and D is a m × r with r ≤ k, we start from its natural point estimator * M = AB * M D and propose to use the pivotal quantity (see Corollary 2.2) whose distribution is obtained from the relation , all independently; taking r = 1 and k = 1, and making A : 1 × p a matrix of zeros except for A 1,g = 1, and D : m × 1 a matrix of zeros except for D h,1 = 1, for g = 1, . . . , p and h = 1, . . . , m we may observe that thus concluding that the (1−α) confidence interval for B (g,h) will be given by from which we can infer that, conditional on S, B * Corollary 2.1. The distribution of T M defined in Equation (4) can be obtained from the decomposition This implies that T M is a pivotal quantity, that is, its distribution does not depend on .
Corollary 2.2. The distribution of T M, defined in Equation (7) can be obtained from the decomposition We may refer that all the above results remain valid for M = 1, that is, the single imputation case, for which inferential procedures were not available in the literature.
Remark 2.1. When m = 1 and M = 1, the statistic T M in Equation (4) reduces to the statistic T 2 used in (Klein and Sinha 2015a) which has a distribution obtained from the fact that Remark 2.2. One could think that for our context we could suggest the use of the following adaptations of the classical test criteria for the multivariate regression model (see Anderson 2003 for the classical test criteria)

A Second New Procedure Based on a Combination of Mean and Cross Synthetic Covariances
Noting that it will be possible to gather more information from the released synthetic data we propose, in this subsection, another likelihood-based approach for exact inference about B, based on a combination of mean and cross synthetic covariances. Let us start by recalling that V j is an m × n matrix formed by the vectors (v . . , M, and noting that, conditionally onB and S, which ends up being the same estimator defined in Section 2.1.
We may obtain additional information about from , which can be combined with the previous estimator S v to obtain Analogous to what was done in Section 2.1, one can derive the following inferential results, for p ≥ m, and n > m + p, 1. an unbiased estimator of is S comb (see Appendix B.2), 2. we prove in Corollary 2.3 (see below) that is a pivotal quantity and that, for W ∼ W m (I m , n − p) and 3. if one wants to test the significance of a set of regression coefficients or more generally, a linear combination of B, namely AB = C where A is a k × p matrix with rank(A) = k ≤ p and k ≥ m, one may define (11) then, we may build (i) a test for the parameter matrix C: in order to test H 0 : C = C 0 versus H 1 : C = C 0 , at a given γ level, we reject H 0 whenever the computed value of T comb,C 0 exceeds its 1 − γ quantile; in particular a test for B = B 0 follows upon taking A = I p , (ii) a confidence set for C: a (1 − γ ) level confidence set for C is given by where ω M,k,m,p,n;γ is the 1 − γ quantile of T comb,C (the value of ω M,k,m,p,n;γ may be obtained by simulating the distribution of T comb,C , by first generating W ∼ W m (I m , n − p) and then generating T comb,C |W from (11)), 4. to infer about ABD = where A is a k × p matrix and D is an m × r matrix with r ≤ k, we start from its natural point estimator * M = AB * M D and propose to use the pivotal quantity (see Corollary 2.4) whose distribution is given by , all independently. Taking r = 1 and k = 1, and making A : 1 × p a matrix of zeros except for A 1,g = 1, and D : m × 1 a matrix of zeros except for D h,1 = 1, for g = 1, . . . , p and h = 1, . . . , m we may observe that the (1 − α) confidence interval for = B (g,h) will be given by (for details in the proof of this result see Section S3 of Part II of the supplementary material).
The above results are derived based on the observation that S mean |S ∼ W m ( S M , n − p), and on the following theorem and corollaries, whose proofs are provided in Appendix A.
Theorem 2.2. The joint pdf of (B * M , S comb ) defined in Equations (8) and (9) is proportional to from which we can infer that, conditional on S, B * M and S comb are independent, with B * Corollary 2.3. The distribution of T comb defined in Equation (10) can be obtained from the decomposition , which implies that T comb is a pivotal quantity, that is, its distribution does not depend on .
Corollary 2.4. The distribution of T comb, defined in Equation (13) can be obtained from the decomposition Remark 2.3. It is the case that var(S * M ) > var(S comb ) for M ≥ 2 (with equality for M = 1), and therefore the second new procedure is expected to outperform the first new procedure for M ≥ 2. Anyway we still make both procedures available in the article since the first procedure has an easier implementation, which the analyst may prefer to use given that for larger sample sizes there will be no big differences between the results from the two procedures, in terms of the radius, as it is shown in Section 3.
The proof of this Remark may be seen in Appendix B.3.

Reiter's (2005b) Methodology Under Multiple Imputation
Now we present an adaptation of Reiter (2005b) methodology for drawing inference on a vector valued parameter, based on multiply synthetic data, to draw inference on a matrix of regression coefficients. Although originally developed for synthetic data generated by PPS, Reiter and Kinney (2012) showed that the methodology in Reiter (2005b) is also valid for synthetic data generated via Plug-in Sampling.
In order to be possible to use Reiter's (2005b) methodology to estimate B from V 1 , . . . , V M , the synthetic datasets defined at the beginning of Section 2, we consider vec

Simulation Studies
In this section we present results of some simulations. The objectives of these simulations are (i) to show that the inferential methods used in Section 2 perform as we predicted for our proposed methodology for singly and multiply imputed synthetic data generated via Plug-in Sampling, and (ii) to compare the accuracy of our proposed methodology with the accuracy of Reiter (2005b) methodology for multiply imputed partially synthetic data. All simulations were carried out using the software Mathematica . To conduct the simulation, we take the population distribution as a multivariate normal distribution with expected value given by the right-hand side of (1), with matrix of regression coefficients B, and covariance matrix , for m = 2 and p = 3, given by The values x 1i , x 2i , x 3i , i = 1, . . . , n, of the explanatory variables are generated as iid N(0, 1) and held fixed for the entire simulation.
Based on a Monte Carlo simulation with 10 5 iterations, we compute estimates of the coverage probability (percentage of observed values of the statistics smaller than the respective theoretical cut-off points) of the following confidence regions (where in all cases, the level of the confidence region is set to 0.95): 1. for the two new procedures in Sections 2.1 and 2.2, based on single and multiple synthetic data, the confidence sets for B and for AB = C, given by Equations (6) and (12), are computed, with A = ( 0 2×1 | I 2 ), using the methodology described in the two subsections referred above; for M = 1, 2, 5, 10, and 20 synthetic datasets, the estimated average coverage probabilities of the confidence sets are shown in Table 1 under the columns B(1) and AB(1) for the new procedure in Section 2.1, and under the columns B(2) and AB(2) for the new procedure in Section 2.2; for M = 1 only one column is needed since the two procedures coincide, and Reiter's adapted procedure is not available for single imputed data; 2. the confidence set for B is obtained using the adapted methodology of Reiter (2005b) in Section 2.3, for M(> 1) synthetic datasets, and then for each of the cases M = 2, 5, 10, and 20, the estimated average coverage probabilities of the confidence sets are shown in Table 1 under the column vec(B).
The results reported in Table 1 for sample sizes n = 10, 20, 50, 100, and 200, show that, based on singly imputed and multiply imputed synthetic data, the 0.95 confidence sets for B and AB have an estimated average coverage probability approximately equal to 0.95, confirming that the confidence sets perform as predicted. Using the adapted methodology of Reiter (2005b) for multiply imputed partially synthetic data the estimated average coverage probabilities fall short of the stipulated level of 0.95 for very small sample sizes, as expected, since this procedure is asymptotic in nature, but quickly attain the desired level even for moderate sample sizes for the cases where M ≥ 5.
In Part I of the supplementary material, we address cases where the response variables have nonnormal distributions, namely when they have a multivariate t-distribution, a multivariate skew-normal distribution, a binomial distribution, a Poisson distribution and a distribution with a spike at zero. As it may be seen from Tables S1-S8 in the supplementary material, our procedures show, in general, for all these distributions values of estimated average coverage probabilities for B and AB (with A = (0 2×1 | I 2 )), very close to the nominal value of 0.95, even for    small sample sizes. We may note that in all cases the adaptation of Reiter (2005b) procedure gives somewhat similar results, at least for the larger sample sizes. Only for the case of distributions of response variables with a spike at zero, when testing for the matrix B, the adapted Reiter procedure seems to present even lower average coverage probabilities than our procedures. We may note that in the single imputation case (M = 1), the estimated average coverage probabilities have values that are slightly closer to the nominal value of 0.95, mainly when considering discrete distributions for the response variables. This may lead to the conclusion that the use/release of singly imputed datasets may be more adequate in these cases.
In order to measure the radius (distance between the center and the edge) of the confidence sets, we propose, for a level 0 < γ < 1, In order to compare with the original data, we take M = 0, withS 0 = S, and the cut-off points δ 0,k,m,p,n;γ are obtained as the γ quantiles of the statistics where F p,l ∼ F p−l+1,n−p−l+1 and F k,l ∼ F k−l+1,n−p−l+1 . The expected value of this measure will be where K 0,p,n,m = 1 for the original data and, for M ≥ 1, for the procedure in Section 2.1, and for the procedure in Section 2.2, where when M = 1 it will refer to the case of single imputed synthetic data. For more details about these expected values, see Appendix B.4.
Observing Tables 2 and 3, we conclude that as the number M of released synthetic datasets increases, ϒ M decreases and eventually coincides with the value of ϒ 0 , the value for the original data, indeed as expected, since as M increases, the amount of information about the original data released increases, leading us closer to the inference drawn from the original data. We also observe that the values of ϒ M , for M > 1, for both procedures become identical for larger sample sizes.
We may see that the Plug-in Sampling method offers radius that are much smaller than those obtained with the PPS method. For M = 1, that is, in the case of single imputation, the PPS method leads to radius which are approximately two and half times the radius obtained under Plug-in Sampling, what may be seen as an important advantage of the Plug-in Sampling method (Moura 2016, sec. 4.2.2).

An Application Using the Current Population Survey Data
In this section we provide an application based on some real data and compare the inference based on the original data with the inference based on the synthetic data, according to the procedures developed in Section 2 and also the method of  We will only focus on the household records. The full data has seventeen variables measured on 51,016 heads of households and it includes the variables age, race, sex, and marital status as key identifiers and a mix of other categorical and numerical variables. For the vector y of response sensitive variables, we have selected two numerical variables, namely, total household income (I) and household property tax (PT). After deleting all entries where at least one of these variables are reported as 0, we were left with a sample size of 32,923. The example addressed below, using the proposed exact methods developed in Sections 2.1 and 2.2, illustrates the capabilities of these methods. We will assume the normality of the fifth root of the two response variables I and PT. As we may observe in Figure 1, the marginal distribution of the transformed variables is approximately normal. Anyway, as it is shown by the results in the supplementary material, even if these variables would not be normally distributed, the procedures in Section 2 will still perform adequately.
We take the n = 32, 923 households as a random sample, and I and PT as confidential variables. We will use the following set of covariates: N: number of people in household; L: number of people in the household who are less than 18 years old; A: age for the head of household; E: education level for the head of the household (coded to take values 31-46); M: marital status for the head of the household (coded to take values 1-7); R: race of the head of the household (coded to take values 1-4); S: sex of the head of the household (coded to take values 1,2).
For further details, namely on the coding of the variables, we refer to the Current Population Survey March 2000 technical documentation (available at http://www.census.gov/prod/ techdoc/cps/cpsmar00.pdf ) and to Klein and Sinha (2015a).
where the indicator variables for the first code present in the sample for each variable is taken out in order to make the model matrix full rank, and where I(E = 32) is the indicator variable for E = 32, that is, for individuals that have completed 1st, 2nd, 3rd or 4th grade, and so on. The model matrix X = (x 1 , · · · , x n ) has p = 29 rows and n = 32, 923 columns, with rank equal to 29. Using the Plug-in sampling method, we generate a single synthetic dataset. The realizations of the unbiased estimators B * and S * of B and , are, respectively, shown in Table 4 and in expression (17) (17) We see that the point estimates ofB based on the synthetic data and the original data tend to be in agreement. We also find that the two estimates of , S and S * , tend to have a general agreement.
As remarked in Moura (2016, sec. 5.1) estimates obtained from Plug-in synthetic generated data seem to be more in agreement with the ones obtained from the original data than the ones obtained from synthetic data obtained from PPS. We now present inferences on regression coefficients obtained by applying the methodology from Section 2 to analyze the singly imputed synthetic data and multiply synthetic data, considering M = 2 and M = 5. For this purpose, we use the statistics T, T M , T comb and T R,M defined in Section 2 and their empirical distributions (10 5 simulation size) to test the significance of the model, for γ = 0.05. For M = 1 the computed value of the statistic T was 4.96468, which is larger than the determined cut-off point for this case, δ 1,29,2,29,32923;0.05 = 5.14914 × 10 −6 , with a corresponding p-value approximately equal to 0, therefore, rejecting the non-significance of the model, that is, assuming that the explanatory variables in x have a significant role in determining the values of the response variables in y. For M = 2 and M = 5, one finds similar p-values, with the computed values of T M , for the first procedure, equal to 4.94839 and 5.06947, and the computed values of T comb , for the second procedure, equal to 4.94420 and 5.06190. If we perform the same test on the original data, we obtain for T O in Equation (14) the computed value of 4.93432, that is also larger than the determined cut-off point 1.27984 × 10 −6 , with a p-value approximately equal to 0, also rejecting the nonsignificance of the model. For Reiter's adapted procedure the p-values obtained were also approximately equal to zero both for M = 2 and M = 5.
In Figure 2, one can see a histogram associated with the empirical distribution of T M for M = 1 (10 5 simulation size).
We further considered the test of the null hypothesis H 0 : AB = 0, using A = 0 2×3 1 0 0 0 0 1 0 2×23 and the statistics T M,C and T comb,C in Equations (5) and (11), and also the adapted procedure of Reiter (2005b). In the latter we Also for this case, we may note that for all procedures the pvalues are very close to zero as also is the p-value obtained from the original data, when using Equation (15). As a consequence, it is interesting to observe that all p-values lead to the same conclusion, the rejection of the nonsignificance of the set of regression coefficients, and that the p-values obtained for M = 1 are not very far from the ones obtained for M = 2. Comparing the two multiple imputation procedures developed we observe that they present very similar p-values. Also, with the increase of the value of M the p-values get smaller, that is, closer to the p-values obtained with the original data, which although it may be seen as an advantage, it comes at the expense of a decrease in confidentiality.
Alternatively, it is possible to construct individual confidence intervals for all regression coefficients by using Result 5 in Section 2.1 and Result 4 in Section 2.2, based on Corollaries 2.3 and 2.6, and whose detailed proofs may be found in Section S3 of Part II of the supplementary material. In Sections S3.1-S3.6 are shown the confidence intervals for all regression coefficients derived from the original data and the synthetic datasets for M=1, 2, 5. From these confidence intervals one may observe that for increasing values of M, the confidence intervals become smaller and smaller becoming closer and closer to the size of the one derived from the original data. This fact concurs with the study of the radius done in Section 3.

Privacy Protection of Singly Versus Multiply Imputed Synthetic Data
It is anticipated that singly imputed synthetic data will offer bigger protection than multiply imputed synthetic data. Nevertheless, one needs to evaluate this level of protection. In this section, we perform this evaluation using the CPS data referred to in the previous section. Let us consider Assume that after having access to the released synthetic data an "intruder" estimates the original values y i = (y 1i , . . . , y mi ) bŷ i . Then we propose the following three criteria as measures of the level of privacy protection: Let us also consider from 1, the following quantity, for i = 1, ..., n and l = 1, ..., m, and from 3, the quantity We use Monte Carlo simulation with 10 4 iterations to estimate the above measures for each of the n = 32, 923 households in the CPS dataset.
In Table 5, we show the values of 1,0.01 and 2,0.01 and for D 1,0.01 its minimum, 1st quartile (Q 1 ), median, 3rd quartile (Q 3 ) and maximum. In Table 6, we show the values of 3,0.01 , 3,0.1 and the minimum, Q 1 , median, Q 3 and maximum of D 3 . Looking at Table 5, we observe that the values of the measures increase as M increases, showing that the disclosure risk increases with the increase in the number of released synthetic datasets. We also observe that even for M = 5, the maximum value of D 1,0.01 is 0.3279, thus already indicating a substantial disclosure risk compared to 0.1491 from the singly imputed case. Likewise, we observe from Table 6 that if we set = 0.09, we have 3, = 0 for M = 1 but 3, = 0.1886 for M = 5.

Concluding Remarks
The data analysis methodology of Reiter (2003), Reiter (2005b) and Raghunathan, Reiter, and Rubin (2003) are asymptotic in nature and can only be used when multiply imputed synthetic datasets are released. Moreover, their procedures were developed to draw inference only on scalar and vector parameters. In this article, two exact likelihood-based solutions are offered for the case when multiple or single synthetic datasets are released and inference procedures are obtained for a matrix of regression coefficients under a Multivariate Linear Regression Model when synthetic data are generated via Plug-in Sampling. Furthermore, the authors provide an adaptation of Reiter (2005b) vector methodology to a matrix of parameters.
The second procedure proposed for multiple synthetic data presents slightly better performances than the first one for small sample sizes, and their performances are nearly the same for larger sample sizes. Nevertheless, the first procedure presents a simpler way of analyzing the synthetic datasets, thus being important to have access to both these procedures.
Although the singly imputed synthetic data will offer bigger protection than multiply imputed synthetic data, when releasing increasing numbers of multiple datasets, the confidence regions become smaller and smaller becoming closer and closer to the size of the ones derived from the original data.
Simulation studies show that the two new exact methodologies developed lead to confidence sets with the expected level of confidence, even for small sample sizes, for both single and multiple imputation.
Our simulations also reveal that as the number of synthetic datasets released increases, the inference derived from synthetic datasets comes closer to the one based on the original data, but of course at the expense of compromising privacy, namely by increasing the disclosure risk. Due to the increasing need to protect privacy, some entities already have decided to not release multiple imputation synthetic datasets, releasing only a single imputation dataset. The procedures developed in this work now allow the analysis of the data in the single imputation case, encouraging imputers to consider this scenario without having the concern about the feasibility of its analysis. We may note that one other advantage of the single imputation is that the estimated average coverage probabilities have values that are slightly closer to the nominal value of 0.95 than the ones obtained from multiple imputation, mainly when considering discrete distributions for the response variables, not forgetting that an analyst may find less confusing receiving a single dataset.
One of the advantages of using the Plug-in Sampling method is that it offers smaller radius (distance between the center and the edge) of the confidence sets than the PPS method, while also giving estimates of the parameters that are closer to the ones obtained from the original data, although at the expense of slightly higher levels of disclosure risk (Moura 2016). Furthermore, the procedures developed, although based on model (1), which may seem to be a quite restrictive framework, allowed to develop inferential procedures with very good performances when data are generated by adaptations of the Plug-in method that generate nonnormal or discrete response variables.
In the future it would be interesting to research how the procedures developed would perform on partial synthetic datasets generated by CART (Classification And Regression Tree) methodology (Reiter 2005c) and how similar techniques and procedures to the ones developed might be applied using LASSO (Least Absolute Shrinkage and Selection Operator) and other shrinkage and penalized regression methods. and T (4) = λ 1 where λ 1 denotes the largest eigenvalue of (n − p)H × ( + S) 1/2 S −1/2 × G −1 × S −1/2 ( + S) 1/2 .
From T (1) we can observe that a term of the denominator is and in the other statistics there are similar terms. We can also observe that all of the terms involve a product similar to S − 1 2 ( +S) 1 2 that cannot be simplified the same way we could do when using the determinant as in the statistic T M used in this article.
After making the above simulations we can observe from its distributions and cut-off points (γ = 0.05) that these four statistics are non-pivotal.

B.3. Proof of Remark 2.3
We which will be larger than 1 for M ≥ 2 and equal to 1 for M = 1.

B.4. Details on Results in Section 3
Lastly, we provide some details about the derivations of the results in Section 3. Combining the result of E(|(n − p)S|) with each of the synthetic expected values, conditionally on S, we end up with the expression for E(ϒ M ) found in Section 3.