Semi-Structured Distributional Regression

Abstract Combining additive models and neural networks allows to broaden the scope of statistical regression and extend deep learning-based approaches by interpretable structured additive predictors at the same time. Existing attempts uniting the two modeling approaches are, however, limited to very specific combinations and, more importantly, involve an identifiability issue. As a consequence, interpretability and stable estimation are typically lost. We propose a general framework to combine structured regression models and deep neural networks into a unifying network architecture. To overcome the inherent identifiability issues between different model parts, we construct an orthogonalization cell that projects the deep neural network into the orthogonal complement of the statistical model predictor. This enables proper estimation of structured model parts and thereby interpretability. We demonstrate the framework’s efficacy in numerical experiments and illustrate its special merits in benchmarks and real-world applications.


Introduction
In many applications, it is crucial to also capture uncertainties of predictions besides mere point estimates.Different approaches exist to model this (aleatoric) uncertainty and estimate predictive distributions; which we refer to as distributional regression (DR).Examples are quantile regression (Koenker, 2005) or structured additive distributional regression (SADR; see, e.g., Rigby and Stasinopoulos, 2005;Klein et al., 2015).SADR explicitly models all distributional parameters of a distribution D and thereby allows to learn the whole distribution.The general idea of modeling distributions has been advocated early in the machine and deep learning literature (e.g., density networks as proposed by Bishop, 1994).
Among the approaches that directly model predictive distributions, DR can be considered a natural extension of traditional statistical regression models.Classical mean regression is used to estimate the expectation µ := E D (Y |ν) of a target variable Y using tabular input features ν.The most widespread mean regression is commonly known as the generalized linear model (GLM; Nelder and Wedderburn, 1972), defining the set of candidate models as with model weights or coefficients w and response (or activation) function h.As an extension, generalized additive models (GAMs; Wood, 2017) allow for general structured additive predictors µ = h( j f j (ν)) that allow estimating functional (e.g.non-linear, spatial) relationships between features and the response through generic function space representations f j .GLMs and GAMs usually assume only very few and low-dimensional interactions between features ν in the additive predictor j f j (ν).Due to the additivity and function space assumptions of feature effects, this so-called structured predictor allows for a straightforward interpretation of the model components.For example, in a GAM with no interactions, µ = h( j f j (ν j )), the contribution of each effect f j (ν j ) on h −1 (µ) can be quantified explicitly by holding all other effects in the additive predictor constant.SADR extends GAMs by combining

Related Work
Existing SADR models in statistics (e.g., Klein et al., 2015;Groll et al., 2019) assume a complex structured additive predictor.Several authors have described amalgamations of a simpler statistical model and a neural network (NN).Sarle (1994) describes the commonalities and differences between statistical models and NNs, De Waal and Du Toit (2011); Brás-Geraldes et al. (2019) consider GAMs when framed as a NN.Agarwal et al. (2020) propose learning the non-linear additive functions of GAMs within a NN using separate networks for each feature.Various authors in the statistical literature have also proposed a combination of statistical regression and NNs.Tran et al. (2020) propose the class of deep GLMs, where only the conditional expectation µ is modeled through a predictor that is based on a DNN.Their model is similar to the deep Bayesian regression model of Hubin et al. (2018) who implement exact Bayesian inference, while Tran et al. (2020) consider an approximation via fixed form variational Bayes and allow for additional random effects.In Umlauf et al. (2018) a single-layer NN is included in a SADR model for a very specific choice of weight and bias generation, but this approach does not allow for more complex or deep network structures.Existing wide and deep NNs only focus on modeling the distribution mean (e.g., Cheng et al., 2016;Chen et al., 2018) or the hazard in survival regression (Pölsterl et al., 2020).Li et al. (2021) propose a deep distributional regression approach by transforming the estimation problem into a constrained multi-class classification problem but do not use (semi-)structured additive predictors.

Simplified Problem Formulation
While approaches marrying statistical regression and DL exist, these methods either do not provide aleatoric uncertainty or are limited to special use cases.More importantly, none of the existing approaches that try to leverage interpretable regression models and DL actually preserve the interpretability of the structured regression part satisfactorily.This is due to an identifiability problem between the structured and unstructured predictors.For instance, consider a deep GLM that fuses a DNN d with inputs ν additively with a linear GLM predictor using the same features: From the universal approximation theorem (Cybenko, 1989) and related literature, it is well known that d can potentially approximate any continuous function under certain conditions.Thus, without loss of generality, assume that d(ν) = ν m + f (ν), decoupling the learned effect of d into a linear predictor part ν m and some non-linear effect f of ν.
We can directly observe that the deep GLM encompasses an identifiability issue, as we can arbitrarily shift a linear portion of ν from the linear model part in η to the deep part d of the model and vice versa, e.g.: While not relevant for prediction, this identifiability issue inhibits interpretability of the model.In the deep GLM example, it is unclear how much of the linear effect the model will be attributed to the structured predictor and how much to the deep model part.This problem becomes even more entangled for more complex structured predictors as commonly used in GAMs or SADR.

Main Contributions
We present a novel neural network-based framework for the combination of SADR and (deep) NNs.We address challenges with estimation and tuning of such a model, and in particular, propose a solution to the inherent identifiability issue in this model class based on a well-chosen orthogonalization cell.This cell permits the joint modeling of the structured model part and the unstructured DNN predictor in a unifying, end-toend trainable network, while preserving the interpretability of the structured part.On the one hand, our model class subsumes classical statistical regression models such as GAMs or SADR as special cases within a NN architecture while having similar or even superior estimation performance.On the other hand, due to the generality of our approach, extensions of existing regression approaches can be built in a straightforward fashion.This facilitates, e.g., combining interpretable structured effects of tabular features with a DNN to capture potential higher-order interactions, or multimodal learning problems in which multiple different data modalities such as tabular, image or text data are present.
After introducing semi-structured distributional regression and addressing the accompanied identifiability issue in Section 2, we describe implementation details in Section 3. In Section 4, we first examine how to achieve the identifiability of our model.Thereafter, we investigate its estimation and prediction performance in comparison with state-of-the-art distributional models.We then apply our method to several benchmark data sets in Section 5.
Finally, we will use the proposed framework in a multimodal data setting to predict prices of rental apartments in Section 6.We implement our proposed framework in a corresponding software package, available at https://github.com/davidruegamer/deepregression.All codes to reproduce results from numerical experiments, benchmarks and our application are available at https://github.com/davidruegamer/semi-structured_distributional_regression.

Semi-Structured Distributional Regression
In this section, we first introduce SADR and define the basic notation.Afterward, we provide the theoretical basis to achieve identifiability in our general model class.

Distributional Regression and Notation
As briefly introduced before, SADR models aim at estimating arbitrary parametric distributions D(θ 1 , . . ., θ K ) by learning the corresponding distributional parameters θ = (θ 1 , . . ., θ K ) ∈ Θ ⊆ R K .SADR allows to regress features ν on potentially all parameters θ k , k = 1, . . ., K of the response distribution D. In the spirit of GAMs, each of the K distributional parameters is related to (possibly different subsets of) available features ν through a monotonic and differentiable response function The predictors η k (ν) ∈ R specify the relationship between features ν and the (transformed) parameters h −1 k (θ k ), while h k ensures that possible parameter space restrictions on θ k are fulfilled, e.g.h k (•) = exp(•) to ensure positivity for a variance parameter.Note that D itself can be also a more complicated distribution, e.g., a mixture of distributions.

Semi-Structured Distributional Regression
Our proposed framework advances SADR by extending the additive predictors η k to include one or more latent representations learned through (possibly different) DNNs.To this end we embed SADR into a NN and learn the distribution by adapting deep probabilistic modeling approaches.We denote our approach detailed in the following semi-structured distributional regression (SSDR).

Output Model Structure
In order to implement SADR in a NN, we define the last layer of the network as a distributional layer that computes the predicted distribution D(θ 1 (ν), . . ., θ K (ν)) based on the outputs η k (ν) of the subnetworks for the θ k , k = 1, . . ., K. Given a realization y of Y , the model can be estimated by optimizing the negative log-likelihood (NLL) − (θ) = − log p D (y|θ 1 (ν), . . ., θ K (ν)) based on the probability density or mass function p D of D evaluated at the estimated parameters θ.For simplicity, we consider a univariate response Y , but the generalization to multivariate responses is readily available.

Network Inputs
The subnetworks for the distributional parameters each process the feature vector (or different subsets of it) ν.We consider ν to be the concatenated set of input features x = (x 1 , . . ., x p ) ∈ R p modeled as structured linear effects, input features z = (z 1 , . . ., z r ) ∈ R r modeled as structured non-linear effects, and features u = (u 1 , . . ., u q ) that are passed through a DNN and constitute the unstructured model part in one or more of the additive predictor(s) η k , k = 1, . . ., K.

Predictor Structure
With this notation, the semi-structured predictors η k in SSDR are assumed to be an additive decomposition of structured linear parts f k,0 (x) = b k +x w k , structured non-linear functions f k,j (z j ), and one or more unstructured predictors d k,j (u) constituting linear combinations of latent features learned through DNNs.The inputs for these DNNs can be subsets of the features u and can also be (partially) identical to the features x, z: Note that we have suppressed the index k in the subset of feature inputs to not overload notation.However, SSDR allows to specify all additive terms individually for all parameters θ k .This can be relevant for example when prior information is available about which features are relevant for the location or the scale of the response.
We further assume that the DNN predictors can be represented as d k,j (u) = û k,j γ k,j , with latent features ûk,j taken as the outputs of the network trunk (all DNN layers up to the penultimate layer) and γ k,j being the last-layer weights forming the network head (cf. Figure 1).The functions f k,j (•) represent penalized smooth non-linear effects of univariate or low-dimensional features using a linear combination of L k,j appropriate basis functions.A univariate non-linear effect of feature z j is, e.g., approximated by l=1 B k,j,l (z j )w k,j,l , where B k,j,l (z j ) is the lth basis function (such as regression splines, polynomial bases or B-splines) evaluated at z j .Tensor product representations allow for two-or moderate-dimensional non-linear interactions of a subset of z.It is also possible to represent discrete spatial information or cluster-specific effects in this way (see, e.g., Wood, 2017).This representation also allows for random effects in the additive predictor, as these can represented by ridge-penalized linear effects (see Section 3.1 for details).

Identifiability
Identifiability is crucial when features overlap in the structured and unstructured model parts.
More formally, we here address the following identifiability issue for DR with structured additive predictor(s) η str k and unstructured additive predictor(s) η unstr k , k = 1, . . ., K: Definition 2.1.Semi-structured identifiability We say that a semi-structured distributional regression is identified in its structured model parts if there exists no where ( θ) = (θ) and θ k (η k ) in θ is replaced by θ k (η k ) for any k ∈ {1, . . ., K}.
To derive a theoretical concept assuring identifiability in the following, we assume w.l.o.g. that we are only interested in linear effects of pre-specified features x and allow for a single additional DNN predictor d k with arbitrary features u in η k for one distributional parameter θ k .The treatment of the more general case can be found in Supplementary Material A. We again suppress the index k in the feature vectors and design matrices for readability.As we will explain in a later remark, we consider identifiability of every additive predictor η k on the level of n observations.Therefore, let (x 1 , . . ., Lemma 2.2.Orthogonalization Let P X ∈ R n×n the projection matrix for which P X A is the linear projection of A ∈ R n×s , s ≤ n, onto the column space spanned by the features of X and P ⊥ X := I n − P X the projection into the respective orthogonal complement.Then, replacing the latent features U k with and multiplying the result with the last layer's weights γ k ∈ R s , ensures a decomposition of the final predictor into an identified linear part learned from features X and a non-linear part learned from features U k .
A proof of Lemma 2.2 can be found in the Supplementary Material A and can be used to proof the identifiability of structured terms in η k as follows.
Theorem 2.3.Identifiability Replacing U k with U k and multiplying the result with the weights γ k ensures identifiability of the structured linear part in the final predictor (1).
A proof is given in Supplementary Material A.
Remark 2.4.If x, z and u overlap in their features, we first reparameterize the structured non-linear model part to ensure identifiability between the structured linear and non-linear parts and then combine the two into one joint predictor (not shown in Figure 1).The non-linear part can then be interpreted as the non-linear deviation from the corresponding linear effect.Finally, we apply the orthogonalization using the joint structured predictor to separate it from the unstructured DNN predictor.
Remark 2.5.When the DNN and the structured part share p columns with p > n, P ⊥ X is equal to a zero matrix 0 n×n , making U k de facto irrelevant.In this case the estimated model is equivalent to the defined structured model.We see this edge case rather as a property than a limitation of our framework as the key requirement of the proposed architecture is to provide identifiable structured effects, which in this case is only possible by excluding the unstructured predictor part.
Remark 2.6.Last, we note that an alternative orthogonalization type and commonly used technique to make effects identifiable in structured models (see, e.g., Rügamer et al., 2018) is to calculate Ξ k , a matrix with columns spanned by ker( Û k X), and then use U k = Ûk Ξ k .This would have the advantage of being applied on the columns of Ûk and not its rows and would therefore make the orthogonalization independent of the sample or batch size.However, as Ûk is updated in each iteration, the required null space ker( Û k X) is also changing and can thus not be used in an end-to-end differentiable architecture (as gradients cannot be calculated for a random null space).This is the reason we chose the orthogonalization as proposed in Lemma 2.2 and consider identifiability on the level of observations.

Unifying Network Architecture
In contrast to most previous approaches that combine structured regression and DNNs, we propose a unifying network that learns the structured effects of x and z using single unit hidden layers with linear activation functions and different regularization terms for each input type and each distributional parameter.The DNN model part(s) processing u can be arbitrarily specified to, e.g., incorporate complex feature interactions.To ensure identifiability, we propose an orthogonalization cell based on the previous theorem.This allows us to combine structured and unstructured feature effects in a distributional regression setting while ensuring identifiability.Figure 1 and its caption provide a detailed explanation.
Remark 2.7.For latent features ûk,j learned in DNNs d k,j , a distinction is made between DNNs whose inputs are also part of the structured inputs x, z and DNNs whose inputs only appear in the unstructured predictor (xor-node in Figure 1).In the latter case, the DNN outputs are directly summed up as a weighted combination with structured predictors (lower path after xor-node in Figure 1) and fed into the distributional layer.For those ûk,j whose DNNs d k,j also share inputs with x or z, the orthogonalization operation is applied before adding its outputs ũk,j as weighted sums to the remaining predictor parts.While this approach ensures the identifiability of the structured model part, a custom orthogonalization for specific (non-overlapping) inputs in u might also be interesting in applications (see Section 7).

Implementation Details
We now provide further details on various implementation aspects of the SSDR framework.

Penalization, Optimization and Tuning
It is common ground that gradient descent (GD) routines used to train NNs hold an implicit regularization behaviour (see, e.g., Arora et al., 2019).However, in most of our experiments we observe rather coarse estimated non-linear effects or even convergence difficulties when not penalizing structured non-linear effects.We therefore allow for additional quadratic penalization of structured non-linear effects by regularizing the corresponding weights, with the regularization strength controlled through smoothing parameters λ k,j ∈ R for each structured non-linear term f k,j , k ∈ {1, . . ., K}, j ∈ {1, . . ., r k } and appropriate penalty matrices S k,j ∈ R L k,j ×L k,j (see, e.g., Wood, 2017).Optimization of the model is done by Tuning a large number of smooth terms, potentially in combination with DNNs, is a challenging task.Non-linear structured additive models alone require a sophisticated estimation procedure, often based on second-order information and full batch training.This gets even more complicated in semi-structured models.As DL platforms allow for custom optimization routines, one possible optimization strategy could be an iterative procedure alternating between the structured and unstructured model parts.This would allow the use of a stochastic GD routine for the DNN part(s), and a classical statistical optimization for the structured model part(s) including the estimation of smoothness parameters.Here however, we propose an alternative strategy that fosters easy training and tuning by defining the smoothness of each effect in terms of the degrees of freedom (df ; see, e.g., Buja et al., 1989).
This approach can be implemented efficiently using the Demmler-Reinsch Orthogonalization (DRO, cf.Ruppert et al., 2003).The latter can easily be parallelized or sped up using a randomized singular value matrix decomposition (Erichson et al., 2019).We thereby also allow for meaningful default penalization and comparability of effect complexities by setting df k,j to the same value df * k for all smooth effects j of the kth parameter.At the same time, we ensure enough flexibility by choosing df * k = min j∈{1,...,r k } max df k,j , i.e., the largest possible degree of freedom that leaves the least flexible smoothing effect unpenalized and regularizes all others to have the same amount of flexibility.After estimating the smoothing parameters λ k,j ∈ R, optimization can be done efficiently and jointly for all model components using a stochastic GD routine.

Implementation of the Orthogonalization
The orthogonalization proposed in the previous section can be implemented in different variants, depending on the batch size used in training.When using full-batch GD, the projection P ⊥ X can be calculated upfront and applied in each epoch, yielding exact orthogonalization.For mini-batch training with batch matrices X b , we calculate the respective projections for each batch using, e.g., a QR decomposition that allows for stable calculation and inclusion in fully automatic differentiated routines (see, e.g., Roberts and Roberts, 2020).In the next section, we will investigate the difference between full-batch GD and its stochastic GD variant using mini-batch orthogonalization.

Numerical Experiments
We conduct three different simulation experiments to assess the goodness-of-fit in terms of the structured effect estimation and the prediction performance.Our first experiment demonstrates the efficacy of the orthogonalization cell in practice.In the second experiment, we examine the properties of the orthogonalization operation when using mini-batch stochastic GD training.The third experiment compares the proposed framework against classical statistical regression frameworks to demonstrate that estimating a penalized SADR model works equally well when cast as a NN.Unless stated otherwise, the structured non-linear effects f k,j in all experiments and benchmarks are instantiated using thin-plate regression splines (see, e.g., Wood, 2017).

Identifiability and Interpretability
Here, we mimic a situation where the practitioner's interest explicitly lies in decomposing certain feature effects into structured linear, structured non-linear and unstructured nonlinear parts, for reasons of interpretability.We simulate 10 data set replicates with n =  contains linear and non-linear effects of all features, as well as an interaction term of the 10 features: η = b + 10 j=1 x j w j + 10 j=1 f j (x j ) + log 10 ( 10 j=1 (x j + 2)).
The weights w 1 , . . ., w 10 are defined as 2 1 , 2 2 , . . ., 2 10 .f 1 , . . ., f 10 are ten different non-linear functions defined in the Supplementary Material B. Our framework explicitly models the true linear and non-linear terms by separating both effects via orthogonalization.Further, to account for the interaction, an unstructured DNN predictor is defined using a fully connected network with ReLu activation and 32 and 16 hidden units in two hidden layers, respectively.
By projecting the output of the second-last layer into the orthogonal complement of the structured predictors, we ensure identifiability of the linear and non-linear effects.We do not fine-tune the model but rather use the DRO approach described in Section 3.1 and train for a fixed number of 2000 epochs.All models are optimized using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.01 and a batch size of 32.
Results. Figure 2 visualizes the estimated and true non-linear relationships between selected features and the response.Overall the resulting estimates in Figure 2 demonstrate the capability of the framework to recover the true partial non-linear effects, and only the simulation using a normal distribution with Var(Y ) = 100, which amounts to a maximum of 4% signal-to-noise ratio (predictor variance divided by noise variance), shows an overfitting behavior.Most importantly, the results highlight that the DNN predictor, which is also fed all 10 features, does not learn the linear or non-linear part of the structured effects, and thereby the structured part of the additive predictor η remains identifiable and interpretable as constructed and desired.

Mini-Batch Orthogonalization
As discussed in Section 3.2, a full-batch orthogonalization yields an exact projection, while various large-scale applications require mini-batch training, and hence only permit an approximate orthogonalization.Here, we provide evidence that the latter works equally well in practice.We therefore simulate linear models Y = Xw + ε, where X ∈ R n×p and ε ∼ N (0 n , I n ), with different number of samples n ∈ {100, 1000, 10000} and features of size p ∈ {1, 10} drawn independently from N (0, 1).We define w to be equally spaced coefficients from −3 to 3 (w = −3 for p = 1).We then check whether the true effects of these features can be recovered in the presence of a DNN (fully-connected NN with ReLU activation and 100 and 50 hidden units in two hidden layers).The DNN is provided the same input features as defined for the structured part using either no orthogonalization, full-batch or mini-batch orthogonalization with batch sizes B ∈ {25, 50}.The models are trained for 1000 epochs using Adam optimizer with early stopping and a learning rate of 0.001.We repeat each experiment 10 times based on different realizations of standard normal errors ε.

Model Comparison
In this simulation, we compare the estimation and prediction performance of our framework with three other approaches that implement SADR based on a location, scale and shape (LSS) parameterization.More specifically, we run our approach against a likelihood-based optimization (gamlss; Rigby and Stasinopoulos, 2005), a Bayesian optimization (bamlss Umlauf et al., 2018) and a model-based boosting routine (MBB; Mayr et al., 2012).For  Results.Table 1 summarizes the mean log-scores of 0.25n test data points and mean RMSE measuring the average of the deviations between true and estimated structured effects.We observe that our approach (SSDR) often yields the best or second-best estimation and prediction performance while being robust to more extreme scenarios in which the number of observations is small in comparison to the number of feature effects.In this situation, other approaches tend to suffer from convergence problems.We conclude that the estimation of SADR within our DNN framework works at least as good as classical statistical approaches, but yields more robust results in high-dimensional settings.

Benchmark Studies
In addition to the experiments on synthetic data, we provide a number of benchmarks in several real-world data sets in this section.If not stated otherwise, we report measures as averages and standard deviations (in brackets) over 20 random network initializations.Unless reported otherwise, activation functions for DNN predictors in SSDR are ReLU for hidden layers and linear for output layers.Additional investigation of our framework checking its competitiveness for quantile regression, calibrated regression, and high-dimensional classification problems can be found in Supplementary Material C.

Deep Mixed Models
))), with x i,t being the 12 features also used in Tran et al. (2020), individual specific random effects w i ∼ N (0, τ 2 ) ∀i, and d µ , d σ two different fully-connected NNs with two hidden layers and five neurons each.The model is estimated with default settings of Adam, batch size of 32 and the number of epochs chosen by cross-validation.
Results.Our approach yields an average MSE of 0.04 (0.005) which makes our method competitive with the approach of Tran et al. (2020), who report an MSE of 0.05 for the given data split.

Deep Calibrated Regression
Next, we use the data sets Diabetes, Boston, Airfoil, and Forest Fire analyzed by Song et al. (2019) to benchmark our SSDR approach against the two post-hoc calibration methods isotonic regression (IR; Kuleshov et al., 2018) and the GP-Beta model (GPB; Song et al., 2019) with 16 inducing points.The uncalibrated model for the latter two is a Gaussian process regression (GPR) which performed better than ordinary least squares and standard NNs in Song et al. (2019).SSDR directly models the parameters of a normal distribution.
Here, we only fine-tune the specific structure for the predictors η µ = µ, η σ = log(σ), which consist of structured linear and non-linear effects, as well as a DNN for all features (see the Supplementary Material C.2 for details about each model's specification).We split the data into 75% for training and 25% for model evaluation, measured by negative log-scores.
Results.Table 2 suggests that compared to other calibration techniques our method yields more stable results while permitting interpretable structured effects for features of interest.Even though we did not fine-tune the output distribution, SSDR performs as good as the benchmarks in terms of average negative log-scores.

Application to Multimodal Data
Finally, we present an application of our framework to Airbnb price listing data, available at http://insideairbnb.com/get-the-data.html.Various data modalities are included in the data, such as numeric variables for latitude and longitude, integer variables such as the number of bedrooms, textual information such as a room description, dates as well as an image of each property (size 200 × 200 × 3), see Table 3.We will focus on apartments in Munich, Germany consisting of 3,504 observations and 73 features of mixed data modalities.
Our goal is to predict the listing price of each apartment in an interpretable structured additive model, while also accounting for the room description and images.As the data set is relatively small and the information content in the images seems also not substantially decisive for the room price, we use this application to demonstrate a further property of the orthogonalization-its regularization effect.To this end, we compare the model using two Model Specification For D, we use a log-normal distribution.D is parameterized by its location (mean) µ and scale parameter σ with corresponding additive predictors η µ and η σ , respectively.To model the mean of the logarithmic apartment prices (µ), we define our linear predictor η µ as η µ = b µ + 13 j=1 x j w µ,j + 4 j=1 f µ,j (z j ) + f µ,5 (z 5,1 , z 5,2 ) with features given in Table 3.For the scale parameter σ we observe a better validation performance when only including a few structured predictors.As for the mean we include the room type as linear (dummy-)effects and a tensor product spline for location.Further model specifications and test results are given in Supplementary Material D.
Results.Results suggest that structured effects are the main driving factor in our model.
As images and descriptions are rather noisy in this data set, this result is not surprising.
We observe a strong spatial effect of the geographic location on the log price of apartments.
As shown in Figure 4, the more expensive apartments are located in the center of Munich, close to the English garden, or in rather prestigious areas.Relative to a typical apartment in Munich, the price for an apartment in the central part of Munich can, e.g., be up to 1.4  times more expensive, while a comparable apartment in, e.g., the south might cost only 60% of the average room.The partial effect of the latent feature learned via the images and texts is only slightly correlated with the response.The multiplicative effect of remaining structured non-linear effects on the mean prices are visualized in Figure 5.Here the number of accommodated people has the potentially highest impact with a tent-shaped partial multiplicative effect of over two times the average apartment price at 10 accommodated people.The effects for η σ can be found in Table 2 and Figure 3 in the Supplementary Material.
We also compare different model specifications of η µ based on the correlation of predicted and true price values on train and test set.

Conclusion and Outlook
In this work, we demonstrate why the combination of structured additive models and DNNs is not straightforward.We present a solution that provably permits the identification of structured effects in the presence of more flexible DNNs.We further develop a unified network architecture that combines SADR and DL by embedding the former into an overarching DNN.Using an orthogonalization cell enables the estimation of (interpretable) structured feature effects next to unstructured DNN predictors while ensuring identifiability of the structured model part(s).Simulations, benchmark studies, and a multimodal learning application demonstrate the generality and robustness of our proposed approach.
While ensuring identifiability and interpretability of structured effects in the additive predictor, we note that the orthogonalization approach is much more versatile and can be applied to various other use cases beyond the identification of structured effects.In particular, using the orthogonalization, a structured effect can be "subtracted" from a latent effect learned from an unstructured data input, e.g., to adjust for confounders.

B.2 Details on Section 4.3
We generate all features using a standard uniform distribution.The coefficients of linear effects are drawn from a uniform distribution with limits −3 and 3. We use the nonlinear functions as defined in the previous subsection and additionally use the following 3 functions in cases where more than 10 non-linear functions are used (due to the multiple distribution parameters and depending on the amount of overlap between the predictors in these parameters): As intercepts for the location we use 1, for the scale −1.After adding up intercept, linear effects and non-linear effects, the additive predictors η k , k = 1, 2 are transformed according to the link function of the corresponding distribution and distribution parameter.Using the resulting distribution parameters θ k , the outcome values are drawn from the distribution defined by these parameters.

C Benchmark Studies: Further Experiments
In the following, we provide further benchmark experiments on the use of our framework for quantile regression (Section C.1), calibrated regression (Section C.2) and high-dimensional classification (Section C.3).

D Application: Further Details
In the following, we provide further details on the presented Airbnb application.We briefly describe the training of our framework, further results of the proposed model, and the CNN architecture used in the models with images in the unstructured predictor part.All specifications are trained with Adam for 4000 epochs with early stopping, using a learning rate of 0.001 and batch size of 32.

D.1 Warm Starts
In the application, we first estimate a structured model without any DNN and use the estimated coefficients as a warm start for the multimodal network.This effectively tackles the problem of different learning speeds of the two model parts without the need for a dedicated optimization routine.

D.3 Further Results
Figure 7 visualizes the density of the partial effect U 1 γ1 of images on the mean apartment prices.The estimated coefficients for the structured linear predictors are given in the following Table 6 with no entries if the effect was not included in the predictor.The reference apartment in the intercept corresponds to 0 bedrooms and beds (which means either the entry is missing or there is no dedicated bedroom, but instead, e.g., a dormitory) and an entire home or apartment as room type.
The effect of the geographic location on the estimated distribution scale is depicted in Figure 8. Similar to the effect on the mean, the scale parameter is larger in the center of Munich, but also in some less prestigious areas.The effect of these locations is yet much more pronounced than the effect on the mean.

E Details on Computing Environment
Benchmarks and simulation studies were performed on a Linux server with 32 CPUs (Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz), 64 GB RAM and took several minutes (Benchmarks, Identifiability Study) to 4 days (Model Comparison Study).The application was performed on a personal computer with 4 CPUs (Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz), 16 GB RAM and took 6.8 hours.
be the structured feature matrix for n ∈ N observations and w k ∈ R p the corresponding weights.Further, define the collection of n feature vectors (u 1 , . . ., u n ) as U ∈ R n×q , which is fed into the DNN d k , and let U k ∈ R n×s , denote the matrix collecting n latent feature vectors of length s, s ≤ n, obtained from the second-last layer of d k .Finally, let η k = (η k,1 , . . ., η k,n ) ∈ R n be the final predictor vector.If not constrained, d k is able to capture linear effects of any of its features U , including those also present in X, thus making the attribution of shared effects to either one of the model parts in η k not identifiable.The following Lemma shows how the orthogonalization induces a meaningful decomposition of learned effects in η k and guarantees identifiability in the sense of Definition 2.1.

Figure 1 :
Figure 1: Exemplary SSDR architecture incorporating an orthogonalization cell: Structured linear features x and structured non-linear features z (represented by their respective basis evaluations) are fed into the cell, concatenated and combined with the DNN using the orthogonalization operation.The solid lines represent the informing connections that either simply combine information or create a projection matrix P ⊥ .The latter is multiplied with the latent features û to disentangle the structured and unstructured parts.The final additive predictors η k are created by adding the structured predictors to the (orthogonalized) linear combination of latent features ũ and their weights before being passed on to the distributional layer.

Figure 2 :
Figure 2: Non-linear partial effects of six selected features (columns) on the mean of the response from the five different distributions (rows) with true effect in red and estimated functions of the 10 replicates in grey.

Figure 3 :
Figure 3: Squared deviations of model coefficients of the oracle model and the estimated model (y-axis) for different numbers of observations (x-axis), different numbers of features (rows), different batch sizes (colors) both with and without orthogonalization (columns).The differences in squared deviations for "with orthogonalization" are on the scale of 10 −5 and thus negligible.
three different distributions (normal, gamma, logistic), we investigate a combination of different sample sizes n ∈ {300, 2500} and different numbers of linear feature effects in the location (p ∈ {10, 75}), while fixing the number of linear effects in the scale parameter to two.In addition to the linear feature effects, we add 10 non-linear effects for the location and two non-linear effects for the scale.Further details can be found in Supplementary Material B. The SSDR models are trained using Adam with a learning rate of 0.001, batch size of 32, and the number of epochs selected by 5-fold cross-validation.The simulation results are replicated 20 times.
Tran et al. (2020) use a panel data set with 595 individuals and 4165 observations fromCornwell and Rupert (1988) as an example for fitting deep mixed models by accounting for within subject correlation.Performance is measured in terms of within subject predictions of the log of wage for future time points.We follow their analysis by training the model on the years t = 1, . . ., 5 and predicting the years t = 6, 7. We use a normal distribution with constant variance and model the mean with the same NN predictor as implemented byTran et al. (2020).However, instead of being part of the DNN predictor, the subject ID is included as a structured random effect w i for each individual i: log-wage i,t ∼ N (b + different specifications.One model is specified with structured and unstructured predictors which are simply added up.The other specification regularizes the learned latent image effects by orthogonalizing them w.r.t.all specified structured features.In other words, the second model subtracts the structured information from the learned image information in order to regularize the DNN.For model inspection and fine-tuning, we set aside 10% of the data for testing and use 10% of the resulting training data for early stopping.

Figure 4 :
Figure 4: Estimated multiplicative effect (color) of the apartment's geographic location (points on the map) on the mean price value.

Figure 5 :
Figure 5: Estimated multiplicative effect (color) of accommodates, days since last review, reviews per month and review rating, on the mean price value.

Figure 7 :
Figure 7: Distribution of partial effects of the image DNN for the mean apartment price with five selected examples.

Figure 8 :
Figure 8: Estimated multiplicative effect (color) of the apartment's geographic location (points on the map) on the scale of the price value distribution.

Table 1 :
Median and median absolute deviation of the mean negative predictive log-scores and mean RMSE values of estimated weights and non-linear point estimates across all settings and 20 replications.The best performing approach is highlighted in bold.

Table 2 :
Comparison of negative log-scores of different methods (columns) on four different UCI repository data sets (rows).

Table 3 :
Overview of features in the data set, their description and how the features are parameterized.FC denotes a fully-connected layer with a single unit.The embedding layer is of size 100 for a lookup of 10,000 words with maximum sentence length of 100.The CNN architecture is described in the Supplementary Material D.2.The FC DNN architecture consists of 16, 4 and 2 FC layers with dropout layers in between and is used to model the interactions of structured features.
Table 4 compares the model with full information as given in Table 3 (Full), with and without orthogonalization (w/o OZ), with a model

Table 4 :
Correlations of predicted and true values on train and test set (columns) for the six different models (rows).

Table 5 :
Comparison of AUROC on three cancer data sets (first row: Colon cancer; second row: Leukaemia; thrid row: Breast cancer) for our method, a simple DNN and the VAFC