Revisiting Convolutional Neural Networks from the Viewpoint of Kernel-Based Methods

Abstract Convolutional neural networks, as most artificial neural networks, are frequently viewed as methods different in essence from kernel-based methods. In this work we translate several classical convolutional neural networks into kernel-based counterparts. Each kernel-based counterpart is a statistical model called a convolutional kernel network with parameters that can be learned from data. We provide an alternating minimization algorithm with mini-batch sampling and implicit partial differentiation to learn from data the parameters of each convolutional kernel network. We also show how to obtain inexact derivatives with respect to the parameters using an algorithm based on two inter-twined Newton iterations. The models and the algorithms are illustrated on benchmark datasets in image classification. We find that the convolutional neural networks and their kernel counterparts often perform similarly. Supplemental appendices and code for the article are available online.


Introduction
A convolutional neural network is the composition of a transformation mapping, or feature representation, and of a prediction method.The key piece of a convolutional neural network is the feature representation part, which is learned from data.Convolutional neural networks date back to the work of Fukushima (1975), whose original inspiration came from Hubel and Wiesel's observations on visual information processing (Hubel and Wiesel 1962).The feature representation part was trained using unsupervised learning algorithms (Fukushima 1980;Miyake and Fukushima 1984).The work of LeCun and collaborators transformed the convolutional neural network literature by showing how to train the networks end to end, that is, training both the feature representation and the prediction parameters jointly, using gradient-based optimization equipped with reverse-mode automatic differentiation (LeCun 1988).This work resulted in the landmark LeNet architectures, which set the state of the art in digit and image recognition at the time (LeCun et al. 2001).
Over the last decade, convolutional neural networks have become a basis of mainstream approaches for tackling visual recognition problems (Goodfellow, Bengio, and Courville 2016).Such approaches have demonstrated a spectacular ability to learn powerful visual representations from large amounts of labeled images.The parallel development of automatic differentiation software to implement gradient backpropagation algorithms and of graphical processing hardware to implement image filtering operators has been critical to this success.Yet, simple mathematical questions left so far unanswered hinder the possibility of statistical modeling and inference.The approximation theory of neural networks is an active research topic (Gribonval et al. 2021).The series of landmark architectures formed by the LeNet series, All-CNN, and AlexNet, constitute an interesting ground to better understand convolutional neural networks from a statistical viewpoint.
The approach we pursue here is to revisit these architectures from the viewpoint of kernel-based methods or smoothing splines, that is, statistical models estimated using penalized likelihood maximization in reproducing kernel Hilbert spaces (Wahba 1990;Schölkopf and Smola 2002).There is a direct approach that would consist in comparing a convolutional neural network for image classification such as LeNet or AlexNet with a kernel-based method with a simple kernel such as the Gaussian kernel acting on images treated as vectors.This approach has been previously pursued (Le, Sarlos, and Smola 2013;Que and Belkin 2016).We seek instead to transpose the way convolutional neural networks process image data into the world of kernel-based methods.This endeavor is challenging, as it requires identifying and adapting each component of an existing convolutional neural network to define a corresponding convolutional kernel network.The Nyström quadrature approximation of an integral operator associated with a reproducing kernel (Williams and Seeger 2000) builds a bridge between the two worlds.The feature mapping of a convolutional neural network can be related to the datadriven approximation of an integral operator.This viewpoint leads to the kernel-based counterparts of convolutional neural networks, convolutional kernel networks, introduced by Mairal et al. (2014).A convolutional kernel network can be trained in an unsupervised manner (Mairal et al. 2014) or in a supervised manner (Mairal 2016).
We translate the classical convolutional neural networks LeNet-1, LeNet-5, All-CNN-C, and AlexNet into their kernelbased counterparts.We obtain the kernel-based counterparts of each component of these convolutional neural networks.We also extend convolutional kernel networks by considering Matérn kernels, which are a generalization of Gaussian radial basis function kernels.After recalling how to train a convolutional kernel network in an unsupervised manner (Mairal et al. 2014), we detail how to train one in a supervised manner using an alternating minimization algorithm with mini-batch sampling and implicit partial differentiation.In particular, we explain how to obtain inexact derivatives with respect to the parameters using an algorithm based on two inter-twined Newton iterations.We present experimental results with the various networks in a number of settings.The code to reproduce the results is publicly available with qualitative illustrations and a step-by-step example (Jones, Roulet, andHarchaoui 2021a, 2021b;Roulet, Jones, and Harchaoui 2021).
We discuss in Section 2 the related work.We describe in Section 3 the integral operator Nyström approximation of convolutional kernel networks.In Section 4, we describe the algorithms we used to train them in a supervised manner.We present in Section 5 the experimental results of our convolutional kernel networks and their neural originals for digit classification on MNIST and image classification on CIFAR-10 and a subset of ImageNet.

Related Work
A convolutional neural network (ConvNet) typically acts on multi-dimensional inputs.Consider an input x ∈ R f 0 ×h 0 ×w 0 , where f 0 is the number of input channels (e.g., three for an RGB color image) and h 0 and w 0 are the spatial dimensions.We will collapse the latter two dimensions, representing the input as x 0 ∈ R f 0 ×p 0 .A convolutional network first extracts contiguous patches of total size s 1 from the input, resulting in E 1 (x 0 ) ∈ R s 1 ×p 1 .Then a weight matrix W 1 ∈ R f 1 ×s 1 (whose rows are called "filters"), bias vector b 1 ∈ R f 1 , and an elementwise nonlinearity a : R → R are applied.The result may then be post-processed by applying a function c 1 that acts separately on each channel (i.e., row) 1, . . ., f 1 in order to perform what is called "pooling." Starting from F CN 0 (x) = x 0 , this process is repeated L times, resulting in an output Unlike a multi-layer perceptron, a ConvNet extracts multiple patches at each layer and applies the same weight W to each patch.The sharing of weights across spatial locations allows the ConvNet to achieve translation invariance.
ConvNets have been studied from different kernel perspectives.Recall that k : X × X → R for a set X is a positive definite kernel if there exists a Hilbert space H and a map φ : We review previous work along the various perspectives: (a) conjugate kernels; (b) Gaussian processes; (c) tangent kernels; and (d) data-driven convolutional kernel networks.Appendix F includes additional considerations.Daniely, Frostig, and Singer (2016) noted that for a probability measure μ and an activation function a : R f × R f → R that is square integrable with respect to its second input, the quantity

Conjugate Kernel
(2) corresponds to the evaluation of a reproducing kernel k on image patches , ∈ R f .More generally, Fan and Wang (2020) call a conjugate kernel k(x, x ) = E w∼ν [φ(x; w) T φ(x ; w)] where φ(•; w) is the parameterized feature mapping of the hidden layers of the deep network and ν a probability measure.The approximation of the expectation with an empirical average with random w 1 , . . ., w N leads to the random features viewpoint on a deep network.This is related to the random feature approximation of Rahimi and Recht (2007).

Neural Network Gaussian Process
The neural network Gaussian process viewpoint focuses on the pre-activations z :=W E (F −1 (x)) + b at each layer.A Gaussian process is a collection of random variables, any finite subset of which has a joint Gaussian distribution (Rasmussen and Williams 2006).Neal (1996) and Williams (1996) first observed the relationship between single-hidden-layer neural networks and Gaussian processes.Recently, Garriga-Alonso, Aitchison, and Rasmussen (2019) and Novak et al. (2019) showed that the pre-activations at each layer of ConvNets with random weights converge in distribution to Gaussian processes as the number of filters f → ∞ either sequentially or simultaneously at every layer.

Tangent Kernel
The tangent kernel viewpoint focuses on (∇ W F(x)) T (∇ W F(x )), where W contains the network weights and F is the full network input-output function, along the iterations of a supervised training stochastic algorithm.Assuming a proper scaling of the parameters, one can derive the limiting kernel in the case where the dimension of each layer goes to infinity simultaneously (Jacot, Hongler, and Gabriel 2018;Arora et al. 2019;Fang, Dong, and Zhang 2021).The tangent kernel viewpoint has been used to analyze the "lazy dynamics" of training algorithms, that is, the stochastic fluctuations of (W (t) ) t≥0 around the optimum W .

Data-Driven Convolutional Kernel Networks
The conjugate kernel viewpoint suggests that a data-driven choice of the weights can lead to a better approximation of the ideal kernel with a convolutional kernel network (CKN) with a finite number of weight matrices.Mairal et al. (2014) pursued this approach using an unsupervised procedure and Mairal (2016) did so with a supervised end-to-end training procedure to learn weight matrices from data.While we develop a training algorithm with a stochastic component, the dynamics of the training and their theoretical analysis, as in the neural tangent kernel literature, is not our focus.We are primarily concerned with the translation of classical convolutional neural networks into their kernel-based counterparts and the computational aspects of their supervised training.

Convolutional Kernel Networks
We now describe CKNs, going from theoretical foundations to computational aspects.We also summarize correspondences between components of CKNs and ConvNets.Appendices A and C present in detail the CKN counterparts of the LeNets, All-CNN-C, and AlexNet.

Infinite-Dimensional Viewpoint
Consider once again an input x 0 ∈ R f 0 ×p 0 that has been collapsed to two dimensions.As with ConvNets, a CKN first extracts contiguous patches of total size s 1 from the input, resulting in E 1 (x 0 ) ∈ R s1 ×p 1 .However, CKNs then apply the feature map φ 1 of a pre-specified positive definite kernel k 1 , thereby mapping each patch to an infinite-dimensional space (a Reproducing Kernel Hilbert Space (RKHS)).The result may then be post-processed by applying a pooling operator π .For example, for average pooling, π is the feature map corresponding to a cross-product kernel: where i denotes patch i at layer and N i is a set of neighbors of patch i. Repeating this process L times, we obtain, starting from on top of which we apply an affine transformation to get the output of the network.

Approximate Feature Maps
While computing the output using the feature maps φ is theoretically possible (assuming the kernels k at each layer only depend on inner products of features at previous layers), it is computationally unwieldy for even moderately sized networks.
To overcome this, we approximate the kernel k at each layer by finding a finite-dimensional map ψ such that k ( , ) ≈ ψ ( ), ψ ( ) R f for a dimension f .The ψ 's then replace the φ 's at each layer, thereby providing feature representations of size f of the patches at each layer .There are many ways to choose ψ , including optimizing an approximation to the kernel, using random features, and projecting onto a subspace.A simple procedure to set the weights of a network is to use random weights.From a kernel viewpoint, the issue with such an approach is that a good approximation of the kernels by random features requires quite a few filters; see Figure 1.It is also unclear why further optimizing these features could lead to a better approximation, since approximation results are based on drawing features according to a well-specified measure to approximate (2).We consider instead projections onto subspaces.The idea is to select a finite basis in the RKHS that is the most relevant for the task.In an unsupervised scenario, this basis is computed by approximating the distribution of the inputs in the RKHS.In a supervised scenario, this basis is refined to get the best possible basis that maps the inputs to the outputs.This is a difference between CKNs and ConvNets.In CKNs, learning the weights is learning that basis, and the function space is an RKHS.

Nyström Approximation
We perform the approximation by projecting onto a subspace spanned by "filters." This is usually referred to as the Nyström method (Williams and Seeger 2000;Bo and Sminchisescu 2009;Mairal 2016).Owing to the projection, the resulting feature map is guaranteed to live in the RKHS.These filters may be initialized at random by sampling patches from the images.We will show in Section 4.2 how to differentiate through this approximation and learn the filters from data in a supervised manner.
Consider a dot product kernel k with corresponding RKHS H and canonical feature map φ. 1 Furthermore, let w 1 , . . ., w f ∈ R s be a set of filters.Given a patch ∈ R s of size s, the Nyström approximation projects φ( ) onto the subspace spanned by φ(w 1 ), . . ., φ(w f ) in H by solving the following least squares approximation problem in the Hilbertian norm associated with H: ..,f .Using these coefficients to compute an approximation of the inner products, we get that

Approximation Quality
The number of filters f controls the quality of the approximation at layer : the larger the number of filters, the better the approximation.Since optimizing over W leads to a non-convex problem, prior work has studied the error of this approximation when setting W using, for example, k-means++ (Oglic and Gärtner 2017).See also Rudi and Rosasco (2017) for a theoretical comparison of the performance of random features and the Nyström method for kernel ridge regression.
Figure 1 illustrates the interest of the Nyström method on a simple example (see Appendix H.3 for more details).The figure compares results from using (a) the random features a(W, x) where each entry of W is drawn randomly; (b) the Nyström method where the rows of W are chosen by randomly sampling from a large set of patches; and (c) the Nyström method where the rows of W are chosen as the centroids output from spherical k-means on a set of 10,000 patches from MNIST images.Relative to the Nyström method with spherical k-means, the random features method performs 6.8 to 650 times worse for a given number of features, while the Nyström method with random sampling performs 1.3 to 4.0 times worse.

Nyström Interpretation as a Whitening Process
In unsupervised learning, the Nyström approach admits an interpretation as a ZCA-whitening transformation in the RKHS, for the first layer of the convolutional kernel network.Consider the empirical measure P of the filters w 1 , . . ., w f .Assume that P is close to the empirical measure of patches of the same size extracted from real images of a dataset.This is the case if one would use, for instance, a quantization algorithm such as kmeans to obtain the filters from a sample of patches extracted from the dataset.Assume also that the filters have zero mean in H, that is, for any g in H we have E v∼P (g(v)) = 0.The covariance operator C defined for any g, h in H as g, Ch H = cov v∼P (g(v), h(v)) and the integral operator T : L 2 (P) → P defined as (Tg)(v) = g(u) φ(u), φ(v) P(du) share the same positive eigenvalues.On the other hand, T can be identified with the scaled Gram matrix k(WW T )/f = [k(w i w j )/f ] i,j=1,...,R f ; hence, T and k(WW T )/f share the same eigenvalues.Therefore, the nonlinear transform performed by the Nyström method, where U U −1 is the eigenvalue decomposition of k(WW T ) with = diag(λ 1 , . . ., λ f ) collecting the sequence of eigenvalues (λ ν ) ν=1,...,f , and where (v) = (k(w 1 v), . . ., k(w f v)) collects the dot products in H with the filters w 1 , . . ., w f , can be interpreted as performing a rescaling of (k(w j v)) j=1,...,f along the principal directions of variations of the filters (w j ) j=1,...,f when mapped via k to H.This form of pre-processing of the data is beneficial for the subsequent layers to use the best possible representation of the inputs in the RKHS.

Overall Formulation of a CKN
A core hyperparameter in CKNs is the kernel.For simplicity of the exposition we assume that the same kernel is used at each layer.Traditionally, CKNs use normalized kernels of the form where k is a dot product kernel on the sphere.Examples of such kernels k include the arccosine kernel of order 0, the RBF kernel on the sphere, and the Matérn kernel on the sphere.Using dot-product kernels on the sphere allows us to restrict the filters to lying on the sphere.
To account for the aforementioned forms of kernels, denote by N the function normalizing the patches of the representation F of the input at layer .In addition, assume the pooling is a linear operation and denote the pooling operator by the matrix P .(See Appendix D in the supplementary materials for precise definitions.)Then the representation at the next layer given by extracting patches, normalizing them, projecting onto a subspace, re-multiplying by the norms of the patches, and pooling is given by Here denotes the function that applies the approximate feature map ψ as derived above to the features at each spatial location.After L such compositions we obtain a final representation that can be used for a classification task.Figure 2 illustrates a basic CKN architecture.
Precisely, given a set of images x (1) , . . ., x (n) with corresponding labels y (1) , . . ., y (n) , we consider a loss L, leading to the nonconvex optimization problem min Here λ ≥ 0 is a regularization parameter and S d is the product of Euclidean spheres in which the filters W at layer lie such that each row of W has unit Euclidean norm.
Compositions such as a convolution followed by a nonlinearity can be translated by the application of a simple kernel such as the arc-cosine kernel of order 1.Pooling layers can be translated by the application of a double-sum kernel such as the crossproduct kernel.

Supervised Training of CKNs
A convolutional kernel network can be trained using a gradientbased optimization algorithm in a differentiable programming framework.End-to-end learning for training a CKN in a supervised manner was first considered by Mairal (2016).We describe here the main algorithm for supervised training of CKNs that we use for our experiments.This algorithm, summarized in Algorithm 1, is similar to that of Mairal (2016), that is, it is an alternating minimization algorithm.However, in contrast to the algorithm of Mairal (2016), ours uses mini-batch sampling and a form of implicit differentiation; see Appendix E for a detailed discussion.We also describe an auxiliary algorithm to compute

Algorithm 1 Training of CKNs
Inputs: CKN architecture composed of L layers, dataset of images and labels (x (i) , L through the CKN layers on a sampled mini-batch -Optimize classifier on the represented mini-batch using an approximation of the loss -Update filters using the first-order information associated to the optimized mini-batch loss end for -Train classifier W L+1 , b L+1 on the final representation of the data using L-BFGS Output: W 1 , . . ., W L , W L+1 , b L+1 (optimal filters and classifier parameters) up to any accuracy the gradient of the training objective with respect to the CKN parameters.

Main Algorithm
The main algorithm is summarized in Algorithm 1.At first, the CKN architecture is just a skeleton defining the operations performed at each layer by specifying, for example, the number of filters at each layer.The filters of the CKN are initialized as follows.At the input layer we cluster patches of the images using spherical k-means (since we consider normalized kernels), and then take the means associated with the clusters to be the initialized filters.At each subsequent layer we apply the same process on patches from the features input to that layer, which are obtained using the already-initialized filters from the previous layers.Once all filters are initialized, the classifier parameters are initialized by minimizing the objective for a fixed representation of the images given by the initialized filters using the L-BFGS algorithm.The overall initialization of the filters can be seen as an unsupervised training of the network, as it does not require the label information of the data.
Once the filters and the classifier are initialized, a supervised training procedure refines the filters by using the label information.The supervised training proceeds by computing features of a random mini-batch of samples and performing one step of a partial implicit optimization approach, detailed below.Formally, recalling that W = (W 1 , . . ., W L ) collects the parameters of the layers from 1, . . ., L and denoting by V = (W L+1 , b L+1 ) the parameters of the classifier, the objective reads as min W,V i L i (ϕ i (W), V) where each L i and ϕ i are, respectively, the regularized loss function and the feature representation evaluated on the ith instance.Denote by q L S the quadratic approximation of the losses L S = i∈S L i on a random mini-batch S for the current classifier V and representations ϕ S (W).Our optimization scheme computes one step of a regularized Newton method with respect to V, leading to a stochastic approximation M S (ϕ S (W)) of min V L S (ϕ S (W), V), and uses the gradient of M S • ϕ S to update the parameters The step size is γ and the regularization is τ > 0. An additional projection onto the unit sphere is performed after the gradient step to ensure that the filters remain on the sphere.The minimization defining M S can be computed in closed form since it amounts to a quadratic problem.The derivative of M S with respect to ϕ S (W) can then also be computed in closed form.The overall gradient ∇ W (M S • ϕ S )(W) can be accessed formally by the chain rule and in practice by automatic differentiation, provided that we have access to the derivative of ϕ S (W) along any direction.The next section delves into the efficient computation of ϕ S (W) and its implementation in a differentiable programming framework.

Encoding CKNs in a Differentiable Programming Framework
We detail now how to encode the feature representation ϕ(x, W) of an input image x in a differentiable programming framework to access its first order information up to any numerical accuracy.Encoding the feature representation in a differentiable programming framework amounts to encoding ϕ such that the gradient of W → ϕ(x, W), M can be accessed for any M given the computation of ϕ(x, W).For example, to compute the gradient of the loss with respect to the filters, one would need to access the gradient of In our training procedure we need to have access to the gradient of W → ϕ S (x S , W), ∇M S (F S,L ) , for M S (F S,L ) the objective partially minimized with respect to the classifier parameters on a quadratic approximation of the losses and F S,L = ϕ S (x S , W) the feature representations of a mini-batch of images.
Overall the gradient is computed in a backward pass over the network.Denoting the adjoint of the operator of a function f as the operator ∇f (W) that outputs the gradient ∇f (W) [M] of W → f (W), M at W for any M, the gradients Algorithm 2 Intertwined Newton Method for Matrix Inverse Square Root Input: Positive definite matrix K = k(WW ) + εI 0, total number of iterations t max Initialize: )N (F −1 )P the feature representation at the th layer and F 0 = x.
The derivatives of the CKN layer with respect to its inputs and its filters are thoroughly detailed in Appendix D. Overall the computation of which involves differentiating through the inverse matrix square root function s : S n ++ → R n×n given by s(K) = K −1/2 , where K = k(WW ) + εI in our case.To compute its derivative, we use that for a positive definite matrix K ∈ S n ++ and a matrix Hence, computing the gradient of the CKN in this manner consists of solving a continuous Lyapunov equation (Khalil 1992, chap. 3.3).However, solving such a large system may be impractical.

Fast Computation and Derivation of the Inverse Matrix Square Root
We propose to compute the inverse square root matrix K −1/2 of a matrix K such as K = k(WW ) + εI by an iterative method such that , where A t denotes the tth iteration of the method and K = k(WW ) + εI.We use the Denman and Beavers algorithm (Denman and Beavers Jr. 1976).This algorithm amounts to a Newton method (referred as the outer Newton method in the following) (Higham 2008, theorem 5.6).The process begins with S 0 = K and T 0 = I and proceeds with the iterations The sequences T t and S t converge quadratically to K −1/2 and K 1/2 , respectively, provided that K F ≤ 1 (Higham 2008, theorem 5.6).The derivative of the matrix square root can then be approximated by back-propagating the gradients with respect to each step of the iterative method, that is, where K t is the result of t steps of the algorithm.The actual back-propagation of the gradients is automatically taken care of in modern automatic differentiation packages.Each iteration, however, involves the inverses of the iterates T t and S t , which are expensive to compute when using a large number of filters.We therefore apply the Newton method one more time (which we refer to as the inner Newton method), yet now to compute T −1 t and S −1 t (Higham 1997).Recall that a Newton step from a matrix X t to compute A −1 is X t+1 = 2X t − X t AX t .Using S t and T t as initial guesses to compute T −1 t and S −1 t , respectively, and plugging a single inner Newton step into (7), we get Algorithm 2. An experimental evaluation of this strategy when we run, say, 20 iterations of the outer Newton method yet only 1 iteration of the inner Newton method demonstrates that it is effective in practice; see Figure 3.We therefore adopt this approach in the experiments.

Experiments
We experimentally explore our CKN translations of classical ConvNets.Previous work reported that CKNs with ad hoc architectures can achieve comparable performance to ConvNets on MNIST and CIFAR-10 (Mairal et al. 2014;Mairal 2016).Instead, we investigate whether, given a classical ConvNet, a CKN counterpart can be designed, following Section 3, and trained from data to achieve comparable accuracy to the classical ConvNet.

Data
The experiments use the datasets MNIST and CIFAR-10, in addition to a 10-class subset of ImageNet 2012 (LeCun et al. 2001;Krizhevsky and Hinton 2009;Krizhevsky 2014).MNIST consists of 60,000 training images and 10,000 test images of handwritten digits numbered 0-9 of size 28 × 28 pixels.In contrast, CIFAR-10 consists of 50,000 training images and 10,000 test images from 10 classes of objects of size 3 × 32 × 32 pixels.The 10-class subset of ImageNet 2012 consists of 5000 training images, 1000 validation images, and 1000 test images of birds of varying dimension.These images were randomly sampled from the classes "lorikeet" through "goose" in the original ImageNet 2012 training set, and the classes are equally represented.Details about the data preprocessing are in Appendix H.

Architectures and Losses
We consider LeNet-1 and LeNet-5 on MNIST (LeCun et al. 2001), All-CNN-C on CIFAR-10 ( Krizhevsky and Hinton 2009;Springenberg et al. 2015), and AlexNet on the subset of ImageNet (Krizhevsky, Sutskever, and Hinton 2012;Krizhevsky 2014).LeNet-1 and LeNet-5 are prominent examples of first modern versions of ConvNets.Both use convolutional layers and pooling/subsampling layers and achieved state-of-theart performance on digit classification tasks on MNIST.The ConvNets from Springenberg et al. (2015), including All-CNN-C, only used convolutional layers and were among the bestperforming models on CIFAR-10 at the time of publication.Finally, AlexNet changed the landscape of statistical learning by outperforming competing approaches in the ImageNet 2012 competition.All details on the ConvNets and their CKN counterparts can be found in Appendices A-C.
Using the principles outlined in Section 3, we translate each architecture into its CKN counterpart as faithfully as possible.Specific components such as the incomplete connection scheme of the LeNets and the dropout schemes are detailed in Appendix H.Both ConvNets and CKNs are trained end-to-end using a multinomial logistic loss and an additional 2 2 regularization.For ConvNets, a small regularization on the filters is used.Their parameters are tuned by hold-out validation; see Appendix H.The batch size is set to the largest power of two that fits on the GPU when training the CKN counterpart; see Table 3 in Appendix H.

Initialization, Training, and Evaluation
The initialization of the ConvNets is performed using random samples from zero-mean Gaussians; see Appendix H for details.For each layer of a CKN, we sample f patches from the previous layer's feature representation using k-means (with the initial feature representation being the input image).
For both ConvNets and CKNs, we use the optimization scheme based on a partial minimization of the objective with respect to the classifier parameters, described in Section 4.1, that we call the Ultimate Layer Reversal method.The corresponding step size and regularization are tuned by hold-out validation as described in Appendix H.For CKNs, the gradients of the inverse matrix square root operation are computed using the inter-twined method presented in Section 4.2.At each step of the ultimate layer reversal, the classifier parameters are optimized on a random mini-batch.When evaluating the performance of the network, the classifier is reoptimized over the whole dataset.The metrics presented in the following figures, used during hold-out validation, consider an optimal classifier computed over a fixed feature representation using Scipy's L-BFGS (Liu and Nocedal 1989) for 1000 iterations.
The overall training phase is performed for 10,000 iterations for a single random seed.The results we report come from training with 10 different random seeds with hyperparameters set to those found during the validation stage described in Appendix H.The reported results are from training on the training set rather than the combined training and validation sets.

Gradient Computation
The straightforward way to compute the gradient of a CKN with respect to the weights involves differentiating through matrix singular value decompositions.In Section 4.2 we put forth another approach: the intertwined Newton method.We compare the two approaches for the network with the largest number of layers, the All-CNN-C CKN, both in terms of the training accuracy and the accuracy of the gradients.
Figure 3 shows that a small number of iterations of the intertwined method can be sufficient to ensure an efficient decrease of the misclassification error.Moreover, for 128 filters/layer, the inter-twined method can provide an additional gain in generalization performance compared to the SVD.We find that the intertwined Newton method, with 20 Newton iterations, yields relative errors that are 2.5 times smaller than those from differentiating through SVD, when comparing in terms of gradient accuracy for the All-CNN-C CKN on CIFAR-10 with 8 filters/layer.This supports that differentiating through Newton iterations is more numerically stable than doing so through the SVD.Moreover, Newton iterations allow us to control accuracy of gradients and matrix inverse square roots.
Figure 4 shows that, for a given runtime, intertwined Newton led to decreases in the training loss of 3%-5% relative to using a singular value decomposition and decreases in the test misclassification error of up to 2%.Moreover, intertwined Newton can be directly written using matrix multiplications, which is ideally suited for GPU computing acceleration.Hence, it provides a flexible means of computing the gradient for a large number of filters.
Compared to the ConvNets, most of the computations in a CKN are the same as the ones of their ConvNet counterparts, such as the convolution operation or the pooling operation.The computation of the inverse square root matrix involves an additional computational cost of the order of O(t max f 3 ) per layer where t max is the number of iterations used in the intertwined Newton method and f is the number of filters at layer .Implemented in a differentiable programming framework, the intertwined method requires a space complexity of the order of O(t max f 2 ) per layer , while the computational cost of the backward pass is of the order of O(t max f 2 p ) for p the number of patches at layer .

Training Algorithm
A straightforward optimization algorithm is stochastic gradient optimization.However, while stochastic, this approach dismisses the structure of the problem.There are two different types of parameters: the filters and the classifier parameters.For fixed filters, the classifier can easily be optimized by convex optimization, which suggests using an alternating optimization approach similar to the one of Mairal (2016).One concern with such a full batch alternating optimization method is that it is then not stochastic.The ultimate layer reversal method allows us to get the best of both worlds, between a regular stochastic gradient approach and an alternating optimization one, by obtaining the gradient information from a quadratic approximation of the objective on a random mini-batch.Taking the size of the minibatch to the full batch size, under appropriate conditions given in Appendix E, one retrieves the training procedure of Mairal (2016).On the other hand, compared to a regular stochastic gradient algorithm, the proposed algorithm can provide substantial speed-ups, as Figure 5 shows.For large scale problems, PyTorch's DataParallel allows one to use even larger batch sizes if multiple GPUs are available.

Performance at Initialization
A ConvNet is generally initialized with zero-mean Gaussian random filters.Such an approach is aligned with a viewpoint of deep networks using random features explained in Section 2. In contrast, a CKN can be initialized by selecting representative filters drawn from the dataset using a k-means algorithm.The bottom two lines in each plot in Figure 6 show the performance  of the networks after initialization only, that is, in the unsupervised setting, where only the classifier is optimized for a fixed initial representation.We observe that the unsupervised CKNs generally outperform their ConvNet counterparts.These observations corroborate the benefit of the statistical viewpoint and of the Nyström approximation compared to the straightforward random feature approximation presented in Figure 1.

Performance after Training
Now we turn to the comparison between supervised CKNs and ConvNets.We perform this comparison for LeNet-1 and LeNet-5 on MNIST, All-CNN-C on CIFAR-10, and AlexNet on our subset of ImageNet.The two top lines in each plot in Figure 6 display the results when we vary the number of filters per layer by powers of two, from 8 to 128.See Figure 14 in Appendix H for a zoomed-in version.
Beginning with the LeNets, we see that both the CKNs and ConvNets perform well on MNIST over a wide range of the number of filters per layer.The CKN outperforms the ConvNet for almost every number of filters per layer.At best, the performance of a CKN is 0.3% better on average and at worst it is approximately the same.The former value is large, given that the accuracies of both the CKNs and the ConvNets exceed 98%.The success of the CKNs continues for All-CNN-C on CIFAR-10.For All-CNN-C the best CKN performance is 11% better on average than that of the ConvNet (in the case of 64 filters/layer) and at worst is 4% worse (in the case of 8 filters/layer).On the other hand, the CKNs do not perform as well as ConvNets for AlexNet on a subset of ImageNet; the performance ranges from 6% to 17% worse.
Compared to the unsupervised performance of the CKNs, we observe that the supervised networks' performance is often significantly larger.For example, for All-CNN-C, the unsupervised networks' accuracy was 44%-59% worse than of that of the supervised CKN with the same number of filters.Therefore, the supervised training contributes tremendously to the overall performance of the All-CNN-C CKN.These results also suggest that for more complex tasks the unsupervised CKN may require more than 16 times as many filters to achieve comparable performance to the supervised CKN and ConvNet.

Varying the Kernel
One kernel that can be seen as a generalization of the Gaussian RBF kernel is the Matérn kernel (Rasmussen 2003).The Matérn kernel can offer greater flexibility compared to the Gaussian RBF kernel, owing to the additional degree of freedom provided by the order parameter in the small bandwidth regime.We explore whether the Matérn kernel can improve the performance of the AlexNet CKN.In Figure 7 the CKN with the Matérn kernel performs similarly to the CKN with the Gaussian RBF kernel for the AlexNet architecture.In particular, the CKN with the Matérn kernel does between 6% worse (in the case of 64 filters/layer) and 9% better (in the case of 8 filters/layer).These results suggest that the CKNs are rather robust to the choice of the kernel while the width of each layer may have a more important impact.

Qualitative Comparisons
We present in Figure 8 the filters obtained by a ConvNet and a CKN at different layers of the All-CNN-C architecture trained on CIFAR-10.The represented filters were sampled from a deter-  minantal point process to get a diverse set of filters as explained in Appendix H.The filters of the CKN present sharper contrasts in the initial layers while the difference fades as we look into further layers.

Conclusion
We presented a systematic study of the correspondence between a ConvNet and its CKN counterpart.The CKNs we studied often achieved comparable performance to their ConvNet counterparts.This shows that the CKN approach can be fruitful.The extension to data with a geometrical structure (Bronstein et al. 2017;Bietti and Mairal 2019) and the model selection of network architectures are interesting avenues for future work.

Figure 1 .
Figure 1.Average performance of three kernel approximation methods across 10 trials when varying the dimension of the approximation.Error bars or bands in plots show one standard deviation from the mean across 10 random seeds.

Figure 2 .
Figure 2. Example CKN architecture.Each parallelogram is a two-dimensional representation of a feature representation at a given layer.The block sizes are the sizes of the filters at each layer.The height of the blocks at the input layer is three to represent the input having three channels.At every subsequent layer the feature representation has more dimensions than shown, which is indicated by the dots above the boxes.The arrows show how one block gets transformed into the block at the next layer.The numbers on the sides of the parallelograms indicate the sizes of the feature maps at each layer.
maximum iterations of supervised training T Initialization: -At each layer , define W as centroids of the data from previous layer from k-means -Train classifier W L+1 , b L+1 on the initial representation of the data using L-BFGS Supervised training: for t = 1, . . ., T do -Compute features F (i)

Figure 3 .
Figure 3. Test misclassification error over time when training All-CNN-C CKNs, either using an SVD to compute the matrix inverse square root or using the inter-twined Newton method with a varying number of outer Newton iterations.Each plot shows the first 1000 training iterations.To highlight the differences, we omit the initial portions of the curves.

Figure 4 .
Figure 4. Relative training loss (left) and relative test misclassification error (right) over time when training the All-CNN-C CKN with 128 filters/layer and varying the number of outer Newton iterations in the intertwined Newton method.The reported values are relative to those obtained when using the SVD to compute the matrix inverse square root.

Figure 6 .
Figure 6.Average performance of ConvNets and CKN counterparts across 10 trials when varying the number of filters per layer.Note that the y-axis scales differ across plots.

Figure 7 .
Figure 7. Average performance of ConvNets and CKN counterparts across 10 trials when training versions of AlexNet on a subset of ImageNet and varying the number of filters per layer.

Figure 8 .
Figure 8. Examples of learned filters of the All-CNN-C CKN and ConvNet, respectively.
NOTE:The values in parentheses after the arc-cosine kernels indicate their order.* denotes an inexact correspondence.
Table 1 lists correspondences between ConvNets and CKNs.Details on the aforementioned classical ConvNets and their CKN counterparts are in Appendices A-C.