Select to Better Learn: Fast and Accurate Deep Learning Using Data Selection From Nonlinear Manifolds

Finding a small subset of data whose linear combination spans other data points, also called column subset selection problem (CSSP), is an important open problem in computer science with many applications in computer vision and deep learning. There are some studies that solve CSSP in a polynomial time complexity w.r.t. the size of the original dataset. A simple and efficient selection algorithm with a linear complexity order, referred to as spectrum pursuit (SP), is proposed that pursuits spectral components of the dataset using available sample points. The proposed non-greedy algorithm aims to iteratively find K data samples whose span is close to that of the first K spectral components of entire data. SP has no parameter to be fine tuned and this desirable property makes it problem-independent. The simplicity of SP enables us to extend the underlying linear model to more complex models such as nonlinear manifolds and graph-based models. The nonlinear extension of SP is introduced as kernel-SP (KSP). The superiority of the proposed algorithms is demonstrated in a wide range of applications.


Abstract
Finding a small subset of data whose linear combination spans other data points, also called column subset selection problem (CSSP), is an important open problem in computer science with many applications in computer vision and deep learning such as the ones shown in Fig. 1.There are some studies that solve CSSP in a polynomial time complexity w.r.t. the size of the original dataset.A simple and efficient selection algorithm with a linear complexity order, referred to as spectrum pursuit (SP), is proposed that pursuits spectral components of the dataset using available sample points.The proposed non-greedy algorithm aims to iteratively find K data samples whose span is close to that of the first K spectral components of entire data.SP has no parameter to be fine tuned and this desirable property makes it problem-independent.The simplicity of SP enables us to extend the underlying linear model to more complex models such as nonlinear manifolds and graph-based models.The nonlinear extension of SP is introduced as kernel-SP (KSP).The superiority of the proposed algorithms is demonstrated in a wide range of applications.

Introduction
Processing M data samples, each including N features, is not feasible for of most the systems, when M is a very large number.Therefore, it is crucial to select a small subset of K << M data from the entire set such that the selected data can capture the underlying properties or structure of the entire data.This way, complex systems such as deep learning (DL) networks can operate on the informative selected data rather than the redundant entire data.Randomly selecting K out of M data, while computationally simple, is inefficient in many cases, since non-informative or redundant instances may be among the selected ones.On the other hand, the optimal selection of data for a specific task involves solving an NP-hard problem [2].For example, finding an optimal subset of data to be employed in training a DL network, with the best performance, requires M K number of trial and errors, which is not tractable.It is essential to define a versatile objective function and to develop a method that efficiently selects the K samples that optimize the function.Let us assume the M data samples are organized as the columns of a matrix A ∈ R N ×M .The following is a general purpose cost function for subset se-  [1] is considered.(b) the images in (a) are represented as blue dots.Three most significant eigenfaces are shown by green dots.However, these eigenfaces are not among data samples.Here, we are interested in selecting the best 3 out 20 real images, whose span is the closest to the span of the 3 eigenfaces.There are 20  3 possible combinations from which the best subset must be selected.In this paper, we propose the SP algorithm to select K samples such that their span pursuits the span of the first K singular vectors.(c) Utilizing the proposed linear selection algorithm (SP), a tractable algorithm is developed for selecting from low-dimensional manifolds.First, a kernel which is defined by neighborhood transforms the given data on a manifold to a latent space.Next, the linear selection is performed.lection, known as column subset selection problem (CSSP) [3]: where π S is the linear projection operator on the span of K columns of A indicated by set S. This is an open problem which has been shown to be NP hard [4,2].Moreover, the cost function is not sub-modular [5], and greedy algorithms are not efficient to tackle Problem (1).Computer scientists and mathematicians during the last 30 years have proposed many tractable selection algorithms that guarantee an upper bound for the projection error A−π S (A) 2 F .These works include algorithms based on QR decomposition of matrix A with column pivoting (QRCP) [6,7,8], methods based on volume sampling (VS) [9,10,11] and matrix subset selection algorithms [3,12,13].However, the guaranteed upper bounds are very loose and the corresponding selection results are far from the actual minimizer of CSSP in practice.Interested readers are referred to [14,12] and Sec.2.1 in [15] for detailed discussions.For example, in VS it is shown that the projection error on the span of K selected samples is guaranteed to be less than K + 1 times of the projection error on the span of the K first left singular vectors (which is too loose for a large K).Recently, it was shown that VS performs even worse than random selection in some scenarios [16].Moreover, some efforts have been made using convex relaxation and regularization.Fine tuning of these methods is not straightforward.Moreover their cubical complexity is an obstacle to employ these methods for diverse applications.
Recently, a low-complexity approach was proposed to solve CSSP, referred to as iterative projection and matching (IPM) [17].IPM is a greedy algorithm that selects K consecutive and locally optimum samples, without the op-tion of revisiting the previous selections and escaping local optima.Moreover, IPM samples the data from linear subspaces, while in general data points reside in the union of nonlinear manifolds.
In this paper, an efficient non-greedy algorithm is proposed to solve Problem (1) with a linear order of complexity.The proposed subspace-based algorithm outperforms the state-of-the-art algorithms in terms of accuracy for CSSP.In addition, the simplicity and accuracy of the proposed algorithm enable us to extend it for efficient sampling from nonlinear manifolds.The intuition behind our work is depicted in Fig. 2. Assume for solving CSSP, we are not restricted to selecting representatives from data samples, and we are allowed to generate pseudo-data and select them as representatives.In this scenario, the best K representatives are the first K spectral components of data according to definition of singular value decomposition (SVD) [18].However, the spectral components are not among the data samples.Our proposed algorithm aims to find K data samples such that their span is close to that of the first K spectrum of data.We refer to our proposed algorithm as spectrum pursuit (SP).Fig. 2 (b) shows the intuition behind SP and Fig. 2 (c) shows a straightforward extension of SP for sampling from nonlinear manifolds.We refer to this algorithm as Kernel Spectrum Pursuit (KSP).
Our main contributions can be summarized as: • We introduce SP, a non-greedy selection algorithm with linear order complexity w.r.t. the number of original data points.SP captures spectral characteristics of dataset using only a small number of samples.To the best of our knowledge, SP is the most accurate solver for CSSP.
• Further, we extend SP to Kernel-SP for manifoldbased data selection.
• We provide extensive evaluations to validate our proposed selection schemes.In particular, we evaluate the proposed algorithms on training generative adversarial networks, graph-based label propagation, few shot classification, and open-set identification, as shown in Fig. 1.We demonstrate that our proposed algorithms outperform the state-of-the-art algorithms.

Data Selection from Linear Subspaces
In this section, we first introduce the related work on matrix subset selection and then we propose our algorithm for CSSP.

Related Work
A simple approach to selection is to reduce the entire data and evaluate a criterion only for the reduced set, A S .Mathematically speaking, we need to solve the following problem [19,10]: ( Here, φ(.) is a function of matrix eigenvalues, such as the determinant or trace function.This is an NP hard and nonconvex problem that can be solved via convex relaxation of 0 norm with time complexity of O(M 3 ) [20,19].There are several other efforts in this area for designing function φ [10,21,22,23].Inspired by D-optimal design, VS [11] considers a selection probability for each subset of data, which is proportional to the determinant (volume) of the reduced matrix [10,24,25].To the best of our knowledge the tightest bound for selecting K columns [26] for CSSP, introduced in a paper published in NIPS 2019,is as follows: where A K is the best rank-K approximation of A. Moreover, VS guarantees a projection error up to K + 1 times worse than the first K singular vectors [11].A set of diverse samples optimizes cost function (2) and algorithms such as VS assign a higher probability for them to be chosen.However, selecting some diverse samples that are solely different from each other probably does not provide good representative for all (un-selected) data.
Ensuring that selected samples are able to reconstruct un-selected samples is a more robust approach than selecting a diverse subset.The exact solution of Problem (1) aims to find such a subset.An equivalent problem to the original problem (1) is proposed in [27].Their suggested equivalent problem exploits the mixed norm, .2,0 , which is not a convex function and they propose to employ 1 regularization to relax it [27].There is no guarantee that convex relaxation provides the best approximation for an NP-hard problem.Furthermore, such methods which approach the problem using convex programming are usually computationally intensive for large datasets [27,28,29,30].In this paper, we present another reformulation of Problem (1) and propose a fast and accurate algorithm for addressing CSSP.

Spectrum Pursuit (SP)
Projection of all data onto the subspace spanned by K columns of A, indexed by S, i.e., π S (A), can be expressed by a rank-K factorization, U V T .In this factorization, U ∈ R N ×K , V ∈ R M ×K , and U includes a set of K normalized columns of A, indexed by S. Therefore, the optimization problem (1) can be restated as [17]: where A = { ã1 , ã2 , . . ., ãM }, ãm = a m / a m 2 , and u k is the k th column of U .It should be noted that U is restricted to be a collection of K normalized columns of A, while there is no constraint on V .As mentioned before, this is an NP hard problem.Recently, IPM [17], a fast suboptimal and greedy approach to tackle (3), was proposed.In IPM, samples are iteratively selected in a greedy manner until K samples are collected.In this paper, we propose a new selection algorithm, referred to as Spectrum Pursuit (SP), which can select a more accurate solution for Problem (3) than that of IPM.The time complexity of both IPM and SP are linear with respect to the number of samples and the dimension of samples, which is desirable for selection from very large datasets.Our proposed SP algorithm facilitates revising our selection in each iteration and escaping from local optima.In SP, we modify (3) into two sub-problems.The first one is built upon the assumption that we have already selected K − 1 data points and the goal is to select the next best data.However, it relaxes the constraint u k ∈ A in (3) to a moderate constraint u k = 1.This relaxation makes finding the solution tractable at the expense of resulting in a solution that may not belong to our data points.To fix this, we introduce a second sub-problem that reimposes the underlying constraint and selects the datapoint that has the highest correlation with the point selected in the first sub-problem.These sub-problems are formulated as Here S k is a singleton that contains the index of the selected data point.Matrices U k and V k are obtained by removing the k th column of U and V , respectively.Subproblem (4a) is equivalent to finding the first left singular vector (LSV) of k and ẽm is the normalized replica of the m th column of the residual matrix, E k .The set of normalized residuals is indicated by E k .The constraint u = 1 keeps u on the unit sphere to remove scale ambiguity between u and v.Moreover, the unit sphere is a superset for A and keeps the modified problem close to , respectively [32,33].The stopping criterion can be convergence of set S or reaching a pre-defined maximum number of iterations.The convergence behavior of SP is studied in the supplementary document.Simplicity and accuracy of SP facilitate its extension to nonlinear manifold sampling with a wide range of applications.We will refer to this extended version as kernel-SP (KSP) which is discussed next in Section 3.

Kernel SP: Selection based on a Locally Linear Model
The goal of CSSP introduced in ( 1) is to select a subset of data whose linear subspace spans all data.Obviously, this model is not proper for general data types that mostly lie on nonlinear manifolds.Accordingly, we generalize (1) and propose the following selection problem in order to efficiently sample from a union of manifolds argmin where Ω m represent the indices of local neighbors of a m based on an assumed distance metric.This problem is simplified to CSSP in Problem (1) if Ω m is assumed to be equal k =mod(iter,K)+1 3: 5:

6:
u k = find the first left singular vector of E k by solving (4a) 7: 1) is written for each column of A separately in order to engage neighborhood for each data.This problem facilitates fitting a locally linear subspace for each data sample in terms of its neighbors.Nonlinear techniques demonstrates significant improvement upon linear methods for many scenarios [34,35,36].
Similar to Section 2, where we introduced SP as a lowcomplexity algorithm to tackle the NP-hard Problem ( 1), here we propose an extention of SP, referred to as kernel SP (KSP), to tackle the combinatorial search Problem (1).Manifold-based dimension reduction techniques and clustering algorithms do not provide prototypes suitable for data selection.However, inspired by spectral clustering of manifolds [37], main tool for nonlinear data analysis that partitions data into nonlinear clusters based on spectral components of the corresponding normalized similarity matrix, we formulate KSP as where is defined as the similarity matrix of data and D is a diagonal matrix and d ii = j =i s ij .The similarity matrix can be defined based on any similarity measure.A typical choice is a Guassian kernel with parameter α.Note that problem ( 6) is the same as problem (1), where A is replaced by L. The steps of the KSP algorithm are summarized in Algorithm 2.

Algorithm 2 Kernel Spectrum Pursuit
Require: A, α, and K Output:

Empirical Results and Some Applications of SP/KSP
To evaluate the performance of our proposed selection algorithms we consider several applications and conduct extensive experiments.The selected applications in this paper are (i) fast GAN training using reduced dataset; (ii) semisupervised learning on graph-based datasets; (iii) large graph summarization; (iv) few-shot learning; and (v) openset identification.

Training GAN
There have been many efforts [38,39] to employ manifold properties to stabilize GAN training process and improve the quality of generated samples but none of them benefit from smart samples selection to expedite the training as suggested in the first column of Fig. 1.Here, we present our experimental results on CMU Multi-PIE Face Database [40] for representative selection.We use 249 subjects with various poses, illuminations, and expressions.There are 520 images per subject.Fig. 4 (top) depicts 10 selected images of a subject based on different selection methods: SP (our proposed) is compared with three well-known selection algorithms, DS3 [41], VS [33], and K-medoids [42].As it can be seen, SP selects from more diverse angles.Fig. 4 (bottom) compares the performance of different state-ofthe-art selection algorithms in terms of normalized projection error of CSSP, which is defined as the cost function in (1).As shown, SP outperforms all other methods.There is also a considerable performance gap between SP and IPM [17], the second best algorithm.
plication, we use the selected samples to train a generative adversarial network (GAN) to generate multi-view images from a single-view input.For that, the GAN architecture in [45] is employed.The experimental setup and the implementation details from [45] are used, where the first 200 subjects are used for training and the rest for testing.We select only 9 images from each subject and train the network with the selected images for 300 epochs using the batch size of 36.Table 1 shows the normalized 2 distances between features of the real and generated images, indicated as identity dissimilarities, averaged over all the images in the testing set.Features are extracted using a ResNet18 trained on MS-Celeb-1M dataset [46,47].As can be seen, SP and KSP outperform other selection methods.Moreover, KSP performs better than SP due to the selection from a nonlinear manifold.
The test set contains multi-view images from 50 subjects not seen in training.In the test phase, a single view from each of these 50 people is given and we are to generate other views.Please note that in this application, we do have ground truth images for all views.Hence, any similarity measure can be applied.The evaluation is performed identical to that of [45].
Table 1: Identity dissimilarities between real and GAN-generated images for different selection methods.For each method, GAN is trained based on the selected data points.SMRS S5C FFS DS3 K-Med VS IPM SP KSP 0.631 0.617 0.608 0.602 0.599 0.583 0.553 0.550 0.546 Trained GAN using All Data 0.5364

Graph-based Semi-supervised Learning
To evaluate the performance of our proposed selection algorithm on more complicated scenarios, we consider the graph convolutional neural network (GCN) proposed in [48], that serves as a semi-supervised classifier on graphbased datasets.Indeed, a GCN takes a feature matrix and an adjacency matrix as inputs, and for every vertex of the graph produces a vector, whose elements correspond to the score of belonging to different classes.The semi-supervised task here considers the case where only a selected subset of nodes are labeled in the training set and the loss is computed based on the output vectors of these labeled nodes to perform back-propagation.Moreover, we inherit the same two-layer network architecture from [48].To be more specific, an identity matrix is added to the original adjacency matrix so that every node is assigned with a self-connection.Further, we normalize the summation of two matrices using the kernel discussed in lines 2 and 3 of Algorithm 2 while the adjacency matrix serves as the similarity matrix S.
Our proposed KSP algorithm, together with other baselines, is tested on Cora dataset which is a real citation network dataset with 2, 708 nodes and 5, 429 number of edges as well as a random cluster-based graph datasets with 200 nodes from 10 clusters, where every node independently belongs to one of the 10 clusters with equal probabilities and the cluster of a node serves as its label during the classification.Instead of constructing a completely connected graph, the existence of edges between node pairs is determined according to a density matrix, in which the density of edges connecting nodes in the same cluster is uniformly generated from the interval [0.2, 0.6], whereas the inter-cluster densities are randomly sampled according to a uniform distribution on the interval [0, 0.2].The neural network is trained based on semi-supervised learning, i.e., the network is fed with the feature and adjacency matrices of the entire graph while the loss is only computed on the labeled vertices.Here the labeled vertices correspond to the subset of vertices that is selected by applying our proposed algorithm (KSP) on the normalized adjacency matrix.We train both datasets for a maximum of 100 epochs using Adam [49] with a learning rate of 0.01 and early stopping with a window size of 10, i.e. we stop training if the validation loss does not decrease for 10 consecutive epochs.The results are summarized in Figure 5. Due to the inherent randomness of training neural networks using gradient descent based optimizers, some ripples appear in the curves.However, it is clear hat, as expected, the test accuracy tends to increase as more labeled points are utilized for training.Further, as can be seen from the figure our proposed KSP algorithm significantly outperforms other algorithms for almost the whole range of selected points.This implies the superior performance of KSP in selecting the subset of data that comprises the most representative points of clusters.Lastly, because of the existence of outliers in a random graph, the accuracy of the proposed algorithm starts to improve slowly at about 70%, whereas other competitors saturate at about 60%.However, we note that the model is trained with only 10% of data, so this also implicitly suggests that our algorithm successfully picks out the most informative nodes.

Graph Summarization
Clusters (also known as communities) in a graph are those groups of vertices that share common properties.Identification of communities is a crucial task in graph- based systems.Instances include protein-protein interaction networks in biology [61], recommendation systems [62] in computer science, social media networks, etc.In the following, we design an experiment to find the vertices with a central position on several types of graphs, produced both by real datasets such as [55] and also synthetic graph which contains the aggregated network of some users' Facebook friends.In the later dataset, vertices represent individuals on Facebook, and edges between two users mean they are Facebook friends.
Various community detection based algorithms such as betweenness centrality (BC) has been proposed to measure the importance of a user in the network [54], by considering how many shortest paths pass through that user (vertex) for connecting each pair of other users (vertices).The more shortest paths that pass through the user, the more central the user is in the Facebook social network.Now assuming that a graph G or a similarity matrix is given, the aim is to first implement our method on the graph to approximate it with a subset of the vertices, and then exploit the measure of shortest path to evaluate the accuracy.We report the following performance measures: instead of computing the average shortest path between each vertex of the graph and all the other vertices which is really expensive (use of Dijkstra's algorithm n 2 times where n is the number of vertices), we compute the average shortest path between all the vertices and the selected vertices by KSP.The latter can be computed by using Dijkstra's algorithm only kn times, where k is the number of selected vertices.
Further, in this experiment we evaluate the performance of KSP compared to several state-of-the-art algorithms for data selection and coreset construction which is a small (weighted) subset of the data that approximates the full dataset.The results of these experiments are shown in Table 2 where 10 vertices from each graph are selected (except for Karate Club sketched in Fig. 6 from which we select 2 vertices) by different data selection algorithms.As can be seen our proposed method provides significant improvements in shortest path error over the state-of-the-art.

Few Shot Learning
Training on Sampled Pairs: Next, we further evaluate the performance of SP on a more common data such as images and features.This analysis is motivated by the work in [63], as we employ their proposed neural network architecture named Siamese neural network.Moreover, we adopt the Omniglot dataset and split it into three subsets for training, validation, and test, each of which consists of totally different classes.For training and validation process two images are randomly sampled from their own corresponding data and are fed as the input to the Siamese neural network and a binary label is assigned to each pair according to the classes that they are sampled from.The network trained on these pairs achieves 90%+ accuracy in distinguishing inter-class and intra-class pairs.
Classification with Few-Shot Learning: After being fully trained on the sampled pairs, the model is developed for few-shot classification.In other words, if the model is accurate enough to distinguish the identity of classes to which the pairs belong to, given few representatives of a specific class, a trained Siamese network could serve as a binary classifier that verifies if the test instance belongs to this class.Therefore, the problem reduces to selecting the best representatives of every class to be paired with any test image.The class that produces the pairings with the highest average score is then identified as the classification result.The test set of Omniglot, after the partitioning discussed above, comprises 352 different classes, each of which is composed of 20 images.We sequentially evaluate every one of the 352 classes to choose the most informative subset of the 20 images by deploying our selection algorithm on the flattened features extracted from the last convolutional layer of the network that is fed with the 20 images.The classifier made from the Siamese network and the selected 352 representative groups are then evaluated on all the 7,000+ images in the test set.We show in Figure 7 the few-shot learning results when 2, 3, 4 and 5 images are selected out of the 20 images together with an example of selected groups in 2-shot learning.
It can be observed in Figure 7 that images selected by the evaluated algorithms are generally more standard and more identifiable than the others.Among all these competing algorithms, KSP makes the best selection for this character.Due to the fact that the classification accuracy is evaluated based on the 352 test classes, which do not appear in the training set, around 60% of correct classification is considerably acceptable.In particular, SP achieves accuracies of 59.84%, 62.70%, 63.55%, and 64.89% for 2-shot, 3-shot, 4shot, and 5-shot classifications, respectively, which is comparable to the GIGA results of 60.21%, 62.36%, 63.42%, and 65.21% while outperforming other baseline algorithms.Note that SP needs less memory requirement and its computational complexity is less than its peers.

Open-Set Identification
In this experiment, the open-set identification problem is addressed employing propose selection method, which results in significant accuracy improvement compared to the state-of-the-art.In open-set identification, test data of a classification problem may come from unknown classes other than the classes employed during training, and the goal is to identify such samples belong to open-set and not the known labeled classes [64].Interested readers are referred to [65,66,67,68,69] to the state-of-the-art approaches for solving open-set problem.
Employing the entire closed-set data during the training procedure leads to inclusion of untrustworthy samples of the closed-set.Regularized or underfitting models (such as low-rank representations [70,71,72]) still suffer from memorizing effect of such samples, which exacerbate the separation between open and closed-set by adding ambiguity to the decision boundary between the closed and openset classes.To resolve this issue, we utilize our proposed selection method, KSP, which selects the core representatives.Therefore, selected representatives for open-set identification are more robust in rejecting open-set test samples which do not fit well to the core representatives.We pictorially illustrate the proposed scheme for open-set identification in Fig. 1 on the rightmost panel and the proposed algorithm which is referred to as selection-based open-set identification scheme (SOSIS) hereunder (Algorithm 3).
Experiment Set-up: We use MNIST dataset as the closed-set with samples from Omniglot as the open-set.The ratio of Omniglot to MNIST test dataset is set to 1 : 1 (10,000 from each), same as the simulation scenario in [67].A classifier with ResNet-164 architecture [73] is trained on MNIST as for step 1 in Alg. 3. Results of macro-averaged F1-score [74] for SOSIS method with different selection methods and different number of samples are listed in Table 3 as well as the sate-of-the-art in [67].The best achieved F1-score is 0.964 belonging to SOSIS with KSP selection using 50 representatives.The second best performance is by SOSIS with SP selection again using 50 representatives.Performance downgrade is observed for both scenarios of choosing too few representatives such as 5 or fewer and obsessively choosing all data.
The gap between the error values resulting from projection of open and closed-set onto selected samples computed in step 4 of Alg. 3 differs significantly compared to that of the projection onto the entire dataset (due to overfitting and memorization effect).We call this splitting property as reflected in Fig. 8 (a) (entire dataset) vs. 8 (b) (selected samples) at the testing phase.For a better visualization, projection errors are sorted separately for closed-set and openset data at the testing phase.As observed, fewer number of representatives results in higher projection error.However, at the same time closed-set and open-set test data are better split as also observed in Fig. 8.

Conclusion
A novel approach to data selection from linear subspaces is proposed and its extension for selection from nonlinear manifolds is presented.The proposed SP algorithm demonstrates an accurate solution for CSSP.Moreover, SP  Supervised only [67] 0.680 LadderNet [67] 0.764 DHRNet [67] 0.793 and KSP have shown superior performance in many applications.The investigated fast and efficient deep learning frameworks, empowered by our selection methods, have shown that dealing with selected representatives is not only fast but can also be more effective.This manuscript is allocated mostly for algorithm designs and applications of data selection.Theoretical results and more buttressing experiments can be found in the supplementary document.1

Acknowledgements
This research is based upon work supported in parts by the National Science Foundation under Grants No. 1741431 and CCF-1718195 and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. D17PC00345.Authors appreciate valuable comments from Alireza Zaeemzadeh, Marziyeh Edraki and Mehrdad Salimitari.The views, findings, opinions, and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the NSF, ODNI, IARPA, or the U.S. Government.

Supplementary Document
The supplementary material provided in this document is organized as follows.In Section 1, we present a theoretical result on the equivalence of the locally linear selection with the linear selection after applying kernel.Then, in Section 2, further experiments are provided to investigate the performance of the proposed approaches on several different real datasets.

Theoretical Results
The following lemma shows how the introduced locally linear selection problem in the original paper can turn into the plain linear selection on the kernelized version of data.
Lemma 1 Consider M data points and the neighborhood for each one are denoted by a m and Ω m , respectively.The following problems have the same selection results using the SP algorithm.
and, P2 : where Proof of Lemma 1: Matrix X m ∈R M ×N is defined as an all-zero matrix except in rows indexed by Ω m .The nonzero rows are equal to a T m (repeated for all those rows).Matrix X ∈ R M N ×M is defined as follows, Operator vec(.)reshapes a matrix into a vector.Using the definition of X, Problem P1 can be cast in terms of X as follows, Please note that neighborhood information has been infused in matrix X and neighborhood constraints are removed in comparison with P1.In other words, each data is allowed to be approximated using its neighbors as aimed by P1.Non-neighbor samples have no impact on the least square cost function since x T i x j = 0 for all pairs of (i, j) non-neighbor samples.Thus, P1 can be re-written as, Given the singular value decomposition of X as X = U ΣV T , where U and V are orthogonal matrices and Σ is the diagonal matrix of singular values, we have X T X = V ΣU T U ΣV T = V Σ 2 V T .Thus, the k-th left eigenvector of X T X is a scaled version of v k , the k-th column of V .Moreover, where the last equality follows from orthogonality of U .Therefore, X T u k is a scaled version of v k , the k-th left eigenvector of X T X.
As the following step of the proof, we proceed to state that the same data index, m 1 , which maximizes |x T m u k | also maximizes |h T m X T u k |, where h m is the m th column of H = X T X.This can be proved as follow.Let i.e., the index which picks the largest magnitude in vector This means both optimizations (8) and ( 9) result in finding the index of the element in v k with the largest absolute value.This means m 1 = m 2 .Therefore, selection with SP results in the same selection by solving the following problem as solution of P1.
The SP algorithm performs an iterative selection.In each iteration selection is performed on the residual of data after projection on the null space of previously selected samples.Thus, in each iteration P1 and P2 are performed on the residual corresponding to the current iteration and they result in the same index.
Matrix H is equal to the weighted replica of autocorrelation matrix of data, A T A. The weights come from the neighborhood information.For example, if data i and data j are not neighbors, then h ij = 0.And if they share P neighbors then h ij = P a T i a j .Matrix H is a similarity matrix and any other graph-based similarity matrix is reasonable to substitute H.In the main paper, we employ normalized similarity matrix, the definition of which is inspired by Laplacian graph of neighborhood.This choice is a conventional similarity matrix in the context of manifold-based dimension reduction.Moreover, it can be employed easily for graph summarization which is investigated in the main manuscript.The neighborhood and weighting in definition of matrix H is hard, while the normalized similarity matrix based on Gaussian kernel provides a soft neighborhood definition via smooth weighting.Employing the normalized similarity matrix results in Problem (6) in the main paper.Fig. 9 illustrates the impact of nonlinear modeling on a toy example containing a set of 100 ×100 images where each image is a rotated and resized version of other images (Fig. 9(a)).Since none of the images lie on the linear subspace spanned by the rest of images, the ensemble of these data do not form a linear subspace.Therefore, this dataset is of high rank and the union of linear subspaces is not a proper underlying model for it.The KSP algorithm is implemented using a Gaussian kernel with parameter α, i.e., s ij e −α ai−aj 2 .As shown in Fig. 9 (c), the nonlinear selection algorithm has been able to discover the intrinsic structure of data and select data from more distinguished angles than that of Fig. 9 (b) in which the plain SP is applied.

Supplementary Experiments
Further experiments in this section support experiments of the main paper.

Convergence of SP
Provably convergent version of SP algorithm needs a slight modification in the algorithm which is explained later in this section.However, lots of experiments show that the proposed SP algorithm in the main paper converges in less than 5K iterations for selecting K samples.Fig.  and Fig. 11 show convergence behavior of SP and KSP for selecting from multi-pie face data set and Cora citation dataset within less than 5K iterations.We demonstrated our empirical results on the convergence of SP in this section.However, they do not guarantee that SP is provably convergent.A slight modification of SP can guarantee convergence.At each iteration of SP, a new sample is selected only if the resulted residual error decreases (Alg. 1 (SP), line 7).This way the error is non-increasing.The error is also lower bounded by A − A K 2 F .These two conditions guarantee that the algorithm converges and quality of the selected subset always improves or remains the same.Alg. 4 describes the provably convergent version of the SP algorithm.In line 7, 8 and 9 we check if the new updated sample provides a better minimizer than the previous sample.The initial selection of SP algorithm can affect the final selected set.However, regardless of initialization, SP converges to approximately the same cost as shown here in Fig. 12. Further, initialization of SP using a deterministic algorithm such as IPM [17] and SMRS makes SP independent of initialization.

GAN on Multi-pie Face Dataset
As it is discussed in the main paper, we select only 9 images from each subject (1800 total subjects), and train the network with the reduced dataset for 300 epochs using the batch size of 36.Fig. 13 shows the generated images of a subject in the testing set, using the trained network on the reduced dataset, as well as using the complete dataset.The    k =mod(iter,K)+1 3: iter=iter+1 end while network trained on samples selected by KSP (fifth row) is able to generate more realistic images, with fewer artifacts, compared to other selection methods (rows 1-4).The parameter of KSP is set as 1e − 4 for constructing the similarity matrix.

Graph Summarization
In Section 4.3 of the paper we presented one of the important applications of KSP algorithm i.e., graph summarization.Here in Fig 14 we compare the central vertex selection and community detection capability of KSP with other state-of-the-art algorithms provided in table 2 for the Powerlaw Cluster graph [56].

Open-set Identification
It is worth noting that in some contexts, open-set is defined as the set containing both known and unknown classes.In this paper, we have assumed that open-set is only used for the unknown classes and the known classes at the time of training are called the closed-set.ject in testing set using CR-GAN [45].The network is trained on a selected subset of training set (9 images per subject) using random selection (first row), K-medoids (second row), DS3 [28] (third row), IPM (fourth row), and our proposed KSP algorithm.The sixth row shows the results generated by the network trained on all the data (360 images per subject).KSP generates closest results to the complete dataset.In the main paper, a quantitative measure is studied for comparing the generated images and the ground truth from different viwes.
Here, we provide a discussion on how to select the threshold in the open-set identification experiment setup.First, a network is trained on the MNIST training data.Next, the validation data consisting of data from both the known and unknown classes is used to find the threshold as in algorithm 3 in the main text.At the time of test, a pre-determined threshold is required for deciding on test samples.Our proposed method works based on accessing a set of error values by splitting them and deciding on the threshold.Using one test sample at a time does not lead to a set of error values for splitting at a time.Therefore, one can simply assign the threshold to be a value slightly larger than maximum of error values relating to projecting training samples on selected representatives from each class.Alternatively, if the learning framework is allowed to access validation data, the threshold can be achieved by clustering error values in the balanced validation data into two groups with two centroids, and then taking their average (1:1 sample ratio for Omniglot and MNIST in our case).
Fig. 15 contains the macro-averaged F1-score vs. threshold for different selected representatives using SP data selection.Fine-tuning the open-set identifier by selecting best representatives enhances the accuracy significantly as observed in Fig. 15.As the number of representatives decreases, the performance sensitivity to the threshold adjustment increases which means there is a trade-off between accuracy using selection-based scheme and the stability of performance w.r.t the designed threshold range.Fig. 15 also shows that between 50-100 samples from each training class (each containing about 6000) leads to optimal F1-score.
In Fig. 16, the receiver operating characteristic (ROC) of area under the curve (AUC) is plotted for the KSP method in the open-set identification.Different number of   selected representatives in the proposed SOSIS algorithm (Alg.3 in the main text) are considered.Sweeping through the threshold range, the ROC-AUC is achieved for SOSIS algorithm with each desired number of selected samples.As observed and magnified in Fig. 16, the best ROC-AUC performance (higher in plot) is achieved for about 20 − 50 number of selected representatives.

Figure 1 :
Figure 1: Several deep learning applications of our proposed data selection algorithms discussed in this paper.

Figure 2 :
Figure 2: Intuitive illustration of our main contributions in this paper.(a) A dataset including 20 real images from AT&T face database

Figure 3 :
Figure 3: Two consecutive iterations of the SP algorithm.In each iteration the residual matrix, E k , is computed.The first LSV of the residual matrix, u, is a vector on S N −1 .The most aligned column of E k with u is selected at each iteration and it is replaced for the next iteration.Please note that E k ⊂ S N −1 ⊂ R N in the Venn diagrams.therecast problem (3).After solving for u k , we find the data point that matches it the most in (4b).The steps of the SP algorithm are elaborated in Algorithm 1. Fig.3illustrates Problem (4) pictorially.SP is a low-complexity algorithm with no parameters to be tuned.The complexity order of computing the first singular component of an M × N matrix is O(M N )[31].As the proposed algorithm only needs the first singular component for each selection, the computational complexity of SP is O(N M ) per iteration which is much faster than convex relaxation-based algorithms with complexity O(M 3 )[19].Moreover, SP performs faster than K-medoids algorithm and volume sampling, whose complexity is of order O(KN (M − K) 2 ) and O(M KN logN ), respectively[32,33].The stopping criterion can be convergence of set S or reaching a pre-defined maximum number of iterations.The convergence behavior of SP is studied in the supplementary document.Simplicity and accuracy of SP facilitate its extension to nonlinear manifold sampling with a wide range of applications.We will refer to this extended version as kernel-SP (KSP) which is discussed next in Section 3.

Algorithm 1
Spectrum Pursuit AlgorithmRequire: A and K Output: A S 1: Initialization:S ←A random subset of {1, . . ., M } with |S| = K {S k } K k=1 ←Partition S into K subsets, each containing one element.iter= 0 while a stopping criterion is not met 2:

Figure 5 :
Figure 5: Semi-supervised classification accuracy of GCN on (Left) theCora dataset[50]; and (Right) a random cluster-based graph dataset.Only the selected nodes are labeled and the subset selection is performed using the proposed KSP algorithm, and GIGA[51], FW[51], and random selection (RND).

Figure 6 :
Figure 6: Zachary's Karate Club is a small social network where a conflict arises between the admin and the instructor in the club [60].Each node of the club network represents a member of the karate club and a link between members indicate that they interact outside the club.The admin and the instructor which are the two nodes of this graph are {0, 33}, respectively.We apply KSP and two other algorithms to choose two of the main vertices.GIGA, MP and FW select •, IS selects •, VS selects •, and KSP, FFS and DS3 select •.

Figure 7 :
Figure 7: Learning of Omniglot's dataset on Siamese Neural Network using few shots.(Left) Visualization of selected images from the first class of Omniglot's test set in 2-shot learning.Images selected by an algorithm are marked in corners with the same color used in the right plot.(Right) Classification accuracy with few-shot learning.

Algorithm 3 Figure 8 :
Figure 8: Sorted values of err in Step 4 of Alg. 3 for 20,000 test samples (10,000 per each closed/open set).(a) all data are selected as representatives.(b) only 20 representatives are selected.For both (a) and (b), a projection error above/below the threshold leads to classifying a sample as open-set/closed-set.Blue and red points correspond to the correctlyclassified and missclassified samples, respectively.As shown, implementing SOSIS enabled by KSP has significantly reduced the number of misclassified samples, from 5642 to 984.

Figure 9 :
Figure9: (a) A dataset lies on a two dimensional manifold identified by two parameters, rotation and size.However, the rank of corresponding matrix to this dataset is a large number.(b) Linear embedding using linear PCA and selection using linear SP.(c) nonlinear embedding using tSNE[75] and selection using kernel-SP.Un-selected and selected samples are shown as red and black dots in the embedded space, respectively.The non-linear embedding using a kernel is able to keep the intrinsic structure and non-linear selection provides more diverse samples.

Figure 10 :Figure 11 :
Figure 10: Selecting 5 and 20 representatives from the first 20 classes of Multi-pie dataset.Each class has 520 samples and the error trajectory of each single implementation is depicted in order to show that SP algorithm converges for each independent selection.(Left) Projection error for selecting 5 samples versus iterations.(Right) Projection error for selecting 20 samples versus iterations.Typically, SP selects K representatives in 5K iterations.

Figure 12 :Algorithm 4
Figure 12: CSSP cost function of selecting K = 5 out of 520samples using SP with 100 random Init.as the first iteration vs. the IPM algorithm, which is deterministic.Interestingly, the accuracy of IPM is comparable with SP using only K iterations with a rough random initialization.However, SP continues iterations.

Figure 13 :
Figure 13: Multi-view face generation results for a sample sub-

Figure 14 :
Figure 14: We apply KSP and other algorithms as in table 2, to choose three of the main vertices from another graph, i.e., Powerlaw Cluster graph for which the quantitative results were provided in table 2. The nodes . The nodes selected by different methods are: GIGA, MP and FW select •, IS selects •, VS selects •, DS3 selects •, and KSP and FFS select •.As is evident, KSP and FFS are the only ones that are able to detect the clusters and their corresponding vertices.

Table 2 :
Error performance of different state-of-the-art coreset construction algorithms for Graph summarization (central vertex selection) on various types of graphs.Practically all major social networks provide social clusters for instance, 'circles' on Google+, and 'lists' on Facebook and Twitter.For example, concerning Facebook ego graph, with KSP algorithm we define the task of identifying users' social clusters on a user's ego-network by exploiting the network structure.

Table 3 :
[43]aring F1-score of the proposed SOSIS algorithm with state-of-the-art methods for open-set identification.SP, KSP and FFS[43]are employed as the core of SOSIS.