Differential network analysis and graph classification: a glocal approach

Based on the glocal HIM metric and its induced graph kernel, we propose a novel solution in differential network analysis that integrates network comparison and classification tasks. The HIM distance is defined as the one-parameter family of product metrics linearly combining the normalised Hamming distance H and the normalised Ipsen-Mikhailov spectral distance IM. The combination of the two components within a single metric allows overcoming their drawbacks and obtaining a measure that is simultaneously global and local. Furthermore, plugging the HIM kernel into a Support Vector Machine gives us a classification algorithm based on the HIM distance. First, we outline the theory underlying the metric construction. We introduce two diverse applications of the HIM distance and the HIM kernel to biological datasets. This versatility supports the adoption of the HIM family as a general tool for information extraction, quantifying difference among diverse in- stances of a complex system. An Open Source implementation of the HIM metrics is provided by the R package nettols and in its web interface ReNette.


Introduction
The paradigm shift towards complex systems science [3], stimulated by its recent theoretical and computational advance [15,4], has paved the way for a parallel leap in computational biology by moving the focus from the differential gene expression analysis to differential network analysis (NetDA) [20,25].Due to the heterogeneity in the NetDA process and potential ill-posedness of some of the involved functional Fig. 1 A pair of similar subgraphs from a comparison of D. melanogaster and S. cerevisiae proteinprotein interaction network as shown in [12].Blue links are present only in the S. cerevisiae subnet.
operations [1,5,38], a number of alternative approaches has appeared in the literature, with different strategies and aims [23,22,20,45,25,51,10,50,41,6,7].For example, NetDA can be used to compare networks corresponding to different organisms, phenotypes or conditions.The subgraph of the protein-protein interaction network shown in Fig. 1 (from [12]) is the same in terms of shared nodes for the fruitfly and the budding yeast.A group of links is shared by both instances of the subgraph, but the budding yeast network includes nine additional edges.Clearly, when graphs to compare have a more complex structure, more sophisticated quantitative indicators are needed also to ensure a reproducibile analysis [26].In general, the two key applications of NetDA are network comparison and network classification.Both can be framed in terms of similarity between graphs, which is best dealt by defining a distance.However, non-metric alternatives can be used [49,16], and even combinations of metric and statistical approaches [43,35].
Here we propose to use the Hamming-Ipsen-Mikhailov (HIM) distance [32,31] first as the underlying metric for the NetDA framework, and also to induce a kernel for classification purposes.The HIM metric linearly combines two distances, the Hamming [24,17,48,28,40] and the Ipsen-Mikhailov [27]; the first is an edit distance, while the latter is a spectral measure.These are the two most relevant families of graph distances: the edit distances are based on functions of insertion and deletion of matching links between the compared graphs, while the spectral measures are functions of the eigenvalues of one of the graph connectivity matrices.The Ipsen-Mikhailov distance was chosen after a comparative review [30], while Hamming was selected as the simplest member of the edit family.As a characterizing feature, HIM is a glocal distance that overcomes the drawbacks of local (edit) and global (spectral) metrics when separately considered.In fact, local functions disregards the overall network structure, while spectral measures cannot distinguish isospectral graphs.
NetDA based on the HIM distance has been used in metagenomics [52], MEG neuroimaging [21], liver high-throughput oncogenomics [19] and oncoimmunol-ogy [39].Moreover, the same method has found applicability also out of computational biology, e.g., socioeconomics [32] or even in multiplex network theory [29].Here we present, after a brief summary of the main definitions, one application example in neurogenomics and one in developmental functional genomics.In the first example, we highlight and quantify weighted network dissimilarities among gene expression of brain tissues with different phenotypes (location, sex, health status), while in the latter we describe the trajectory of the binary developmental gene network in fruitfly across its different life stages.
Finally, we describe the CRAN R package nettools and the web framework ReNette [18], which are available to implement NetDA projects.

The HIM distance and kernel
We recap hereafter the main definitions and results about the Hamming-Ipsen-Mikhailov (HIM) metric and kernel.The synthesis is based on the notations of Tab.1: for a fully detailed description, including mathematical proofs, we refer the reader to [31].The (normalized) Hamming distance [24,17,48,28,40] is the (local) simplest edit metric, counting the presence/absence of matching links: By definition, H ranges between 0 and 1, where H = 0 for A (1) = A (2) and H = 1 for A (1) + A (2) = 1 N − I N .
Note that, for H, all links are equivalent regardless of their position within the network: for instance, in Fig. 2, both networks B 1 and B 2 differ from A for just one link, and thus H(A, B 1 ) = H(A, B 2 ), although B 1 is connected as A while B 2 is not.The Ipsen-Mikhailov distance [27] is the (global) L 2 integrated difference of the Laplacian spectral densities: By definition, IM too ranges between 0 and 1, where Table 1 Notation and list of symbols A (1) , A (2) Corresponding adjacency matrices, with a (1) i j , a (2) IM = 0 for spec(L (1) ) = spec(L (2) ) and IM = 1 for In fact, being a spectral measure, IM cannot distinguish isospectral (non isomorphic) networks.
To overcome the drawbacks of both H and IM, we define their normalized cartesian product, the Hamming-Ipsen-Mikhailov distance: When ξ is not close to the bounds {0, +∞} (and one of the factors becomes dominant), the impact of ξ is minimal, and in general more relevant when HIM ξ is used as a kernel [21].Hereafter ξ = 1 will be assumed, and the subscript ξ omitted.Again, HIM is bounded between 0 and 1, with HIM = 0 for A (1) = A (2) and HIM The HIM distance can be naturally extended to directed networks, by transforming it into an undirected bipartite graph through the procedure shown in [36].
The HIM distance naturally induces a kernel via Gaussian (Radial Basis Function) map [13,9], to be used standalone or in a Multi-Kernel Learning framework to increase performance and enhance interpretability [33]: for a positive real number γ.
Although the HIM kernel is not positively defined in general for all γ ∈ R + 0 , by results in [44] it can be used in Support Vector Machines or other algorithms whenever K is positive for the given training data, which is the case for all the examples shown in what follows.

The UKBEC dataset
The United Kingdom Brain Expression Consortium (UKBEC) hybridized on a Affymetrix Human Exon 1.0 ST Array (transcript version) 1213 human brain samples from 10 diverse regions.Samples originated from 134 neurologically and neuropathologically normal individuals and were used in three studies aimed at better understanding gene expression differences [47,46,42].Data details about  sample stratification according to sex and tissue location are listed in Tab.2(a).
Here, this dataset 1 is used to build the absolute Pearson coexpression networks corresponding to different phenotypes defined on the 50 genes involved in the BRAIN DEVELOPMENT (GO:0007420, GSEA M7203) pathway 2 , corresponding to 1012 probes on the Affymetrix Human Exon 1.0 ST Array platform 3 .First, we consider planar projections of all the mutual HIM distances between networks with shared nodes based on the metric multidimensional scaling (mMDS) [37,14] in Fig. 3.The mMDS plot shows the mutual HIM distances with networks stratified for both sex and tissue location.Citing the authors, the study in [46] "provides 1 available as GEO46706 at http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46706 2 available at http://software.broadinstitute.org/gsea/msigdb/cards/BRAIN_DEVELOPMENT 3 the platform has no probes for the 51st gene of the pathway, VCX3A unequivocal evidence that sex-biased gene expression in the adult human brain is widespread in terms of both the number of genes and range of brain regions involved".In our analysis, the result is numerically confimed by the major effect emerging at the gene coexpression level (Fig. 3): male and female networks can be linearly separated in the mMDS space, with large HIM distances between both interand intragender tissue locations.In particular, intragender HIM distances among different tissue regions are larger for the female samples (range [0.112,0.232],median 0.146) than for the male (range [0.077,0.200],median 0.118), with statistical significance (t-test p-value 1.9 • 10 −4 ).In Fig. 4, we show instead the mMDS projections for the mutual HIM distances of the coexpression networks built separately for male and female subjects, partitioned in 10 age groups: the sample size for each network is listed in Tab.2(b).While the plot for the females does not show any global pattern, for males the first 5 groups (age < 58 y) have small mutual HIM distances and they result clustered together.On the other hand, the five older male groups are both mutually distant and distant from the younger subjects cluster, too.In this dataset, the small sample size in the female subgroup may be a relevant source of noise for some of the age groups, e.g. the 32 ∼ 44.Our results are consistent with findings obtained with different data and methodology by Berchtold and colleagues in [8], suggesting the existence of a global pattern of gene expression change associated with brain aging, more evident from the sixth decade onward, with different evolutions between males and female, with larger variations in male subjects.Biologically, this is due to a wider global decrease in males in the catabolic and anabolic capacity with aging, mainly in genes linked to energy production and protein synthesis and transport [8].

The D. Melanogaster development dataset
In [34], Kolar and colleagues applied the Keller algorithm to infer the gene regulatory networks of Drosophila melanogaster from a time series of gene expression data measured during its full life cycle, originally published in [2].They followed the dynamics of 588 development genes along 66 time points spanning through four different stages (Embryonic -time points 1-30, Larval -t.p. 31-40, Pupal -t.p. 41-58, Adult -t.p. 59-66), constructing a time series of inferred networks N i4 : in Fig. 5(a) we show four instances of the N i networks, at different timing.
As a first step in the quantitative NetDA of this dataset, we measure the HIM distance between each N i and the initial network N 1 : the resulting distance time series is shown in Fig. 5(b).The largest variations, both between consecutive terms and with respect to the initial network N 1 , occur in the embrional stage (E).In particular, the HIM distance grows until time point 23; next networks get closer again to N 1 , showing that the interactions of the selected 588 genes in the adult stage are more similar to the corresponding net of interaction in the embrional stage, rather than in the other two stages.Moreover, while the Hamming component ranges between 0 and 0.0223, the Ipsen-Mikhailov distance has 0.0851 as its maximum, indicating an higher variability of the networks in terms of structure rather than matching links.
Then we computed all 2145 HIM distances HIM(N i , N j ), and we projected them on a 2D mMDS representation, shown in Fig. 5(c).Interestingly, the networks for the Embryonal stage split into two clusters (before and after time points 17), and the Embryonal and Pupal stages are orthogonal in this representation.
Moreover, the Adult stage networks form a cluster well separated from the other nets, with the Larval stage graphs mixing with the Pupal and late Embryonal stages.Finally, a Support Vector Machine classifier with HIM kernel was developed with the kernlab package in R, with a 5-fold Cross Validation with γ = 10 3 and C = 1.The classifier reached accuracy 0.97 in discriminating Embryonic and Adult networks from Larval and Pupal.Similarly, in the same setup, perfect separation is reached between Embryonic and Adult stages for all values of γ > 10 3 .

Conclusion
The interest of the HIM metric is its global/local approach: by combining edit and spectral distance types, we overcome the drawbacks of the two distance components.The two applications in functional hig-throughput -omics presented support the effectiveness of the approach.The strategy of a NetDA based on the HIM distance offers a reproducible method: the metric gives a completely quantitative assessment of the differences among networks (on shared nodes) as well as a scalar product for kernel learning machines.Operatively, we provide an Open Source implementation of the HIM distance with the R package nettools available on CRAN and GitHub5 , and in the web interface ReNette [18] 6 .In particular, ReNette includes a complete pipeline for NetDA, integrating a comprehensive collection of tools for network inference, network comparison and network stability analysis [19] through queue-based submission system and asynchronous task management.The software is already configured for usage on multicore workstations, on high performance computing (HPC) clusters and on the cloud, to deal with the extraction of the Laplacian spectrum, which represents the computational bottleneck of the algorithm.

Fig. 2
Fig. 2 Link equivalence for Hamming metric: H(A, B 1 ) = H(A, B 2 ) although B 1 is connected while B 2 consists of two connected components.

Fig. 3
Fig. 3 Metric Multidimensional Scaling projection on two dimensions of all 190 mutual HIM distances between gene coexpression Brain Development networks stratified by gender and tissue locations.

Fig. 4
Fig. 4 Metric Multidimensional Scaling projection on two dimensions of all 45 mutual HIM distances between gene coexpression Brain Development networks stratified by age groups, separately for the male (a) and female (b) subjects.a ∼ b means a < x ≤ b.

Fig. 5 D
Fig. 5 D. melanogaster development network dataset.(a) Keller interaction network N i for the D. melanogaster development genes at the time points i = 1, 20, 35, 66.(b) Evolution of H (cyan), IM (magenta) and HIM (goldenrod) distances network time series across 66 time points in the 4 stages Embryonic (E), Larval (L), Pupal (P) and Adult (A).(c) Metric multidimensional scaling planar projection of the mutual HIM distances between the 66 networks N i , colored according to the developmental stage Embryonic (blue), Larval (red), Pupal (green) and Adult (orange).

Table 2
Sample size of the UKBEC human brain dataset stratified by gender and tissue location (a) and by gender and age group (b).Region: the tissue location, Abbr.: abbreviation as in Fig. 3, M: number of samples from male individuals, F: number of samples from female individuals.a ∼ b means a < x ≤ b