Distributed cosegmentation via submodular optimization on anisotropic diffusion

The saliency of regions or objects in an image can be significantly boosted if they recur in multiple images. Leveraging this idea, cosegmentation jointly segments common regions from multiple images. In this paper, we propose CoSand, a distributed cosegmentation approach for a highly variable large-scale image collection. The segmentation task is modeled by temperature maximization on anisotropic heat diffusion, of which the temperature maximization with finite K heat sources corresponds to a K-way segmentation that maximizes the segmentation confidence of every pixel in an image. We show that our method takes advantage of a strong theoretic property in that the temperature under linear anisotropic diffusion is a submodular function; therefore, a greedy algorithm guarantees at least a constant factor approximation to the optimal solution for temperature maximization. Our theoretic result is successfully applied to scalable cosegmentation as well as diversity ranking and single-image segmentation. We evaluate CoSand on MSRC and ImageNet datasets, and show its competence both in competitive performance over previous work, and in much superior scalability.


Introduction
Cosegmentation refers to a procedure that simultaneously segments common regions from multiple images [6,7,13,15]; leveraging an intuition that the saliency of regions or objects in an image can be significantly boosted if they recur in multiple images.Cosegmentation has a wide potential in web-scale applications.For example, it can guide an interactive image editing by suggesting popular regions in the image database [1,13], or summarize personal photo collections by automatically segmenting highly co-occurring object instances such as persons or dogs [7].
Despite of the promising appeal of cosegmentation, very few algorithms are applicable to web-scale applications, which require cosegmentation to be not only scalable but also adaptable to heterogeneous images with high variability in content and complexity.In this paper, we address these problems with a new cosegmentation framework, which builds on a solid theoretical ground of submodular optimization, and is readily applicable to large-scale image collection with high variability.Our approach is easily parallelizable; most computations occur independently on individual images, and then an integration step quickly merges all outputs from individual images into a coherent cosegmentation result.We quantitatively show that our approach outperforms state-of-the-arts methods [6,7] on the MSRC datasets [17].We also evaluate the scalability of our method on the challenging ImageNet [4].The magnitude of the dataset sizes in our experiments exceeds those of previous work by orders of magnitude.
The compelling performance and scalability of our approach stem from a novel optimization formulation on the anisotropic diffusion (which inspires the name of our algorithm, CoSand, standing for Co-Segmentation via anisotropic diffusion).The optimization problem underlying CoSand can be summarized in a single sentence as follows; Given a system under heat diffusion and finite K heat sources, where should one place all the sources in order to maximize the temperature of the system?In terms of image segmentation, the optimization corresponds to finding the K segment centers that maximize the segmentation confidence of every pixel in the image 1 .(e.g. the ideal segmentation is that every pixel has confidence one to be clustered with one of K segment centers).This idea is extended to the cosegmentation problem by constraining the source placements in multiple images to be coupled.This diffusion theoretic optimization framework takes advantage of a strong theoretical property that inspires an efficient computational algorithm.We prove in our paper that, the temperature, which is to be optimized in our problem, is a submodular function if the system is under linear anisotropic diffusion.A well-known beneficial property of submodular functions is that one can achieve at least a constant factor of the optimal solution by a simple greedy algorithm, which iteratively chooses K locations that maximize marginal temperature gain.Such a greedy solution is particularly promising for cosegmentation tasks on large-scale image collections.

Relations to Previous work
Submodular optimization: In recent years, submodular optimization has emerged as a useful optimization tool in a variety of machine learning problems such as active learning, structure learning, clustering, and ranking [8,9].The submodular function is characterized as a diminishing return property that states that, the marginal gain of adding an element to a smaller subset of S is higher than that of adding it to a larger subset of S. Some typical submodular functions explored in machine learning include a cut function in a graph and the entropy and the information gain of Gaussian random variables [8].
To the best of our knowledge, our work is the first to address submodular optimization on diffusion in physics 2 .
Cosegmentation: Cosegmentation is the problem of jointly segmenting each of M images into K different regions [6,7,11,13,15].Table 1 summarizes the comparison of our work and other unsupervised cosegmentation methods.In summary, our approach is unique in terms of M and K.Most previous work has dealt with binary figure-ground segmentation (K=2) of small sized image sets (mostly M =2 but M ≤30 in [7]).On the other hand, our algorithm is able to perform segmentation of a largescale dataset with any arbitrary K.We tested with M ≥10 3 images in our experiments, but a more scalable setup is also applicable.The optimization methods for cosegmentation in most previous work, except [7], are based on the graphcut algorithm.Hence, it is not straightforward for them to be extended to arbitrary K-way cuts.In theory, the method of [7] can perform cosegmentation with K>2, but it was not evaluated in the paper.On the other hand, our algorithm can attain a constant factor approximation to the optimum with any arbitrary K.The computation time is at worst linear with K.
In addition, our approach also supports the automatic selection of K and robustness against a wrong choice of K.They will be presented in experiments in Section 4, which also reveals that CoSand is compelling in segmentation quality over the state-of-the-art techniques [6,7] in MSRC [17] and ImageNet [4] datasets.

Anisotropic diffusion:
The heat diffusion framework that is represented by a partial differential equation has been a successful technique in image processing and computer vision.Notable examples include image segmentation [18], optical flow estimation [2], and image smoothing [16].In these applications, the temperature corresponds to various objectives, which are the clustering confidence in segmentation, the optical flow in motion analysis, or the RGB value in image smoothing.In this paper, we focus on image segmentation, but our optimization is also easily extendible to those problems such as large-scale edge-preserving image smoothing or layered motion segmentation in video.

Summary of Contributions
The main contributions of this paper are as follows: (1) We propose a diffusion-based optimization framework that is applicable to a wide range of computer vision problems.In this paper, we show that our optimization leads to an effective solution to diversity ranking, single-image segmentation, and cosegmentation.
(2) We prove that the temperature of a linear anisotropic diffusion system, which corresponds to many important objectives in computer vision tasks, including the cosegmentation score concerned in this paper, is a submodular function.This is a new result that widens the applicability of submodular optimization in computer vision research.
(3) We present CoSand, a distributed cosegmentation exploiting the submodularity of our diffusion-inspired segmental objective.As compared in Table 1, our approach has some unique benefits including compelling performance over previous methods, superior scalability, and a desirable ability of automatically deciding the number of segments.

Optimization on Anisotropic Diffusion
We begin with a general theory of anisotropic diffusion [16].Let Ω denote the domain of a system and x be a point in Ω ∈ R d (x ∈ Ω).Since we are usually interested in discrete systems (e.g.images or graphs), let us assume that Ω is a discrete set of points 3 .The u(x, t) is the temperature at position x at time t and D(x) is a d×d positive symmetric tensor called the diffusion tensor.The linearity of diffusion indicates that D is not a function of u or u.The anisotropy means that the flux −D(x) u(x, t) and the gradient u(x, t) are not parallel in an image domain.The diffusion equation of such a system is as follows: Our optimization problem is that of maximizing the sum of temperature of the system that is under anisotropic diffusion by choosing the locations of K heat sources.Formally, where we assume that the temperature of environment (i.e.outside of the system Ω) is zero (i.e.u(g) = 0), and the source temperature is one at any time (i.e.u(s) = 1)4 .
For physical analogy, one may imagine a metal plate in open air, and its temperature is to be maximized with K point heat sources.Without loss of generality, we explicitly decompose the heat flux at every point into two parts -a flux within the system and a dissipation flux to out of the system.Let z(x) be a positive scalar diffusivity to the environment at x, and then the dissipation heat loss is −z(x)(u(x)−u(g)).If z(x) = 0 for ∀x ∈ Ω, the system is insulated.From now on, we assume that −D(x) u(x, t) solely contributes to the diffusion within the system.
In order to efficiently solve the optimization of Eq.( 2) for arbitrary K, we first prove that the temperature under the linear anisotropic diffusion is submodular.
Theorem 1 (Submodularity on Anisotropic Diffusion).Suppose that the system undergoes linear anisotropic diffusion.Let u(x, t; S) be the temperature at position x at time t when identical heat sources are attached to S(⊂ Ω).Then, the following statements hold for ∀x ∈ Ω, ∀t ∈ [0, ∞].
Proof.The proof is shown in the supplementary material.Let U (t; S)= x∈Ω u(x, t; S)dx be the temperature sum of the system at t. Intuitively, U (t, S) is also submodular since it is the sum of submodular functions [8].Theorem 2 below states that a simple greedy algorithm achieves a near optimal solution for the maximization of a submodular function.

Examples: Diversity ranking and clustering
For better understanding of the above diffusion formulation, let us first examine a simple case − diversity ranking in a graph.Diversity ranking [19] aims to re-rank items to reduce redundancy while maintaining their centrality, which is highly relevant to the goal of segmentation.Intuitively, in order to maximize the temperature of the system with limited sources, the sources should be located in the center-ofgravity regions that are densely connected to other elements with high conductivity.Simultaneously, the sources should be sufficiently distant from one another to have a broad and balanced coverage of the system.In the next section, we extend this idea into the cosegmentation problem.
Suppose the following; (1) The system Ω is a graph G = (V, E). (2) We are interested in the steady state (i.e. when t→∞), thus we can drop t in our notation.(3) The diffusivity (i.e.conductance) is defined by Gaussian similarity between the features of vertices: where g(x) is the feature vector at node x ∈ V. (4) The dissipation conductance at a vertex x is constant in time, denoted by z x .That is, each node x is connected to an environment node g with conductance of z x .With these assumptions, diffusion reduces to the famous random walk model [5] or Gaussian random fields [20].The optimization problem in Eq.( 2) grounds to a more specific form below 5 : where a x is the degree of x.In terms of random walks, the optimization of Eq.( 4) corresponds to selecting K nodes as absorbing nodes to maximize the sum of absorbing probabilities of a random walker in a given network G.In terms of linear electric circuits, the first constraint of Eq.( 4) is the Kirchhoff equation, and the problem is locating K voltage sources to maximize the electric potential of the circuit.Since the objective u(x; S) is submodular, we can obtain a near-optimal solution by a greedy algorithm, which starts with an empty S and iteratively adds the item s k that maximizes the marginal temperature gain, U (S k-1 ∪ {s k }) − U (S k-1 ), as shown in Eq.( 5).The details of the greedy algorithm will be discussed in Section 3. where The dissipation conductance z is a parameter to control trade-off between centrality and diversity.With a larger z, the heat loss to the environment is larger as well, and only the neighbors within a shorter range of a source will get high temperatures.Hence, a point to be closer to the already ranked set S k-1 is likely to be chosen as a next s k .
Fig. 1 shows two toy examples of diversity ranking and clustering.Here, the location of a point is used as the feature g(i)=[x y] T to compute the similarity of Eq.( 3).Therefore, a closer point pair (i, j) has a larger diffusivity d ij .In the first example of three Gaussian distributions (Fig. 1.(a)-(f)), our intuition tells that the center point in the largest blob should be selected as the first item s 1 , and it actually has the highest marginal gain in Fig. 1.(b).In the next iteration, since the points near s 1 already have high temperature, the second choice to maximize the marginal gain should be not only distant enough from s 1 (diversity) but also densely linked by other points with high diffusivity d ij (centrality), which is s 2 in Fig. 1.(c).In sum, s k is chosen as the most central but distant enough from already selected items S k-1 .
In the second example of three co-centric circles (Fig. 1.(g)-(l)), one interesting behavior is that among the points in each circle, the point at the opposite side of the circle to the selected point has the highest marginal gain.Thus, if the fourth s 4 is chosen in Fig. 1.(k), it is the exact opposite of s 3 in the circle.That is, the largest circle in Fig. 1.(l) will be divided as two exact half circles with K=4.
This algorithm may seem to be similar to the Grasshopper algorithm [19], a greedy algorithm for diversity ranking.However, the objective function is different, and our main contribution over [19] is that our method is not ad-hoc, but a constant-factor approximation based on the submodularity.

Large-scale CoSegmentation
In this section, we present our scalable cosegmentation algorithm.Below, we begin with the segmentation of a single image to illustrate the basic behavior of the algorithm.

Segmentation of a Single Image
The segmentation of a single image aims to find K segment centers to maximize the sum of segmentation confidence of every pixel in an image.This can be achieved via the following procedure.
Building the intra-image graph of an image: For faster computational speed, we first extract superpixels from an image as shown Fig. 2.(b).Any edge-preserving superpixel methods can be applied, and TurboPixels [10] is used in our implementation.Then we build the intra-image graph G i = (V i , E i , D i ) where the vertex set V i is the set of superpixels and the edge set E i connects all pairs of adjacent superpixels.Let N i denote the number of superpixels of an image i.In each superpixel, 3-D CIE Lab color and 4-D texture features 6 are extracted.The diffusivity D i is computed by Gaussian similarity in Eq.( 3) on the features of superpixels.The adjacency matrix G i of G i is a sparse N i ×N i matrix, in which the number of nonzero elements of each superpixel is the same with the number of its neighbors.In most cases, it is less than 10.
Construction of evaluation set: In the diversity ranking discussed earlier, we compute the marginal gain at every datapoint to find the maximum (Fig. 1).However, this search is inefficient since the actual distinctive regions in an image are usually much fewer than N i .For example, in Fig. 2, there are a lot of sky superpixels and there is little difference in the segmentation results no matter which sky superpixel is chosen as a segment center.Thus, we first run agglomerative clustering on G i to find out the set of evaluation points L i .(|L i | ≤ 100 in our experiments).The marginal gain is only computed at L i .That is, segment centers are limited to be placed in a subset of L i .(i.e. S i ⊂ L i ⊂ V i in the third constraint of Eq.( 6)).Fig. 2.(b) shows an example of L i as colored superpixels.
Basic behavior of segmentation: In summary, our segmentation algorithm greedily selects the largest and most coherent regions.As shown in Fig. 2.(d), the sky is first chosen with K=2.As K increases, the regions of the tree, the house in the center, and the building in the left are chosen in the decreasing order of their sizes and coherence in Fig. 2.(d)-(g).This desirable trend comes from the greedy nature of our algorithm.This behavior is quite helpful for automatic selection of K. We can keep increasing K until the detected segment is not significant any more (i.e.temperature increase by adding a new source is not significant any more).As iteration goes, we re-use the previous results of a lower K, which significantly reduce the computation time (e.g. the lazy greedy approach in [9]).

Scalable Cosegmentation
The input of cosegmentation is an image set I and the number of segments K.The optimization formulation for cosegmentation in Eq.( 6) is an extension of that of the diversity ranking (Eq.( 4)).
The objective in Eq.( 6) is the sum of temperature (i.e.segmentation confidence) of every image in the dataset.Thus, it encourages each image to be segmented as K largest and most coherent regions that are nevertheless content-wise diverse with respect to one another.In order to enforce inter-image similarity between chosen clusters, the second constraint of Eq.( 6) is introduced.The f (g(s ik ), g(s jk )) is an increasing function of the feature affinity between the k-th sources of an image i (s ik ) and an image j (s jk ).More visually similar the features of s ik and s jk are, a higher value f (g(s ik ), g(s jk )) has.It is intuitive that the system temperature is linear with the source temperature.(e.g. if the source temperature is halved, then the temperatures of all points in the system are halved as well).Hence, the second constraint pushes the k-th source placement of image i to be similar to its corresponding placement in other images of N (i), which is the neighborhood image set of i to be jointly cosegmented.If N (i) = I\i, then each image is cosegmented with respect to all the other images in I.Meanwhile, the affinity function f controls how strongly the inter-image similarity is imposed.If f (g(s ik ), g(s jk )) is constant, the optimization of Eq.( 6) reduces to independent segmentation of each image.Otherwise, if it is a fast increasing function, the inter-image similarity is highly weighted.We use the Gaussian similarity in Eq.(3) for f .Algorithm 1 presents the greedy algorithm to solve Eq.( 6).Note that Algorithm 1 is easily parallelizable.All steps except step 5 can be computed individually in each image.The computation complexity of step 5 is O(|I||N |) 7 .
Once we obtain K source placement S i for each image, the segmentation is straightforward.Here we use the method of [5], which is summarized in step 7-8 of Algorithm 1.It first calculates (N i −K)×K matrix X in which X(j, k) is the probability that a random walker starting at an unselected j-th point (i.e.x j ∈ V i \S i ) reaches the k-th source points.Then, we cluster the superpixels that share the same source point as the most probable destination.
Fig. 3 shows an example of our cosegmentation on three MSRC cow images with K=4.Since our algorithm can handle arbitrary K, the brown and black cows and the river in the first image can be detected as individual clusters.
Optimality: The constant factor approximation of our algorithm is guaranteed if the element with the maximum marginal gain is chosen in each round (step 5).In diversity

1: foreach
where L i is the Laplacian of G i and u is an N i ×1 vector with the constraints of u({S i ∪ l j }) = 1 and u(g) = 0. 4: Obtain the gain ∆U i (l j )=|u| 1 (l-1 norm of u).
5: Solve the energy maximization by belief propagation and Is is a K×K identity matrix.ranking and single-image segmentation, we compute the exact solution for this step.However, we use belief propagation, which is an approximate maximization, for a largescale cosegmentation with full dependency.In most cases, this relaxed solution is good enough to obtain a high-quality segmentation result.
A more scalable setting: In practice, a large-scale image set is likely to contain various noisy information as well.If heterogeneous images are cosegmented, then the results would be worsen than those of individual image segmentation.Thus, one can first decompose I into disjoint sets I = I 1 ∪ • • • ∪ I O so that each subset I o consists of similar images.Then, Algorithm 1 can be applied to each I o separately.This decomposition can be done by the proposed diversity ranking and clustering of Eq.( 4) on the similarity graph of I, which can be constructed by applying Gaussian similarity to image descriptors (e.g.dense SIFT or GIST).

Experiments
We evaluate our approach with two different experiments: (1) figure-ground segmentation with a pair of images (M =2 and K=2), and (2) scalability tests with a large number of images (M ∼1000).The figure-ground tests are performed to quantitatively compare our method with other state-of-the-art cosegmentation techniques that are only applicable in this setting.The scalability tests evaluate how well our algorithm works with real-world data.

Evaluation on Figure-ground Cosegmentation
In the figure-ground tests, we use MSRC dataset [17], which provides 30 pixel-wise labeled images per object.Two recent cosegmentation methods, [6] and [7], are compared using the original authors' implementation with the default parameter setting8 .We run [6], [7], and our method on randomly generated 100 pairs in each class.
Unlike the others, the method of [6] requires priori labels of foreground (fg) and background (bg) RGB colors.In order to obtain labels, we fist identify the fg and bg regions of each image from the groundtruth.Then, we apply K-means to the RGB space of fg and bg pixels to compute three cluster centers each, which are used as labels (i.e. total 6 fg and 6 bg RGB labels in each pair).These labels can be regarded as strong supervision, but they were used because the performance of [6] was highly sensitive to the labels.
Since our method is not designed to aim at figure-ground segmentation, we add an additional step to generate the binary segmentation results.Our approach iteratively chooses large and coherent regions across input images in a bottomup way.Thus, if the foreground object consists of several distinct regions, it is likely to segment them into multiple regions.For binary segmentation, we first safely cosegment a pair of images with a large K (K=8 in our experiments).Then, we apply Normalized cuts to the similarity graph of eight pairs of cosegments to obtain two balanced and discriminative partitions.We observed that our approach showed excellent performance for detecting a moderate number of cosegments but the final figure-ground segmentation accuracy depended much on this binarization.
Table 2 summarizes the segmentation accuracies on the random test pairs of MSRC dataset.The accuracy is measured by the intersection-over-union metric that is a standard in PASCAL challenges (i.e.For each image, Ac = GTi∩Ri GTi∪Ri ).We observed that our method outperformed both [6] and [7] in most objects of the MSRC dataset.Our algorithm was also significantly faster than both competitors; it took less than 10 seconds for a pair of images with a [320×213] dimension, 750 superpixels, and K=8.

Evaluation on Scalable Cosegmentation
For scalability tests, we use ImageNet 9 [4].We compute segmentation accuracies by using its bounding box annotations.The bounding boxes may not be a perfect groundtruth for segmentation evaluation, but in practice it is difficult to obtain pixel-wise labels for large-scale datasets.
We compare our algorithm to MNcut [3] and the method of [14], which are publicly available 10 .As a baseline, the MNcut [3] independently segments each image with K=2 and the fg and bg are assigned so that the segmentation accuracy is maximized.For [14], we apply the algorithm several times by changing the number of topics from two to eight, and the best results are reported.Note that most previous cosegmentation methods including [6] and [7] cannot run well with a large number of images.( [7] reported that their algorithm took between 4 and 9 hours for 30 images).
For ImageNet tests, we select 50 synsets that provide bounding box labels.We randomly select up to 1000 images per synset.Since the ImageNet images are too diverse to be jointly cosegmented at once, we first split each synset into 100 disjoint sets I = I 1 ∪ • • • ∪ I 100 by our diversity ranking and clustering.Then, our cosegmentation is separately applied into each I o .This decomposition is much more favorable for the performance.We tested a simultaneous cosegmentation of all 1,000 images with full dependency, in which both accuracy and speed were much worse.Fig. 5.(a) shows an example of synset decomposition.A single synset has several different aspects, which were successfully detected by our diversity ranking and clustering.Table 3 shows the segmentation accuracies for 13 selected synsets.Our algorithm significantly outperformed the two competitors by more than 10%.Our algorithm took 60-70 minutes for 1,000 images on a single machine.Note that this computation time can be significantly reduced by parallelization as discussed in section 3.2.
Fig. 4 and Fig. 5.(b) show some examples of cosegmentation on the MSRC and ImageNet datasets.We made two interesting observations here: (i) Our method can easily segment multiple instances in the images.(ii) Our algorithm is robust against an incorrect selection of K.In the duck example of the second column of Fig. 4, the best choice of K would be four, but a faulty guess with K=8 did little harm.The four significant segments are successfully detected (e.g. three ducks and grass) and the other four overestimated segments were trivially selected as tiny dots. 9http://www.image-net.org/challenges/LSVRC/2010. 10 Codes are available at [3]: http://www.seas.upenn.edu/∼timothee,[14]: http://www.cs.washington.edu/homes/bcr/projects/multseg discovery/ Class (%) Our method Hochbaum et al. [

Conclusion
In this paper, we proved that the temperature of the system under linear anisotropic diffusion is submodular.Based on this finding, we design a constant-factor greedy solution to temperature maximization with limited sources.Our theoretic results were successfully applied to diversity ranking, single-image segmentation, and scalable cosegmentation.

Figure 1 .
Figure 1.Two toy examples of diversity ranking.The data points are randomly generated from three Gaussian distributions in (a) and three co-centric circles in (g).In (b)-(e) and (h)-(k), the marginal temperature gain of each point U (S ∪ {x}) − U (S) is shown along z-axis.s k (∈ S) are iteratively selected by solving Eq.(5).Once a point is selected, the marginal gains of its neighbors largely drop because they already get high temperatures.In (f)(l),final three clusters are shown.The clustering from S will be discussed in Algorithm 1.

Figure 2 .
Figure 2.An example of segmenting a single image.(a) An input image.(b) 1000 super-pixels and colored evaluation locations L. (c) Image segmentation with red boundaries.(d)-(g) Color-coded segmentation outputs by ranging K from 2 to 8. As K increases, the following regions are detected in turn: {sky, tree, wall (center), roof (left), windows (left), building (left), and trash container}.

Algorithm 1 :
CoSand Cosegmentation.Input: (1) Intra-image matrix G i for all I i ∈ I. (2) Number of segments K. (3) Evaluation set size |L|.Output: Cluster centers S i and segmented images for I i ∈ I.

Figure 4 .
Figure 4. Four cosegmentation examples on the MSRC dataset.(a) Pairs of input images.(b) Our cosegmentation results with K=8.The cosegmented pairs are presented by the same colors.Some segments are too small to be shown.(c) Figure-ground segmentation results that are induced from the eight pairs of cosegments.

Figure 5 .
Figure 5. Examples of scalable cosegmentation on the ImageNet dataset.(a) Decomposition of the Gorilla Synset by the proposed diversity ranking and clustering.Three cluster centers and their three closest images are shown.(b) Examples of cosegmentation on green lizard, siamang, ferret, and nymphalid butterfly.In each set, 20∼60 images are simultaneously cosegmented and five selected images are shown.

Table 3 .
Accuracies of scalable cosegmentation tests for 13 selected synsets from the ImageNet dataset.