Extracting Scale and Illuminant Invariant Regions through Color

Despite the fact that color is a powerful cue in object recognition, the extraction of scale-invariant interest regions from color images frequently begins with a conversion of the image to grayscale. The isolation of interest points is then completely determined by luminance, and the use of color is deferred to the stage of descriptor formation. This seemingly innocuous conversion to grayscale is known to suppress saliency and can lead to representative regions being undetected by procedures based only on luminance. Furthermore, grayscaled images of the same scene under even slightly different il-luminants can appear sufﬁciently different as to affect the repeatability of detections across images. We propose a method that combines information from the color channels to drive the detection of scale-invariant keypoints. By factoring out the local effect of the illuminant using an expressive linear model, we demonstrate robustness to a change in the illuminant without having to estimate its properties from the image. Results are shown on challenging images from two commonly used color constancy datasets.


Introduction
The representation of objects and scenes by sparse local patches has proven to be well suited to the practical challenges of occlusion, clutter and variation in viewing conditions.Approaches employing this representation [14] typically start with the detection of a set of distinguished points from the image.These "interest" points are often chosen as the extrema of functions of the input specifically constructed to have some desirable properties, e.g.high entropy, stability and invariance to geometric transformations.The detection step is followed by the computation of a descriptor at each interest point, possibly after the patch surrounding the point at its appropriate scale has been suitably normalized [13].
When the input is a color image, a common preprocessing step is the conversion of the image to grayscale.This conversion is convenient for several reasons.Firstly, by moving from our original vector-valued image to a scalar-valued image we make the available theory [11] and its proven guarantees of scale-invariant behavior directly applicable to our data.Secondly, it is well recognized that while color is a useful cue, its reliable use as a feature is hampered by various practical difficulties.The measured pixel values in a scene are influenced not just by the spectral reflectivity at that point, but also by the material properties, the sensitivities of the camera sensors and the spectral profile of the illuminant.Without additional information, inference of the "true" color at a pixel (defined, say as seen with a reference illuminant) is often under-determined and necessitates modeling assumptions on a subset of these unknowns [5].Reverting to grayscale and postponing the use of color to when computing the descriptor [18] is an attractive alternative.However the conversion to grayscale has a number of side-effects that are particularly undesirable for reliable interest point detection.It is well recognized in neurobiology and computer graphics that grayscaled versions of color images do not preserve chromatic saliency [8,12].Regions that exhibit chromatic variation often lose their distinctiveness when naively mapped to scalars based on isoluminance.Figure 1 (top row) shows an example of a Laplacian of Gaussian (LOG) based interest point detector applied to a grayscaled image from [8] with added Gaussian noise.It can be seen that the luminance values in the grayscale image do not exhibit the original differences in chrominance between many of the disks and the gray background as in the original.This causes the LOG-based detector to not detect significant stable extrema at several locations.
A related effect occurs due to change in illuminant.Using images of the cruncheroos scene under a white Sylvania 75W bulb (halogen) and fluorescent Philips Ultralume (phulm) tube from the Simon Fraser University (SFU) dataset [5], we can compute the linear transform that best numerically approximates the mapping of colors from halogen to phulm.These two illuminants are common to indoor scenes and the only perceivable visual difference between the two is in the form of a mild blue tone in the latter.We may then render the image to appear as it would under the ph-ulm illuminant in Figure 1 (bottom row).Note the difference between the grayscaled version of this image and that of the original.It can also be seen that the LOG-based detector picks a larger subset of keypoints from the content of the picture.
The above two observations indicate that the use of a grayscale intermediary to guide the selection of keypoints poses two kinds of risks.First, by attenuating differences in chrominance in favor of preserving luminance across the image, it can restrict the representative power of extracted regions.Secondly, by its sensitivity to the global illuminant, it can inconsistently accentuate parts of the scene leading to lowered repeatability of the detected keypoints.This paper addresses the question of how to appropriately combine information from the color channels to detect scale-invariant features in a manner that is robust to change in illuminant.

Approach
Our proposed method first constructs a one-parameter family of functions on the color image analogous to the scale-space representation of scalar-valued images [11].The function at each level is designed to be relative invariant to the transformation group approximating the effect of the illuminant.The extrema of this scale-space representation along with their appropriately sized neighborhoods are chosen as the interest regions.
In Section 2.1 we look at two possible choices of linear transformation groups to approximate the effect of the illuminant.Section 2.2 then constructs normalized differential invariants that are relative invariant to these group and are subsequently used to form a linear scale-space representation of the color image.

Choice of Illumination Model
We choose to numerically approximate the effect of the illuminant as a linear transformation of the pixel RGB values.Finlayson [4] proposed that under this model, a diagonal transform is sufficient to discount the illuminant if a pre-computed transformation matrix, tuned to a particular illuminant pair, is first applied to the sensor outputs.
The diagonal model is sometimes also defended with Shafer's dichromatic reflectance model [3,17] under the weaker assumption of sensor responses behaving as delta functions.Consider the "Mondrian" assumption of lambertian surfaces illuminated by a single light source.The reading c i (x) recorded by sensor i at location x may be expressed as: where m b (x) models the effect of the illuminant geometry on surface reflectance (shadows, incident angle), b(λ , x) denotes the surface reflectance, e(λ ) is the spectral profile of the illuminant and s i (λ ) is the spectral sensitivity of the i-th sensor.If the sensors s i (λ ) behave as delta functions δ (λ − λ i ), the sensor response c i (x) to a new illuminant e (λ ) can then be trivially related by a scale factor as c i (x) = αc i (x) with α = e (λ i )/e(λ i ), making the diagonal model valid exactly.Unfortunately, most consumer cameras are nowhere near obeying this assumption [1,19].Several researchers [6] have subsequently noted that a diagonal matrix model is often insufficient to explain observed data.
We may instead adopt a non-diagonal illumination model for the scene-independent terms e(λ )s i (λ ) in Eqn.(1) as: The quality of the diagonal approximation then depends on how well the function e (λ )s i (λ ) can be numerically approximated by α i e(λ )s i (λ ) for the best possible choice of scaling factor α i for all channels i = 1, 2, 3.
It is easy to verify that for real sensors and the choice of L 2 cost function, the optimal α i j values reduce to zero for i = j only if s i (λ )s j (λ ) = 0 for all λ .That is, the diagonal model is L 2 -optimal if the sensor responses are non-overlapping.This is in agreement with the principle behind spectral sharpening [1], and the observation that narrow-band sensors tend to obey the diagonal model better since their profiles are "closer" to delta functions.
To accommodate the common scenario of overlap in spectral response functions, we will consider both the full as well as the diagonal 3 × 3 matrix models.Hence, the illuminant dependent transformation is modeled as where c (x) and c(x) are the 3-vector responses at pixel x under two illuminants, and A ∈ GL(3, R), the set of real non-singular 3 × 3 matrices or A ∈ GD(3, R), the set of real non-singular diagonal 3 × 3 matrices.

Constructing Relative Invariants
Having fixed the transformation groups GL(3, R) and GD(3, R) to model the effect of a change in illuminant on pixel value, we now proceed to construct a 1-parameter scalar function of the input image that is invariant to this transformation as well as exhibits a semi-group property with respect to this parameter.We will then compute a stack of scalar functions of the image corresponding to increasing values of this parameter to form a scale-space representation of the image.The extrema in this 3-D representation will then constitute our desired result of a set of interest points and their scales.
Our procedure to construct a relative differential invariant is similar to that of [15], although that work focused on semi-differential invariants.We will denote the image f (x) by a 3-tuple of functions as f (x) = r(x) g(x) b(x) T denoting the three color channels.
We will work with the larger group GL(3, R) from which construction of invariants to the subgroup GD(3, R) will be self-evident.We define a relative invariant to the transformation group GL(3, R) as a real valued function h : f (x) → R for which there exists a weight function ξ where A ∈ GL(3, R).In the special case of ξ = 1, h( f ) is called an absolute invariant.
From the definition of the group action, we know that h((AB) f ) = h(A(B f )) for all A, B ∈ GL(3, R).This implies that for non-trivial invariants (h = 0), this relation will hold only if the weight function satisfies the property ξ (AB) = ξ (A)ξ (B) for all A, B ∈ GL(3, R).It is a standard result in Lie group theory [15] that for the group GL(3, R), the weight function must take the form for some β ∈ R. We will consider the case of β = 1 since invariants for other values may be reduced to this case.

Case of 1-D signals
We first analyze the case of a 1-D signal, i.e. f (x) : R → R 3 .Consider the function where the subscripts denote first and second order derivatives of the signals.By the linearity of the differentiation operator, the effect of a change in illuminant modeled by a group element A ∈ GL(3, R) can be expressed as using the property of determinants.Hence by our definition in Eqn.(5), h( f ) is a relative invariant to the group GL(3, R).The assumption of non-singularity of the transformation matrices avoids reduction to the trivial case of h(A f ) = 0. Physically this corresponds to the absence of degenerate lighting such as monochromatic light sources that would elicit sensor response only in a small part of the spectrum.
In practice, differentiation is not a well-defined operator in the Hadamard sense, and we are forced to look for locally regularized alternatives for computing Eqn.(6).The pioneering work by Iijima [9] followed by Witkin [20] and others introduced the idea of constructing a regularized multi-resolution representations of a signal through a oneparameter family of functions.The functions were obtained by convolution with Gaussian kernels G(x, σ ) of increasing width parameter σ , also termed as the scale.Later work by Lindeberg [11] showed that by a clever change of variables, one can obtain normalized derivatives whose extrema positions are covariant to spatial scaling of the signal.
Following [11], we normalize the terms of the relative invariant in Eqn.(6) to yield where the ' * ' denotes convolution and the subscripts denote the appropriate order of Gaussian derivative.Combined with Eqn. ( 7), the scale invariance property of each term in the 3rd order polynomial formed by the determinant Eqn. ( 8) implies the property for a spatial scaling factor γ ∈ R and A ∈ GL(3, R).Hence the set of 2-D extrema of the function h( f , σ 2 ) in x and σ is invariant to illumination change and relative invariant to spatial scaling of the signal.Note that because the effect of the illuminant is factored out in the invariant through the scalar det(A), the detection of extrema does not require the actual estimation of matrix A. Also, the effect of the illuminant through A is treated locally though the particular value of scale.Hence we may expect that the detection of extrema will be robust to gradual changes in the value of A across the image.

Case of 2-D signals
For 2-D images, we desire the additional property of invariance to spatial rotation.This is done by choosing the appropriate rotationally invariant 2-D form of the terms in Eqn.(8).We henceforth use G(x, σ 2 ) to denote a circular 2-D Gaussian.
The zero-th order term G(x, σ 2 ) * r(x) requires no modification as it is already rotationally invariant.Also the second order terms in the Eqn.( 8) can be easily corrected by replacing G xx with the rotationally symmetric Laplacian of Gaussian ∇G(x, σ 2 ) = G xx (x, σ 2 ) + G yy (x, σ 2 ).
The first order derivative terms, however, cannot be corrected easily.Because the Gaussian function is radially symmetric, it is not possible to generate a linear combination of odd-order spatial derivatives that is invariant to rotation.Some remedies include (a) using a different family of scale-space generators such as the Poisson kernel [16] and its derivatives to replace the odd-order Gaussian derivatives, and (b) using regularized higher order even derivatives of G(x, σ 2 ).However the former requires more computation and the latter risks susceptibility to noise.
We opt to compute a local invariant frame {u, v} at each pixel with its u axis aligned with the direction of maximal norm in intensity change in all 3 channels.The value of G x (x, σ 2 ) * r(x) is then replaced by G u (x, σ 2 ) * r(x).This is a computationally efficient alternative but with the drawback that it is only invariant to a subgroup of GL(3, R).However, we show experimentally that it works well over a broad range of illuminants.
Another alternative is to settle for the diagonal illumination model which, as outlined in Section 2.1, will work well for narrow band sensors.This reduction in the number of model parameters vastly increases the space of available relative invariants.We will consider the invariant formed by a simple modification to the LOG detector as  The final algorithm then proceeds as follows in a manner much similar to traditional interest point detection: 1. Construct pyramids of Gaussian blurred and 1st order derivatives of the color image following the procedure of [13].

Compute the invariant
(or h diag ) using the image channels and their derivatives at each pyramid level.
3. Find the extrema in the invariant scale-space pyramid and their corresponding scales.

Experiments
In this section we evaluate the repeatability of the detected regions across illuminants.We use images from two online color constancy datasets from SFU.The older dataset, which we label objects98, is associated with work in [5] and contains static household objects under 5 different illuminants.The second, labeled mondrian, is more recent [2] and contains scenes captured under as many as 11 illuminants.Images in both datasets were taken with a camera having narrow-band sensors [19].
We adopt the error metric of area repeatability used in the survey of [14] that compared affine-invariant detectors.The repeatability score between a pair of images is defined as the ratio of the number of pairs of interest points matched between the images to the minimum of the number of detected points in the pair.Two detected keypoints are determined to overlap if the ratio of their area of intersection to their union exceeds 60%. Figure 2 shows an example of matched interest points between the reference tide image in objects98 and rotated images under the harsh blue mb-5000+3202 illuminant.The top and bottom rows shows keypoints detected using the full and diagonal illuminant model respectively.For each object in objects98, we pick its image under the halogen illuminant as the fixed reference image and record its pairwise repeatability rate with test images taken under the other illuminants.We also evaluate repeatability when the test images are transformed by in-plane rotation (Figure 4) and spatial scaling (Figure 5).Due to space limitations, we show only a subset of the objects for two test illuminants for only the full invariant h full .It may be seen from the plots that the detection rate is quite stable to geometric transformations.
Figure 3 shows a scatter plot of repeatability rates for images from the mondrian dataset.Each point in the plot represents the repeatability rate using a luminance-based LOG detector and that with the full (h full ) or diagonal (h diag ) invariants for the indicated test illuminant.It can be seen that both invariants tend to either appreciably increase the repeatability rate or leave it relatively unchanged.
We consider the null hypothesis that the median difference in repeatability between the LOG and h full / h diag detectors is zero.The Wilcoxon two-sided paired sign test convincingly rejects the null hypothesis for the h full detector with a p-value of 1.21 × 10 −5 at a 5% significance level, and similarly rejects the same for the h diag detector with a p-value of 1.73 × 10 −11 .Thus the use of illuminant invariants has a statistically significant and favorable influence on the repeatability rate.

Related Work
The goal of scale and illuminant invariance from color images relates to two kinds of endeavors in the literature, and to the authors' knowledge they have only been pursued independently.The first is that of the extraction of scale invariant features from scalar images [11], popularized by [13] and recently surveyed in [14] for the affine invariance setting.The detectors were designed to be at most invariant to affine transformations of the luminance values.Recent work in [18] independently formed color invariants and concatenated them to the chosen descriptor only after the initial detection of keypoints from luminance.
The second is that of color constancy and extracting features that are illumination invariant.The main challenge here is how to correctly combine information from each channel of a vector-valued image.Work in [7] constructed color invariants using the Kubelka-  Munk model and was based on approximating the Gaussian function and its derivatives by linear combinations of implicitly known sensor profiles.Lenz et al [10] constructed semi-differential invariants for 2 × 2 transformations of color values, but only for a fixed pair of points having the same material properties.An early paper by Di Zenzo [21] proposed a modified structure tensor, termed the color tensor, that has been adapted in [3,17] and others for detecting edges and corners from an illuminant invariant representation of the image.However the non-linearities normally associated with the formation of these invariants do not preserve the relationships required for a linear scale-space representation.The target illuminant invariance is also traditionally restricted to the diagonal model for simplicity [3,18] or to a rigid rotation of the color coordinate axis [17].
In contrast to the above work, this paper focused on the use and construction of joint invariants that are covariant with spatial scaling while being robust to the traditional GD(3, R) and the larger GL(3, R) group of transformations.

Conclusions
The use of grayscale to represent a color image can have adverse effects depending on the distribution of colors in the image.While scale invariant detectors have reached a sufficient level of maturity, the next step, as also concluded in [14], consists of generalizing them to other representations such as RGB.
In this paper, we proposed to complement the existing class of interest region detectors by using an alternate intermediate representation.The constructed invariants enjoy robustness to a larger class of illuminant transformations that is traditionally addressed, while retaining the much desirable property of scale invariance.
Qualitatively we have observed that the detector using the full model finds fewer keypoints that its diagonal model counterpart.By our choice of h diag , the latter also yields many keypoints at locations common to those found using the traditional LOG detector.As seen in Figure 3, the influence of using a full invariant is either large, particularly in scenes where the repeatability with the traditional luminance-based LOG detector is low, or negligible.We hypothesize that invariants to the larger GL(3, R) group tend to form smoother functions with extrema that are not as well localized in some scenes as extrema of invariants to the diagonal subgroup.Hence, an appropriate practical strategy would be to combine the detector outputs using both the full and diagonal models.Future directions include a more rigorous characterization of possible joint invariants, experiments using images acquired with broad-band sensors, as well as the design of better metrics to compare detectors in a manner that is independent of the scheme for threshold selection.

Figure 1 :
Figure 1: Example of the effect of grayscale conversion on a synthetic color image under two perceptually similar illuminants.Note the repeatability of detected scale-invariant LOG features (circled).This figure is best seen in color.

Figure 2 :
Figure 2: Interest points matched between the (a) reference halogen lamp and the (b-d) the mb-5000+3202 Macbeth 5000 tube with Roscolux #3202 full blue filter illuminants and rotation by 0 • , 45 • and 90 • .Top row uses the invariant h full and bottom row uses the diagonal invariant h diag .Presented images have been contrast-enhanced for clarity.

Figure 3 :
Figure 3: Scatter plot of LOG repeatability against the (a) full and (b) diagonal matrix illumination models between halogen and 10 other illuminants.

Figure 4 :
Figure 4: Variation in area repeatability with in-plane rotation between halogen and the ph-ulm and syl-cwf illuminants for the full matrix illumination model.

Figure 5 :
Figure 5: Variation in area repeatability with spatial scaling between halogen and the ph-ulm and syl-cwf illuminants for the full matrix illumination model.