Invisible Geolocation Signature Extraction From a Single Image

Geotagging images of interest are increasingly important to law enforcement, national security, and journalism. Today, many images do not carry location tags that are trustworthy and resilient to tampering; and landmark-based visual clues may not be readily present in every image, especially in those taken indoors. In this paper, we exploit an environmental signature from the power grid, the electric network frequency (ENF) signal, which can be inherently captured in a sensing stream at the time of recording and carries useful time–location information. Compared to the recent art of extracting ENF traces from audio and video recordings, it is very challenging to extract an ENF trace from a single image. We address this challenge by first mathematically examining the impact of the ENF embedding steps such as electricity to light conversion, scene geometry dilution of radiation, and image sensing. We then incorporate the verified parametric models of the physical embedding process into our proposed entropy minimization method. The optimized results of the entropy minimization are used for creating a two-level ENF presence–classification test for region-of-capturing localization. It identifies whether a single image has an ENF trace; if yes, whether it is at 50 or 60 Hz. We quantitatively study the relationship between the ENF strength and its detectability from a single image. This paper is the first comprehensive work to bring out a unique forensic capability of environmental traces that shed light on an image’s capturing location.


I. INTRODUCTION
This digital era has witnessed an unprecedented amount of digital audio, images, and videos being generated daily. Digital media contain many metadata fields, one of which can relate to the location information using the built-in GPS input, which is valuable for law enforcement, national security, and journalism. This piece of metadata, however, is not always available and can be easily tampered with. In this paper, we focus on deriving information from a single image regarding the location at which it was captured, especially in challenging scenarios that existing technologies cannot sufficiently address.
Recent advances, notably through the Finder program by the U.S. agency for the Intelligence Advanced Research Project Activity [2], utilized computer vision techniques and a big data paradigm to exploit visible terrains and landmarks appearing in an image and match them with various geospatial data to identify the image's geographic location. These technologies would face considerable challenges in the absence of visible landmarks, such as when the background of outdoor images does not have differentiating features or when an image is The preliminary results of this paper was presented in ICASSP 2018 [1]. captured indoors. The latter scenarios are common in such global efforts as fighting crime on child exploitations [3]- [5], where a significant portion of images of such mistreatment and abuse are taken indoors in private locations by perpetrators. The anonymous nature of the internet makes it difficult to track down the perpetrators. Having new technologies to locate where such images were taken, even at a coarse level, will provide an unprecedented tool to aid global investigations, supporting law enforcement in appropriate jurisdictions.
In this work, we explore an unnoticeable location-related signature that may be inherently captured in an image. This signature comes from the power distribution network whose varying frequency over time is defined as the electric network frequency (ENF) signal [6]- [8]. The ENF signal over time in a multimedia recording reflects the behavior of the power grid at the time and the location where the media recording was captured. Such a relation enables the estimation of the time, location, and integrity of corresponding multimedia recordings [9]- [12].
Due to the temporally varying nature of the ENF that plays the critical role in determining the time and the location of a recording, ENF extraction in the literature thus far requires the hosting signal to have a temporal nature, e.g., an audio or a video. Specifically, to produce one frequency estimate, audio/video recordings need a signal of several seconds long. Images captured by a rolling shutter also have a temporal dimension, but it is very short, i.e., 0.02-0.03 seconds long. Hence, a series of intriguing questions to push the boundary of ENF's localization capability are: Can the ENF traces be captured in a single image? If yes, can we determine the presence of ENF traces? If a trace is present, can we narrow it down to 50 vs. 60 Hz? This ENF's localization capability may be a valuable clue to law enforcement investigations and investigative journalism. To the best of our knowledge, this is the first comprehensive work (building on our preliminary conference paper [1]) to develop a unique forensic capability to geotag individual images that do not have a visible landmark or GPS tag, especially those captured indoors. This paper begins with characterizing the physical embedding process for the ENF by modeling the steps of electricity to illumination conversion, scene geometry dilution of radiation in controlled experiments, and image sensing. Such effort leads to a more precise parametric model than that in our preliminary version [1]. The new model contains time-varying amplitude and trend for embedded ENF traces, which better reflects the realistic scenario. Using the parametric form of the embedded ENF traces and an entropy-related statistical phenomenon of 2 indoor photos, we approach the research problem of detecting and reconstructing ENF traces in an image as an optimization problem and name it as the entropy minimization method. The optimization results will be used for a two-level ENF presence-classification test in which the first level determines whether an embedded ENF trace is present in a test image; if it is, the second-level test determines whether the images come from a 50 or 60 Hz region.
In addition to the proposed entropy minimization method in conjunction with a simple parametric model for the embedded ENF traces presented in our preliminary work [1], we summarize the main contributions of this journal version as follows: • We model the physical process of ENF embedding that comprises the steps of the light generation, the geometry dilution of radiation, and the image sensing. We also show experimentally and via simulation that ENF traces in camera captured photos may have time-varying amplitude and trend. • We propose a more precise parametric model than that of [1] for embedded ENF traces based on insights of the physical embedding process, allowing the entropy minimization method to extract more precise ENF traces. • We design a two-level ENF presence-classification test that produces three decision outcomes: no ENF, ENF at 50 Hz, and ENF at 60 Hz. The rest of this paper is organized as follows. Section II provides background and preliminaries of the ENF and image acquisition principles of rolling-shutter cameras. Section III investigates the physical embedding process for ENF traces, and Section IV presents our improved parametric model for ENF trace extraction from a single image and the twolevel ENF presence-classification test. Section V shows the experimental results. Section VI discusses a few technical points. Finally, Section VII concludes the paper.

A. Electric Network Frequency (ENF)
The electric network frequency (ENF) of power distribution networks has a nominal value of 60 Hz in most of the Americas and 50 Hz in most rest of the world. The instantaneous frequency of the sinusoidal supply signal of a power network does not stay at the nominal value. Instead, it fluctuates around the nominal value due to the mismatch between the power supply and demand across the power grid. The trends in the ENF variations tend to be very similar at different points interconnected in the same grid. The changing instantaneous value of the ENF over time is what we define as the ENF signal.
The ENF signal has been a growing subject of multimedia forensics research due to its presence in media sensing signals. If an audio or video recording is made near an electrical power source, the recording may capture the ENF variations at the time of recording. In audio recordings, this has been mainly attributed to the acoustic hum emitted from devices connected to the power mains [13]- [15]. In video recordings, this comes from the invisible flickering of electric lighting [14], [16]- [19]. This paper will show that it is possible to extract ENF traces captured in a single image taken in electrically lighted settings. In particular, the ENF value can be estimated from an image captured by a camera equipped with widely available complementary metal-oxide-semiconductor (CMOS) image sensors that use the rolling-shutter mechanism.

B. Rolling Shutter and Read-Out Time
Unlike cameras with a global shutter that simultaneously acquires all pixels of an image frame, a camera with a rolling shutter captures an image frame one row at a time. Although traditionally regarded as the cause of negative artifacts in images/videos, the rolling shutter has been exploited in beneficial ways for computational photography applications [20], [21]. This work shall show that the rolling shutter's sequential readout time mechanism allows the resulting image to capture samples of an incident light signal at slightly different time instants and thus obtain a short segment of ENF traces. Studies on ENF from audio [11], [15] and videos [16], [19], [22], [23] show that one frequency estimate of audio and video recordings requires a signal of 5-10 and 12-20 seconds long, respectively. In contrast, the available signal length allowed by a single rolling-shutter image is only about 0.02-0.03 seconds, corresponding to about 2-3.6 cycles that are not long enough to observe ENF fluctuations.
The amount of time for a rolling shutter to acquire the rows of an image frame, which we denote by the read-out time T ro , is specific to the camera's model line. The readout time is not typically given in its user manual or technical specifications. To extract ENF information buried in a single image, we may need to know the read-out time T ro value of the camera that had produced the image under question. In our work, we use a protocol inspired by the one proposed in [24], [25] to estimate the T ro value of a camera at hand. TABLE I lists the estimated values we calculated for the cameras on the backsides of different models of iPhone devices.

C. Image Acquisition by Rolling-Shutter Cameras
The rolling-shutter mechanism equivalently turns rows of the CMOS sensor array of a camera into high-frequency samplers. The sampling frequency of a row can be expressed by f row = M T ro , where M is the number of rows of the image. A CMOS camera captures ENF traces by recording a visual scene under an electric light source. The electric light intensity embedded in the resulting image relates to the supply current via the power law, making its nominal frequency twice the nominal ENF value, i.e., 120 Hz in most of the Americas and 100 Hz in most other parts of the world. By considering the acquisition at different row indices i as samples of a fluctuating signal, the ENF trace captured in an image is modeled in our preliminary work [1] as where f is set to be a constant of 100 or 120 Hz considering that the frequency changes over the a few cycles is negligible. Note that A is the magnitude, i is the row index of the image in {1, . . . , M }, and ϕ is the initial phase of ENF trace immediately before acquiring the first row. Our preliminary work [1] that uses the oversimplified parametric model in (1) showed good performance in images with ENF traces that are synthetically added but showed limited performance in images with real ENF traces. Thus, to improve the model fidelity and, ultimately, the accuracy of the ENF estimation, we begin with investigating the physical embedding process in Section III for a more accurate parametric model of ENF traces embedded in an image.

III. PHYSICAL EMBEDDING PROCESS FOR ENF TRACE
This section will show that the amplitude and bias of embedded ENF traces are not constant but time-varying and investigate what causes the time-varying artifacts. The investigation will be conducted through examining modeling the physical ENF embedding process that includes the light generation, the geometry dilution of radiation, and the image sensing. Fig. 1(a)-(c) show three scenes corrupted by stripes of ENF traces, in which Fig. 1(a) and Fig. 1(b) have nonuniform foreground and almost uniform background and Fig. 1(c) has a scene for a piece of white paper without foreground object. Fig. 1(d) shows their averaged column signals, each of which is obtained by averaging pixel intensities across all columns of an image. Fig. 1(e) shows the sinusoidal part (excluding the trend/biases) of the ENF traces for Fig. 1(a)-(c) estimated by the entropy minimization algorithm proposed in Section IV-D. Each sinusoidal part of the estimated ENF trace exhibits a similar sinusoidal shape as the averaged column signal, evidenced by the well aligned peaks and valleys. One can notice that the overall trend in each averaged column signal is sinusoidal-like, implying an impact from the ENF. Besides, the scene contents of Fig. 1(a) and Fig. 1(b) contribute to biases that alter the sinusoidal shape and result in small teeth of the averaged column signals, making the mixture of the scenes content and the ENF-induced fluctuation nonsinusoidal. Unlike Fig. 1(a) and Fig. 1(b), Fig. 1(c) has no scene content, but shows a slightly decreasing trend along the row index, which makes its averaged column signal a nonideal sinusoid as well.

A. ENF Embedding into Photos Under Indoor Lighting
In this subsection, we analyze the physical ENF embedding process that primarily consists of light-generation and pixelsensing processes. We approximate the supply voltage signal to be an ideal sinusoidal wave to obtain analytic expressions and later use simulation from real supply-voltage measurements to demonstrate that ENF traces embedded in an image have a time-varying behavior.
The voltage to power conversion is illustrated in the first block of Fig. 2. Here, we use a simplified model, and a more realistic scenario is investigated in Appendix A of the supplementary materials and discussed in Section VI-A. When a source light is powered by an AC voltage signal v(t) = V l cos(2π f 0 t + ϕ) with a voltage amplitude V l ∈ R + , a frequency f 0 ∈ {50, 60} Hz, and an arbitrary initial phase ϕ 0 , the instantaneous power has twice the frequency of the supply voltage, namely, where ω 0 = 2π f 0 is the angular frequency of power mains. Note that we assume that only resistive loads R are present in the circuit, but the result can be extended to capacitive and inductive loads. It is also noteworthy that the ratio of the AC component to the DC component of the supply power equals one. The model presented in Appendix A of the supplementary materials is more realistic and the model presented here is idealized. Regardless of the models, the frequency component 2ω 0 will be dominant in the final camera-captured image. The circuit load converts the electricity into the illumination with an equivalent lowpass filter, which leads to attenuation in the relative strength of the AC component. The power to illumination conversion is illustrated by the second block of Fig. 2 formulated by a convolution of the power and a single exponential decaying function [26], [27] as follows: where " * " denotes the linear convolution, τ is a time constant accounting for the illumination half-life, and u(t) is a unit step function with value 1 for t ≥ 0 and 0 for elsewhere. Note that 4 Sampled and quantized pixel intensity signal 2. An illustration of the ENF embedding process. When a light source is powered by a supply voltage v(t ) with frequency ω 0 , the instantaneous power p l (t ) is converted into the illuminance of the light (t ), which has a DC component and an AC component flickering at 2ω 0 rad/s. The visual light then falls onto an object and interacts with the object's surface. The camera acquires the light reflected or reemitted from the surface and creates a visual image by collecting photons during the exposure time. This photons accumulation process results in a sampled and quantized temporal pixel intensity signal I (t ) whose intensity fluctuates at 2ω 0 rad/s. K l is a scalar with a unit of lm · W −1 · m −2 that relates the power to the illumination of the light source. The exponential decaying function acts as a lowpass filter, converting the power signal into a more smoothed illumination. Hence, the resulting luminance signal (t) carries a DC component and a relatively smaller flickering AC component.
The third block of Fig. 2 illustrates a geometric setup in which rays from a light source fall onto an object, reflect off, and enter a camera's imaging system. During the light transmission and reflection, its intensity degrades per the inverse-square law and surface albedo of the object. The camera prepares to acquire the light reflected from and/or reemitted by the object before forming a visual image of the object. The overall effect due to the geometric setup is a nonuniform spatial scaling to the magnitude of the visual light, yet it does not change the AC-to-DC ratio for the light arriving at the camera lens. We, therefore, defer a more detailed analysis of the geometric setup to Section III-B and focus on the image sensing stage that will further reduce the AC-to-DC ratio. To this end, we introduce a spatial-varying discounting factor α(p) < 1 at location p of the shooting object to capture light energy loss after being reflected and during transmission caused by the geometric setup.
The incoming light focused by the lens is further processed by the camera's sensing unit to create digital images. Photons of the incoming light emitted from a small region in the scene arrive at an active cell of the sensor array. These photons are converted into electrons and then into a voltage level representing the intensity of a pixel of a digital image. Ignoring the complication of color, the intensity of the pixel is proportional to the number of photons of light landing on the sensor cell. The photons are collected during the camera exposure time t exp , which can be modeled as a convolution of a rectangular window to the perceived light signal as follows [12]: where β is a scalar that relates the illuminance to the pixel intensity values. The pixel intensity I (t) is decomposed into into two components: a time series I e (t) capturing the flicker of pixel intensity along the time and a constant term I dc capturing the offset of pixel intensity. This camera integration process is a second lowpass filter that further reduces the AC-to-DC ratio.
To derive an analytic form of the relative strength between I e (t) and I dc , we apply the Fourier transform to the intensity signal in (4c) and obtain: where Using (6), we calculate the ratio between the AC and DC parts of the pixel intensity: which implies that the relative strength, also known as modulation index [28] in the research field for flickering light's  impact on vision, is controlled by the nominal ENF ω 0 , the time constant τ , and the exposure time t exp .
To model more precise ENF traces, we use our analytic pipeline and simulate the pixel intensity signal using an experimentally measured AC voltage source signal as input. We directly measured the supply voltage v(t) from a power outlet in the USA. Fig. 3 reveals that the practical AC voltage is a nonideal sinusoid with varying frequency and varying amplitude. We subsequently simulated the instantaneous power p l (t), the illuminance of the light (t), and the resulting pixel intensity I (t) using (2a), (3), and (4b) by setting R = 10 Ω, K l = 1 lm · W −1 · m −2 , α(p) = 1, β = 1, τ = 1.9 msec, and t exp = 1/165 sec, respectively, and display them in Fig. 3. It is observed that the upper and lower envelopes of the simulated camera-captured pixel intensity I (t) slightly change over time, which implies that ENF traces embedded in an image can have a time-varying behavior. Such behavior may result from either the varying frequency and/or the varying amplitude of the AC voltage v(t). In Appendix B of the supplementary materials, we further investigate the causes by using simulated voltage signals of different characteristics and reveal that both the varying frequency and the varying amplitude result in a time-varying behavior in the pixel intensity signal, as shown in Fig. 4(a). Also, we shall see that the parameterization of the amplitude and the bias of image-embedded ENF traces over a short duration for temporally sampling the rolling-shutter image as displayed in Fig. 4(b) can be simplified using affine functions.  The amount of light reflection can be calculated using the diffuse and specular geometric reflection models that have been widely utilized in computer vision and computer graphics [29], [30]. The diffuse reflection corresponds to the scenario that the reflected light is uniformly scattered in all directions, and its strength depends on the angle formed by the directions of the incident light and the surface normal. The specular reflection corresponds to the scenario that the light is reflected as if it were bouncing off a mirrored surface, and its strength depends on the angle formed by the directions of the viewing/sensing angle and reflected light. In our work, we assume perfect diffuse reflection and ignore specular reflection as most reflections are dominantly diffuse.

B. Trends of Amplitude Due to Camera-Scene Geometry
Considering the light intensity degradation due to the inverse-square law and the diffuse reflection, we model the intensity of the reflected light arriving at the camera as a function of row i of an image as where ρ is the albedo parameter of the surface, d i is the distance between the light source and the object, d i is the distance between the camera and the object, and θ i is the angle between the surface normal vector and the incident light direction. The attenuation and reflection factors considered in (8) will allow us to analytically show that the amplitude of the ENF trace is approximately linear in terms of the row index i of an image due to the small-scale assumption of the captured scene. To analytically show the relationship between the light intensity L i and the row index i, we consider a special case that the camera is in the same plane as that formed by the incident light and the surface normal vector. Further, we use a simplified 2-d scenario in which pixels at the same row Appendix C of the supplementary materials analytically shows that the light intensity L of location (b + u, 0) picked up by the camera has a linear relationship with the row index i, which indicates the approximate linear increase/decrease of the amplitude of the ENF trace across the row. To empirically confirm this theoretical result, we captured a set of images under a fluorescent lamp using an iPhone 6s camera in an image acquisition environment similar to Fig. 5(b). We used a piece of blank paper to ensure a relatively uniform color distribution over the entire scene. To boost ENF traces estimation accuracy, we averaged pixel intensities across all columns, which can increase the SNR in the linear scale by a factor of the number of columns. As a result, the trend of ENF traces at the time of capturing the images can be well revealed. Fig. 5(c) shows overlaid averaged column signals at the top and detrended averaged column signals at the bottom, where the linear trend in the top plot is removed from the corresponding averaged column signal to visualize the envelope of the amplitude. Our controlled experiment confirms that the overall trend and amplitude of the averaged column signals decrease approximately linearly with the row index due to the geometry in Fig. 5(b), where the camera is positioned in the same plane as that formed by the incident light and the surface normal vector. This result, together with the result from the previous subsection, allows us to model the amplitude and bias of the sinusoidal-like ENF trace using affine functions, as is detailed in the next section.

IV. PROPOSED METHOD FOR ENF PRESENCE DETECTION AND CLASSIFICATION
To address the challenging geotagging problem, we ask the following research questions: Can we determine whether an ENF trace is present in an image? If yes, can we reveal if the ENF trace is at 100 or 120 Hz? Lastly, what is the lowest amplitude of the embedded ENF trace for it to be correctly identified? To answer these questions, we propose a method that exploits: i) the domain knowledge presented in Section III that the image-embedded ENF trace has a parametric form and ii) a largely predictable change in the statistical behavior of the image before and after ENF embedding.

A. Phenomenon of Entropy Increase Due to ENF Embedding
We observe that smooth regions in images tend to become busier or have higher entropy after being corrupted by realvalued sinusoidal signals. In the case of an image column with constant intensity, the sole bin of the histogram before corruption starts to split into multiple bins due to positive and negative additive values contributed by the sinusoidal signal. Such corruption increases the entropy of the constant signal from zero to a positive value. In the case of an image column with linearly increasing intensity values, the similar statistical behavior can be observed even though the increase is not as drastic as the constant intensity case. The histogram of the linearly increasing intensity case before corruption has a rectangular shape. After corruption, the bins at two sides of the rectangular shape will split, which may increase entropy.
We verify the observed statistical behavior claimed above via simulation. First, we simulated the ENF embedding process by adding real-valued sinusoidal signals at 100 or 120 Hz with ten randomly picked parameter configurations to columns of images. Then, we calculated the estimated entropy increase for each column and draw the histogram of entropy increase in Fig. 6(a). It reveals that 99% of all columns have an increase in entropy after being corrupted by sinusoidal-like signals with the understanding that entropy values of texture regions of images may not be sensitive to the additive sinusoidal-like signals.

B. Parameterization of ENF Traces
The theoretical analyses and the controlled experiments in Section III confirm that an ENF trace in images can have time-varying amplitude and bias. This allows us to generalize the simple model (1)  of this work beyond constant amplitude A and zero bias. Specifically, in Appendix B of the supplementary materials, we have attributed both the ups and downs of the pixel intensity signal to the varying amplitude and varying frequency of the supply voltage. For rolling-shutter captured images typically with 2-3.6 sinusoidal cycles, a sinusoidal model with linearly changing amplitude and bias can adequately track the change of the pixel intensity. In Section III-B, we have derived and experimentally verified that the amplitude of light intensity arriving at a camera is approximately linearly changing as a function of the row index per the diffuse illumination model and the inverse-square law. Based on these newly obtained evidences, we propose the following sinusoidal function with affine amplitude A(i) and affine bias B(i) to model the ENF trace: where i is the row index, and i s and i t are the indices of the first row and the last row of an available region of an image, respectively. We parameterize amplitude A(i) and bias B(i) differently, namely, where function A(i) is a line passing through the first point (i s , a s ) and the last point (i t , a t ). Parameters k E and b E are the slope and the intercept for B(i), respectively. The use of the two-point form for A(i) allows us to constrain the whole line segment to be positive/negative for all i, which is reflected by the constraints of the optimization problem (15) we will later formulate. On the other hand, we use the slope-intercept form for B(i), which can help avoid undesirable local minima in which the two-point form can result. Details of the justification are provided in Appendix D of the supplementary materials.

C. ENF Embedding Using Additive Model
One or more columns from an image can be exploited for estimating an embedded ENF trace. Fig. 7(b) visualizes the image intensities of two columns from the image displayed in Fig. 7(a), where the column signal in red comes from the smooth region (boxed in dashed line) and the signal in black comes from the nonsmooth region (boxed in solid line). Since the native image content in the smooth region is nearly uniform, the column signal from that region tends to change with row indices following the ENF trace's fluctuation at the time of capture. Similarly, the column signal from the nonsmooth region roughly follows the ENF variation, but contains outliers compared to the column signal from the smooth region. Based on the observations of Fig. 7(b) and other column signal examples, we make a simplified assumption that the jth column of the native image content (without ENF) comprises an overall linear trend and an error term characterizing the visual components, acquisition noise, and outliers, namely, where k x (j ) is the slope of the linear trend, b x (j ) is the intercept, and ϵ (j) (i) is the zero-mean error term. The embedding of ENF traces e(i) into visual content x (j) (i) has been considered as an additive procedure in the literature [14], [16]- [18]. We will contrast the additive and multiplicative models in Section VI-C, but in the rest of the paper, we adopt the additive model. Combining the ENF trace model (9)-(10) and the native image content model (11), the pixel intensity signal at column j can be written as

D. Entropy Minimization for Frequency Estimation
1) General Principle: Below, we outline the proposed idea of estimating the frequency from a single column signal from an ENF-corrupted image, and then extend it to using multiple column signals for improved estimation accuracy. The statistical behavior of ENF-induced entropy increase discussed in Section IV-A guarantees that, in general, if the added sinusoid is removed from the ENF corrupted image, the entropy will 8 reduce to the level of the uncorrupted image. A linear trend in an image can also increase the entropy by stretching the histogram wider. By examining the ingredients of the pixel intensity model (12b), we note that the first two terms are contributing to the inflated entropy of ENF-corrupted images. We define their sum as the entropy-affecting signal I (j) e (i) to facilitate subsequent description of the proposed algorithm: We remark that the bias term b j in (12b) does not change the entropy as it merely shifts the histogram. The noise term ϵ (j) (i) in (12b) is a part of the native image (11) hence it is irrelevant to the entropy changes caused by ENF traces. When the true parameters of (13) are plugged in and I (j) e (i) is subtracted from the jth column signal of the ENF-corrupted image, the resulting entropy will be reduced. Formally, we can set up the following cost for the jth column signal as a function of θ j = [a s , a t , f , ϕ, k j ] T for which the true parameters may lead to the global minimum or a local minimum: where hist(·) calculates a histogram of an unordered/ordered set of numbers and entropy(·) returns the entropy after properly normalizing the input to a probability mass function. We note that hist(·) can be highly nonlinear as it involves counting numbers in predefined intervals. Among all unknown parameters, the frequency parameter f is our ultimate goal. The remaining parameters a s , a t , ϕ, and k j are called nuisance parameters [31, Sec. 8.2] and need to be jointly solved with f to obtain the correct estimate of the frequency.
2) The Complete Optimization Problem: We propose to use multiple columns I (1) , . . . , I (K ) sparsely sampled from an ENFcorrupted image to improve the accuracy of the frequency estimation. We set up a multi-column optimization problem to solve for θ def = [a s , a t , f , ϕ, (k j ) K j=1 ] T as follows: The parameters a s , a t , f , and ϕ solely attributed to the embedded ENF trace are shared across K contributing columns, whereas the parameters k j 's depending on the image content are allowed to vary column-wise. This multi-column approach significantly improves the single-column approach (14) in two ways. First, the true solution may be merely a local minimum for each of the singlecolumn cost functions {J j (θ j )} K j=1 . When the cost functions are summed up in (15), the true solution may be boosted to the global minimum and the aggregated cost at another local minimum may increase. A similar phenomenon is observed in frequency estimation algorithm MUSIC [32] in which multiple eigenvectors in the noise subspace contribute to finding the global minimum. Second, even when each of {J j (θ j )} K j=1 can return a solution near the ground truth, it can be analytically shown that the cost function involving K columns can boost the estimation accuracy by at most √ K times measured in the standard deviation of the frequency estimator. The proof can be found in Appendix E of the supplementary materials.
We further comment on the choices of constraints. The constraint a s a t ≥ 0 is to ensure the same sign of the amplitude of the ENF trace at the first row, a s , and the amplitude at the last row, a t . If not, the amplitude-varying sinusoid will have a (fractional) row index with zero amplitude. Typical amplitudevarying sinusoids should look similar to those displayed in Fig. 5(c).
The constraint f ∈ {100, 120} can significantly reduce the complexity of the optimization by effectively exploring one less dimension of the search space. This is motivated by the fact an instantaneous ENF usually departs by a small fraction, e.g., ±0.2% in Europe [33] and USA [34], from the nominal frequency. This added constraint is also justified by our observation that slight departure from the groundtruth frequency does not significantly affect the optimization results. We have experimentally verified this using images with synthetically added ENF traces, where the ENF frequency was generated from a uniform distribution with the range [120−0.24, 120+0.24] Hz and the remaining parameters of the ENF traces were fixed to a s = 18, a t = 14, k = −1.68 · 10 −4 , b = 0.21, and ϕ = 3.77. We added each synthetic ENF trace to the same image and applied the entropy minimization by setting the frequency parameter to 100 or 120 Hz, where the parameter search was initiated from the same initial point. Following the way to solve the optimization problem introduced in the next subsubsection, we obtained the estimatesâ s ,â t , andφ and drew a histogram for each of them as shown in Fig. 8. One can observe that the obtained estimates have slight bias and small variance, which implies that the optimization results are not significantly affected by slight deviations from the ground-truth frequency.
3) Solving Optimization Problem via Decomposition: We propose to solve the multi-column optimization problem (15) by decomposing it into two subproblems. The first subproblem solves for {k j } K j=1 using the RANSAC algorithm [35], which is more robust to outliers than the least-squares method. This step estimates the slope of each selected column without depending on the sinusoidal component since they are relatively uncoupled as evidenced in Fig. 5(c). Plugging in the estimated {k j } K j=1 , the second subproblem solves for a s , a t , and ϕ for f = 100 and 120 Hz separately. The more 9 restricted 3-d search problem in the space of [a s , a t , ϕ] T is easier especially when the cost function is nonconvex. We use a gradient-free optimizer as the numerical solver as it does not require computing gradients and the cost function to be convex or smooth. Specifically, we use the Nelder-Mead simplex method [36] that finds a solution by iteratively narrowing down the search space. In our implementation, we impose search-space constraints |a s | ≤ 25, |a t | ≤ 25, and 0 ≤ ϕ ≤ 2π , where 25 was empirically determined based on the largest possible amplitude of ENF traces. To understand the landscape of the 3-d error surfaces, we calculated its 2-d projection along the directions of a s , a t , ϕ, and many randomly chosen directions in the space of (a s , a t , ϕ). We visualize the evolution of the 2-d projected error surfaces in animations that can be found in the multimedia supplemental files of the paper and in this URL: https://bit.ly/3IlMilE. It is observed that the cost function is generally well-behaved as the projected error surface evolves through the true solution, i.e., locally convex around the true solution with slight bias. Each of the 2-d projected error surfaces with the remaining parameters set to the true values is shown in Fig. 6(b).
We have outlined in this subsection a procedure to estimate the frequency when an image is corrupted by an ENF trace. However, this procedure was not designed to answer another important question: Is an ENF trace present in a given image? As we shall see in the next subsection, the presence of the ENF can be inferred from the intermediate quantities that appear in the algorithm for solving the optimization problem (15).

E. Proposed Two-Level ENF Presence-Classification Test
Next, we propose a two-level hypothesis testing framework as summarized in Algorithm 1 to determine the presence of the ENF and conduct a binary classification between 100 or 120 Hz if necessary. A first-level test will determine whether an ENF is present in a given test image. If yes, a secondlevel test will decide whether the test image was captured in a 100 or 120 Hz environment. Our testing framework relies on the intermediate quantities such as estimated nuisance parameters in Section IV-D and exploits the behaviors of the cost function (15) to construct effective test statistics. We present and explain the two subordinate tests of the framework below.
The first-level hypothesis testing is formulated as follows: We propose a decision rule that rejects null hypothesis H 0 if entropy marдin ≥ η 1 , where η 1 ∈ R + is a user-defined decision threshold. Here, the test statistic entropy margin is defined to be the absolute difference between the values of the cost function (15) evaluated at two candidate frequencies, namely, entropy marдin = |br(1) − br(2)|, where br(1) and br (2) are the costs for f = 100 and 120 Hz, respectively. We justify the use of the entropy margin as the test statistic for the first-level ENF presence test as follows. When the alternative hypothesis H 1 is true, the cost function (15) evaluated at f = 100 and 120 Hz will result in significantly different Algorithm 1: Proposed Two-Level ENF Test.
Input: An RGB image with no ENF or ENF at 100/120 Hz Output: no ENF orf ∈ {100, 120} Hz Step 1. Initialization 1 Convert the image to gray scale 2 Randomly select K columns and augment them into a sample matrix I = [I (1) , . . . , I (K ) ] 3 Use RANSAC algorithm to obtaink = [k 1 , . . . ,k K ] Step 2. Estimation of Nuisance Parameters 4 F (1) = 100, F (2) = 120 5 br ( ) = min a s ,a t ,ϕ cost I, F ( ), a s , a t ,k, ϕ s.t. a s a t Step 3. Two-Level Classification 6 entropy margin = |br(1) − br(2)| 7 if entropy margin < η 1 then 8 no ENF presence else 9 if br(1) − br(2) < η 2 then f = 100 Hz elsef = 120 Hz in (15) is close to detrended image content, which should lead to lower cost value per the statistical phenomenon reported in Section IV-A. In contrast, when the cost function is evaluated at f = 120 Hz given the true ENF at 100 Hz, the residue contains two extra sinusoidal-like ENF-trace components at 100 and 120 Hz in addition to the detrended image content. The extra components will increase the dynamic range of the residue and therefore lead to a higher entropy or cost function value. Hence, the costs at f = 100 and 120 Hz are very different under H 1 , and it is easy to see that the same argument also applies when the nominal ENF is 120 Hz. In contrast, when the null hypothesis H 0 is true, no matter f = 100 or 120 Hz, the cost function contains residue that is corrupted by one sinusoidal-like ENF trace contributed by the prediction model (13). Hence, the costs evaluated at the two candidate frequencies may not differ as significantly as in H 1 . Simulated results in Fig. 9 confirms that when no ENF traces are added to images, the entropy margin is close to zero. In contrast, when ENF traces are added to images, the entropy margin are nonzero. Fig. 9 also shows that the entropy margin increases as a function of the ENF amplitude, which reveals that the proposed first-level ENF presence test can provide higher statistical power with stronger ENF. This is another desirable characteristic of proposed test. In Appendix F of the supplementary materials, we analytically justified that as the ENF

No ENF trace
With ENF trace Fig. 9. Box plots of entropy margin for different ENF strengths. The strength is quantified using the amplitude of the synthetically added sinusoidal ENF traces. The pixel intensity ranges from 0 to 255. The entropy margin is almost zero when no ENF traces are present in images, whereas it is nonzero when ENF traces are present. As the ENF strength increases, the entropy margin also increases. amplitude increases, the entropy margin tends to increase, as revealed in Fig. 9.
If the first-level test decides that an ENF trace is present, the test image will be passed to the second-level hypothesis testing formulated as follows: Instead of using a simple decision rule such as br(1) > br (2), we propose to reject the null hypothesis

A. Results for Images with Synthetic ENF Traces
We created a set of test images to simulate those captured in 50 Hz and 60 Hz regions using six template images that do not contain detectable ENF traces. Based on the parameterization in (9) and (10), we generated synthetic ENF traces parameterized by (a s , a t , ϕ, k E , b E ), where the mean sinusoidal amplitude, L = a s +a t 2 ∈ {2 −1 , 2 −0.5 , 2 0 , 2 0.5 , 2 1 , 2 2 , 2 3 , 2 4 }, has the same unit as the pixel intensity with 256-level shades of gray. For each L, we randomly assigned each of the parameters as follows: a s = L − 0.5δ and a t = L + 0.5δ , where δ ∼ U(−L, L). ϕ ∼ U(0, 2π ), k E ∼ U(−2/M, 2/M), and b E ∼ U (−1, 1), where M is the number of rows of an image. Then, we added the generated synthetic ENF traces to the native images to generate a total of 96 (= 2 × 6 × 8) images. For each image, the entropy minimization and a baseline method described in Appendix G of the supplementary materials were applied 20 times to ensure that we obtain consistent results. We randomly chose evenly spaced N = 32 columns. The performance of the first-level and the second-level classifications was assessed separately. Fig. 10 shows representative results of the entropy minimization applied on images with synthetic ENF traces. Each row has one example that contains an image with ENF trace, an ENF trace-removed image, and an estimated sinusoidal-like ENF trace (along with the true ENF trace). We only display the sinusoidal-like term A(i) cos (2π f /f row i + ϕ) but ignore the bias B(i) of (9), since the ENF slope term k E in B(i) is mixed with the content slope term k x (j ) as in (12b) and therefore none of them are individually estimable. It is observed that the zero crossings of the estimated and true sinusoidal-like ENF traces are well aligned, which implies that ENF traces are well estimated by the entropy minimization. Fig. 14 in Appendix H of the supplementary materials shows a trend that the stronger an embedded ENF trace is, the more reliable the estimated ENF traces is in general. Fig. 15 in Appendix H also presents receiver operating characteristic (ROC) results of the entropy minimization for images with ENF traces of varying strengths. 1 The ROC results reveal that as the ENF strength increases, both the first-level and the second-level classification performance improves, yielding equal error rates (EER) closer to 0. EERs represent the accuracy at the ROC operating point, where the false positive and false negative rates are equal. Similar results can also be found in the next subsection when we tested images with real ENF traces.

B. Results for Images with Real ENF Traces
We also created a real-world image dataset with iPhone captured indoor photos. By following the image acquisition setup described in Appendix I of the supplementary materials, we acquired in the United States a dataset with 60 Hz ENF trace using an iPhone 6s; in China a dataset with 50 Hz ENF trace using an iPhone 6. Each dataset contains ten scenes, each of which was obtained with different ENF strength levels set on the camera by changing the exposure time. As shown in (7), the exposure time determines the intensity of the ENF trace embedded in the captured images. We show all captured photos in Fig. 17 of the supplementary materials. For each image, the entropy minimization and the baseline method were applied five times to ensure consistent results.
We assess performance through the ROC curves shown in Fig. 11. At the first level, the presence of an ENF trace is considered positive; at the second level, the 120 Hz ENF trace is treated as the positive case. The top row of Fig. 11 shows the ROC curve results of images with real ENF traces when the entropy minimization method applied. Performance of the first-level and second-level classifiers is shown in the left and right panels of Fig. 11, respectively. Square, diamond, and triangle markers in the top row's left panel of Fig. 11 are operating points of the first-level classifier with decision thresholds η 1 = 0.006, 0.008, and 0.010, respectively. The top row's right panel of Fig. 11 contains results for the secondlevel classifier of the entropy minimization applied on true positive cases of the first-level classifier.
The top row's left panel of Fig. 11 shows the first-level decision results of the entropy minimization that as the ENF strength increases, the classification performance generally improves. The second-level decision results of the entropy minimization given in the top row's right panel of Fig. 11 show that when the strength is the lowest, the second-level classifier performs the worst regardless of the operating point of the first-level classifier. Conversely, when the strength increases above 0.23 corresponding to the ENF strength embedded in the images in the first three columns of Fig. 16 in Appendix I of the supplementary materials, the EER begins to approach 0.
The bottom row of Fig. 11 shows the ROC results of images with real ENF traces when the baseline method applied. Square, diamond, and triangle markers in the bottom row's left panel of Fig. 11 are operating points of the first-level classifier with decision thresholds γ 1 = 10, 20, and 25, respectively. The bottom row's right panel of Fig. 11 contains results for the second-level classifier of the baseline method applied on true positive cases of the first-level classifier. One can observe that in most cases, the baseline method underperforms the entropy minimization method. For example, the first-level classifier of the baseline method is little more than a random guess when the strength is below 0.66. Even when the strength is more than equal to 0.66, it is worse than the corresponding performance level of the entropy minimization method. The results for the second-level classifier also show no consistent performance trend as a function of the ENF strength and no perfect classification result with EER = 0 at any strength level.

C. Comparison with Preliminary Parametric Model
To compare the proposed more sophisticated parametric model to the preliminary parametric model (1) proposed in [1], we applied entropy minimization using the preliminary parametric model on the real-world image dataset of Section V-B. Fig. 12 compares the EER results of the two models for the first-level classifier. The EERs are shown as a function of the ENF strength defined as the numerator of (7), i.e., sinc(ω 0 t exp /π ) . The proposed more sophisticated parametric model can provide much better ENF detection performance under challenging scenarios when the ENF is weak. In contrast, when the ENF is strong, the performance is comparable. This observation implies that our proposed model performs better than the preliminary model in more challenging scenarios of dealing with images with weaker ENF traces-The weaker traces are more difficult to extract amid dominating visual content.

A. ENF Embedding Process for Fluorescent Lamps
Appendix A of the supplementary materials analyzes the ENF embedding process when the source light is a nonideal fluorescent lamp. A more realistic lamp voltage response is close to a periodic square wave given a sinusoidal-like alternating supply current [38], [39]. The result of (21) of the supplementary materials shows that the steady-state instantaneous power consumed by a fluorescent lamp contains frequency components at 2ω 0 k, k ∈ N. In contrast, the instantaneous power signal of (2c) in the ideal model has the fundamental frequency only at 2ω 0 . Even though the more realistic case contains extra harmonics, frequency component 2ω 0 will end up dominating the final camera-captured image because of the power-to-illumination conversion (3) and the camera integration (4a), which act as lowpass filters. The lowpass filters not only suppress higher harmonics of the power signal but also attenuate the strength of the AC component relative to the DC. Hence, the resulting pixel intensity contains a DC component and a relatively small sinusoidal-like flicker AC component with the frequency component 2ω 0 , regardless of the characteristic of the voltage response of the lamp.

B. Color Combination
In images captured under the lighting of fluorescent lamps, we may see colored ENF stripes that may be caused by uneven energy distributions in the radiation spectrum. Thus, instead of using the gray color combination being a weighted average of RGB colors, using a weight vector emphasizing the light spectrum with stronger spectral energy may improve the performance. We leave the investigation of colored ENF stripes as future work.

C. Multiplicative Model
In addition to the additive model [14], [16]- [18] that is resilient to the global intensity bias used in this paper, we also considered using the multiplicative model that is resilient to albedo and illumination changes. In a realistic scenario of capturing an image using rolling shutters, the light interacts with physical surfaces within the camera view. This interaction with physical surfaces can be nonlinear, making it more sense 12 First-level decision  to use the multiplicative model as shown in (18a). A large part can, however, be considered additive to the native image content when the strength of the AC is small relative to that of the DC component of ENF traces, namely, e(i) 1. We mathematically demonstrate the argument as follows: where C is the average intensity of the image. Also, the additive model is more mathematically tractable. Our experimental results showed that an algorithm developed based on the additive model has a better performance on real-world data than the multiplicative model.
VII. CONCLUSION AND FUTURE WORK In this paper, we have explored an almost unnoticeable power signature, the ENF trace that can be embedded in images acquired by CMOS cameras for geotagging purposes. Specifically, we have addressed a very challenging research question of whether the basic attributes of an embedded ENF trace can be correctly identified. To this end, we have investigated physical ENF embedding processes considering experimentally measured power supply signals and a light reflection-transmission model. We have discovered that an image-embedded ENF trace is a sinusoidal wave with varying amplitude and varying bias.
We have proposed the parametric model of image-embedded ENF traces with varying amplitude and varying bias, which allows the entropy minimization method to extract more precise ENF traces. We have subsequently designed the two-level ENF presence-classification test to identify images among no ENF, 50 Hz ENF, and 60 Hz ENF. The experimental results have shown that our proposed method is able to make higher decision accuracy compared to the baseline method.
In this proof-of-concept work, we have demonstrated a unique forensic capability of extracting invisible traces to help narrow down the capturing geographic region of an image. It is important to recall the significant challenge in extracting 13 ENF traces over a short duration buried in single images, as it is extremely difficult to overcome the dominating visual content and reveal the weak ENF traces. In contrast, the ENF extraction from video [16], [18] can take advantage of multiple frames to estimate the native visual content and separate the ENF from the visual content. Despite the difficulty, the results have shown promising capability by our proposed algorithms as the first attempt to tackle the challenge.
In future work, we plan on examining the performance on a larger scale investigation with more cameras and images. We also plan to explore additional location information that may be inferred from the ENF traces using a set of image frames captured in a continuous shooting mode. It may facilitate wider adoption of the ENF analysis in the image forensics field.
ACKNOWLEDGMENT We thank Mr. Yuan Frank Fang for providing the references on LED drivers and for discussing the lighting mechanisms.  [40] shows that gas-discharge lamps such as fluorescent lights have a negative voltage-current characteristic, where an increase in current across the lamp terminals results in a voltage drop. Consequently, when an alternating current of sinusoidal form is applied to a fluorescent lighting system, the lamp voltage is significantly distorted. Fig. 7.1 of [40] illustrates the lamp voltage formation against one half cycle of sinusoidal alternating current. During the following half cycle, the same lamp voltage pattern with the opposite sign will be observed.
To facilitate analysis, the lamp voltage has often been represented as a square wave [?], [?]: where V l ∈ R + is the lamp voltage amplitude, i l (t) is the alternating supply current, and the sign is defined as follows: Using the power law, we define the steady-state instantaneous power consumed by the lamp at time t as where I l ∈ R + is the current amplitude, ω 0 = 2π f 0 is the angular frequency relating to the mains frequency f 0 ∈ {50, 60} Hz, and ϕ ∈ [0, 2π ) is the initial phase. According to the Fourier series of | sin(ω 0 t)|, the power p l (t) can be expanded in terms of cosine functions: in which the frequency component at 2ω 0 dominates. The time-dependent part of p l (t) determines changes in light intensity.
Fluorescent lamps generate light by exciting mercury vapor that produces ultraviolet radiation through the discharge process. The ultraviolet generated in the gas discharge causes the phosphor on the inner wall of the tube to radiate visible light. To detect photons of visible light and measure fluorescence lifetime, a technique named time-correlated single photon counting (TCSPC) [26], [27] is commonly used. Using the fact that the likelihood of observing a photon at a certain time after a pulsed light excitation is proportional to fluorescence intensity at that time, TCSPC helps to draw histograms of fluorescence photon arrival times from a series of excitation pulses. Studies [26], [27] show that the fluorescence intensity is modeled by a single exponential decaying function or a sum of multiple exponential functions. Following the literature, we assume that the impulse response of the power-to-light conversion process takes the form of single exponential decay. Thus, the resulting illuminance of fluorescent lamps is represented as where " * " is the linear convolution operator, τ is a time constant accounting for the illumination half-life, and u(t) is a unit step function with value 1 for t ≥ 0 and 0 for elsewhere. Note K l is a power factor considering that a portion of the consumed power will contribute to the illumination of the light.

APPENDIX B ATTRIBUTION OF ARTIFACTS OF TIME-VARYING BEHAVIOR
In Section III-A, we notice that the time-varying behavior of the camera-captured pixel intensity signal is present when the nonideal sinusoidal-like supply voltage is applied. To investigate whether the varying frequency and/or the varying amplitude of the supply voltage are the cause(s), we design an experiment, in which simulated voltage signals of different characteristics are fed into the physical embedding process for a factor analysis. We adopt the following three parametric models of supply voltage, namely, a constant frequency and constant amplitude (ConFreq-ConAmp) sinusoidal model, a varying frequency and constant amplitude (VarFreq-ConAmp) model, and a varying frequency and varying amplitude (VarFreq-VarAmp) model, denoted by where V l , V l (t), ω 0 , ω 0 (t), and ϕ represent the constant amplitude, the varying amplitude, the constant angular frequency, the varying angular frequency, and the initial phase, respectively. The ConFreq-ConAmp model is the standard sinusoid with fixed parameters, whereas the VarFreq-ConAmp and VarFreq-VarAmp models have time-varying amplitude and/or frequency parameters. We estimate the parameters of each model as follows. For the ConFreq-ConAmp model (23a), we estimate the constant parameters V l and ω 0 by applying a trust-regionbased nonlinear least-squares algorithm [41] to the whole voltage signal. We calculated a sequence of 6.8 msec segments with 50% overlap and averaged them based on triangularwindowed moving average to obtain the simulated signals of the VarFreq-ConAmp and VarFreq-VarAmp models. For the VarFreq-VarAmp model (23c), we estimated a voltage amplitude-frequency pair for each signal segment. By creating a sequence of overlapped windows, we will obtain a sequence of time-varying voltage amplitude and frequency. The estimation procedure for the VarFreq-ConAmp model (23b) contains an extra step in addition to that for the VarFreq-VarAmp model: the constant voltage V l is estimated by averaging the sequence of the time-varying voltage amplitude. Fig. 4(a) shows three simulated pixel intensity signals I (t) driven by the different voltage models in (23). We used the same supply voltage measurement in Section III-A to generate the three simulated voltage signals. Then, we simulated the pixel intensity signals I (t) by setting the parameters to be Note: This is a supplemental document for paper "Invisible geolocation signature extraction from a single image," submitted to IEEE T-IFS by Jisoo Choi, Chau-Wai Wong, Adi Hajj-Ahmad, Min Wu, and Yanpin Ren.
S-2 the same as those in the experimental setup of Fig. 3 except with t exp = 40 msec. Note that we used an exposure time of 40 msec, so that the amplitude and the frequency can vary dynamically within the time window when considering the length of simulated voltage segments of the VarFreq-ConAmp and VarFreq-VarAmp models. One can observe that the VarFreq-ConAmp and VarFreq-VarAmp cases can result in time-varying trends and the VarFreq-VarAmp leads to even more variations, which means that both the varying amplitude and the varying frequency in the AC source are the cause of the time-varying behavior for the embedded ENF traces in images. Fig. 4(b) shows a few zoomed-in segments of the simulated I (t) for the VarFreq-ConAmp and VarFreq-VarAmp models, in which the displayed signal length roughly corresponds to 2, 2.5, and 3 cycles of a single image's acquisition duration. The segments' envelopes and biases roughly follow linear trends, which allows us to simplify the parameterization of the amplitude and the bias of the ENF trace using affine functions in Section IV-B. Fig. 5(b) shows that given a point light source at (0, h), the incident light reaches the object's surface of locations (b, 0) and (b +u, 0) with incident angles θ 0 and θ i , respectively. The distances from the light source to locations (b, 0) and (b +u, 0) are defined to be d 0 and d i . After hitting the object's surface, the reflected light goes through the camera located at (b+b , h ) that is d 0 and d i away from each location of interest.

APPENDIX C AMPLITUDE OF AN ENF TRACE AS A FUNCTION OF ROW INDEX
The light intensity L i of the reflected light at location (b + u, 0) arriving at the camera can be expressed as a function of row i of an image, namely, where c is a constant per (8) and we assume the scale of the captured scene u is much smaller than the vertical distance h and horizontal distance b to the light source and the horizontal distance b to the camera, i.e., which we refer to as the small-scale assumption. To verify that L i is approximately linear in u, we take the partial derivative with respect to u and examine if it is constant.
Due to the small-scale assumption of the captured scene, we have cot 2 θ 0 + 1 + u/b 2 ≈ cot 2 θ 0 + 1 = sin −2 θ 0 and cot 2 θ 0 + We further reduce (26) to Applying the small-scale assumption u b < d 0 and u b < d 0 again, we finally have which is a constant not depending on u.
The result of (28) shows that the light intensity L i has a linear relationship in u that can be related to the row index i of the image. When the light reflecting off the object's surface travels through the camera lens, in the case of DSLR cameras, it is turned upside-down and then flipped over again by a pentaprism. In the case of mobile cameras, as there is no reflecting mechanism, the light directly hits the camera imaging sensor. Basically, regardless of the camera type, a linear relationship between u and the row index i holds, meaning that as u increases, it contributes to forming the higher row of the image. Therefore, it is concluded that the light intensity L i of location (b +u, 0) picked up by the camera has a linear relationship with the row index i of an image.

SLOPE-INTERCEPT AND TWO-POINT FORMS
There are two common ways of parameterizing B(i) of (9): the slope-intercept form and the two-point form. The slopeintercept form is written as where k E and b E are the slope and the intercept for B(i), respectively. The two-point form can be written as S-3 where function B(i) is a line passing through the first point (i s , b s ) and the last point (i t , b t ). Given an image corrupted by a synthetic ENF trace, we visualize the landscapes of the cost function to determine a better parameterization for B(i) between the two possibilities. Fig. 13(a) and Fig. 13(b) correspond to error surfaces of the slope-intercept form and the two-point form, respectively. Each of them is shown with respect to two search variables when the remaining search variables, including the frequency parameter, are set to the true values. Regardless of the parameterization, the cost functions for the three combinations (a s , a t ), (a s , ϕ), and (a t , ϕ) are generally well-behaved, i.e., locally convex around the true solution with slight bias. On the other hand, as long as one of the parameters relates to the bias-related parameter, i.e., b for the slope-intercept form and b s and b t for the two-point form, there is little change in the cost value as a function of the bias-related parameter.
An interesting observation is made from the error surfaces of (k, b) of Fig. 13(a) and (b s , b t ) of Fig. 13(b), which suggests that the slope-intercept form is preferred over the two-point form. The variable b of the slope-intercept form merely shifts the histogram without changing the histogram shape. This does not change the cost function, making it impossible to estimate b. Meanwhile, k changes the histogram shape leading to the change in the cost function, which makes it possible to determine k. These behaviors are seen in Fig. 13(a). Yet, in practice, we choose to combine each of them with other parameters for the native image content signal and set them to some fixed values using the RANSAC algorithm, as described in Section IV-C of the paper.
The variables b s and b t of the two-point form can be related to b of the slope-intercept form, since changing b s and b t simultaneously corresponds to changing b of the slopeintercept form. The error surface of (b s , b t ) in Fig. 13(b) shows that the cost values are the same or very similar when both b s and b t increase in the top-right direction, which implies that a unique solution cannot be determined if b s and b t are involved in a two-variable landscape. Furthermore, multiple local minima around the true parameters are observed. Thus, optimization algorithms are likely to get stuck in local minima, making it even more difficult to correctly estimate the correct ENF frequency.

A. Using Single Column to Estimate Parameters
Let us define a column of smooth native image content as x (j) ∈ R (i t −i s +1)×1 , and an embedding ENF signal as e, where We define θ g = [a s,g , a t,g , f g , to be a vector of the ground-truth parameters and search parameters, respectively. The corrupted sensing signal is therefore I = x (j) + e. We set up a cost function to estimate θ as follows: We denoteθ(x (j) ) as a numerical suboptimal solution to the optimization problem By taking into consideration the randomness of the image, the suboptimality of the numerical algorithm for finding the optimal solution, we modelθ(x (j) ) as a random vector with a small bias b and a positive definite variance-covariance matrix Σ, namely,

B. Using Multiple Columns to Improve the Accuracy of the Estimation
We use a random sample (x (1) , . . . , x (K ) ) to improve the accuracy of the estimation. We useθ {x (j) } K j=1 to denote a numerical suboptimal solution to the multi-column optimization problem We conduct Taylor-series expansion aroundθ(x (j) ) for J j (θ ), and obtain where ∇ 2 J j is a Hessian matrix whose individual components are 2nd-order partial derivatives evaluated nearθ(x (j) ). We further assume J j (·) is continuous hence ∇ 2 J j is a symmetric matrix. Substituting the Taylor expanded cost function for each column in (36) into the multi-column cost function in (35), and by setting the gradient with respect to θ to zero, we obtain the optimal analytical solution to (35) as follows:  where The sample average of gradients in (37b) converges to 0 in distribution as K increases. Hence, The above results reveal that by using K columns, the estimator has a significantly reduced variance and a similar bias as the single column case.

APPENDIX F ENTROPY MARGIN BEHAVIOR
For an arbitrary column of an ENF-containing image, I (i), we measure entropy of a random variable Y with where y represents pixel intensity with 256 shades of gray, i.e., y ∈ [0, 255] and p I (y) is the probability mass function obtained by normalizing the histogram of the column of the ENF-containing image. We formulate our research problem of identifying whether an image contains 100 or 120 Hz as an optimization problem to solve for the search parameters, θ = [a s , a t , f , ϕ, where p(y; θ) is the normalized distribution of the column of the ENF-trace-removed image.
We define the entropy margin as the absolute difference between the two entropy values at the ground-truth ENF f g and the non-ground-truth ENF f − g as follows: We assume that the image column of our native image has a constant intensity, such that it contains the sole bin of the histogram. Thus, when the estimated ENF traceê(y; θ) obtained by the optimization problem (41) is very close to the ground-truth ENF trace, we have y p(y; θ, f g ) log 1 p(y; θ, f g ) where y i is the pixel location of the sole bin. The first term of (43) equals to 0 by log(1) and the second term equals 0 by p(y; θ, f g ) because p(y; θ, f g ) except when y = y i is zero. This leads the entropy margin to only depend on H (p(y; θ, f − g )) as below: which is the entropy of the ENF-trace-removed column at the non-ground-truth ENF f − g , I (i) − e(i; θ, f − g ). When the estimated ENF is not equal to the ground-truth ENF, we can view the ENF-trace-removed column as the image column containing two ENF traces characterized by two different ENF frequencies. Thus, we may observe bins spread over some range around the sole bin of the native image column, which contributes to the entropy increase. Also, as the ENF strength gets larger, the bins are spread over a wider range of the pixel intensity, which makes the entropy even higher. This trend has been confirmed by our simulation results in Fig. 9. S-5 For an image that does not contain ENF, the entropy margin can be expressed as H (p(y; θ, f = 100)) − H (p(y; θ, f = 120)) = y p(y; θ, f = 100) log 1 p(y; θ, f = 100) − y p(y; θ, f = 120) log 1 p(y; θ, д = 120) (45a) where the conversion from (45a) to (45b) is due toê(y; θ, f = 100) ≈ 0 andê(y; θ, f = 120) ≈ 0.

APPENDIX G CURVE FITTING AS A BASELINE METHOD
We propose a baseline method based on curve fitting, where the pixel intensity signal in (12b) is used as the fitting function. Using the nonlinear least squares fitting in the MATLAB Curve Fitting Toolbox [41] where the trust-region optimization and the bisquare weights fitting methods are used, we fit the function to the averaged pixel values of selected K image columns:Ī (i) = 1 K K j=1 I (j) (i), to obtain parameter estimates for f , a s , a t , k, ϕ, and b. As in the entropy minimization method, we also consider the constraints on f ∈ {100, 120} Hz and a s a t ≥ 0, thereby leading to generating a total of 4 (= 2 × 2) fitting results given one single image. Among the four fitting results, we choose the frequency showing the best fit RMSE as the final frequency estimatef .
Similar to the entropy minimization, we use the two-level ENF presence-classification test, where mse margin defined as the absolute difference between the MSE values of f = 100 and f = 120 and the difference between the MSE values of f = 100 and f = 120 are exploited as test statistics for the first-level and second-level tests, respectively.

APPENDIX H DETAILED RESULTS FOR IMAGES WITH SYNTHETIC ENF TRACES
To examine the accuracy of the estimators, we calculated Pearson's correlation coefficients between the sinusoidal components of each estimated ENF trace and a true ENF trace. The performance evaluation using the correlation coefficient is shown in Fig. 14 using box plots. 2 It is revealed from the box plots that as the ENF amplitude level increases, overall, the median of the correlation coefficients is closer to 1 with smaller variance. Under all ENF amplitude levels except the strongest amplitude, the entropy minimization method generally outperforms the baseline method in terms of both the median and the variance of the correlation coefficients. Fig. 15 illustrates the ROC and the EER performances on images with synthetic ENF traces as a function of the ENF 2 Each box shows the median, and the upper and lower quantiles, and the most extreme data points excluding outliers. The outliers are plotted individually using the "+" symbol. The top row's left panel of Fig. 15 shows the first-level decision results of the entropy minimization that as the ENF strength increases, the classification performance improves. When the strength exceeds 2.0, the first-level detector starts to perform perfectly. The second-level decision results of the entropy minimization is presented in the top row's right panel of Fig. 15 showing that no matter what the threshold of the first-level classifier is, as long as the strength is more than 0.7, the second-level classifier can almost perfectly distinguish images between 100 Hz and 120 Hz.
The bottom row of Fig. 15 shows the ROC curve results of images with synthetic ENF traces when the baseline method applied. Square, diamond, and triangle markers in the bottom row's left panel of Fig. 15 are operating points of the firstlevel classifier with decision thresholds γ 1 = 4, 5, and 6, respectively. The bottom row's right panel of Fig. 15 contains the ROC results for the second-level classifier ot the baseline method applied on true positive cases of the first-level classifier. It is observed that when the ENF strength is less than 8, the first-level detector of the baseline method performs almost as the same as a random guess. When the strength is greater than equal to 8, the EER approaches 0. When the strength is 16, it performs the best, but not entirely perfect. The results for the second-level classifier also demonstrate that the performance is roughly proportional to the ENF strength regardless of the threshold of the first-level classifier. It is noted that only when the strength is 16, the second-level classifier performs perfectly.

APPENDIX I DETAILED RESULTS FOR IMAGES WITH REAL ENF TRACES
Under the same image acquisition environment, τω 0 in the denominator of (7) is constant. This allows us to define the ENF strength using the numerator of (7), i.e., sinc(ω 0 t exp /π ) . Each scene of the 60 Hz dataset was taken with the exposure To obtain a similar average brightness on images even with different shutter speeds, ISO sensitivity was varied according to the shutter speed change. The authors in [42] provided the following formula relating a pixel value in an image, N d to the camera settings and scene luminance: where K c , f s , S, and L s are the calibration constant, the aperture number (f-stop), the ISO sensitivity, and the scene luminance, respectively. A pixel value represents the light intensity or the brightness at that pixel location of scene. Thus, assuming a uniform color intensity over the entire pixel area and the same scene luminance over different scenery images, it is possible to calculate the ISO sensitivity required to obtain a similar average brightness on images captured with different exposure time lengths. For example, if we use 1800 as the ISO sensitivity for the 1/311 sec shutter speed, the ISO value for the 1/30 sec exposure time should be an integer around 30×1800/311. The left half of Fig. 16 shows ten scenes taken from the United States, where the brightness of the electric lighting varies at 120 Hz by the exposure time; for each scene, images with the seven ENF strength levels controlled by the exposure time are shown. As we can see, a higher ENF strength can capture clearer fluctuations of the ENF trace.