Global Illumination on a Mobile Phone: Scalable Real-time Global Illumination using Sparse Radiance Probes

Real-time global illumination that scales from low to high-end hardware is important for interactive applications so they can reach wider audiences. To do this, the real-time lighting algorithm used needs to have varying performance characteristics. Sparse Radiance Probes (SRP) is a recent real-time global illumination algorithm that runs in under 5 ms per frame on a high-end Nvidia Titan X GPU. Its low per-frame timings suggest it could scale to low-end devices, but no prior work provides complete implementation details and evaluates its performance across devices with varying performance characteristics to prove this. Therefore, this thesis aims to fill this gap and determine if SRP is scalable across low to high-end devices. SRP is implemented with adjustable scaling parameters, and its performance is compared across three test devices. A low-end iPhone 7, a mid-range AMD Radeon 560 graphics card, and a high-end AMD RX Vega 56 graphics card. The implementation in this thesis ran above 60 FPS for simple scenes on the iPhone 7, and with a reasonable reduction in quality, it ran just above 30 FPS on more complex scenes like Crytek Sponza. These results show that SRP can scale to low-end devices. While the implementation in this thesis runs in real time, there are implementation optimisations that would make SRP run even faster across all the test devices without reducing quality. Acknowledgements First, I would like to thank my supervisors Taehyun Rhee and Andrew Chalmers. The advice and guidance you have given me over the past year has been invaluable. I also want to thank Thomas Roughton. For both the countless technical discussions we have had, and for the support you have given me as a friend. Lastly, I want to thank my parents and two brothers for supporting and encouraging me over the years I have had at University.


Introduction
Imagine you are working on a video game. Your desktop computer with a high-end graphics card simulates the world beautifully, the lighting feels realistic, and the game runs at a smooth 60 frames per second. You have a problem though: the game needs to run on more devices than just your computer. Unfortunately, the global illumination (GI) algorithm that provides the realistic lighting does not scale to the limited memory and processing power of mobile devices.
A scalable GI algorithm needs to run well on low-end devices while not being so limited that it cannot take advantage of the extra computation and storage available on high-end devices. Scalability is especially important as more and more devices with rendering capabilities are becoming available, from low-end mobile devices up to high-end desktop computers with the latest graphics cards. Applications for scalable real-time GI are not limited to video games: for example, architects may need to show a realistic visualisation of a design to a client on a mobile device while on-site, and yet still have a higher quality version available on a desktop computer in their office.
There are many real-time GI methods in production use, each offering different levels of performance-quality tradeoffs. Screen-space based methods [1,2] perform well on low-end devices but their inherent lack of off-screen information means the result is equally inaccurate on high-end devices. Voxel cone tracing [3] produces high-quality results, but it is too expensive to run on current low-end devices [4]. Precomputed lightmaps [5] scale well across all hardware but force geometry and lighting to remain static at runtime.
Recently, Silvennoinen and Lehtinen introduced Sparse Radiance Probes (SRP) [6], a real-time GI method which supports dynamic lighting, cameras, and diffuse 1 materials. It efficiently reconstructs indirect illumination at a dense set of receiver points distributed over scene surfaces by using a sparse set of radiance probes and precomputed transport coefficients. In computer graphics, a method is generally accepted as real-time if it runs in under 33.3 ms; on a high-end Nvidia Titan X graphics card, their method runs in under 5 ms per frame. As SRP runs well under the real-time cut-off on this high-end graphics card, it suggests that low-end devices could also run SRP in real time. Additionally, the sparse nature of SRP means low-end devices with limited memory can still store SRP's precomputed transport coefficients, and the precomputed data can be packaged into reasonably small downloads over mobile networks. Silvennoinen and Lehtinen reported that SRP used less than 100 MB of memory for a scene used in an unannounced "triple-A" video game.
Although these metrics suggest SRP could scale to lower-end devices, there is no previous work that implements and evaluates SRP on low-end devices such as mobile phones. Additionally, the authors of SRP did not provide complete implementation details for precomputing the transport coefficients. The aim of thesis is therefore to determine if SRP is a scalable real-time GI method for various platforms, including an iPhone 7: that is, to determine if it runs in real time on low-end devices while also scaling up to provide higher quality results on highend hardware. I implemented my own version of SRP with adjustable scaling parameters, and analysed its performance, quality, and scalability across low, mid, and high-end hardware.
The main contributions of this thesis can be summarised as: • An implementation of SRP which runs across three devices with varying performance characteristics: a high-end AMD RX Vega 56 graphics card, a mid-range AMD Radeon 560 graphics card, and a low-end iPhone 7.
• A pipeline and complete implementation details for precomputing SRP's transport coefficients on the graphics processing unit (GPU).
• A performance and visual quality evaluation of my SRP implementation across low, mid, and high-end hardware to determine its scalability.
The thesis is structured as follows. Chapter 2 provides background on topics necessary to understand the chapters which follow. Chapter 3 describes the related work in real-time scalable GI algorithms. Chapter 4 covers the theory of the SRP; 2 CHAPTER 1. INTRODUCTION it describes how SRP encodes radiance at a sparse set of probes while correcting for parallax error and visibility, how SRP reconstructs irradiance at receiver points using the probes and transport coefficients, and lastly how SRP compresses its transport coefficients using clustered principal component analysis (CPCA). Chapter 5 describes my implementation of SRP; it covers the steps of my implementation and how data flows between them, how radiance at the probes is efficiently projected into spherical harmonics at runtime using compute shaders, and how the pipeline for precomputing the transport coefficients on the GPU works. Chapter 6 provides the results of my performance and visual quality evaluation across low, mid, and high-end hardware to determine the scalability of SRP. Lastly, Chapter 7 outlines the conclusions of this thesis, discusses its limitations, and provides suggestions for future work.

Global Illumination
Global illumination (GI), roughly speaking is the effect that other lit objects have on the illumination of a surface. A basic real-time rasteriser only considers local illumination. Local illumination assumes that the illumination of a point

PHYSICALLY BASED RENDERING UNITS
on a surface depends only on light sources in a scene; this means a rasteriser can compute illumination for each pixel independently of every other pixel. However, in the real world, the illumination of a point also depends how light interacts with other surfaces. This interdependency between surfaces makes global illumination expensive to simulate.

Physically Based Rendering Units
Physically based rendering (PBR) is a methodology that aims to simulate light as close as possible to how it behaves in the real world. One component of PBR is the use of physically based units from radiometry to measure light: • Radiant Flux, Φ, is the flow of radiant energy over time. It is measured in watts (W).
• Radiance is the radiant flux per square metre per steradian.
• Irradiance is the density of radiant flux over an area -dΦ dA .
Radiance and irradiance are the units most commonly used in computer graphics. At a high level, radiance is a measure of an amount of light in a direction ω, and irradiance is the amount of light a point x receives from all directions.

The Rendering Equation
The aim of any global illumination method is to solve or approximate the rendering equation [8]. This equation describes the distribution of light, or more specifically, radiance in a scene: It states that the outgoing radiance from a point x in the direction ω o is the sum of the radiance x emits in direction ω o , plus the radiance x reflects in direction ω o . The integral describes the amount of light x reflects. It gathers incoming radiance from all directions over the hemisphere Ω centred at the surface normal n and modulates it using the bi-directional reflectance distribution (BRDF) and Lambert's cosine law. The BRDF controls the amount of radiance from an incoming direction ω i that x reflects in the outgoing direction ω o .
The interdependence between surfaces can be seen through the recursion in this equation. The incoming radiance at a point x depends on the radiance it receives from all points it can see, and the incoming radiance at those points again depends on the incoming radiance from all the points they can see, ad infinitum.

Spherical Harmonics
Spherical harmonics are an infinite set of orthonormal basis functions defined over the unit sphere. Many computer graphics techniques use spherical basis functions to efficiently encode radiance or irradiance using a linear combination of coefficients and basis functions: Each coefficient c j conveys how similar a particular basis function B j is to the original function f (ω). Finding these coefficients can be done by projection, and recreating the original function is called reconstruction (Equation 2.2). To project a function f (ω) onto an orthonormal basis 1 , the inner product is taken with each basis function, yielding a coefficient c j which represents how similar the basis function is to the original function:

SPHERICAL HARMONICS
Limiting the number of basis functions to n gives a band-limited approximation of the original function: This approximation is useful for real-time global illumination methods as it provides a way to efficiently store and evaluate radiance and irradiance distributions. In fact, spherical harmonics can represent Lambertian diffuse irradiance using nine coefficients with only 1% error [10]. However, a band-limited approximation only captures low-frequency information. Spherical harmonics represent low-frequency diffuse irradiance well, but information is lost when encoding radiance or irradiance with higher-frequency information like visibility. Unfortunately, low-frequency representations are a necessary trade-off for real-time methods, as alternatives generally require expensive sampling of the rendering equation.
In computer graphics, we only use the real part of spherical harmonics. These are defined as follows: where P is the associated Legendre polynomial function, and K is a normalisation factor. Spherical Harmonic Lighting: The Gritty Details [11] includes definitions of the Legendre polynomials and the normalisation factor.
Spherical harmonic functions are parameterised by l and m. These parameters group the functions into orders, where l is the order of a function and m is the index of a function within that order. Orders can also be referred to as bands. While orders are indexed from zero, bands start at one (i.e. order-0 is the first band, order-1 is the second etc.). For given order l, the index m lies within the range [−l, l] and it follows that each order has 2l + 1 basis functions. As an example, an order-2 spherical harmonic (3rd row of Figure 2.2) has nine functions in total: the five functions from its band and the four from the bands before it.

CHAPTER 2. BACKGROUND
The total number of functions for an order (including all lower bands) is (l + 1) 2 . For cleaner notation in this thesis, I refer to the direction using a single vector ω instead of a spherical coordinates (θ, φ), and I index each basis function with a single index j instead of l and m (i.e. Y j (ω) instead of Y m l (θ, φ)). Spherical harmonics are not the only spherical basis functions in use in rendering; non-linear wavelets [12], spherical radial basis functions (SRBFs) [13], and Ambient Dice [9] are some examples.
To calculate each coefficient c j , I solve Equation 2.3 using Monte Carlo integration within a compute shader. The next section introduces Monte Carlo integration, followed by a section on compute shaders.

Monte Carlo Integration
Monte Carlo integration is a numerical integration technique which uses random samples to estimate the value of an integral. It is especially useful for solving complex multi-dimensional integrals like the rendering equation because its convergence rate does not depend on the dimension of the integrand.
The basic Monte Carlo integration algorithm is simple: it takes N samples of a function and then divides the sum by the total number of sample N . For this to work, the samples must be uniformly distributed over the integral's domain. Given enough samples, the algorithm will converge to the true value of the integral.
For more information on Monte Carlo integration, Physically Based Rendering (PBRT) [15] provides a good introduction to the subject and why it works in the context of computer graphics.

Compute Shaders
Compute shaders are general purpose shaders which allow non-graphics specific functions to run on a graphics processing unit (GPU); they are a generalisation of vertex and fragment shaders. Each shader function is run in parallel over different data; in vertex and fragment shaders, this data is usually a single vertex or a pixel, but for a compute shader it can be anything. Different shading languages use different terminology to describe how compute shaders operate. This thesis uses the terminology from the Metal Shading Language [16]. To transform data using a compute shader, it must be broken into chunks. Metal calls these chunks threadgroups. Within a threadgroup are individual threads which each run a kernel function. The threads within a threadgroup can run in parallel and can access shared threadgroup memory.

Principal Component Analysis and Singular Value Decomposition
Principal Component Analysis (PCA) is a technique used to find the axes in a data set with the most variation. These axes are called principal components.
Singular Value Decomposition (SVD) is one way to perform PCA. SVD decomposes matrices into the form: where U is a m × m matrix, Σ is a m × n rectangular diagonal matrix, and V is an n × n matrix. SRP uses the SVD to implement Clustered Principal Component Analysis (CPCA) [17] for compressing its precomputed transport coefficients.

Related Work
There is a great deal of research available on real-time global illumination. In this section, I cover the related work most relevant to scalable real-time global illumination. For more detailed background, Ritchell et al. provide an in-depth coverage of the state-of-the-art in interactive global illumination [18].
Monte Carlo path tracing [8] sets the bar in terms of the quality achievable by rendering algorithms. It is the industry standard for rendering VFX, animated films, commercials, and more [19]. Weta Digital's Manuka [20], Pixar's Ren-derMan [21], and Disney's Hyperion [22] are renderers from the top VFX and animation studios which are all based on path-tracing. However, path-tracing, and other ray tracing based methods [23,24] are typically very expensive making them unsuitable for real-time applications.
While ray tracing can be expensive, recent advances in both GPU hardware and algorithms are making real-time ray tracing feasible [25]. For example, Weta Digital now uses real-time ray tracing as a pre-visualisation tool [26], and some video games such as Battlefield 5 use real-time ray tracing to render reflections and other global illumination effects [27].
The current problem with these advances is that they are only performant on recent high-end graphics cards specifically designed to improve the performance of ray tracing. Low-end devices like mobile phones present a more challenging environment for ray tracing: their small form factor means there is little room for cooling, and energy usage must be kept to a minimum to preserve battery life.
One way mobile GPUs reduce energy usage is through dynamic voltage and frequency scaling (DVFS). DVFS adjusts the voltage supplied to the GPU based on previous GPU usage. This typically saves energy at the cost of minor decreases in performance but unfortunately does not work well with global illumination algorithms [28]. There is research looking to improve this; Lee et al. proposed a novel GPU architecture in 2013 to improve ray tracing performance on mobile devices [29]. However, at publication, their work was in the simulation stage and they had not developed any hardware. Real-time ray tracing may be a viable solution in the future for scalable real-time global illumination, but hardware acceleration is not yet available on a wide enough range of devices.
One way to overcome the cost of ray tracing at runtime is Lightmapping [5]. Lightmapping stores precomputed illumination samples in a two-dimensional texture called a lightmap. In a manner similar to texture mapping, UV coordinates are assigned to surfaces in a scene which map to locations in a lightmap. Path tracing is commonly used to precompute the illumination samples. For diffuse Lambertian surfaces, a lightmap stores precomputed irradiance samples. At runtime, a forward or deferred pass queries the lightmap at the UV coordinates for the current surface the pass is shading. For normal-mapped or rough specular surfaces, it is infeasible to store samples for every view direction. Instead, a lightmap can store radiance samples encoded using a spherical basis such as spherical harmonics [30,31].
Lightmapping has been proven to be scalable by many video games; for example, the puzzle game The Witness uses lightmapping and is available on mobile devices, consoles, and desktop computers [32,33]. The primary limitation of lightmapping is that geometry and lighting must remain static at runtime. This is not an issue for applications like The Witness where there are no dynamic lights or geometry, but other techniques must be considered for applications for which these limitations are too restrictive.
Screen space global illumination techniques use the final rendered image and other auxiliary textures from the rasterisation process like the depth buffer to approximate global illumination effects. Examples include Screen Space Ambient Occlusion (SSAO) [1], which approximates the self-shadowing around edges of geometry; and Screen Space Directional Occlusion (SSDO) [2] which extends SSAO 14 CHAPTER 3. RELATED WORK by adding directional shadows and one diffuse indirect bounce of light. Screen space techniques are highly performant, support both dynamic lighting and geometry, and need no precomputation. However, they lack accuracy as indirect light from offscreen sources cannot contribute.
Many Light Methods use many virtual point lights (VPLs) to approximate global illumination. Instant Radiosity [34] first introduced this idea. Paths of light are traced through a scene, and a VPL is placed at the intersections the paths make with geometry. Many light methods are biased (meaning their result is not exactly correct), but they produce images without noise and are scalable as their performance/quality can be tuned from plausible renders in real time to high-quality offline renders; Lightcuts [35] made many light methods scalable so they could produce high-quality offline renders, and Bidirectional Lightcuts [36] helped reduce the inherent bias. However, these extensions mainly improve offline applications.
Imperfect shadow maps by Ritchell et al. [37] is a many-light method which interactively renders GI with dynamic lighting and geometry. They later improved on their original imperfect shadow map implementation by adding support for more complex scenes [38]. While close, imperfect shadow maps does not run above 30 FPS on the dedicated graphics cards it was tested on, indicating that it may not be suitable for low-end devices. For more information on many light methods, Dachsbacher et al. provide a in-depth survey [39].
Voxel Cone Tracing (VCT) by Crassin [3] is a real-time global illumination technique which supports indirect diffuse and specular lighting, as well as dynamic lighting and geometry. At a high-level, VCT is similar to path tracing but with two approximations to improve performance. First, it approximates geometry by voxelising the scene, and second, it traces cones through the voxelised scene instead of firing many individual rays. Cascaded voxel cone tracing [40] offers improved performance by using dynamic resolution voxel grids. Similar to mip-maps, they use a high resolution grid near the camera, which progressively decreases in resolution as the distance from the camera increases.
VCT runs in real time on mainstream graphics cards but it is still too expensive for lower-end devices. Nvidia reported their implementation to run in 7.4 ms on the GTX770, and 28.1 ms on a then-mid-range GTX650 at resolution 1920x1080 and "medium" quality. Both of these graphics cards are considerably more performant than modern mobile graphics hardware. Wahleén's master's thesis found that VCT was too expensive to run in real time on a Samsung S7 Edge mobile phone [4], and Gaijin Entertainment also stated at GDC 2019 that voxel cone tracing was too expensive for their purposes on an Xbox One [41]; however, their GI budget is only a few milliseconds due to other GPU computation costs needed for their game. Instead of using VCT, they use their own method which also makes use of a voxelised scene. It runs in 0.7 ms for single bounce, and 1.6 ms for n-bounces on an Xbox One.
McGuire et al. introduce a new data structure and an algorithm called light field probes [42]. It supports real-time global illumination for static scenes with both static and dynamic objects.
Enlighten is a commercialised and proprietary solution for rendering globalillumination in real time [43,44]. It runs on low-end mobile devices to high-end desktop GPUs. Unfortunately, there is limited information available on the details of their solution; however, at a high level, it uses lightmapping and light probes for diffuse indirect lighting. Their solution aims to minimise GPU time so that the GPU is free to perform other tasks, and as such they update their probes asynchronously on the central processing unit (CPU).
Precomputed radiance transfer is a method introduced by Sloan et al. for rendering indirect environment lighting in real time [45]. PRT splits the rendering equation into two parts: a lighting function and a transfer function. The lighting function describes incoming radiance, and the transfer function describes how that incoming radiance interacts with surfaces in a scene. This separation means PRT can reduce computation considerably by precomputing the transfer function. While this means geometry must remain static, it reduces runtime computation enough to make dynamic real-time indirect lighting possible. In addition to splitting the rendering equation, precomputed radiance transfer methods also band-limit the indirect environment lighting by projecting it into a spherical basis. Sloan et al. use spherical harmonics [46], but other bases like non-linear wavelets have also been used [12].
Early PRT methods were limited to environment lighting. Hasan et al. extend PRT adding support for local lights (e.g. point and spot lights) [47]. It is a direct-to-indirect method which computes indirect illumination by gathering from a large set of radiance/irradiance samples distributed in a scene called radiance probes.
Sparse Radiance Probes (SRP) is a real-time global illumination method based on PRT and other direct-to-indirect methods which supports dynamic lighting, cameras, and diffuse materials [6]. Additionally, it can support glossy materials for the final bounce towards the camera. Silvennoinen and Lehtinen, the authors of SRP, reported it ran in 2.5 ms on an a Nvidia Titan X for the test scene Sponza, and 4.87 ms for Brutalist Hall, a novel scene from a "triple-A" video game. While these scenes were rendered on a high-end graphics card, the low render times suggest it could also scale to low-end devices. The next chapter of this thesis describes the theory of SRP in detail, followed by a chapter describing my SRP implementation. 17 18

Chapter 4 Sparse Radiance Probes: Theory
The Sparse Radiance Probes (SRP) technique aims to compute indirect illumination at a dense set of points called receivers. These receivers can be any set of points distributed over scene surfaces or within space. Computing the indirect illumination (or, more specifically, the outgoing indirect radiance) requires densely sampling the rendering equation [8] over directions ω o for all receivers x: Unfortunately, it is too expensive for current hardware to do this in real time. SRP first simplifies the problem by assuming all surfaces use the viewindependent diffuse Lambertian BRDF 1 : Removing a dimension from the problem helps make it more tractable, but the incoming radiance at each receiver L i (x, ω i ) remains too expensive to sample in real time. Instead, SRP approximates L i (x, ω i ) by interpolating between radiance observed at a much sparser set of points called radiance probes. Section 4.1 covers how SRP performs this interpolation while accounting for parallax error and visibility.
The interpolation also helps, but SRP still needs to compute the incoming radiance at each probe. On top of this, it must do this every frame to support dynamic lighting in real time. The problem is that even with only a few probes, it is still too costly to compute incoming radiance at each probe by sampling the rendering equation. To overcome this, SRP band-limits each probe's radiance by encoding it in a spherical basis (section 4.2). This encoding means SRP can efficiently store and evaluate radiance in real time using a small number of radiance coefficients. Dynamic lighting is possible as we can efficiently generate these coefficients in real time using a compute shader (section 5.2). The trade-off of this representation is that the encoding only captures low-frequency information, which can reduce the accuracy of indirect shadows and cause colour bleeding. Silvennoinen and Lehtinen [6], the authors of SRP, use spherical harmonics as the set of spherical basis functions.
While the interpolation and band-limited probes significantly reduce computation, performing the interpolation at every receiver in real time is still too expensive. The interpolation provides an approximation for the incoming radiance in a direction at a receiver, but to calculate outgoing radiance we must integrate the incoming radiance over a hemisphere at the receiver; since it requires densely sampling an integral, this operation is too expensive to perform in real time. To overcome this, SRP factors the radiance coefficients out of this integral and precomputes the resulting integral (section 4.3). This precomputation yields a set of transport coefficients for each receiver.
For simple scenes, a weighted sum of the radiance coefficients and transport coefficients for a receiver is enough to compute diffuse indirect irradiance in real time. However, this is not possible for more complex scenes, as storing the transport coefficients for every receiver can easily consume gigabytes of memory. The last component of SRP is compression. Section 4.5 covers how SRP uses clustered principal component analysis (CPCA) to compress its transport coefficients.

Sparse Interpolation
To avoid densely sampling the rendering equation (Equation 2.1) at every receiver, SRP builds upon the work of sparse interpolation techniques [48,49]. Sparse interpolation techniques use a more accurate but more expensive method to sample radiance at a sparse set of points and then reconstruct radiance at a denser set of points by interpolating between those sparse samples. SRP calls the sparse sample points radiance probes, and the dense sample points receivers. The receivers interpolate between nearby probes to approximate their incoming radiance (Figure 4.1).
Ward et al. [48] introduced this idea with the following interpolator: where x is a receiver, n is the probe count, L(p i , ω) is the incoming radiance for a probe p i in direction ω, and w i (x) is a weight controlling how much of probe i's radiance contributes to the receiver's radiance.
SRP uses a weighting function based on the distance between p i and the receiver x from Lehtinen et al. [50]: where t is x − p i /r and r is a cut-off radius that defines the maximum distance at which a probe contributes to a receiver.
There are two main issues with the interpolator in Equation 4.3: first, it suffers from parallax error, and second, it does not account for visibility. The remainder of this section describes how SRP addresses these two issues and then defines SRP's interpolator. The dark grey boxes represent geometry, the light grey circles distributed over the geometry represent the dense set of receiver points, and the larger green circles represent the sparse set of radiance probes. The coloured circles around the highlighted receivers represents their cut-off radius (i.e. the maximum distance at which a probe can contribute to a receiver). The receivers interpolate their radiance from nearby probes; for example, receiver A interpolates its radiance from probes p 0 and p 1 . Probe p 0 contributes more to A than probe p 1 as it is closer; receiver B interpolates from probes p 1 and p 2 ; and receiver C interpolates from probe p 2 .

Parallax Error Correction
Parallax error occurs in Ward's interpolater (Equation 4.3) when the incoming radiance observed by a probe in a direction ω comes from a different location than the radiance observed by a receiver in that same direction ω. The only scenarios this does not occur in are when either a probe is in the same location as a receiver or the source of the radiance is at infinity (e.g. an environment map). Parallax error is an issue when the radiance observed differs between the probe and receiver (Figure 4.2a); it is less of an issue with low-frequency diffuse lighting but becomes more noticeable with higher frequency changes such as harsh shadows.   To account for parallax error, we sample in a corrected direction φ(ω), instead of the same sample direction ω used at a probe ( Figure 4.2b). The corrected direction points from the probe towards the same point in the scene the receiver sees in the direction ω.

CHAPTER 4. SPARSE RADIANCE PROBES: THEORY
We calculate the corrected direction using the following equation: where Γ(x, ω) is the ray-cast operator that returns the intersection point of a ray fired from x in direction ω. Equation 4.5 accounts for parallax error, but the observed radiance in the corrected direction may still be incorrect as it does not account for visibility. If an object o lies between p and the intersection point Γ(x, ω), then the radiance from o will be incorrectly accounted for, instead of the radiance from Γ(x, ω) ( Figure 4.3).

Mutual Visibility
Previous methods before SRP have accounted for visibility, but only by checking between the receiver point x and probe p [42]. This check is not enough to be correct. The surface point Γ(x, ω) must be mutually visible to the receiver and the probe. Both the receiver and the probe must see Γ(x, ω).

Interpolation Operator
SRP combines the parallax error correction and mutual visibility check into one interpolator: If all surfaces are diffuse, the interpolator (Equation 4.6) is exact given an infinite resolution sampling at the probes, and that the point Γ(x, ω) can see at least one probe [6].

Band-Limited Probes
To make sparse interpolation viable for real-time use, SRP stills needs an efficient way to compute incoming radiance L(p i , ω) at each probe p i . Even though there are considerably fewer probes than receivers, a high-resolution sampling of the incoming radiance at each probe is still too expensive, both in terms of computation and in storage required. As a cheaper alternative, SRP band-limits the radiance at the probes by encoding it in a spherical harmonic basis: where each radiance coefficient λ ij is an element of the probe radiance vector λ. This vector contains the radiance coefficients for all probes. For convenience, this thesis indexes λ with probe index i and the basis function index j.
Band-limiting radiance at the probes trades high-frequency information for much-needed execution speed and storage. For diffuse surfaces, the main source of high-frequency information comes from sharp changes in albedo or visibility. As an example, if an object casts a hard-edged shadow onto a surface, a lowfrequency probe will not capture the sharp change in radiance it causes; this can cause areas to appear brighter than is correct. For spherical harmonics, it is possible to bring back high-frequency information by increasing the number of coefficients a probe uses; however, the cost increases quadratically, since each order n requires (n + 1) 2 coefficients.

Precomputed Local Transport Operator
To combine the interpolation operator from Equation 4.6 with the band-limited probes from Equation 4.7, SRP defines a local transport operator P x (λ) for every receiver x. This operator transforms incoming radiance in the probe radiance vector λ into incoming indirect radiance at a receiver x: While significantly cheaper than densely sampling the rendering equation at every receiver, the local transport operator is still too expensive to evaluate in real time. To see why this is, recall that the final goal of SRP is not to compute incoming indirect radiance at each receiver, but to compute outgoing indirect radiance towards the camera. To do this for diffuse surfaces, we must calculate irradiance over a hemisphere centred at x: Again, we come up against an integral which is too expensive to sample in real time. As the mutual visibility and parallax-corrected radiance terms in the local transport operator depend on the variable we are integrating over ω, we must evaluate these terms many times to converge on the solution to the integral. To overcome this expense, SRP factors the probe radiance vector λ out of the integral and then precomputes the remaining components of the integral offline. This can be done because λ does not depend on the direction ω. Due to this precomputation, geometry must remain static, but lighting can still change at runtime. The next section covers the steps to factor λ out of Equation 4.9.

Irradiance Transport
This section describes how to factor the probe radiance vector λ out of the irradiance integral in Equation 4.9.
Starting with the irradiance transport integral from Equation 4.9: We bring the radiance coefficients out of the integral: cos θdω Then, we define the transport kernel K ij (x, ω): Leaving us with: SRP precomputes the integral in Equation 4.11 yielding a transport coefficient α ij for each probe i and basis function j at a receiver x: .13 provides an efficient way to transform the incoming irradiance at the probes into indirect irradiance at a receiver point. To convert to outgoing indirect radiance, we multiply at runtime by the Lambertian BRDF ρ π .

Clustered Principal Component Analysis
Sections 4.1-4.3 show how SRP reduces evaluating indirect diffuse irradiance at a receiver x to a dot product between the probe radiance vector λ and a precomputed transport vector α (Equation 4.13). This is efficient to evaluate but impractical as storing a transport vector α per receiver requires a significant amount of memory. To reduce the amount of memory, SRP compresses the transport vectors using clustered principal component analysis (CPCA) [17]. CPCA is an important part of SRP's scalability. A medium-sized scene with around 200 000 receivers, 100 probes, and 192 coefficients (order-7 spherical harmonics) would use 28.61 GB if stored in 16-bit floating point. To put this in perspective, an iPhone 7 only has 2 GB of shared memory between the GPU and CPU, and the largest amount of memory available on current high-end GPUs like the AMD FirePro W1900 or NVIDIA Quadro GV100 is 32 GB.
This section describes how CPCA compresses the transport vectors for SRP. CPCA works by dividing receivers into clusters. In each cluster, it uses principal component analysis (PCA) to find the axes which best describe the variation in the transport vectors. The top n c axes are kept along with weights which describe how to reconstruct the transport vectors using the n c axes. As n c lowers, less memory is used, but the higher the error in the reconstructed transport vectors.

Cluster Compression
To compress the transport vectors within a cluster, we perform PCA on each cluster. To do this, we first define a cluster transport matrix T c where the ith row of T c contains the transport vector for the ith receiver in a cluster.

CLUSTERED PRINCIPAL COMPONENT ANALYSIS
We then compute the SVD of T c : The SVD decomposes T c into three matrices: U , Σ, and V T . U is a n × n matrix where n is the number of receivers in the cluster, Σ is a n × n i n j matrix, and V T is a n i n j × n i n j matrix where n i is the number of probes, and n j is the number of basis functions used to represent radiance at a probe. The notation n × n i n j is short for n × (n i × n j ), where n i × n j is a scalar; in this case, n is the number of rows in the matrix, and n i × n j is the number of columns.
To compress T c , we keep only the first n c columns of the matrix U and the first n c rows of the matrix Σ: To evaluate indirect irradiance at runtime, we save two matrices. The cluster projection matrix Σ c V T , which contains the top n c principal axes; and the receiver reconstruction matrix U c , which contains reconstruction weights for each receiver in the cluster. Section 4.5.2 describes how these matrices are used to evaluate indirect irradiance at runtime.
As the receivers in a cluster are near each other, each cluster only uses a small subset of the total probes in a scene. This means the cluster projection matrix Σ c V T contains many zero-columns. Further compression is possible if only the non-zero columns and their offsets are stored.

Runtime Reconstruction
At runtime, SRP evaluates indirect illumination in two steps. First, each cluster projection matrix Σ c V T is multiplied with the probe radiance vector λ to produce a n c dimensional light basis vector l for each cluster. Second, the receiver reconstruction matrix U c is multiplied with the light basis vector l to produce an n k dimensional vector containing the diffuse indirect irradiance for each receiver in the cluster. To convert irradiance to outgoing radiance when shading, each element of the n k dimensional vector is multiplied by the surface albedo and divided by π.

Runtime Overview
At runtime, the implementation is split across several passes on the GPU. These passes are listed below. Passes 2-5 reference sections which describe them in more detail. Figure 5.2 shows how data flows between these passes and where precomputed data is used.

Lightmap Update
This pass rasterises the direct illumination in the scene to a lightmap.
After the first loop of the algorithm, this pass also accounts for secondary indirect light bounces. It adds the contribution of the indirect illumination lightmap from the previous frame to the direct illumination lightmap. This creates a feedback loop, as the probes updated in the next pass now see the indirect lighting from the previous frames.

Probe Update (Section 5.2)
This pass uses the lightmap from the previous step to update the probe radiance vector λ.

Cluster SVD Projection (Section 5.3)
This pass computes a light basis vector l for every cluster p by multiplying each precomputed cluster projection matrix by the probe radiance vector λ.

Irradiance Transport (Section 5.3)
This pass takes in the light basis vectors [l] p 1 from the previous pass and multiplies each one with the corresponding receiver reconstruction matrix [U c ] p . This results in a n-dimensional vector which contains the reconstructed irradiance for the nth receiver in that cluster. The pass saves the reconstructed irradiance at each receiver's location in a new indirect illumination lightmap κ.

Forward Render Pass (Section 5.4)
Finally, the forward render pass samples the indirect irradiance from the indirect illumination lightmap κ and adds it the direct lighting. A dilation pass is also run before the forward pass on the lightmap to remove dark seams around geometry edges (section 5.5).  p is the light basis vector for cluster p).

Radiance Coefficient Generation
SRP transforms direct illumination stored in the probe radiance vector λ into indirect irradiance at a receiver by convolving it (which can be performed with a dot product) with a precomputed transport vector α. GPUs can perform this operation in real time, but we still must update λ every frame to support real-time dynamic changes in lighting. To update λ, we must project the radiance at each probe into spherical harmonics using the following equation: The serial implementation in Source Code 5.1 loops through all of the probes in a scene and computes the jth basis coefficient by summing each sample, L(p i , ω)Y j (ω), multiplied by a weight. As the samples are uniformly distributed over the unit sphere, the weight is 4π N where N is the number of sample directions. To evaluate the incoming radiance L(p i , ω) at a probe, the algorithm samples from either a direct illumination lightmap or an environment map. If a ray fired from a probe in a sample direction hits a surface, the algorithm samples from a direct illumination lightmap; if it misses a surface, it samples from an environment map. As casting a ray for every sample at runtime is too expensive, the samples are precomputed; for samples that hit scene surfaces, the lightmap UV coordinate of the hit location is saved, while the sample direction is saved for samples that miss. The basis functions are also precomputed in the directions of the precomputed samples. The only operations we must perform at runtime are sampling from the lightmap or environment map, multiplying by the appropriate precomputed basis function and weight, and summing the irradiance samples. Source Code 5.2 shows an updated serial algorithm which uses two precomputed sample buffers: one buffer of lightmap UVs for the samples that hit and another buffer of directions for samples that missed.

Source Code 5.2:
Serial probe projection code on the CPU using precomputed sample data. Precomputed sample rays that hit surfaces sample from a lightmap using the UV coordinates of the hit location. Precomputed sample rays that miss sample from an environment map. Y is a buffer of basis functions evaluated in the precomputed sample directions and multiplied by the weight 4π N .

RADIANCE COEFFICIENT GENERATION
To implement this algorithm on the GPU, we must split the work into smaller chunks so it can execute the work in parallel. I perform the projection for all probes in a single compute pass. The pass assigns each probe to its own threadgroup 2 and then each thread within that threadgroup handles a subset of the total samples for a probe. More specifically, each thread performs the sums on lines 6 and 14 of the serial algorithm in Source Code 5.2 for the subset of samples it is assigned. Each thread saves the result of each sum in threadgroup memory and then waits at a threadgroup memory barrier. This barrier blocks the execution of a thread until all threads in its threadgroup have reached the barrier. The first j threads in each threadgroup then sum the sums stored in threadgroup memory to produce each coefficient λ ij for a probe. Source Code 5.3 contains pseudocode for the compute kernel.

Sample Generation
For our use case, Monte Carlo integration requires uniformly distributed samples to produce correct results [15]. For each probe p i , I generate uniform samples over a sphere centred at the probe using stratified sampling.
Stratified sampling works in two steps. First, we evenly distribute n samples over a unit square. To do this, we divide the square into a √ n × √ n grid and then choose a random sample point (x, y) within each cell. Second, we map each sample point (x, y) to the surface of the unit sphere (θ, φ): For clarity and to save space, arguments to the kernel function and offsets to account for multiple probes are omitted. Each thread performs the sums for a subset of the total samples for a probe (lines 5 to 19). All threads then wait until every thread has reached the threadgroup memory barrier (line 21). Lastly, the first j threads then sum the sums saved in threadgroup memory to produce each coefficient λ ij using the sumAt function (lines 23 to 25). Y is a buffer of basis functions evaluated in the precomputed sample directions and multiplied by the weight 4π N .

Cluster SVD Projection and Irradiance Transport
To produce the indirect illumination lightmap, two matrix multiplications must be performed (section 4.5.2). I use one compute pass for each matrix multiplication. The first compute pass, cluster SVD projection, multiplies the cluster projection matrix Σ c V T by the probe radiance vector λ and outputs a n c dimensional light basis vector l for each cluster. To do this, each thread in the pass multiplies one row of the cluster projection matrix by the probe radiance vector. Most of the columns of the cluster projection matrix contain only zeroes as probes which lie outside the cut-off radius for all receivers in a cluster make no contribution. These columns are not stored. Instead, the thread loops through the number of non-zero columns in the cluster projection matrix, and then uses a buffer to map each non-zero column index to its actual column index in the original matrix so that the result of the multiplication can be written to the correct element of the light basis vector.
The second pass, irradiance transport, reconstructs the irradiance at each receiver in a cluster. To do this, it multiplies the receiver reconstruction matrices with the corresponding light basis vector to produce the reconstructed indirect irradiance for each receiver in a cluster. Each thread performs the multiplication for a single receiver by multiplying a row of the receiver reconstruction matrix for a cluster by the light basis vector for that cluster. The multiplication produces the reconstructed indirect irradiance for that receiver which is then saved into an indirect illumination lightmap at the corresponding texel for that receiver.

Forward Render Pass
In my implementation, a forward render pass performs the last step of adding the contribution of the indirect lighting to the direct lighting by sampling from the indirect illumination lightmap. When sampling from the texture, it uses a linear texture filter to produce a smoothly blended result. While a forward renderer is used here, this could equally be performed in the geometry pass of a deferred lighting setup.

Lightmap Dilation
Dark seams can occur around the edges of geometry due to the limited resolution of lightmaps (Figures 5.3a and 5.3b). As texels inside of geometry are black, when the forward render pass samples the lightmap using a linear filter, it blends the black texels inside of geometry with the texels outside of geometry causing dark seams. To fix this, a dilation pass is run which extends valid texels that are beside invalid black texels out by a one texel radius so that there are no longer any black texels near the edges of geometry (Figures 5.3c and 5.3d).
I implement dilation in a compute shader pass, which runs over all texels in the lightmap. If a texel has an alpha of one, then the texel is valid and the shader leaves it as it is; however, if a texel has an alpha of zero, it is invalid, and so the mean contribution of its valid neighbouring texels is used instead. Dark seams appear around the edges of geometry in 5.3a due to invalid black receivers inside geometry blending with valid receivers outside geometry. 5.3b shows these invalid texels from within the cubeoid in 5.3a. 5.3d shows how lightmap dilation removes the seams by extending valid texels so the invalid texels are no longer near edges.

Transport Coefficient Precomputation
This section describes my implementation of a pipeline for precomputing SRP's transport coefficients. Although the transport coefficient precomputation is run on a high-end device, it still plays a part in the scalability of SRP as parameters chosen for precomputation change SRP's quality and performance at runtime. The number of receivers and their positions, the number of probes and their positions, the number of basis coefficients per probe, and the the number of PCA vectors for compression are all parameters set at precomputation time which affect the quality and performance of SRP.
One challenge of a GPU implementation is the limited memory available. As discussed in section 4.5 on CPCA compression, storing a transport coefficient for every receiver and every basis consumes many gigabytes of memory. This means it is not possible to compute all transport coefficients at once. Fortunately, each transport coefficient can be solved for independently which means the precomputation can be performed in batches. Each batch computes the transport coefficients for a portion of the receivers. The method computes the transport coefficients for each batch on the GPU and then copies them to the CPU. It then performs CPCA compression for that batch while computing the transport coefficients for the next batch on the GPU at the same time. The process repeats until all batches are complete.
For irradiance transport, the goal is to compute a transfer coefficient vector α for every receiver x. To do this, we must solve the following integral (derived in section 4.4) for every probe i and basis function j: To compute the transfer coefficients for each batch, I use a method which progressively solves the integral in equation 5.3, in a manner similar to how GPU path tracers like AMD's RadeonProRender [51] progressively solve the rendering equation. The method runs in passes, where one sample is taken per receiver per pass. The more passes, the more samples per receiver, and the better the estimate for each transport coefficient becomes.

CHAPTER 5. SPARSE RADIANCE PROBES: IMPLEMENTATION
Sections 5.6.1 to 5.6.5 cover how I compute the transport coefficients on the GPU, and section 5.6.6 describes how I compress the coefficients using CPCA on the CPU.

Overview
The method I use to compute the transport coefficients solves Equation 5.3 progressively in passes. The goal of each pass is to take one sample of the transport kernel K ij for each receiver in the batch. In each pass, the following five steps are performed: 1. Primary Ray Generation (Section 5.6.2) Generate a ray for each lightmap texel (i.e. the receiver points x). The direction of the ray is chosen from the hemisphere centred around the worldspace normal of the texel.

Primary Ray Intersection (Section 5.6.3)
Run an intersection query for the generated rays.

Mutual Visibility Ray Generation (Section 5.6.4)
For each primary ray, and for all the probes that are within the cut-off radius of the receiver the ray was fired from: If the primary ray hit, generate a ray from the probe to the intersection point. This ray is used in the next step to make sure the probe can see the hit point.
If the primary ray missed, generate a ray from the probe in the direction of the primary ray. This ray is used in the next step to make sure the probe can see the environment lighting.

Mutual Visibility Check (Section 5.6.3)
Run an occlusion query for rays generated in previous step to check visibility between the probe and intersection or the probe and the environment lighting. Steps 1-4 generate the data necessary to perform one step of Monte Carlo integration. To accumulate, for every receiver, sample each transport kernel K ij (x, ω) for each nearby probe i, and basis function j, and then accumulate each sample progressively using Welford's method [52].

Primary Ray Generation
The first step for each batch is generating the primary rays from the receiver points ( Figure 5.4). To do this, my implementation uses a compute shader which runs over all valid lightmap texels 3 . The compute shader I used to do this was based on a shader written by Thomas Roughton for their Master's thesis on path-traced lightmaps [53].  3 Not all texels in a lightmap are valid. Lightmap packing algorithms aim to pack surfaces as tightly as possible into a two-dimensional texture but there will still be gaps. The receivers for SRP are texels which map to a valid surfaces.

CHAPTER 5. SPARSE RADIANCE PROBES: IMPLEMENTATION
Each thread in the compute shader generates a ray. The ray's origin is a point chosen randomly within the lightmap texel, and the ray's direction is chosen using cosine weighted sampling over a hemisphere centred at the normal of the receiver point.
Using cosine weighted samples has two benefits over uniform sampling. First, Monte Carlo integration converges faster on a solution if the samples are taken from a distribution similar to the integral it is solving [15]. Second, when sampling from a non-uniform distribution, each sample of the function being integrated must be divided by the inverse of the probability distribution function (PDF). For Equation 5.3, dividing by the inverse PDF cancels with cosine term in the integral we are solving. This means it is not necessary to multiply by cosine when accumulating the samples in step 5.
To show this mathematically, the estimator for the transport coefficient where p is the PDF is: For cosine weighted samples, the PDF for sampling over a hemisphere is cosθ π : Which simplifies to: The factor of π can also be ignored if we compensate by not dividing the diffuse albedo by π at runtime.

Primary Ray Intersection and Mutual Visibility Check
Both the primary ray intersection and mutual visibility check steps require ray casting. LlamaEngine has integrated support for both Radeon Rays [54] and Metal Performance Shaders [55], which both provide compute kernels for accelerated ray intersection and occlusion queries. An intersection query returns 43

TRANSPORT COEFFICIENT PRECOMPUTATION
information about the surface the ray made an intersection with, while an occlusion query only returns whether the ray hit or missed.

Mutual Visibility Ray Generation
This step generates the rays that the next step uses to check for mutual visibility.
For each primary ray, a compute shader generates a mutual visibility ray for every probe that contributes to the receiver the ray was fired from. A probe contributes if the weight term w(x) in the transport kernel is non-zero. In my implementation, this means that the probe is within a certain radius of the receiver. Therefore, I refer to these probes as nearby probes.
If the primary ray hits a surface, the mutual visibility ray for each nearby probe points from the probe towards the intersection point. This makes sure both the receiver and the probe can see the intersection point. I refer to these rays as probe-to-intersection rays.
If the primary ray does not hit any surfaces, the mutual visibility ray for each nearby probe instead points in the same direction as the primary ray. This makes sure both the receiver and the probe can see the same point in the environment lighting. Parallax error is not an issue with environment lighting as it is assumed to be infinitely far away. I refer to these rays as probe-to-environment rays.
Probe-to-environment rays are important regardless of whether a scene includes environment lighting or not. Every transport coefficient still needs to encode the fact that there is no radiance coming from a direction. Figure 5.5 shows the difference before and after accounting for probe-to-environment rays.
The direction of the probe-to-intersection ray is the parallax-corrected direction φ(ω). The direction of the probe-to-environment ray is also technically parallax corrected but is equal to the original direction ω.
Step 5 uses the corrected direction when evaluating the basis function in each transport kernel.
Lastly, there are two validation checks for probe-to-intersection rays. These rays are marked as inactive if they do not pass these checks: 1. The probe must be in front of the intersection surface. Any probes behind the surface of an intersection cannot see the surface.

CHAPTER 5. SPARSE RADIANCE PROBES: IMPLEMENTATION
2. The receiver-to-intersection direction must point in the opposite direction to the intersection's surface normal. Otherwise, the surface is facing away from the receiver.

Figure 5.5:
Before and after accounting for probe-to-environment rays. Both renders used one directional light and had no environment lighting. In the top render, probe-to-environment rays were not accounted for in the mutual visibility ray generation step. This results in the image appearing too bright as even though there is no environment lighting, the transport coefficients still need to encode the fact that there is no radiance in a direction. In the bottom render, the probe-to-environment rays have been correctly accounted for.

Accumulation
The last step in a pass accumulates one sample of the transport kernel K ij (x, ω) for every receiver x, nearby probe i, and basis function j. Recall the transport kernel from Equation 4.10 in section 4.4: To perform the accumulation, a compute shader runs over all the receivers in the batch. Each thread of the compute shader first computes the denominator shared between all the transport kernels; i.e. the total weighted-visibility for all probes. To avoid looping through all probes, the thread looks up a buffer precomputed on the CPU that contains a list of nearby probes for the current receiver. The thread then sums the weights for each of the nearby probes which pass the mutual visibility test. Mutual visibility is checked by reading from a buffer generated by the occlusion pass in step 4.
If the total weighted-visibility is not zero, then for every probe that passes the mutual visibility test, the thread updates the current estimate for the corresponding transport coefficient a ij by accumulating the contribution of the transport kernel K ij for every basis function j in the corrected direction using Welford's method.
Welford's method [52] is a single-pass algorithm for computing the mean of a data set as samples become available. The method keeps track of the current mean and weight of all samples. A new sample is added by subtracting the mean from the new sample and multiplying by the weight.
If the total weighted-visibility is zero, then the sample is invalid. Ideally, the total-weighted visibility will never be zero as there should be at least one nearby probe for every sample that is mutually visible to a receiver and intersection point. However, it is hard to guarantee this if probes are placed manually like they are in my implementation. Additionally, an edge case occurs when a lightmap texel is both inside and outside geometry ( Figure 5.6). As the origin of a primary ray can be offset anywhere within a texel, a sample ray's origin could be inside geometry in some passes and outside geometry in others. When it is inside Using an incorrect per sample weight. In general, the render appears too dark. Additionally, dark seams appear near edges in geometry (regions (a) and (b)). The green region (b) is enlarged to show a receiver inside and outside geometry. The outlined square in the enlarged green region is a single receiver. The dotted line is inside the geometry, and the solid line is outside. Samples inside the geometry incorrectly contribute zero radiance so when they are mixed with correct samples outside geometry the receiver appears darker than it should. Right: Using the adjusted sample weight correctly accounts for invalid samples. geometry, the receiver cannot see any probes and so the total weighted-visibility is zero making the sample invalid.
The per sample weight must be adjusted to account for these invalid samples. If all samples were valid, then the per sample weight would be 1 N where N is the number of passes. However, if this per sample weight is used then areas will appear darker than they should ( Figure 5

Clustered Principal Component Analysis
After the transport coefficients have been computed for a pass, they are compressed on the CPU using CPCA. My implementation is mainly based off the original paper by Sloan et al. [17] but I have only implemented the static version of CPCA without iterative or adaptive refinement. For each cluster in a batch, I create a cluster transport matrix T c where each row contains the transport coefficients for each receiver in a cluster. To compute the SVD of this matrix, I use the dgesvd routine from the LAPACK library [56].
The clusters CPCA compresses can be computed in multiple ways. Sloan et al. [17] use the Linde-Buzo-Gray algorithm [57] to perform the clustering but Silvennoinen and Lehtinen [6] instead use an axis-aligned bounding box tree (AABB) clustering technique as it produces smaller clusters. This speeds up compression as computing the SVD for large matrices is expensive.
To implement AABB clustering, I start with an AABB enclosing all the receivers at the root node, the AABB is divided in half by making a split orthogonal to its longest axis, and a child node is added for each half (Figure 5.7). I repeat this process recursively until the leaves of the AABB tree contain at most 1024 receivers. To determine whether SRP is scalable across a range of hardware, I performed three experiments which tested my implementation of SRP on a low-end, midrange, and high-end device. The low-end device was an iPhone 7, the mid-range was a 2017 Macbook Pro with a Radeon 560 graphics card, and the high-end device was the same 2017 Macbook Pro with an AMD RX Vega 56 graphics card externally connected via a Sonnet eGFX Breakaway Box 550 (see Table 6.1 for more details). Four test scenes of varying complexity were used for the experiments. The complexity of a scene in this thesis is based on triangle count. Section 6.1 provides details on the four test scenes. As a preview of the results, Figure 6.1 shows my SRP implementation running the most complex test scene used in this thesis Crytek Sponza in real time and scaling from a Macbook with a high-end graphics card down to an iPhone 7.
The first experiment was a visual quality (section 6.2) and performance evaluation (section 6.3). The aim of the visual quality evaluation was to show how my SRP implementation visually compared to a path-traced lightmap reference and to direct-only lighting. The aim of the performance evaluation was to show how my SRP implementation performed in terms of frame time over the four test scenes across the three test devices.
The second experiment was to see whether using higher order spherical harmonics is worth the extra computation cost on low-end devices (section 6.4). To do this, I compared an order-7 reference image of a scene to the same scene rendered with lower order spherical harmonics. As well as comparing the entire image, I also compared specific regions within the image to see if error decreased at different rates for high and low frequency areas.
The last experiment focuses on scalability (section 6.5). It shows how frame time changes as spherical harmonic order, probe count, and receiver count is increased. The aim of this experiment was to see if and how changing these parameters could help improve the performance of SRP and allow more complex scenes to scale down and run in real time on the iPhone 7.

Test Scenes
For the visual quality and performance evaluations, two simple scenes and two complex scenes were used (Table 6.2). Complexity is based on triangle count. The two simple scenes were Cornell Box, a scene often used as a benchmark for global illumination algorithms [59]; and Blue-Orange Box, a scene I created which is similar to Cornell Box but longer in length and with the boxes closer to scene surfaces to show indirect shadowing. The complex test scenes were Modern Hall [60] and Crytek Sponza [61].
The precomputation for all the scenes was performed using 4096 sample rays per receiver, 1024 sample rays per probe, and 32 PCA vectors per cluster for CPCA compression. The probes were manually placed near scene surfaces, and all scenes were rendered at a resolution of 1334 × 750 on all devices. This resolution was chosen as it is the maximum possible on the iPhone 7; however, note that the cost of the algorithm depends mainly on the lightmap resolution, not the screen resolution, as the main overhead during the lighting pass is a single filtered texture sample of the runtime-generated lightmap.

Visual Quality Evaluation
To perform the visual quality evaluation, my SRP implementation is compared to a ground-truth reference and to direct-only lighting. As the receivers in my implementation are lightmap texels, path-traced lightmaps were used as the ground-truth reference. Path-traced lightmaps are accurate but significantly more expensive than SRP, taking seconds or even minutes to produce renders without noise. They were generated using a progressive lightmapper developed by Thomas Roughton for their Master's thesis [53]. All of the path-traced lightmap reference images were rendered on the AMD RX Vega 56 graphics card. Crytek Sponza 262267 160 437 557

Figure 6.4:
Indirect-only comparison between path-traced lightmaps and SRP for the two simple test scenes. Multiplication by the surface albedo is omitted in order to more clearly show the contribution of the indirect lighting.

Figure 6.5:
Indirect-only comparison between path-traced lightmaps and SRP for the two complex test scenes. Multiplication by the surface albedo is omitted in order to more clearly show the contribution of the indirect lighting.

Performance Evaluation
For the performance evaluation, the mean render time per frame was recorded over 1000 frames. To show the performance difference between low and highorder spherical harmonics, the frame time of SRP was recorded for both order-7 and order-2 spherical harmonics. Additionally, the frame time of direct lighting was recorded to show the baseline cost of rendering a scene in LlamaEngine.
For the two simple test scenes, SRP ran in real time at a minimum of 60 FPS across all three test devices for both order-7 and order-2 spherical harmonics ( Figure 6.6a-6.6b). SRP took more than twice as long to render on the iPhone 7 than on the two dedicated graphics cards. The performance gap between the iPhone 7 and the dedicated graphics cards is also larger for SRP than for direct lighting only. For example, for the Cornell Box scene, the difference between SRP order-2 spherical harmonics and direct lighting on the Radeon 560 was 2 ms yet on the iPhone 7 it was 9 ms.
The two complex test scenes ran in real time for the dedicated graphics cards. However, only Modern Hall ran in real time on the iPhone 7 at 33.67 FPS (29.7 ms per frame) for order-2 spherical harmonics. It ran slightly below realtime for order-7 spherical harmonics at 21.34 FPS (46.84 ms per frame) ( Figure  6.6c). Crytek Sponza ran at an interactive frame rates of 16.36 FPS (61.1 ms per frame) for order-2 spherical harmonics and 13.24 FPS (75.5 ms per frame) for order-7 ( Figure 6.6d).
While these results show the complex test scenes do not run in real time on the iPhone 7, the scalability results in section 6.5 show how the performance of SRP can be scaled by reducing the receiver and probe count.

Error vs Spherical Harmonic Order
The second experiment was to see whether using higher order spherical harmonics is worth the extra computation cost on low-end devices. I compared an order-7 indirect only render of a test scene ( Figure 6.7 top left) to the same scene rendered with lower order spherical harmonics ( Figure 6.7 top right). The comparison was done using the mean squared error (MSE) of the lightness channel (L) of the renders in the CIE L*a*b colour space [62]. I chose to perform the comparison in the CIE L*a*b colour space as the space is based on human perception of light and therefore gives a more intuitive error metric than a mean squared error in the RGB (red, green, blue) colour space. To see if the error changed at different rates in different areas of the render, I also performed this comparison in four regions ( Figure 6.7 top left). Regions a) and b) were placed over areas with higher frequency changes. The edge of the object by the left wall and the corner of the room. Regions c) and d) were placed over areas with low frequency changes. The back wall and the floor.
For the entire image, the MSE in lightness decreased as spherical harmonic order increased (grey line in Figure 6.7 bottom right). For the regions, the lightness MSE decreased faster for the low frequency regions than for the higher frequency regions as spherical harmonic order was increased. The low frequency regions dropped to a MSE below 0.01 by order-1, but it took the high frequency regions until order-4 to do the same. This is shown in both the bottom left and bottom right sub-figures within Figure 6.7. The bottom left figure shows the original render and a greyscale difference image between the order-7 render and each lower order render of the L channel in the L*a*b colour space. Bottom row is the difference in MSE from the reference render displayed as a greyscale render, where 0 is black and 0.01 is white. Bottom Right: Graph of the MSE in lightness for each region as spherical harmonic order increases. The graph omits order-0 in order to avoid skewing the Y-axis scale.

Scalability
SRP has scaling parameters which can be adjusted in the precomputation process to reduce the amount of computation required at runtime. These include spherical harmonic order, receiver count, and probe count 1 . To see how each of these parameters affected frame time as they increase, one parameter at a time was increased in increments and the mean frame time over 1000 frames was recorded for each increment on the Blue-Orange Box test scene.

Spherical Harmonic Order
Spherical harmonic order was increased from order-0 up to order-7. The probe count was fixed at 20 and receiver count at 50 784.
Interestingly, the increase in frame time was higher for the dedicated graphics cards. The difference between order-7 and order-2 spherical harmonics was 1.7 ms for the Radeon 560 and 2.3 ms for the Vega 56. Whereas on the iPhone 7 it was 1.1 ms.

Probe Count
Probe count was increased from 20 to 200. Spherical harmonic order was fixed at 2 and receiver count at 50 784.
The number of probes in a scene had a significantly larger impact on the iPhone 7. The difference in frame time between 200 probes and 20 probes for the iPhone 7 was 49.3 ms, 6.7 ms for the Radeon 560, and 2.25 ms for the Vega 56. For a receiver count of 50 784 with order-2 probes, around 100 probes was the limit for staying under the real-time cut-off of 33.3 ms for the iPhone 7.

Receiver Count
Receiver count was increased in increments from 50 784 -1 000 000 receivers. Spherical harmonic order was fixed at 2 and probe count at 20.
As with the probe count, increasing the receiver count had a significantly larger impact on the iPhone 7. The frame time difference between 1 000 000 and 50 784 receivers was 150.1 ms on the iPhone 7, 11.7 ms on the Radeon 560, and 1.9 ms on the Vega 56.

Reducing Receiver Counts to Improve Performance
The results from the previous section show that performance improves if any of the spherical harmonic order, probe count, or receiver count are reduced. To see if Crytek Sponza could run in real time on the iPhone 7 while still producing acceptable results, the receiver count was decreased in increments from 437 557 to 56 581 receivers. All of the renders in this section were performed on the iPhone 7 using order-2 spherical harmonics and 169 probes.
Crytek Sponza ran in real time at 31.5 FPS (31.7 ms per frame) when the receiver count was reduced to 120 479, but there are some quality differences when compared to the render with 437 557 receivers ( Figure 6.9). Notably, the indirect lighting is slightly darker, the indirect shadows have less definition, and the indirect colour bleeding from the cloth curtains extends over a slightly larger area.
The quality differences are clearer when only indirect lighting is shown without surface albedo. To a lesser extent, the same quality differences appear when reducing the receiver count of a path-traced lightmap reference render. Figure 6.11 shows a path-traced lightmap reference render of Crytek Sponza with 437 557 (6.11a) and 56 581 receivers (6.11b). The renders show only indirect lighting without surface albedo, and no linear filter is applied to clearly show the size of the receivers. At 56 581 receivers, the path-traced lightmap render also becomes darker, indirect shadows are no longer visible, and the indirect colour bleeding extends out slightly more. However, these quality differences are much more pronounced in same render using SRP (Figure 6.12b). Additionally, the path-traced lightmap reference render with 437 557 receivers has much more detail than SRP with 437 557 receivers. For example, the yellow seal on the curtains is visible in the path-traced reference (Figure 6.11a), but not in the SRP version (Figure 6.12a).

Analysis
The performance evaluation results (section 6.3) showed that, at least for the simple scenes, it is possible for SRP to run in real time on an iPhone 7. The two simple test scenes ran on all devices at 60 FPS (Figures 6.6a and 6.6b), but for the two complex test scenes, SRP only ran in real time on the iPhone 7 for Modern Hall with order-2 spherical harmonics. For Crytek Sponza, SRP still ran interactively on the iPhone 7 at 16.4 FPS for order-2 spherical harmonics, but it did not run above the 30 FPS real-time threshold.
The visual evaluation results (section 6.2) show that my SRP implementation is comparable to a path-traced lightmap reference image. Although there are differences between them, SRP is considerably faster. For Crytek Sponza, the path-traced image took 3 minutes and 12 seconds to render compared to 75.5 ms on the iPhone 7. The main differences between SRP and the path-traced lightmaps were less defined indirect shadows and colour bleeding (Figure 6.13). This lack of definition is due to the band-limiting of the probes, the number of probes, and the probe positions. It is not because of the limited resolution of the receivers as the path-traced lightmap reference renders use the same receiver points as the SRP renders yet still have more definition.
The error vs order results (section 6.4) indicate that high-order spherical harmonics may not be worth the extra computation cost on low-end devices. For the low-frequency regions, the error is significantly reduced by order-2 spherical harmonics. This is expected as spherical harmonics can capture Lambertian diffuse irradiance signals with 1% error [10]. However, higher order spherical harmonics are still needed to better capture higher frequency information like visibility. For the Blue-Orange test scene with one directional light, the reduction in lightness MSE between order-3 and order-7 spherical harmonics was minimal for both the low-frequency and high-frequency regions, and not worth the 0.82 ms increased cost on the iPhone 7 between order-2 and order-7 harmonics. However, as the experiment only tested one scene, the results can only indicate and do not confirm that high-order spherical harmonics are not worth the extra computation. There may be some scenes where high-order spherical harmonics have a greater impact.

ANALYSIS
The scalability results (section 6.5) show that the receiver count has a large impact on performance for the iPhone 7. Reducing the receiver count from 437 557 to 120 479 receivers for Crytek Sponza brought the per frame time down to 31.7 ms (section 6.6) -just under the real-time cut-off of 33.33 ms. The render with 120 479 receivers was still comparable to the render with 437 557 receivers, although there were slight reductions in quality: the render was slightly darker, indirect shadows were a little less defined, and the coloured indirect lighting from the curtains bled across the ground more. Once the receiver count was reduced to 56 581, however, the render was no longer comparable. It was significantly darker, indirect shadows were barely visible, and much more colour bleeding was apparent.
The darkening of the renders is mostly due to the receiver resolution. If half a receiver is lit by a light, then the whole receiver will be half the intensity of the lit half as each receiver stores the average irradiance of the patch it covers. This makes the half lit by the light appear too dark and the half in shadow appear too bright. With fewer receivers in a scene, the area each receiver covers is larger and so it is more likely that a receiver will cover an area which differs in intensity. A slightly darker image was expected when rendering Crytek Sponza with fewer receivers because some receivers were positioned underneath the curtains. These receivers were both in light and shadow at the same time. Figure 6.11b shows a path-traced lightmap reference render of Crytek Sponza with 56 581 receivers; as expected, the receivers underneath the curtains are considerably darker. However, the same render with SRP in Figure 6.12b is even darker still and there is considerably more colour bleeding over the receivers in the middle of the courtyard. These differences are likely due to the band-limited probes and their placement, but without further investigation, it is hard to tell if the differences are inherent to SRP or due to an implementation error.
Either way, the results show that receiver count is a point of scalability which can be reduced to improve performance at the cost of quality, and the renders are still visually comparable as long as the receiver count is not reduced too much. In this case, reducing the receiver count allowed Crytek Sponza to run in real time on the low-end iPhone 7.
4.87 ms of GPU time on a Nvidia Titan X, while the most complex test scene in this thesis took 14.26 ms on the AMD Vega 56 graphics card. This is a significant difference, especially given the specifications of the Vega 56 are generally better than the Titan X. On top of this, Brutalist Hall is more complex than Crytek Sponza. The triangle count for Brutalist Hall is not published, but it uses 115 more probes and has 129 813 more receivers. With more time, effort, and expertise spent on optimisation, I believe the performance cost of my SRP implementation could be considerably reduced even for low-end devices like the iPhone 7.
There are two main areas where future work could improve the performance of my implementation. Firstly, the compute shaders my implementation uses to perform the matrix multiplications needed to reconstruct irradiance at the receivers use a naïve approach (section 5.3). Each compute thread for both multiplications only multiplies a single row of the matrix; there is an overhead associated with spawning threads, so reducing the number of threads by increasing the number of rows each thread multiplies may improve performance. Additionally, as the cluster SVD projection multiplication is sparse, it would be worth investigating efficient algorithms available for performing sparse matrix multiplication on the GPU [64,65]. Secondly, my implementation projects radiance at all probes in a scene into spherical harmonics every frame. It would be interesting to see if staggering probe updates across frames could reduce costs without a perceivable drop in quality; for example by updating half of the probes one frame, then the other half on the next, or by updating probes closer to the camera more often.
There are a few other areas which would be worth exploring in future work. My SRP implementation was limited to irradiance transport. Silvennoinen and Lehtinen support radiance transport in their implementation which allows for normal mapping and glossy BRDFs for the final bounce towards the camera. They had to reduce spherical harmonics to order-4 to maintain performance. It would be interesting to see if low-end devices like the iPhone 7 could handle radiance transfer in real time. Next, my thesis did not explore any automatic probe placement methods. Probes were placed manually for all results. This is not ideal as probe placement affects the quality of renders. Exploration of automatic probe placement methods to help with the scalability of SRP would 72 CHAPTER 7. CONCLUSION also be worthwhile for future work. Fewer probes placed in better positions could provide higher quality results at a lower performance cost.
Real-time global illumination that scales from low to high-end devices means applications that require realistic lighting can reach wider audiences by running on more devices. This thesis shows that SRP scales from a high-end AMD RX Vega 56 graphics card down to an iPhone 7 while still being visually comparable to a path-traced lightmap reference. Additionally, an optimised implementation would run even faster making it viable for applications that need to run other computation alongside SRP.