Contextual Learning for Content Caching With Unknown Time-Varying Popularity Profiles via Incremental Clustering

With the rapid development of social networks and high-quality video sharing services, the demand for delivering large quantity and high quality contents under stringent end-to-end delay requirement is increasing. To meet this demand, we study the content caching problem modelled as a Markov decision process in the network edge server when the popularity profiles are unknown and time-varying. In order to adapt to the changing trends of content popularity, a context-aware popularity learning algorithm is proposed. We prove that the learning error of this scheme is sublinear in the number of requests. In light of the learned popularities, a reinforcement learning-based caching scheme is designed on top of the state-action-reward-state-action algorithm with a function approximation. A reactive caching algorithm is also proposed to reduce the complexity. The time complexities of both the caching schemes are studied to demonstrate their feasibility in real time systems and a theoretical analysis is performed to prove that the cache hit rate of the reactive caching algorithm asymptotically converges to the optimal cache hit rate. Finally the simulations are presented to demonstrate the superiority of the proposed algorithms.


I. INTRODUCTION
W ITH the proliferation of machine-type communications and increasing user demand for video streaming, content caching is emerging as a key technology to reduce delays experienced by users and network congestion. For instance, according to [1], Internet video is expected to account for 82 percent of all business Internet traffic by 2022. The popularity plays an important role in video traffic, for instance only one percent of the popular Facebook videos represents 83 percent of total watch time [2]. Hence, a proper understanding of the content popularity can help efficient cache deployment and algorithm design.
However, understanding and tracking the variation of the content popularities is challenging. Classical schemes such as the least frequently used (LFU) and the least recently used (LRU) are widely applied in the current caching systems. The LFU can be treated as optimal under an independent reference model (IRM) whereas the LRU reaches the optimal competitive ratio when the requests are made adversarially. However, they have limitations to handle non-stationary popularity profiles, as it is the case in practical caching networks. In addition, although it is possible to estimate the global popularity in systems such as the on-demand video streaming, the global popularity may not match the local popularity because the servers locating at the network edge can only serve a small geographical area with limited requests, and the variation tendency of the global popularity and the local popularity can be totally different. These challenges motivate us to study the content caching scheme of an edge server under non-stationary and time-varying popularity profiles without any global information.
There are two types of study for content popularities. First, a large proportion of papers employ the IRM to track content popularities and synthesize user request pattern, which assumes the user requests are made in an i.i.d. fashion from a predefined distribution [3]- [5]. On the other hand, there are many papers working on learning the content popularity profiles and then optimizing the caching system. The social networks and the user mobility information are considered to learn the content popularities and various types of machine learning schemes are utilized. However, most of them assume that the prior knowledge of the content popularities is known for the caching placement. Alternatively, a training session is required to learn the file popularities (which means these schemes are not online). These schemes deterministic caching schemes were proposed to determine content placement by solving optimization problems according to some knowledge of network and contents. Probabilistic caching (e.g., [13], [14]) assumes that each cache-enabled node has a probability mass function (PMF) for caching different content files rather than deterministically caching the popular files. Coded caching (e.g., [15], [16]) is another caching technology in which each content is partitioned into several coded fragments and each cache-enabled node is only able to cache part of the fragments so that a request should be served by a set of nodes concurrently.
From another perspective, combining learning and content caching has become a promising research orientation because learning algorithms can be used to estimate file popularity and user request pattern especially with the help of some extra side information. In [6], an extreme-learning machine neural network algorithm was proposed for content popularity prediction by utilizing the learned popularity to initialize the caching capacity of each BS and to decide which files should be cached at each base station. Authors in [17] collected not only the real request information from wireless access points and BSs but also the user mobility information and location information to design a geo-collaborative caching strategy. However, the main drawback of these learning algorithms is the need for prior knowledge and offline training. For these schemes, the collected data for learning could become outdated and the caching performance would be unpredictable.
To address these issues, context-aware online learning has been explored as a new strategy to enhance content caching performance and accelerate learning. In such schemes, context information of requests are collected to help learning. For example, in [18], each video instance with context information was mapped to a point in a context space, and, by grouping different points, videos with similar forecasted popularities maintained an average value for video popularity forecast in future. Moreover, in [19] and [20], caching decision problem is modelled as a contextual multi-arm bandit problem and a reinforcement learning (RL) algorithm is proposed to choose the file that should be cached. Furthermore, [21] and [22] utilized context information by designing a grid-based partitioning method to group requests into different hypercubes and estimate the file popularities. The main drawback of this approach is that the partitioning method is not sufficiently flexible. To improve the clustering process and to learn file popularities better, we propose a context-aware popularity learning to group points into adaptive clusters based on the Euclidean distance via incremental clustering. Numerical results prove that this scheme can create a proper partition to enhance the cache hit rate (CHR).
In addition, RL has been considered as a good learning tool in caching algorithm designs due to its ability to solve decision-making problems in an interactive environment rather than using only a fixed dataset. In [23], a Q-learning based caching policy was proposed to dynamically cache files at small base stations aiming at maximizing the CHR. The work in [24] leveraged RL to perceive the file popularities and to solve a proactive caching problem in wireless networks to minimize the average energy cost. In [25], a model-free RL algorithm was proposed to solve the caching problem for energy harvesting access point. In [26], a MAB problem was formulated to learn the file popularities and to maximize the caching reward.

B. Contributions
The problems identified in both the classical and novel caching schemes reemphasize a practical need for a robust online caching scheme to provide stable caching performance under time-varying content popularity. In this work, we model a realistic caching decision problem in an edge server as a Markov decision process (MDP) and strictly demonstrate how to learn the popularities of contents online and use the learning results to improve cache management (caching decision). We first design a context-aware popularity learning algorithm to track the time-varying file popularity which is then used in RL to solve the MDP for dynamically updating cached contents in edge servers. Moreover, a reactive caching algorithm is studied to lower the complexity of the RL-based caching scheme. Our algorithm requires neither any prior knowledge about the users and their content requests nor any training sessions, which may be inaccurate, outdated and expensive to obtain. The contributions of this paper are summarized below: • We propose a context-aware popularity learning algorithm to learn the time-varying file popularity profiles. Particularly, an incremental clustering algorithm is applied to requests with varying context information. This enables us to obtain the similarity among requests, which enhances the learning rate and accuracy. • The caching decision problem is modelled as a non-stationary MDP. By invoking a linear function approximation, an RL-based content caching scheme is designed via state-action-reward-state-action (SARSA). By incorporating the knowledge learned from the context-aware popularity learning algorithm, the RL is accelerated and the caching decisions are improved. Enlightened by the RL-based caching, a reactive caching algorithm is proposed to reduce the computational complexity. • A rigorous theoretical analysis on the popularity learning performance is provided and the sublinear learning error over time is demonstrated. We prove that the proposed reactive caching scheme converges to the optimal caching scheme with increasing number of requests and for a given true file popularity. Moreover, the time complexity of the proposed algorithms is shown to be competitively low, which enhances the scope of the algorithms for practical applications. • Multiple settings of time-varying popularity profiles are designed for performance evaluation, by simulating both the independent and temporal correlated file request processes. Numerical results confirm that both algorithms provide a more robust caching performance as compared to several solutions when the file popularities are unknown and time-varying.
The remainder of this paper is organized as follows. Section II describes the system model. The context-aware popularity learning algorithm is presented in Section III. The detailed RL-based caching algorithm and the reactive caching algorithm are described in Section IV, followed by their theoretical performance in Section V. The simulation setup and results are presented in Section VI, and the conclusions are drawn in Section VII.

II. SYSTEM MODEL
We consider a content delivery network where a content provider (CP) has a library of files, i.e., F = {1, 2, . . . , F }. The number of files F may be very large and caching is a viable solution to improve quality of service. We assume a cache-enabled server at the network edge, the caching capacity of which is M representing the maximum number of contents that can be locally stored. We focus on caching decision problem for a single server to design a decentralized caching scheme and to maximize the CHR of each node independently. Without loss of generality and for the sake of simplicity, we assume the file sizes are identical. Consider req n as the n th content request made by the users, where n ∈ N = {1, 2, . . . , N}. Each request is represented by a 3-tuple as req n =<f n , t n , ν n >, where f n ∈ F is the file being requested, t n is the time that the request was made, and ν n is the context vector associated with the request. Generally, the context data is a d-dimensional vector and each element represents one context information.
In this section, we model the caching decision process as a non-stationary MDP, which is an appropriate mathematical framework to model sequential decision making considering the random dynamics of the system under study [27]. MDP formulation is useful to model and study the long-term effect of each caching decision on the CHR. In this problem, the server is an agent who decides whether and how to cache the requested file to maximize the long-term CHR. In particular, at each time step n ∈ {1, 2, . . .}, the server picks a caching action a n from the action space A given the current state of the system g n ∈ G, which is the state space. Given the current action and state pair (g n , a n ), the server moves to some state g with the probability of Pr(g |g n , a n ) and receives a reward r n (g n , a n ) [28]. In the following, the action space, states, transition probabilities, and reward function are defined.

A. Action Space
When a request arrives, the server checks whether it can be served locally. No change is required if the file is available in the cache. Otherwise, the requested file will be retrieved from CP, and the server decides whether and how to update its cache taking into account the possibility of future requests. Since we assume a time-varying popularity profile in this work, replacing the least popular file in the cache with the requested file is not always optimal. In such a case, it is imperative to take precautions and explore all caching possibilities to some extent to balance the exploitation-exploration trade-off and track time-varying popularity profile. Therefore, the server should be able to cache the requested file by replacing any of the files currently in the cache. Given M is the cache capacity, this implies that the action space of the server has M + 2 possible actions as A = {α 1 , α 2 , . . . , α M+2 } where • α 1 : The server caches the requested file by replacing the 1 st popular file in the cache. . . .
• α M : The server caches the requested file by replacing the M th popular file in the cache. • α M+1 : The requested file is retrieved from CP but not cached. • α M+2 : The server serves the requested file locally without changing its cache. The server chooses between the first M + 1 actions if the requested file is not cached currently. Otherwise, if the requested file is locally available, the last action α M+2 is invoked.

B. State Space and Transition Probability
The state of the process is determined by the currently cached files and the requested file. In particular, the state at time-step n is defined as g n =< S n , f n >, where S n represents the set of cached files and f n represents the requested file at time-step n. More specifically, suppose the set of cached files is S n = {s 1 n , s 2 n , · · · , s M n }, in which the cached files are indexed in a decreasing order of the file popularities and S as the subsequent cache state after taking an action. Moreover, define f ∈ F as the next requested file.
The transitions among the states can be modelled by a Markov chain. Let us define the transition probability from state g n to state g . When an action is taken, the cache state transfers from S n to S deterministically. Therefore, the transition probability depends solely on the file popularity of the next requested file f at time step n, which is defined as P f (t n ).
If the requested file f n / ∈ S n (i.e., cannot be locally served) and α m ∈ {α 1 , α 2 , · · · , α M }, the transition probability can be written as Pr(g |g n , a n = α m ) If the requested file f n / ∈ S n and action α M+1 is taken, the requested file will be served by the CP without replacing the cached files, namely S = S n . In this case, the transition probability is Pr(g |g n , a n = α M+1 ) Otherwise, if f n ∈ S n , the requested file can be served locally and α M+2 will be taken. In this case, the cached files will not be replaced and action and the transition probability is Pr(g |g n , a n = α M+2 ) The sizes of the state space and the transition probability matrix depend on the file library size and the cache size of the server. In this regard, the size of the state space is F × F M and the size of the transition probability matrix is

C. Reward Function
In MDP, the reward is the return of the process after transferring from one state to another under a taken action. The CHR is an appropriate metric to evaluate the performance of content caching schemes. The CHR H can be calculated as H = N L /N Total , where N L is the number of locally served requests and N Total is the total number of received requests.
Caching relatively popular files can increase the CHR by locally serving more requests. Therefore, we consider the file popularity when defining the reward. To prioritize caching relatively popular files, the reward function is defined as where P fn (t n ) represents the popularity of the requested file and P s m n (t n ) represents the popularity of the replaced file. If a cached file is replaced by the requested file, the reward reflects the change of the file popularity as P fn (t n )−P s m n (t n ). It should be noticed that in this case, the reward can be negative if the requested file is less popular than the replaced file. In addition, if the action α M+1 is taken, the cache remains unchanged and hence the reward is set to zero since the requested file is not locally served. Finally, if a requested file is locally served, which refers to the action α M+2 , the popularity of the requested file is used as the reward.

D. Action-Value Function
The action-value function is used to quantify how beneficial it is for the agent to perform a given action in a given state under a specific policy. Define π n (g) = a as the policy of choosing the action a under the state g at decision time step n. The action-value function of taking action a in state g at time t under a policy π n can be defined as where γ is the discount factor showing the importance of the long-term reward compared to the current reward.

III. CONTEXT-AWARE POPULARITY LEARNING
In the formulated MDP in the previous section, the reward function and transition probabilities depend on file popularities. Therefore, first, we need to learn and track popularities in order to solve the MDP. To learn the popularity of files, a simple and effective way is to calculate the frequency of the files that are being requested based on the request history. However, this method faces several problems. First, since the server is located at the network edge, the number of requests is limited in the early stage, thus calculating the frequency might be inaccurate especially when the file popularities are non-stationary and time-varying. In addition, the similarity between files are not considered while multiple files with similar context information may tend to yield close popularities. Therefore, in this section, we propose a context-aware popularity learning algorithm to track the time-varying content popularities.

A. Context Information Management
Context information represents the feature of the content and captures the situation under which the request is made. File request history can be treated as the context information [22]. For example, the context information may include the request counts of a file in the last hour, the last day or the last week. With context information, each request req n is associated with a multi-dimensional context vector ν n (depends on how many types of context information are available), which can be mapped as a request point in the context space V, where each context information is treated as one axis.
In order to overcome the two deficiencies of popularity learning which were mentioned above, we apply clustering algorithm to the request points to group them into multiple sets. After the clustering process, the request points with the similar context information will be expected to be in the same set. Hence, instead of learning the popularity of each file, we learn the popularity of each set. By utilizing this context information, we will have much more points to accurately estimate the time-varying popularities, which will greatly enhance the accuracy and the learning rate of the popularity estimation.
However, the key challenge is the efficient and dynamic clustering of different request points with a similar context information. K-means clustering is a classic scheme that is able to partition N request points into K clusters. However, K-means is designed only for a fixed number of points, however, in our problem, the number of request points is increasing over time. To solve this problem, inspired by and advancing [29], we propose an incremental clustering-assisted learning algorithm which is suitable for learning the time-varying file popularities by processing the context information incrementally. Figure 1 illustrates an example of clustering in a two-dimensional context space V. The red points demonstrate various requests received in the course of time, which has been grouped into four clusters. By doing this, the request points with similar context information are grouped into the same cluster and treated to have the same popularity. Since the instant frequency of a requested file can be easily gathered, we consider the average instant frequencies of the request points in a cluster as the popularity of that cluster.

B. Incremental Clustering-Assisted Popularity Learning
Define c k as the center of cluster k ∈ {1, . . . , K n }, where K n represents the total number of clusters created till t n . The clustering and learning process is divided into three steps as follows.
In the first step, once a new request arrives, if K n is smaller than a certain threshold κ, the server will calculate the Euclidean distances between ν n (the context point of req n ) and each existing cluster center. Let us define D min n as the distance between ν n and the closest cluster center. If D min n is larger than the initial clustering distance threshold δ (which is a predefined parameter), this new context point will be selected as a new cluster center, otherwise, it will be grouped into the closest cluster. The logic behind this step is to create a number of clusters which are not too close to each other for future clustering.
Once a request point is clustered into the existing cluster k, the average popularityP k of cluster k is used as the estimated popularity of this requested file.P k is updated as follows where Σ k is the sum of the estimated popularity of the points in cluster k and Θ k is the number of points in cluster k. When a new point is added into cluster k, Σ k and Θ k are updated as where q n is the real popularity of f n that can be calculated as the received number of requests for file f n so far divided by the total number of requests.
In the second step, when K n is equal to κ, the server calculates distances between any two centers. The set of these distances can be represented as L. Define d x ∈ L as the x th smallest distance between any two centers. Then, the sum of the z smallest distances is defined as where Δ 1 is the clustering distance threshold in phase 1, which will be used to decide whether a new cluster should be created or not once a new request point arrives.
As the number of clusters increases, the clustering distance threshold Δ r should increase to control the growing rate of the cluster quantity. Specifically, the clustering distance threshold is updated in different phases. Let us define Δ r as the clustering distance threshold in phase r. Each phase includes the creation of κ clusters. Define l r as the number of clusters created in phase r. When l r is larger than κ, it means a sufficient number of clusters has been created based on the current threshold Δ r . Therefore, Δ r+1 will be scaled up to ξΔ r and l r will be reset to zero. It should be ensured that ξ > 1 so that the clustering distance threshold is nondecreasing. By doing this, we can avoid the algorithm from creating too many clusters in a relatively small area in the context space or it may cause clusters overlapping with each other.
In the third step, when K n becomes larger than κ, the proposed algorithm searches for the closest cluster center for an arrived request point and calculates D min n . Depending on how large D min n is compared to the clustering distance threshold Δ r , the algorithm would decide to map the newly arrived point to the closest cluster or to select it as a new cluster center. This mechanism is implemented stochastically with the probability defined as ρ n,r = min D min n /Δ r , 1 . More specifically, this new point will be selected as a new cluster center with probability ρ n,r , otherwise, it will join the closest cluster with probability 1 − ρ n,r . Accordingly, if the distance between the new point and the closest cluster center (i.e., D min n ) is much smaller than Δ r , it is more likely that this point will be added to the closest existing cluster. On the other hand, if D min n is comparable with Δ r , this point would create a new cluster with a higher probability. In particular, if ρ n,r is larger than 1, it means this new point is quite far from existing cluster centers and it will definitely be selected as a new cluster center. The complete context-aware popularity learning algorithm is depicted in Algorithm 1.

IV. CACHING UPDATE ALGORITHMS
We modeled the caching decision process as an MDP in Section II and proposed a learning algorithm to track content popularities in Section III. To solve the caching management MDP, since the transition probabilities are uncertain due to time-varying content popularities, in this section, one of the model-free temporal-difference (TD) learning schemes, namely the SARSA, is utilized to design an optimal caching algorithm [30] without having the dynamics of the environment (transition probabilities).

A. RL-Based Caching
In SARSA [31], an agent interacts with the environment and updates the policy based on the taken actions. In particular, SARSA consists of taking an action on a state, noting the reward of the action and the next state, then choosing the next action on the following state, and updating the Q value.

Algorithm 1 Context-Aware Popularity Learning
Update Σ k , Θ k andP k based on (6) and (7) Calculate D min n , generate a Bernoulli binary variable x n,r with ρ n,r = min(D min Update Σ k , Θ k andP k based on (6) and (7) The update of the state-action value (i.e., the Q-value) only depends on the previous Q-value, the reward, and the Q-value of the next state-action; The dynamics of the environment is not needed. After updating the Q-value, the agent moves to the next state and executes the action which has been chosen earlier. The Q-value for a state-action is updated by Q π n+1 (g, a) = Q π n (g, a) + ω[r n (g, a) + γQ π n (g , π n (g )) − Q π n (g, π n (g))], (8) where ω represents the learning rate which determines to what extent the Q-value is updated based on the newly acquired information and γ is the discount factor. To decide which action to choose on the current state, thegreedy policy is introduced with 0 < < 1. According to this policy, the agent either chooses the action which maximizes the Q-function at time step n given the current state with probability 1 − or randomly chooses an action with probability . The -greedy action selection method can be described as Pr π n (g) = argmax a∈A Q π n (g, a) = 1 − , Pr (π n (g) = randi (A)) = .
In the TD algorithms such as SARSA, a table is maintained to save all the Q-values of different state-action pairs as the basis of the Q-value update and action selection. However, the size of the state space in our model is very large especially when a large file library and a large cache size are considered, thus directly applying the standard SARSA causes extremely huge memory cost [32]. In addition, this tabular method ignores the correlation between different Q-values,which makes it inefficient to learn the Q-value of each state-action individually, which shows poor generalization quality [33]. As a result, the standard SARSA is difficult and inefficient to be implemented. Therefore, a linear function approximation is utilized to reduce the storage cost and to accelerate the learning due to the fact that the algorithm can generalize its earlier experiences to previously unseen states. The Q-function is represented by a linear combination of a number of features which are able to appropriately reflect the inherent characteristics of the caching system. Therefore, the Q-function can be approximated as where η n (g, a) is the feature vector and θ n is the parameter vector at n. η(g, a) n is defined as where I req , I cache , and I action are binary indicators denoting which file is requested, which M files are cached, and which action is taken. Based on the definition, the features of different states and actions are distinguishable, which helps to accurately update the value of the Q-function.
To depict the contribution of each feature, the parameter vector θ n ∈ R 2F +M+3 is introduced as the weight of each feature.
In the RL-based caching, each time a request is received, Algorithm 1 is called to do the clustering and to estimate the file popularity. In the beginning, the cache of the server is assumed to be empty so it always caches the requested files. When the cache is full, the initial state g 0 is observed and the action a 0 is selected based on (9). It is noted that we add a hint to exclude α M+2 from the action selection process, thus the server always checks if a request can be locally served to make sure α M+2 is always taken if it is possible. After executing the selected action, the algorithm moves to a new state g n+1 , and an immediate reward is received. Then the parameter vector is updated by selecting the next action a n+1 and following (12). The detailed RL-based caching is described in Algorithm 2.

B. Reactive Caching
Due to the requirement of the real-time processing, the caching decision should be made with the least delay to improve CHR. For the proposed RL-based caching scheme, the computing load still cannot be ignored when a large size Algorithm 2 RL-Based Caching Require: γ, ω, , randomly initialize θ, observe g 0 if the request can be locally served then a 0 ← α M+2 else Select action a o based on (9) end if a n ← a 0 , g n ← g 0 while n ≤ N do Play action a n , and receive a request req n Call Algorithm 1, observe r n and the next state g n+1 if f n is cached then a n+1 ← α M+2 else Select the next action a n+1 based on (9) end if Update θ based on (12), g n ← g n+1 , a n ← a n+1 , n ← n + 1 end while of the file library is considered. In addition, in the proposed scheme, there are three parameters that need to be adjusted to reach the optimal performance, which is time-consuming and can cause a generalization problem.
In the RL-based caching scheme, the objective is to maximize the Q-function which is a function of the discounted summation of the file popularities. Solving this problem ensures high CHR since the algorithm tends to cache those popular files. Enlightened by this setting, a reactive caching scheme is designed to overcome the above mentioned problems of the RL-based caching scheme. In particular, in this scheme, the server will definitely cache the more popular requested file if it cannot be locally served, and drop the least popular file without considering the long-term effect. It should be noted that the reactive caching also can explore the dynamic popularity profiles. The exploration is done by calling the Algorithm 1 which utilizes the context information to track the time-varying file popularity.
To ensure the reactive caching scheme works, the server maintains an extra record of the popularity of the cached files. Each time a request arrives, Algorithm 1 is called to group it into a cluster and estimate the popularity of the requested file, which is defined asP fn (t n ). If the cache is not full, the server caches the requested file to improve the CHR. Otherwise, for each coming request, the server checks if it can be locally served or not. If this requested file can be locally served, the corresponding popularity of it can be updated by the latest estimated popularityP fn (t n ). If the requested file cannot be locally served, the popularity of the cached least popular file f least and the requested file f n are compared, and the server determines to cache the more popular file. The detailed reactive caching scheme is described in Algorithm 3.

V. PERFORMANCE ANALYSIS
In this section, we first bound the learning regret of the file popularity and then utilize it to derive the bound of the CHR of the proposed reactive caching scheme.

Algorithm 3 Reactive Caching
for n ← 1 to N do Receive a request req n ←< f n , t n , ν n >, call Algorithm 1, getsP fn if The cache is not full then

A. Learning Regret of File Popularity
There is a widely applied assumption [35] that the expected popularity of files with similar context information is similar. This assumption can be mathematically formulated as follows.
Assumption 1 (Uniform Lipschitz Continuity): There exists a real number β > 0 such that for any two requests, the popularity difference between the requested files can be bounded as where || · || represents the Euclidean norm. Based on this assumption, we can bound popularity difference of files by the Euclidean norm of the corresponding context points in the context space V. As a result, to bound the learning regret of file popularity, the worst case of the regret should be defined as the largest Euclidean norm between the new point and the farthest point in the same cluster. Before we estimate the total forecast popularity error brought by the first N requests, we prove the following lemma. Lemma 1: Let us assume a sequence of N independent experiments denoted by x 1 , x 2 , · · · , x N , where each experiment succeeds with a probability of p n = min{ An B , 1} where B ≥ 0 and A n ≥ 0 for n = 1, 2, · · · N . If u denotes the random number of consecutive unsuccessful experiments, then with μ < 1, we have Proof: Let u be the maximal index for which p i < 1 for all i ≤ u . Therefore, we have By utilizing mathematical induction, we can rewrite (15) (see Appendix for the proof) as Let us divide the right-hand side of (16) into two parts and write as . (17) Define q i = A i /B < 1 for i ≤ u as the probabilities of different events to be successful, then (17) can be represented as It is obvious that is the probability that i is the first successful event. Since these events (when i takes different values) are mutually exclusive, (18) can be upper bounded as can be treated as the expected value of A i corresponding to the first successful experiment x i (i.e., E[A i |x i is the first successful experiment]).
Obviously, this value is larger than the expected value of A i no matter x i succeeds or not. Intuitively, this can be explained as the first succeeding experiment has a larger A i on average than the average value of A i . Therefore, the upper bound can be written as Therefore, it can be concluded that there exists an upper bound which is smaller than B (since E[A i ] > 0) and this can be further represented as B μ with μ < 1. O(κ log ξ (υN )) where υ ≥ 1 is defined as the dataset aspect ratio of the maximum distance to the minimum distance between any two points in the context space.

Lemma 2: The number of clusters created while serving the first N requests can be upper bounded by
Proof: Based on Theorem 3 in [29], consider the phase r of our algorithm where, for the first time, Δ r ≥ W κ log n where n represents the n th point and W is defined as the summation of all distances between each point and its cluster center in the optimal solution. The total number of clusters created in our algorithm before r can be upper bounded by O(κ log ξ (υn)). In addition, during and after phase r can be upper bounded by O(log ξ n). Combining both bounds we can conclude that the number of clusters after N requests is the order of O(κ log ξ (υN )).
Proposition 1: The expected total popularity learning error of the first N requests can be upper bounded by O (N μ ) for some μ < 1.
Proof: In the first step of the proposed scheme, when a point is chosen as a cluster center (which means it is the first point in that cluster), it will not cause any error. This is because the caching decision is made only based on the instant request frequency of that file. In the third step, for each coming point, there is a probability ρ n,r for it to be selected as a new cluster center and a probability 1 − ρ n,r for it to be grouped into an existing cluster. In the first case of the third step, obviously the forecast popularity error is zero. In the second case, the largest distance (which is the worst-case error) can be defined as λ n D min n where λ n is a factor to scale D min n to the worst case (largest distance). Therefore, the expectation of forecast popularity error of one point can be expressed as whereq n represents the learned popularity of f n , q n represents the file popularity in the optimal scheme, and λ sup is the upper bound of λ n . In the phase r, κ clusters are created and this can be interpreted as having κ sequences of unsuccessful experiments if we define a success as creation of a new cluster. For each of these sequences, considering that the upper bound of error for each experiment is given by (21) and success probability is ρ n,r = min D min n /Δ r , 1 , the expected error can be upper bounded by λ sup (Δ r ) μ according to Lemma 1. Consequently, the expected value of the sum of forecast popularity errors in phase r can be upper bounded by κλ sup (Δ r ) μ . Mathematically, where n r is the number of requests received in phase r. Therefore, the expected total forecast popularity error of the first N requests is at most where R is the total number of phases corresponding to N requests. Considering that Δ r = ξ r−1 Δ 1 , (23) can be further represented as Since (24) can be treated as the summation of a geometric progression with a common ratio of ξ μ , the expected total forecast popularity error can be upper bounded by Based on Lemma 2, the number of clusters created while serving the first N requests can be upper bounded by O (κ log ξ (υN )). Therefore, the total number of phases R corresponding to N requests can be upper bounded by Since the algorithm will step into a new phase when κ cluster centers are created. Consequently, based on (26), (25) can be written as

B. Learning Regret of CHR
Similar to [21], let us divide time into periods with each of them containing φ requests. Based on the proposed caching scheme, the server will always cache the M -most popular files. In addition, define Q sort as the sorted vector of the popularities of all files in a time period s. For each requested file, assume that the popularity error of file f satisfies where Q sort e (f ) represents the estimated popularities of file f learned by the proposed algorithm and Q sort r (f ) represents its corresponding real popularity. Based on Proposition 2 from [21], the CHR of the proposed algorithm in time period s can be lower bounded as where represents the normalized total popularity of the M most popular files. This lower bound can be divided into two parts. The first part (Q(M ) − 2M/φ) depends on the caching capacity M and the file popularity distribution (related to Q(M )). On the other hand,H s also depends on the total learning error of file popularities in s (which is denoted as ΔQ s = φ n=1 ΔQ n ), since a larger error will lead to a lower CHR. For the optimal scheme, since the full knowledge of file popularity is known, ΔQ s is set to 0, and the CHR in s is therefore defined as Proof: First, we consider the CHR difference between the optimal scheme and the proposed scheme in time period s. The difference can be represented as where Q inf is defined as the lower bound of f ∈F Q sort r (f ). After that, we wish to calculate the expected difference over all time periods. Therefore, the expected CHR difference can be represented as where H andH are the final CHRs of the optimal and the proposed caching scheme after receiving N requests. To derive a bound for the mean CHR difference from the first time period to the last, the forecast popularity error can be summed from the first request to the last. Therefore, by considering Proposition 1, the expected difference can be represented as According to (32), the expected CHR difference between the proposed algorithm and the optimal algorithm is zero, and our algorithm converges to the optimal scheme. Besides, it should be noted that the conclusion holds for any number of file library size.

C. Time Complexity
For the proposed RL-based caching scheme, each time a request arrives, Algorithm 1 is called to estimate the popularity of the requested file, followed by the SARSA algorithm to make a decision with the knowledge of the learned file popularity. In Algorithm 1, we need to find the nearest cluster center of the newly requested file. According to Lemma 2, the number of clusters created while serving n requests can be upper bounded by O(κ log ξ (υn)). Therefore, the time complexity of finding the nearest cluster center can be represented as O(log ξ (υn)). In SARSA, before deciding an action to take, matrix multiplication is needed to calculate the Q-value of each action. Recall that M denotes the cache size and F denotes the file library size. Since the size of feature vector is (M + 2, 2F + M + 3) and the size of the parameter vector is (2F + M + 3, 1), the time complexity of calculating the multiplication of the matrices is O(M F ). Therefore, the time complexity of the proposed RL-based caching is O(log ξ (υn) + M F ).
For the proposed reactive caching scheme, it does not use matrix multiplication but only calls Algorithm 1 and sorts the cached files in a decreasing order of popularities. Therefore, the time complexity can be represented as O(log ξ (υn)) + M . Comparing both algorithms, we can find that both algorithms reach low time complexity. However, in practice, there is always a limited cache size but a potentially large file library. In this case, RL-based caching leads to higher time complexity, which highlights the significance of designing the reactive caching scheme.

VI. SIMULATION RESULTS
In this section, the performance of the proposed caching schemes is evaluated and compared with other existing algorithms. The effect of parameters used in the context-aware popularity learning scheme is also investigated.

A. Simulation Setup
The file popularities are modelled using a Zipf distribution which is widely used in content caching literature [7], [10], [36]. The PMF of Zipf distribution is written as where P f is the popularity of f , F is the file library size, and ψ is the skewness factor (ψ ≥ 0). A larger ψ means the popular file becomes more popular. When ψ = 0, the Zipf distribution will be converted to a uniform distribution. In addition, the request arrival is assumed to follow a Poisson process which implies the time interval between the two consecutive requests follows an exponential distribution with a rate parameter ζ which increases for more frequent requests.
The simulation lasts for 72000 time slots. The duration of each time slot is one minute and F = 100. To evaluate the robustness of the proposed algorithms against time-varying popularity profiles, a deterministic variation is introduced to the file popularity distribution by randomly permuting all the files at every 12000 time slots. The learning rate, the discount factor and the -greedy factor are set to 0.1, 0.05 and 0.1, respectively. All the results are generated by running the simulation for five times and calculating the average CHR. In addition, the request history is chosen to be the context information in the simulation. Specifically, we use the total number of requests for a file in the past week and the past day as the context information in the simulation.
As a benchmark, we simulate LRU, LFU, Least frequent recently used (LFRU) [37] and a popularity-driven (POP) caching [21] algorithm. For the LRU scheme, the node always tries to update its cache by replacing the least recently requested file with the newly requested file. Besides, LRU is suitable for the case that the popularity of file changes over time because it does not consider the request history when making a decision. In LFU scheme, the node always cache the most frequently requested file. However, the caching performance might drop drastically if the file popularity changes. In addition, LFRU combines benefits of both LRU and LFU by partitioning the cache into two parts. It is a well-known scheme for content caching network. Finally, POP caching is also simulated which is a context-aware caching algorithm with the ability to handle the time-varying file popularity. Figure 2 illustrates the CHR of different caching algorithms over time. The caching capacity is set to 10 files and ψ is set to 1. The proposed algorithms can identify the changes of the popularity profiles. In addition, the RL-based caching scheme reaches the highest CHR and competitive robustness among all six schemes because it introduces a reasonable amount of exploration to deal with the variability of the popularity profiles thus caching the more popular files when the popularity profiles change. Moreover, the reactive caching reaches  a competitive CHR and robustness as well. Note that before the variation happens, this scheme performs well because the proposed online learning algorithm is able to quickly learn the popularity profiles and the incremental clustering is more efficient than the grid-based clustering scheme utilized in the POP caching scheme. Finally, for the rest benchmark schemes, it is obvious that LFU can perform well before any variations but its robustness is very poor. LRU is very robust but cannot reach a good CHR when the storage resource is limited. As a combination of LRU and LFU, LFRU performs much better, however, there is an obvious gap because LFRU does not consider the context information. Besides, this scheme is quite heuristic which means it needs tuning to reach a better performance. Figure 3 shows the caching performance over time when the period of the popularity variation decreases from 12000 to 6000 time slots. Therefore, the variation happens more frequently and the RL-based caching scheme can reach the highest CHR. In addition, due to the more frequent variations, the gap between the RL-based caching and the reactive caching becomes larger, which further demonstrates the robustness of the proposed RL-based caching scheme. Moreover, the reactive caching scheme still reaches a higher CHR as compared to the benchmark schemes. Figure 4 depicts the caching performance versus the capacity which ranges from 3 to 20 (files). The CHRs are collected at the end of the simulation with ψ = 1. Obviously, the proposed RL-based caching scheme outperforms other benchmark schemes for different cache capacities. Furthermore, the reactive caching scheme reaches a competitive CHR as well which  proves that the proposed schemes are suitable and robust for the cache-limited scenario.

B. Numerical Results
In Figure 5, the effect of the request arrival rate is studied by changing the rate parameter ζ of the exponential distribution. The caching capacity here is set to 10 files and ψ is set to 1. The performance of the proposed caching schemes is robust against ζ because according to the figure, the choice of the length of context information is not sensitive to the request arrival rate. To further study the effect of the file library size on the CHR, we simulate a new scenario with 1000 files in the file library and the cache size is set to 20. As we observe in Figure 6, the proposed RL-based caching algorithm reaches the highest CHR and it is very robust against the frequent changes of the file popularity profiles. In addition, the reactive caching scheme also reaches a competitive performance as compared to other benchmark schemes. Finally, the running times of both the proposed caching schemes are measured under this setting. Both the designed algorithms were run on an i5 desktop computer by MATLAB. In order to accurately measure the running time, both schemes are executed for 1000 time slots. The running times for RL-based caching and reactive caching are 1159 seconds and 642 seconds, respectively. According to the measurements, reactive caching runs faster than RL-based caching, which demonstrates that the former scheme requires a lower computational complexity.
In the previous results, the change of the popularity profile is assumed to be independent over time and the popularities of all the files are changed at each variation. Inspired by [36],  we further test the caching performance of the proposed algorithms with correlated file popularity, which means only a part of file popularities are changed, so the popularity profile is dependent over time and correlated with previous profile. In Figure 7, the variation percentage of popularity profiles ranges from 60% to 100%. According to the results, with the variation percentage increasing, the CHRs of the proposed schemes drop because the popularity profiles are more dynamic. However, the RL-based caching scheme still reaches the best performance and the performance gap between it and the reactive caching becomes larger when the popularities of more files vary. In order to further evaluate the caching performance, a new scenario is simulated. For each variation of popularity profiles, a subset of the files are removed from the file library and a number of new files are included. Besides, a part of the file popularities are still changed periodically. Figure 8 presents the caching performance for this dynamic scenario. The RL-based scheme reaches the highest CHR, which demonstrates the robustness of it under a correlated time-varying popularity profiles.

C. Correlated File Request Process
In this section, we designed two new settings to extend the simulation results, which capture the temporal correlation of file requests in varying ways. The first one is based on the Bernoulli model [36] and the second is based on the shot noise model (SNM) [38]. To implement the time-varying popularity profile, a part of the popularity profile is changed   periodically consistent with previous results. In Figure 9, we study the Bernoulli request model which can capture the request correlation. Assume there are 10 users and each user can make a file request with the probability of 0.1 in each time slot and the file library size is 1000. The result reflects that the proposed algorithms perform well in this new setting and the correlated requests will not affect the caching performance as compared to the results of independent request model ( Figure 6).
In addition, we also simulate the SNM which is able to capture the popularity evolution and explicitly account for the temporal locality of file requests. Following the setting in [39], the requests for the contents in a window of time slots are generated based on the SNM using the exponential shape [38].  The average number of requests for each content in the considered time window follows the Zipf distribution. The result is presented in Figure 10. According to the results, we conclude that both the proposed caching schemes outperform all the benchmarks.

D. Parameter Determination
Here, we investigate the effect of parameters of the proposed context-aware popularity learning algorithm. The popularity profiles are assumed to be unchanged. The optimal caching scheme which is assumed to know the true file popularity is also simulated. For the optimal caching, the server caches the most popular files without any updates. Figure 11 depicts the CHR and the number of clusters versus δ. When δ is relatively small, the CHR gap between the optimal scheme and the reactive caching scheme is very small. However, when δ ≥ 100, the gap starts to increase which means the popularity learning scheme gradually loses its learning accuracy. The reason is that when δ is too large, only very few clusters will be created so that the request points with different popularity levels can be grouped into the same cluster. Therefore, there is a high probability that the algorithm cannot identify the difference between the two requests. In the simulation, to avoid creating too few clusters, δ is set to 5. Figure 12 demonstrates the caching performance versus κ. It is observed that as κ increases, the CHR difference between the optimal scheme and the reactive scheme first increases and then remains stable. This means that creating too many clusters cannot necessarily help to improve the learning accuracy but may lead to confusion of popularity learning. In our simulations, to avoid the algorithm creating too few clusters and thus reducing the CHR, κ is set to 20. We also investigate the effect of ξ and z. The results show that these parameters do not affect the caching performance severely but only affect the speed of creating clusters. Therefore, ξ and z are chosen as 10 and 15 respectively.

VII. CONCLUSION
In this paper, we studied a caching decision problem in the network edge server under a dynamic and time-varying file popularity profile. We first designed a context-aware popularity learning algorithm to improve the accuracy and speed of tracking the file popularity. With the assistance of this algorithm, an RL-based caching scheme is proposed to make proper caching decisions and to improve the CHR. Furthermore, a reactive caching algorithm was designed to reduce the computational complexity for real time processing. Via the theoretical analysis, we demonstrated the superiority and the efficacy of the proposed algorithms. Finally, through numerical results, we demonstrated that the proposed algorithms are able to achieve a competitive and robust caching performance as compared to various benchmark schemes.

APPENDIX
In order to verify the transition from (15) to (16), we need to prove the following equation This can be proved via the mathematical induction. First, let us rewrite (33) and prove, u i =i For i = 1, (34) is equal to u i =1 . For i = X, we assume the following equation holds, u i =X Then, when i = X + 1, we need to prove that u i =X+1 The left-hand side of (36) can be written as u i =X+1 According to the assumption in (36) and (37), we have Therefore, we complete the proof of the transition from (15) to (16).