Guaranteeing performance based on time-stability for energy-efficient data centers

ABSTRACT We consider a system of multiple parallel single-server queues where servers are heterogeneous with resources of different capacities and can be powered on or off while running at different speeds when they are powered on. In addition, we assume that application requests are heterogeneous with different workload distributions and resource requirements and the arrival, rates of request are time-varying. Managing such a heterogeneous, transient, and non-stationary system is a tremendous challenge. We take an unconventional approach, in that we force the queue lengths in each powered-on server to be time-stable (i.e., stationary). It allows the operators to guarantee performance and effectively monitor the system. We formulate a mixed-integer program to minimize energy costs while satisfying time-stability. Simulation results show that our suggested approach can stabilize queue length distributions and provide probabilistic performance guarantees on waiting times.


Introduction
We consider a data center that consists of a set of heterogeneous servers with different maximum processing speeds and different capacity limits of various resources, such as memory and storage. The data center hosts heterogeneous application classes that have different workload distributions, for which time-varying requests arrive with resource requirements and levels of Quality Of Service (QOS) guarantees. In this case, servers process requests that belong to multiple classes, whereas requests categorized into the same class are stochastically identical and they arrive to data centers based on a piecewise constant nonhomogeneous Poisson process. It is assumed that the environment process that drives arrival rates of the non-homogeneous Poisson process is cyclic. This is a fairly reasonable assumption as arrivals tend to have daily or weekly patterns that repeat in a cyclic fashion (Gmach et al., 2007;Lin et al., 2012;Lin et al., 2013;Liu et al., 2013;Kwon and Gautam, 2015). Using that assumption, we model each cycle as being divided into a set of intervals corresponding to the piecewise constant period for the arrival process. In addition, we assume that the number of classes is large (larger than the number of servers; e.g., 10 servers and 20 applications); however, their workloads are so small that most servers host a mixture of heterogeneous classes. For the non-homogeneous arrival process, the arrival rate of each class is time-varying and also changes so fast that steady state is not reached before arrival rates change. Based on the discussed scenario, we model a data center as a system of multiple parallel single-server queues where a dispatcher routes arriving applications to servers.
For such a time-varying and heterogeneous system, our fundamental aim is to manage the servers in the data center in an energy-efficient manner while satisfying performance CONTACT Natarajan Gautam gautam@tamu.edu Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/uiie.
Supplemental data for this article can be accessed for this article at www.tandfonline.com/uiie. guarantees by achieving time-stability. For time-varying and heterogeneous systems, the arrival rates of requests change so fast that a steady state of the system is not reached, therefore making it extremely difficult to provide performance guarantees. Without assuming that the system reaches a steady state, one can provide performance guarantees by powering upon a large number of servers responding to peak workload; however, that causes a huge additional energy cost. In this context, achieving time-stability is extremely useful for managing a nonhomogeneous and transient system, as it enables operators to effectively design and analyze the time-varying system, it provides guaranteed QOS as desired, and it effectively performs monitoring and control. Achieving time-stability and providing performance guarantees, while being mindful of energy costs, is the main objective of this study. To achieve time-stability, we consider the following control decisions that can be tuned.
r Assignment: Applications corresponding to each class of request can be assigned to servers such that each server hosts one or more classes and each class is hosted on multiple servers. We assume that there is no cost for the assignments as well as for switching between assignments.
r Sizing: Each server can be dynamically powered on or off across time intervals. However, we neither consider switching costs (from on to off and vice versa) nor consider reliability costs for on-off cycles. Note that some modern servers allow for "sleep" settings instead of completely turning off servers. From a mathematical standpoint, we consider them to be equivalent.
r Routing fraction: The dispatcher routes arriving requests to one of the powered-on servers based on predetermined routing fractions by considering assignment of classes to servers. A key assumption is that the dispatcher cannot observe the real-time state of any of the servers.
Copyright ©  "IIE" r Speed scaling: For each server, it is possible to dynamically change the processing speed by scaling the voltage and frequency up to a maximum speed. We will show that assignment, sizing, routing, and speed scaling can be appropriately performed in an integrated fashion to achieve both time-stability and energy efficiency through the suggested framework described in Sections 2 and 3. Figure 1 shows the scenario and the main objective of this study. Next we review the relevant literature and introduce the notation used in this article.

Literature review
As there has been a surge in demand for cloud computing in recent years, server management in data centers has received significant levels of attention from both industry and academia. Data centers provide benefits in cost reduction, flexibility, and accessibility by allowing enterprises to outsource resources for service rather than managing their own resources. As a result, data centers have challenging tasks to achieve energy efficiency and provide performance guarantees in cloud computing environments characterized by time-varying workloads with significant variation and uncertainties. In general, data centers need to provide strict QOS guarantees to users, thereby overprovisioning their servers to respond to peak loads due to inherent uncertainty and variability in demand. On the one hand, this over-provisioning of servers results in low utilization levels as reported in Barroso and Holzle (2007) and Vogels (2008). On the other hand, they incur a significant amount of energy usage for operating and cooling servers as explained and documented in Hamilton (2008) and Koomey (2009). Fortunately, to abate the skyrocketing energy consumption in data centers (see the report by Brown et al. (2008)), there are tremendous opportunities to conserve energy consumption in data centers, such as powering servers off and running them at slower speeds.
In this context, server provisioning plays a key role in improving utilization by selecting active servers (e.g., powering off servers or allowing them to enter a power-saving mode) in accordance with traffic changes, while considering performance guarantees, and thus a number of techniques for efficient server provisioning have been proposed to address this problem. Gandhi et al. (2011) presented an approach based on a combination of predictive and reactive provisioning to correctly allocate resources in data centers such that Service Level Agreement (SLA) violations and energy consumption are minimized. Zhu et al. (2011) presented a data center architectural design based on virtualized resources, in order to reduce provisioning overhead; they also proposed a dynamic provisioning technique while satisfying user's SLA and maximizing overall profits. Also, Wang et al. (2015) provided an analytic framework that captures non-stationarities and stochastic variation of workloads for dynamic re-sizing in data centers. Lin et al. (2013) suggested a new on-line algorithm for dynamic right sizing in data centers, motivated by optimal off-line solutions to minimize energy costs including switch costs. In addition, there have been some studies that have considered heterogeneous workloads for server provisioning and allocation problems. There is another body of literature that has proposed an optimization approach based on powering servers on/off and Dynamic Voltage/Frequency Scaling (DVFS) to minimize energy consumption. Bertini et al. (2008) proposed a Mixed Integer Program (MIP) for the problem of selecting the servers' states and processing speeds with QOS control, and Gallego Arrubla et al. (2013) introduced a unified methodology that combines virtualization, speed scaling, and powering off servers, so as to efficiently operate data centers while incorporating the inherent variability and uncertainty.
Although all of the aforementioned research is complementary to our work, the key difference is that our study suggests an approach that aims to not only save an energy consumption but also achieve time-stability for providing performance guarantees. Note that, to the best of our knowledge, the problem of achieving time-stability over time-varying traffic in data center operation has received little attention and has not been effectively addressed. Although time-stability has received little attention in the context of data center operation, there have been some research studies in the queueing area. Foley et al. (1986) and Barnes and Meili (1997) showed that the departure process from the M t /G t /∞ queue can be made stationary, and recently, Whitt (2015) suggested the rate matching control algorithm, which stabilizes the queue length distribution for a G t /G t /1 single-server queue where both arrival rate and service rate are time-varying. There is another body of literature that provides algorithms to determine appropriate staffing levels for call centers. Feldman et al. (2008) proposed a simulation-based iterative-staffing algorithm for time-stable delay probability, and Liu and Whitt (2012) suggested a formula-based algorithm to stabilize abandonment probabilities and expected delays using offered-load-based approximations for a queueing model with a non-homogeneous Poisson arrival process and customer abandonment.
In fact, our previous work (Kwon and Gautam, 2015) suggests an approach to achieve time-stability in both queue lengths and sojourn times for data center operations where time-varying arrivals are cyclic. In that paper, we considered a scenario of a company that owns a data center (e.g., Yahoo, Google, or Facebook) and operates a large number of servers to host applications. We assumed that each application request has a high Squared Coefficient of Variation (SCOV) of workloads. Also, each application needs to be hosted on a large number of servers to handle the load (in fact, we used an asymptotic scaling, the number of servers N a → ∞ for each application a, which allowed us to attain time-stability). In comparison, this study considers a unique scenario for data centers that provide a hosting service for several other companies. In general, hosting data centers cluster applications of each company and assign the cluster to a group of servers for the purpose of security and confidentiality. It is important for the data centers to monitor the performance experienced by the applications, and without a time-stable performance, monitoring would be difficult. This motivation is to consider time-stability. We assume that the number of application classes is much larger than the number of servers and many application classes have so little load that they could be hosted on just one server. Also, we assume that servers are heterogeneous and host a mixture of heterogeneous application requests, and the processing speed of a server can be changed for energy efficiency (in Kwon and Gautam (2015), servers are homogeneous and every server runs at the same processing speed). Moreover, in this study, we formulate an MIP problem to optimally determine operational decisions including assignment, routing, and speed scaling for time-stability as well as energy efficiency, which were not considered in Kwon and Gautam (2015). Therefore, our suggested model in this study is fundamentally different from the previously considered problem.
For the purpose of this study, we provide an analytical framework that decomposes a complex and non-stationary system into individual simpler stationary ones based on multiple strategies for assignment, sizing, routing, and speed scaling. Based on the suggested framework, our objective is to stabilize the queue length distributions of each powered-on server and provide performance guarantees on the waiting time of each application class. Moreover, for energy efficiency we propose an optimization model to minimize the total energy cost via powering on or off servers, routing, and speed scaling, while satisfying the time-homogeneity constraints. In fact, our suggested approach enables us to utilize standard stationary queueing analysis to obtain performance guarantees for time-varying, transient, and heterogeneous systems, and we believe that our study has significance and provides useful insights for practitioners. The main contributions of this study are to (i) provide an integrated framework that unifies sizing, assignment, routing, and speed scaling under heterogeneous conditions, which have seldom been implemented jointly; (ii) define time-homogeneity constraints that ensure time-stability; (iii) suggest an approach to provide performance guarantees based on their time-stability; and (iv) introduce an optimization problem with an MIP formulation to reduce energy cost while considering time-stability. The remainder of this article is organized as follows: Section 2 introduces our concept of time-stability and suggests an approach to obtain time-homogeneity constraints; Section 3 proposes an optimization problem to determine various decisions for energy conservation and time-stability; Section 4 proposes an approach to provide performance guarantees; Section 5 reports and analyzes the results of numerical experiments; and Section 6 presents conclusions and future research directions.

Notation
For the scenario considered in this study, we set the notation to define the problem and describe our approach appropriately. An arriving request belongs to one of multiple classes in a discrete set A. We categorize incoming requests into several class types based on their mean workload, and each class has a small squared coefficient of variance. The amount of work a class Maximum processing speed of server j

Decision variables
x a j Assignment of class a to server j in time interval y j Whether server j must be powered on or off in time interval v a j Fraction of class a to route to server j in time interval φ j Processing speed of server j in time interval a ∈ A request brings is independent and identically distributed based on a general distribution H a (·) with mean 1/θ a and SCOV C 2 a . Also, each class a has requirements β k a for each resource type k ∈ K (this could be memory or storage, for example). Let T be a collection of contiguous time intervals and in each interval ∈ T the arrival rate for each class a ∈ A remains a constant λ a . Also, we consider N heterogeneous servers and each server j ∈ N has the maximum processing speed φ j max and capacity limit for each resource type k ∈ K, b k j . Note that instead of defining service time distribution, which is generally used in other studies based on queueing models, our approach uses the combination of workload distribution and processing speed (which could be varying). In addition, we define the desired traffic intensity ρ for each powered-on server. In fact, we use the desired traffic intensity to create enough residual capacity for unforeseen surges by restricting the utilization of each server to be ρ. Next, for time interval ∈ T , server j ∈ N , class a ∈ A, and resource type k ∈ K we define decision variables considered in this study as follows: (i) the assignment of class a to servers j for each time interval , x a j (e.g., x a j = 1 if class a assigned to server j in time interval ; otherwise, x a j = 0); (ii) whether server j must be powered on or off in time interval , y j (e.g., y j = 1 if server j is powered on in time interval ; otherwise, y j = 0); (iii) the processing speed for server j in time interval , φ j ; and (iv) the fraction of jobs of application a to be routed to server j in time interval , v a j . Each server j must be powered on or off in each time interval and thus there will be N = j∈N y j powered-on servers in each time interval . We summarize the set of indices, parameters, and decision variables that are used to define optimization problem in Table 1.

Time-stability
Recall that our main objective is to provide performance guarantees for time-varying and heterogeneous data centers by achieving time-stability. In this section, first we introduce the notion of time-stability considered in this study and then we suggest an approach to achieve time-stability. We conclude this section by introducing a practical application of a suggested approach considered in this study. Recall that the benefit of time-stability was discussed in Section 1.

Our notion of time-stability
As described in Section 1, we consider a scenario where each server may host multiple classes of applications with the aggregate workload being defined by a mixture of heterogeneous application classes. To achieve time-stability, we seek to manage the system so that any powered-on server receives arrivals with a target workload distribution H(·) whose Laplace-Stieltjes Transform (LST) isH(s) = ∞ 0 e −sx dH(x). We denote 1/θ as the target average amount of work brought by each arrival. We seek to keep the aggregate workload distribution at any instant of time in each powered-on server to be time stable, in order to ensure that the distribution of queue lengths at any time for every powered-on server is based on a stationary distribution of a homogeneous M/G/1 queue. In Section 2.2, we will propose an approach to obtain a time-stable queue length distribution by stabilizing both the aggregate workload distribution and arrival rates, and in Section 3 we will show that time-stability can be valid for speed scaling. For mathematical representation, for all j ∈ N , at time t let X j (t ) be the number of requests in the queue of server j and O j (t ) be the status of the server (O j (t ) = 1 if server j is powered on at time t; otherwise, O j (t ) = 0). Our approach indeed stabilizes the queue length distribution π (i) for all i ≥ 0, which can be represented as P{X j (t ) = i|O j (t ) = 1} = π (i) where π (i) is constant and independent of t. Based on the time-stability in the queue length distribution, the performance analysis is straightforward and probabilistic guarantees on the waiting times can be provided.

Approach to achieve time-stability
In this section we suggest an approach to achieve the timestability introduced in Section 2.1. For time-varying arrivals of requests of heterogeneous applications, the main idea of our suggested approach is to stabilize the queueing process of each powered-on server based on (i) a time-invariant aggregate workload distribution obtained by a moment-matching approximation and (ii) rate matching between the arrival rates and processing speeds, by appropriately selecting routing fractions. In the following subsections, we will derive time-homogeneity constraints for the proposed MIP problem.

... Moment-matching approximation for time-invariant workload distribution
To achieve time-stability, we seek to stabilize an aggregate workload distribution of each powered-on server, with the servers hosting incoming requests of multiple heterogeneous applications that have different workload distributions H a (·). To achieve the invariant target workload distribution H(·) with LST H(s) and average workload 1/θ , our objective is to appropriately control the system. This way each powered-on server receives arrivals of requests with a homogeneous aggregate workload distribution H(·) based on the moment-matching approximation.
To begin, for the desirable traffic intensity ρ and the target nominal speed φ, we determine the minimum number of powered-on servers for each time interval ∈ T , N , that satisfies the following inequality: where the left-hand side is the LST of the aggregate workload distribution for the entire system and the right-hand side represents the LST of the aggregate workload distribution of N time-stable servers, where each of the N servers has the target workload distribution H(·) and arrival rate of ρφθ. Based on the inequality (1), N can be determined so that the workload served by each of the N servers would be less than or equal to the target workload (which is defined by traffic intensity ρ, processing speed φ, and mean workload 1/θ ) when the incoming workload is equally distributed with the same routing fractions to the powered-on servers. In fact, we can redefine inequality (1) based on moment-matching approximation for several moments as follows: . . . by taking the derivatives of the LSTH(s) at s = 0. Our aim is to introduce additional traffic to match the difference between the left-hand side and the right-hand side of each inequality (each moment) so that the aggregate workload distribution of each of the N servers is approximately equal to the target workload distribution H(·). As discussed in , Xu and Liu (2012), and Yao et al. (2012), additional traffic can be thought of as low-priority jobs (or delay-tolerant jobs) that do not have time constraints that data centers need to process. For additional traffic with index 0, the arrival rate in time interval , λ 0 , and the moment of workload distribution for each time interval , . ., can be determined appropriately through the following equalities based on N : . . . and then we can approximately define H 0 (·) by combining the derivatives. Recall that there is additional traffic that has low priority of QOS guarantees (e.g., non-interactive jobs are less sensitive to the response time) and is also CPU-intensive (it does not need other resource requirements). In this case, it is reasonable to assume that this additional traffic can be assigned to the system based on a Poisson process with parameter λ 0 and workload distribution H 0 (·) for each time interval from frontend proxy servers or other data centers. We would like to note that our suggested approach allows additional traffic; thus, the increment in workloads caused by an addition of traffic will not degrade the desired performance but will reduce variability and uncertainty in aggregate workload while also improving the utilization of CPU for each powered-on server. Note that if we sufficiently match many moments, then the above equalities would result in where H 0 (·) has −H (0),H (0), −H (0) , . . . , as moments. By selecting additional traffic for several moments, our suggested approach stabilizes the workload distribution for each powered-on server.

... Selecting routing fractions for time-stable arrival rates
In addition to stabilizing the aggregate workload distribution as introduced in Section 2.2.1, our suggested approach stabilizes an aggregate arrival rate to each powered-on server so that the queue length distribution of each powered-on server is a stationary time-homogeneous M/G/1 queue. Based on the minimum number of powered-on servers N , arrival rates λ 0 , and workload distributions H 0 (·) of additional traffic for each time interval ∈ T , we seek to route arrivals of requests so that the arrival rates into each powered-on server would be timehomogeneous (i.e., constant across time intervals) while ensuring that the aggregate workload distribution is also stabilized as the target workload distribution H(·). Let v a j be the fraction of arriving requests of class a routed to server j in time interval , then v a j can be appropriately selected based on the following time-homogeneity constraints: Recall that ρ is the desired traffic intensity and φ is the target nominal speed. Equation (3) means that all application requests should be routed to servers; the traffic intensity of each powered-on server would be ρ based on Equation (4); Equation (5) ensures that the aggregate workload distribution at each powered-on server j is approximately equivalent to the target workload distribution H(·). In fact, Equation (5) is directly derived from Equation (2) by applying v a j to the routing arriving requests instead of distributing then equally to N poweredon servers. Based on Equations (4) and (5), the arrival rates into each powered-on server would remain constant at ρφθ across time intervals, since the mean amount of workload brought by incoming request would be 1/θ based on Equation (5) (i.e., −H (0) = 1/θ ) and thus In Section 3 will formulate an MIP problem by using timehomogeneity constraints (3), (4), and (5) to optimally determine v a j to achieve time-stability.

... Adjusting the initial condition of the time intervals
The fundamental idea for reducing energy costs is to power servers on and off as they respond to time-varying workloads.
To that end, we will formulate an MIP problem in Section 3 to determine operational decisions (including powering servers on and off) so that the queue length distribution of every poweredon server is stabilized across time intervals. In this case, it is intuitive to think that time-stability (stabilized queue length distribution) would be affected when servers are powered-on afresh (from powered-off state) since newly powered-on servers start processing jobs with an empty queue; however, in the other case, servers process jobs with a stationary queue. Thus, the queue length distribution would not be stabilized when the length of each time interval is not long enough to reach steady-state operation. To address this problem, we add a batch of requests to servers that are powered on afresh at the beginning of a time interval, so that the queue lengths of every powered-on server would be stochastically equivalent to that of a stationary M/G/1 queue. We adopt an approach proposed in our previous work (Kwon and Gautam, 2015) whose main idea is adding a number of jobs sampled from the stationary queue length distribution defined by the stabilized workload distribution (H(·)) and arrival rates (ρφθ). If servers selected to be powered off are not idle at the end of a time interval, then incoming requests would not be routed to those servers, and we wait until those servers complete service for the remaining requests (and then power off the servers). As we discussed in Kwon and Gautam (2015), the key benefit of the suggested approach is that performance bounds based on time-stability are provable and also easily derived by using standard queueing theory results. In addition, for the general case, we have the following remark for an extension of our suggested approach.
Remark 1 (Extension of time-stability to processor sharing): In Section 2.1, we introduce our notion of timestability based on stationary M/G/1 queues where each server hosts a mixture of multiple heterogeneous applications with a First-Come, First-Served (FCFS) service discipline. Here our claim is that our suggested approach can be extended to a model using a Processor Sharing (PS) service policy. Under the PS regime, all of the jobs or entities in the system are served simultaneously and equally share the processor at any given time; it is also fairly common to model computer servers based on a PS service policy. In fact, M/G/1 queue with PS does produce closed-form results that are identical to those of M/M/1 queues with FCFS discipline (Gautam, 2012). Thus, our suggested approach for time-stability can be applied to the model that uses PS instead of FCFS for its service policy. For PS service policy, the mean queue length L of M/G/1 would be L = ρ/(1 − ρ) and thus the mean sojourn time would be 1/(θφ (1 − ρ)).

Practical application of the suggested approach
In Section 2.2, we introduced the notion of time-stability considered in this study and suggested an approach to achieve timestability based on a moment-matching approximation. As we previously mentioned, if we match sufficiently many moments, then the aggregate workload distribution at every powered-on server would be approximately equivalent to H(·), and thus we could theoretically have stabilized the queue length distribution. Moreover, for practical application, we can choose to match only a few moments to achieve time-stability. In this section, we will show that the mean queue lengths of every poweredon server can be reasonably stabilized across time intervals by considering only first and second moments in practice. Recall that our aim is to manage each powered-on server as a stationary M/G/1 queue, and we use the Pollaczek-Khintchine (P-K) formula (Gautam, 2012) to determine the mean queue length L as follows: where C 2 is the SCOV of the service time and ρ is the traffic intensity. Based on our suggested approach, we need to determine the target workload distribution H(·) and, in fact, H(·) can be determined based on the desired mean queue length. For example, letL be the desired (i.e., targeted) mean queue length; then it is possible to select C 2 of H(·) based on Equation (6) for givenL and ρ (recall that ρ is the desired traffic intensity). Since C 2 is solely determined by the first and second moments of H(·) (−H (0) andH (0), respectively), the mean queue length can be stabilized as desired by determining the first and second moments of H(·) and matching the first and second moments as follows: In Section 3, we will formulate an MIP by using the constraints (7) and (8) instead of using Equation (5) to stabilize the mean queue length and provide performance guarantees on the mean waiting time.

Optimization problem
In this section, we suggest an optimization problem to determine the operational decisions that correspond to decision variables x a j , y j , v a j , and φ j introduced in Section 1.2. The key idea is to select decision variables, x a j , y j , v a j , and φ j so that they not only enforce time-stability of the queue length distribution at every powered-on server but also result in the system being energy-efficient.

Assumption for energy cost
To consider the energy cost of data center operation, we define a fixed energy cost f j for a powered-on server j (i.e., energy cost for server j in its "idle" state) and an operating cost c j for processing jobs at server j at speed φ j . In fact, we are assuming that there is no cost and no service delay for switching server operations between time intervals. Next, we briefly describe the reason we assume there would be no switching cost for our suggested problem. According to Lin et al. (2012) and Lin et al. (2013), the switching cost mainly consists of (i) the energy used to power servers on and off; (ii) a time delay during which the server is set up; and (iii) increased wear-and-tear on the server toggling. For (i) and (ii), first, we would like to note that the length of time intervals used in our suggested model is much longer than the duration of the setup time. For example, Gandhi et al. (2010), Guenter et al. (2011), and Gandhi et al. (2012) reported that the setup times range between 20 and 200 seconds, which are very small compared to the 1-hour (3600 seconds) time interval length. In this case, the power consumed during setup can be negligible compared with the power consumption for running a server during a 1-hour time interval and, thus, it is reasonable to only consider operating cost without considering extra setup cost. Also, our suggested approach is a static control algorithm, and server schedules (e.g., y j and φ j for all j ∈ N and ∈ T ) are pre-determined, as opposed to the dynamic server provisioning considered in Lin et al. (2012) and Lin et al. (2013), which determines whether to put idle servers to sleep or wake up servers in real-time. Thus, based on the scenario considered in our problem, servers can be set up to process jobs in advance based on the pre-determined server schedules and thus service delay also can be negligible. In terms of (iii), a body of literature (Chen et al., 2005;Guenter et al., 2011;Lin et al., 2012;Lin et al., 2013;Wang et al., 2015) considers the impact of server on/off cycles on the reliability of the server. For example, the recent studies Lin et al. (2012), Lin et al. (2013), and Wang et al. (2015) proposed an approach for the dynamic server provisioning model by considering a switching cost that included a server reliability cost for wear-and-tear caused by toggling servers. However, contrasting the dynamic server provisioning algorithms proposed by Lin et al. (2012), Lin et al. (2013), and Wang et al. (2015), which change server state frequently, the servers in our problem stay at the poweredon or powered-off (or sleep) state for at least an hour and do not change server state frequently; thus, the effect of toggling servers on reliability may not be significant. Moreover, as mentioned in Chase et al. (2001) and Bodik et al. (2008), powering servers off may extend the lifetime of server components. Therefore, toggling servers may not significantly affect the reliability of servers in our proposed problem. In addition, since the assignments of classes to servers change across time intervals in our suggested problem, one may argue that there may exist a cost for switching assignments. We would like to mention that such a cost for switching assignments can be easily ignored, as all applications essentially reside in all servers and in a given time interval, the applications that are used are fired up while others are off. Also, it is reasonable to assume that the assignment of applications to servers can be changed without service delay, as applications can be deployed on servers concurrently while they process other jobs. Based on the above description, we consider only fixed energy cost and operating cost without including server switching cost in our suggested problem.

Formulation
For time-stability, the target aggregate workload distribution H(·) and arrival rates of additional traffic λ 0 and workload distribution H 0 (·) are determined as described in Section 2.
Recall that we define time-homogeneity constraints to stabilize the mean queue lengths by using constraints (7) and (8) based on the first and second moments of the workload distribution of additional traffic (indexed as "0") 1/θ 0 andH 0 (0). Based on the set of indices, parameters, and decision variables, we formulate an MIP optimization problem. For each time interval ∈ T , our suggested optimization problem can be formulated as follows: In the above formulation, the objective function (9) is the total energy cost in a cycle, which consists of the fixed cost and operating cost that we wish to minimize. We define the cost function for energy usage as a combination of a fixed energy cost and the operating cost using a model of power usage of typical servers adopted by Lin et al. (2013). For operating cost, we would like to note that modern CPUs can be operated at different speeds during runtime by employing DVFS, and recent studies by Raghavendra et al. (2008), Gandhi et al. (2009), andKusic et al. (2009) reported that DVFS results in a linear power and frequency relationship without additional cost for ramp up/down processing speed as defined in our objective function (9). Constraint (10) ensures that all class a traffic is divided across various servers, and constraint (11) ensures that the traffic intensity for server j be ρ, as the average work that arrives at server j is ρφ j if the server runs at speed φ j . Constraint (12) forces server j's speed to be zero when it is off and limits its speed by its maximum value when on. Constraints (13) and (14) match the first and second moments, respectively, to stabilize the mean queue length as described in Section 2.3. Constraints (10), (11), (13), and (14) are derived from the homogeneity constraints defined in Section 2.2.2. Recall that we define constraint (4) with constant speed φ, but here we define constraint (11) to allow speed scaling with φ j for minimizing energy cost based on theoretical foundations described in the online supplement. In fact, Theorem 1 in the online supplement shows that the queue length distribution of a timevarying M t /G t /1 queue is stochastically identical to that of time-homogeneous M/G/1 queue when we adjust the processing speed (service rate) responding to a time-varying arrival rate based on constraint (4). In addition, constraint (15) limits resource k to b k j across classes assigned to server j with requirements β k a for class a, and constraint (16) ensures that if class a is not assigned to server j, then no fraction of arriving requests is assigned to that server. Constraint (17) ensures non-negativity and binary nature of the decision variables. The decision variables x a j , y j , φ j , and v a j (for all ∈ T , j ∈ N , and a ∈ A) can be determined by solving the above MIP-problem separately for each time interval ∈ T . Although we solve the optimization problem to determine operational decisions independently for each time interval ∈ T (e.g., powering server on/off, assignment, routing, and speed scaling), the aggregate workload distribution and the queue length distribution of every powered-on server would be stabilized across time intervals. In other words, based on our suggested approach, we can simplify a large-scale, transient, non-stationary problem by decomposing it into individual smaller problems (i.e., decompose the overall problem into 24 MIP-problems) in order to achieve time-stability. This is the key benefit of our suggested approach, as we can solve smaller problems instead of large-scale MIP problems while stabilizing the queue length distribution across time intervals. Although we decompose the overall problem into individual MIP-problems, each of the MIP-problems is indeed NP-hard (we can simply show that a well-known NPhard Capacitated Facility Location Problem can be reduced to our suggested MIP-problem), and it is difficult to solve the problem for large-size instances; therefore, we need to develop an efficient algorithm to solve the proposed problem. Note that in this study we choose to focus on introducing the key idea of our approach to achieve time-stability and analyze numerical results for the smaller instance, and developing an algorithm to solve the suggested MIP problem is beyond the scope of this study. Based on our suggested approach, we will study approaches to solve large-scale problems in the future.

Performance guarantee
In Sections 2 and 3, we introduced the notion of time-stability and suggested an approach to stabilize the queue length distribution of each powered-on server. Specifically, we derived the constraints and formulated an MIP problem to stabilize the mean queue lengths. Moreover, in this section, we will show that a performance bound on waiting time can be obtained based on the stabilized queue length distribution. In general, users of data centers commonly require QOS guarantees in terms of waiting times (or response times) or, for that matter, SLAs and their violations are also defined based on waiting times. Thus, a performance bound on waiting time would be a preferred performance measure and much more useful to design, monitor, and control data center operations. Recall that, for energy efficiency, we apply speed scaling techniques and show that the queue length distribution would be stabilized when the processing speed of each powered-on server changes in proportion to the arrival rates in the online supplement. In this case, even if the queue length distribution is stabilized, the waiting time distribution would not be stabilized, as the processing speed is varying across time intervals. However, based on the stabilized queue length distribution, we can derive the waiting time distribution and compute the mean waiting time of each application request for each time interval. For each time interval ∈ T , let X a be the waiting time of class a in time interval with Cumulative Distribution Function (CDF) W a (·) and X j be the waiting time of applications that are served by server j in time interval with CDF W j (·). Then X a can be defined as X a = X j with probability v a j , and the LST of the waiting time distribution for class a in time interval ,W a (s) can be defined as whereW j (s) is the LST of the waiting time distribution at server j in time interval . For the stabilized M/G/1 queue, the P-K transform formula forW j (s) is defined as Note that the aggregate arrival rates λ j at server j in time interval would be λ j = a v a j λ a . In addition, for a random variable Y with a target workload distribution H(·) and processing speed of server j in time interval , φ j , the LST of the service time distribution of server j in time interval ,S j (s) can be defined asS j (s) =H(s/φ j ) (since service time is defined as Y/φ j ). Now we have the LST of waiting time of application a in time interval as Moreover, the mean waiting time of class a in time interval isX where C 2 is the SCOV of the target workload. Based on the analysis of waiting times, it is intuitive to think that waiting times of applications can be controlled in a straightforward manner by adjusting processing speeds of powered-on servers. Furthermore, if we do not consider speed scaling and enforce that the processing speed of powered-on servers is a constant across time intervals such that φ j = φ ∀ j ∈ N , ∈ T , then the waiting time distributions are also stabilized as well as the queue length across time intervals. In addition, for speed scaling, the mean waiting time of each application class would be bounded by the minimum and maximum processing speeds of assigned servers.
Since we have stabilized the aggregate workload distribution and queue length distribution for each powered-on server, an upper bound on the mean waiting times of each class a, τ a (e.g., X a ≤ τ a ∀a ∈ A, ∀ ∈ T ) can be easily obtained based on the desired mean queue lengthL, which was introduced in Section 2.3, as follows: In other words, the upper bound τ a is derived by assuming the worst-case scenario reflecting the situation that class a workloads are solely routed to server j with the minimum speed φ j min among the powered-on servers. Consequently, an upper bound on the mean waiting times of the overall system across time intervals τ can be defined as Based on the analysis of the waiting time, we define the constraints on the processing speed of each server j in time interval , φ j , to define the upper bound on the mean waiting time as follows: where r is the bound ratio of the minimum speed to the maximum speed of the servers. We will solve our proposed MIP problem by adding constraint (21) and analyzing the results in Section 5.

Numerical evaluation
In this section we introduce simulation results to show that our suggested approach stabilizes the mean queue length and we provide a performance bound on the mean waiting time. For the numerical evaluation, we selected two test instances: 10 servers with 20 applications (|N | = 10, |A| = 20) and 20 servers with 40 applications (|N | = 20, |A| = 40) from Gallego Arrubla et al. (2013). Recall that, although data centers generally operate hundreds or thousands of servers, considering that in most data centers applications of a single client are clustered among servers for security and confidentiality, test instances of 10 or 20 servers are large enough to show that our suggested approach can be applied successfully in practice. We modeled the arrival process of each application's request as a non-homogeneous Poisson process with time-varying arrival rates λ a for 24 1-hour time intervals (i.e., 24 equally spaced time intervals for a 1-day cycle) such that ∈ T = {1, 2, . . . , 24}. In Figs. 2(a) and 2(b), we plot the arrival rates of both test instances (20 applications and 40 applications, respectively) across 24 1-hour time intervals. Also, for the workload distribution of each application a, H a (·), we chose a uniform distribution with mean workload 1/θ a and SCOV C 2 a . We also assumed that each application has a requirement for three types of resources (K = {1, 2, 3}) and that each server has a capacity limit for those resource types. For energy cost, we defined fixed energy costs for powered-on server j as being much higher than the unit operation costs of server j, c j , considering the fact that energy cost of servers is dominated by fixed costs as mentioned in Barroso and Holzle (2007). In this study, we solved the suggested MIP problem using CPLEX 12.6 Concert Technology on an Intel Core i7-3740QM 2.70 GHz with a 16 GB memory, and we used Java to develop a discrete-event simulation framework. We ran simulation experiments by using decision variables, x a j , y j , φ j , and v a j , which were determined by solving the suggested MIP problem.

Computational time
Before analyzing the simulation results, to describe the complexity of our suggested MIP problem, we summarize in Table 2 the objective function values and the computation time obtained by solving MIP-across time intervals ∈ T for the two test instances. Note that we obtained the optimal objective function values for the instance of 10 servers with 20 applications within 3600 seconds; however, due to the complexity of the problem, we report the objective function values of the best feasible solution and MIP gap for 3600-, 7200-, and 10 800-second time limits for the instance of 20 servers and 40 applications.

Analysis of time-stability
First, we analyze the mean queue lengths of each poweredon server to check whether our suggested approach achieves time-stability. Recall that as described in Sections 2.3, our suggested approach can be utilized to stabilize the mean queue length by matching the first and second moments for the desired queue lengthL. For the numerical evaluation, we selected the desired mean queue lengthL and the first and second moments (E(Y ) and E(Y 2 ), respectively) asL = 3 with E(Y ) = 2.3 and E(Y 2 ) = 7.2737 for 10 servers with 20 applications andL = 4 with E(Y ) = 3.7 and E(Y 2 ) = 27.38 for 20 servers with 40 applications. In both cases, we selected the desired traffic intensity as ρ = 0.8. Figure 3(a) and Fig. 4(a) depict the mean queue length of powered-on servers across the 24 time intervals of both test instances. Although there exist some variations, the mean queue lengths at each powered-on server are approximately stabilized as desired across 24 time intervals as compared with the time-varying arrival rates shown in Figs. 2(a) and 2(b). We would like to note that if we match more moments for the target workload distribution, then the mean queue lengths would be more stable across time intervals. Also, we compare the mean queue length obtained using the decision variables determined by solving the modified MIP problem that uses the following constraint (22) instead of time-homogeneity constraints (11), (13), and (14): The above constraint (22) ensures that the average workload that arrives at every powered-on server must not exceed the desired traffic intensity ρ for fair comparison of our suggested approach. As shown in Fig. 3(b) and Fig. 4(b), the mean queue length would not be stabilized without using time-homogeneity constraints as compared with Fig. 3(a) and Fig. 4(a). Also, we compared the objective function values obtained by solving our suggested MIP problem with values obtained by solving the modified MIP problem to check how much additional energy cost is required to stabilize the mean queue length, as well as provide performance guarantees on the mean waiting time. As we can see in Figs. 5(a) and 5(b), the total energy cost for stabilizing the mean queue length is slightly higher, but it can be reasonably ignored since we can provide provable performance bounds.

Performance bounds on the mean waiting times
As mentioned in Section 4, a performance bound on the mean waiting time of requests can be derived by using the bound ratio r of the minimum processing speed for each powered-on server based on constraint (21). Here, we selected the bound ratio r as r = 0.5 for 10 servers with 20 applications and r = 0.4 for 20 servers with 40 applications, and then we contrasted the mean waiting times obtained by using constraint (21) with the results obtained without using constraint (21) for both test instances. Figure 6(a) and Fig. 7(a) depict the mean waiting times and the performance bounds of both test instances, respectively. Note that the mean waiting time is strictly bounded by τ , which can be derived as Equation (20) as opposed to the results obtained without using constraint (21) shown in Fig. 6(b) and Fig. 7(b).

Benchmark
In addition, we compare our approach against the algorithm proposed by Lin et al. (2013) as a benchmark. Lin et al. (2013) considered a data center optimization problem and proposed an off-line algorithm to determine the number of active servers to minimize the total energy cost, which consists of operating costs and switching costs. The main idea of their suggested approach is to minimize energy costs while satisfying the QOS constraints, which are defined by using standard queueing theory results as shown in Van et al. (2009) and Rao et al. (2010). Note that Lin et al. (2013) implemented a dispatching rule that delivers an equal amount of workload to each powered-on server (i.e., load balancing). Note that Lin et al. (2013) suggested a model that uses quasi-steady-state approximations (i.e., the metrics are piecewise constant for each time interval, long enough for the system to reach a steady state) to define QOS constraints for a non-stationary system, and it is worthwhile pointing out that our suggested approach can provide a provable performance bound for time-varying and transient systems. Since Lin et al. (2013) assumed that both workloads and servers are homogeneous and did not consider resource capacities and speed scaling, we slightly modified their model and ours and compared the results. In fact, Lin et al. (2013) proposed a QOS constraint that the average waiting time is bounded by certain thresholds based on M/G/1 processor sharing queue, and we redefined their constraint by using multiclass M/G/1 queue with FCFS (Gautam Gautam, 2012) for the heterogeneous workload. Also, since our suggested MIP problem does not consider switching costs (i.e., cost for powering a server on and off), we modified our model by adding the following constraints (23) and (24) and cost function (25) so that our MIP problem determines operational decisions while minimizing both operating costs and switching costs. In order to include the cost for powering on/off, we define additional decision variables, u on j (e.g., u on j = 1 if server j is turned   on at the beginning of time interval ) and u off j (e.g., u off j = 1 if server j is turned off at the end of time interval − 1). Then we can define the constraints to determine variables u on j and u off j as follows: Recall that decision variable y j = 1 if server j is active in time interval ; otherwise, y j = 0. By using u on j and u off j , now we can define the cost function for turning servers on/off, which can be added to the objective function as follows: where c on j and c off j are the cost for turning servers on/off, respectively. To compare the approaches, we solved both problem test instances without considering resource capacity (as well as constraints (23) and (24)) and set the cost for turning servers on/off so that the fractions of cost to power cost at "idle" state of each server j, f j as 0.2, 0.4, 0.6, 0.8, and 1.0. For example, if the fraction is 0.2, then the cost for each time the servers turn on/off would be 20% of the "idle" power cost. In Figs. 8(a) and 8(b), we compare the total energy costs of our approach against that of Lin et al. (2013) by setting the thresholds for the mean waiting time to be equal to the upper bound obtained by our suggested approach under the same traffic intensity. As shown in Figs. 8(a) and 8(b), the total energy cost of our suggested approach is smaller than that of the approach in Lin et al. (2013) for the same upper bounds on the mean waiting time. This is because both the variability of the queue length and the waiting time get bigger for multiclass jobs (heterogeneous workloads) and thus more servers are needed to provide the same performance bounds as can be derived by our suggested approach.

Concluding remarks and future work
In this article, we considered a fairly common scenario in cloud computing where requests of heterogeneous applications arrive at time-varying rates to a data center that consists of heterogeneous servers. For such a time-varying and heterogeneous system, it is difficult to achieve energy efficiency and provide performance guarantees simultaneously; therefore, to tackle this shortcoming, we suggested an approach to achieve timestability. For performance guarantees, our suggested approach stabilizes (i) the aggregate workload distribution, based on a moment-matching approximation, and (ii) the arrival rates, based on routing fractions to achieve time-stable queue length distribution. We also derived time-homogeneity constraints based on our approach and formulated an MIP problem to optimally determine various decisions about sizing, assignments, routing, and speed scaling to minimize energy costs while considering time-stability. In fact, we can obtain timestability of the queue length distribution by sufficiently matching many moments based on the suggested approach. We showed that we can also obtain reasonable time-stability for the mean queue length by matching only the first and second moments. In addition, based on a time-stable queue length distribution, the waiting times of applications are easily controlled by obtaining bounds on the processing speeds. Our suggested approach indeed enables us to provide performance guarantee on the mean waiting times, which is an extremely useful measurement in terms of QOS for both users and service providers. To evaluate our approach, we developed a simulation model and summarized results that support our claim. In the present study we do not focus on developing an algorithm to efficiently solve the proposed MIP for the large-scale problem; we leave that for future work.
We acknowledge that our approach is purely based on historical information and further work needs to be done before it can be implemented in practice. We believe that the results from this research can be used by practitioners to develop an initial cut for setting up servers in their data centers. Then an appropriate analysis tailored to the type of applications hosted by the data center would need to be performed to determine the number of stand-by servers to be positioned to handle unexpected surges in demand. Then, traffic can be monitored and significant deviations from time-stable queue lengths in real-time can be used to trigger the use of stand-by servers. Furthermore, one could develop control algorithms to dynamically change server settings on the fly and use our proposed approach as the baseline static setting. We propose to extend our study to develop an online algorithm for such real-time control problems in the future.