THE EFFECT OF PARALLEL EXECUTION ON MULTI-SITE COMPUTATION OFFLOADING IN MOBILE CLOUD COMPUTING

The demand for running complex applications on smart mobile devices is rapidly increasing. However, the limitations of resources are restricting the development of intensive applications on these devices. The restrictions can be overcome by offloading the computation of an application in the powerful cloud servers. The objective of the computation offloading is to offload the parts of an application to the cloud server to minimize the response time, energy consumption and monetary cost of the application. Unlike prior work in computation offloading, this work considers the effect of parallel execution—on different devices (external parallelism) and on the different cores of a single device (internal parallelism). This work models each device as a multi-server queueing station. It uses genetic algorithm to determine the near-optimal offloading allocation. The results show that considering the effect of parallel execution yields better pareto-optimal solution for the allocation problem compared to excluding parallelism.


List of Tables
Chapter 1: Introduction

Motivation
The demand of mobile devices is continuously increasing in our daily lives through their new impressive features such as face recognition, augmented reality and interactive gaming. However, these functionalities are offered through specific applications which are resource-hungry and demand intensive computation as well as high energy consumption. Further, the mobile devices are very resource constrained due to their physical size, limited processing speed, and battery life.
These limitations cause excessive resistance in the development of these impressive applications.
One promising approach to deal with the resource limitations of mobile devices is to use computation offloading. Computation offloading is a solution to improve the capability of mobile applications by migrating heavy computation tasks of an application to powerful servers in the cloud [18]. Computation offloading can save energy and prolong the battery life of mobile devices by running computation-intensive tasks in the cloud servers, which will drain a device's battery if executed locally. Computation offloading can improve the response time of the mobile application by running some tasks on the cloud servers (assuming that the processing speed of the cloud servers is higher than the mobile device). However, there are some factors that adversely affect the efficiency of offloading such as, the amount of data that must be transferred among the mobile device and the cloud servers, and the communication bandwidth between them. Computation offloading also incurs the following monetary cost for the mobile user: The user has to pay for the renting cost of the cloud servers for the duration of the application execution (ii) In case of excessive data exchange between the mobile device and cloud servers, the user may have to pay for the additional data usage if it exceeds the monthly subscription.
Thus, a mobile device should judiciously determine whether to offload computation, what tasks (i.e. parts) of an application should be offloaded, and to which servers in the cloud. Further, the code offloading can be deployed either by offloading any method, any thread or any class of an application to the cloud server. The greatest benefit from computation offloading can be obtained by finding the optimal allocation for the tasks of an application to different devices (i.e. the mobile device and the cloud servers) that minimizes the application objective. The objectives can be the total response time, the mobile battery energy required for the computation, or the monetary cost incurred by the user for the execution on the cloud servers. The workflow (the execution sequence of tasks) of a mobile application may not be linear, i.e. it may contain tasks that can execute in parallel in multiple different resources. The computation offloading can be further enhanced due to this non-linear property of the workflow, by adopting parallelism in the task execution. It can be implemented among the different computation resources (i.e. mobile device and cloud servers), referred to as external parallelism or to different processing cores of a single device, referred to as internal parallelism in this research. Both external and internal parallel execution can significantly improve the mobile application response time based on the offloading allocation. As a result, the energy consumption and the monetary cost needed to run the application will also be affected depending on the offloading allocation.

Research Problem
Mobile Cloud Computing (MCC) has dramatically improved from its initial term of cyberforging and continuously getting substantial attention of researchers, investors, and analysts due to its ability to leverage execution of an applications from mobile device to powerful cloud server(s). There has been continuous research conducted in the area of MCC to make it more convenient and user-friendly to the end user. The current literature emphasizes on both, the singlesite as well as multi-site code offloading, between the mobile device and cloud servers. However, to the best of our knowledge, the current research in the models of code offloading is still performing the computation of an application among different resources sequentially by adding the execution time of all parallel tasks. This assumption of sequential execution of the parallel tasks dramatically affects the prediction of overall response time and energy consumption of the mobile application.
Further, there are diverse types of mobile device users whose objectives, needs, and perceptions of the mobile applications are different. For example, some users are aggressive and their main concern is the performance, some are conservatives and their concern is the battery life of their mobile device and others are reluctant to spend any additional cost on the application execution.
Thus, to make the code offloading more reliable and available to these diversified users, it must be multi-objective. To the best of our knowledge, the current literature focuses on the application response time and energy consumption for computation offloading. However, the monetary cost which arises by renting the cloud servers as well as the network service charges of the mobile device are ignored. Thus, to achieve a more accurate estimate of code offloading, the network charges and cloud service renting cost should also be considered. This research proposes a unique multi-site code offloading model by considering external and internal parallel execution of the application tasks with the consideration of multiple objectives, i.e. the response time, the energy consumption and the monetary cost to provide a user with more realistic and near-optimal code offloading allocation.

Contributions
This thesis solved the problem of multisite offloading of mobile applications. Our work goes beyond existing approaches by considering parallel execution of tasks during offloading decision in contrast to others who primarily focused on sequential executions.
The contributions of this thesis are as follows: 1.
It proposes a theoretical framework for the near-optimal offloading allocation problem in multi-site offloading scenario.

2.
It uses genetic algorithm to find the near-optimal allocation of tasks to different devices.
The genetic algorithm iteratively evaluates multiple allocations to find the near-optimal solution.

3.
To evaluate an offloading allocation, we propose a new algorithm that computes the application's response time, the energy consumption on the mobile device, and the monetary cost. Our algorithm accounts for the execution dependencies of the tasks and the parallel execution of tasks across the cores of a device as well as across different devices.

4.
We implement our novel algorithm that considers parallel execution of tasks, and an existing algorithm that ignores the parallel execution of tasks. We accomplish this implementation using an existing library of genetic algorithms in Java (the MOEA Framework [15]).

5.
We compare and analyze our novel algorithm against the existing algorithm for a realworld face recognition application. The results show that accounting for the effect of parallel execution yields better near-optimal solution for the allocation problem compared to not accounting for parallelism at all.

Research Overview
The response time and energy consumption are the key elements in the performance and reliability of an application. The enhancement of these two factors can open a straight path of the development of the intensive applications such as face recognition and GPS services on mobile devices. The code offloading framework has the capability of dramatically improving these two factors at a tiny processing cost by leveraging the intensive execution from resource hungry mobile processor to the powerful cloud server. This research compares the enhancement of the response time and energy consumption of an application with the required additional processing cost as a multi-objective code offloading framework. The trade-off between the Response Time, the Energy Consumption and the monetary cost is further examined by introducing parallel execution amongst different available code offloading sites (VMs) as well as partitioning between different cores of the processors using Genetic Algorithm (GA). The GA finds the near-optimal values of the Response Time, the Energy Consumption and the monetary cost by examining the solution population in all possible scenarios such as complete offloading to the cloud server(s), or performing total local execution (in the mobile device), or performing hybrid execution between the cloud server(s) and the mobile device.

Thesis Outline
This thesis consists of total six chapters. A brief description of each chapter is as follows: Chapter 1: The first chapter provides the introduction. It summarizes the motivation behind this research and provides a brief overview of the research.

Chapter 2:
This chapter provides a background of available code offloading frameworks, their solution to the research problem, along with limitations and areas of improvement on each framework. It also summarizes our contribution to the literature.

Chapter 3:
This chapter illustrates our multi-objective framework that accounts for the effects of internal and external parallelism on offloading allocation problem.

Chapter 4:
This chapter compares the effect of external and internal parallel execution with the sequential execution proposed in the current literature though a real-world face recognition application.
Chapter 5: This chapter provides the conclusions and the future work.

Mobile Cloud Computing
Mobile Cloud Computing (MCC) enables the execution of intensive applications such as face recognition and augmented reality on resource constrained mobile devices. The primary role of the MCC is to serve as a terminal between the resources-rich cloud server and resource constrained mobile device to improve the application execution time as well as to reduce the application energy consumption [19]. One of the method of creating a server-client bridge is called code offloading.
There are several existing code offloading frameworks which one can use based on their objective such as improving response time, and reducing the energy consumption of a mobile application.

Single-Site Offloading Frameworks:
The MAUI [7] proposes an offloading framework based on the reduced energy consumption by using the integer linear programing to find the near-optimal offloading solution [14]. The MAUI framework provides method level code offloading and requires the developer to manually annotate the methods which can be offloaded to the cloud server. This framework maps the application as a call graph where methods are represented by vertices and their invocation is represented by the edges [7].
The Clone-Cloud [6] provides a transparent code migration code offloading framework based on the energy consumption and execution time [12]. This framework uses a combination of static analysis and dynamic profiling to automatically partition the application and migrate the thread of the application to the cloud server. This framework converts the problem as a tree-diagram.
The ThinkAir [17] framework focuses on the scalability of cloud VM and dynamically scales the cloud server instances to allow parallel execution of offloading code on multiple instances [3]. As in MAUI, this framework also requires the application developer to manually annotate part of the code which can be offloaded to the cloud server. This framework contains an execution controller which determines the execution time, energy consumption and cost of offloading before generating offloading policy.
The COMET (Code Offloading by migrating execution transparently) provides a transparent code migration through distributed shared memory (DSM) between the mobile device and cloud server [13]. Similar to CloneCloud, this framework does not require manual annotation from the developer on the application code. It contains an automated code profiler to analyze the application for the offloading policy.
The framework in [11] dynamically partition the application by classify each task as offloadable or unoffloadable to minimize the response time and energy consumption. Their model constructs the application as weighted consumption graph (WCG) to estimate the computational and communication cost and optimize it using min-cut offloading partitioning algorithm (MCOP).

Multi-Site Offloading Framework:
Multi-Site code offloading is a well-regarded approach for minimizing energy consumption of the mobile application. [26]. To the best of our knowledge, multi-site code offloading is considered by [24,21,14,26].
The [24] multi-site code offloading framework assumes each cloud server has different computational capacities and network bandwidth. The application in this model is represented as a graph partitioning problem where nodes refer to computation module and edges refer to the interaction between modules. This model assigns weight to all nodes and edges to minimize the computation and communication cost using 0-1 Integer Linear programming (ILP) problem. The model is motivated by the data-centric offloading to provide solution to applications that requires multiple sources of data.
The [21] research developed an Energy Efficient Multisite Offloading (EMSO) algorithm by formulating partition problem as 0-1 Integer Linear programming (ILP). They perform object level offloading based on the constructed Weight Object Relation Graph (WORG) dynamic profiling.
Using static analysis, weight is assigned to the nodes and edges of the graph to find the nearoptimal offloading solution.
The energy-efficient multisite offloading policy [EMOP] in research [26] optimizes the application energy consumption using discrete time Markov chain (DTMC) model. It uses value iteration algorithm (VIA) to determine the offloading policy for the Markov chain model. Their model considers heterogeneity of offloading sites and perform data and process-centric offloading.

Contribution to the literature:
To the best of our knowledge, all single-site or multi-site frameworks mentioned in section 2.4 and section 2.5 uses a binary decision variable to decide whether a task of an application should be offloaded to the cloud server or be processed locally in the mobile device. However, our model introduces a multi-state decision variable to decide if the task should be offloaded to the cloud server or process locally in the mobile device. The states of the decision variable are equal to the number of computing resources available for the execution. Similar to the offloading framework of Sinha et al. [24], our model also allows each cloud server a different computational capacity and network bandwidth. The multi-state decision variable finds and offloads the application tasks to the cloud server using genetic algorithm.
In addition to that, all single-site and multi-site offloading frameworks focuses on the minimization of the response time and energy consumption. However, the monetary offloading cost which arises from the mobile data network and renting cloud servers are being ignored in the process. Our partitioning model optimizes the application based on three objectives: minimize the response time, minimize the energy consumption, and minimize the operating cost; and produces paretooptimal solutions using genetic algorithm. The user can choose any pareto-optimal solution for code offloading based on the current state of the mobile battery and network bandwidth.
In the current literature, the partitioning of an application with multiple parallel nodes, is the addition of the computational time of the nodes. However, if the parallel nodes are assigned to the multiple different cloud servers, the computational time is the maximum time of all nodes since the execution is parallel among all resources. Our model addresses this issue and introduces a queue to each cloud server to perform parallel execution of the parallel nodes to further improve the application response time prediction. Our model takes into consideration, multi-processor code offloading since the cloud servers are equipped with the multiple processors and available for code offloading.

Genetic Algorithm
The genetic algorithm (GA) has been the most popular technique in computation research [25] widely used in the area of mobile computing. The research [5], [9] and [28] uses genetic algorithm to find the near-optimal offloading solution for the code offloading problem. The genetic algorithm starts with a set initial population and produces new solutions based on the probability of crossover and mutation. It uses the fitness function to examine the solution and optimizes the objectives of the application.

Initialization:
In initialization, the user defines the initial population size, the probability of crossover and mutation. The GA initializes the user pre-defined population size of chromosomes. In computation offloading problem, each chromosome refers to a unique offloading solution.

Selection:
Selection is the process of choosing two chromosomes from the population to recombine for generating new population via crossover or mutation. The purpose of selection is to filter individuals in the hope that their offspring (chromosome) has higher fitness.

Crossover
The crossover operator is to combine two sets of chromosomes to generate new offspring(s) (chromosomes). It is applied to selected individuals with the hope that they produce child(ren) with better fitness. The process of recombination is as follow: 1. The selection operator selects at random a pair of two chromosomes to mate 2. A random cross-site is selected in the gene 3. The position values are swapped between both chromosomes following the cross-site to produce new offspring(s).

Mutation
The mutation operator slightly modifies chromosomes to improve the fitness and avoid early convergence. It prevents the algorithm to get trapped in the local minimum. The crossover exploits the chromosome to find the better solution and mutation helps in the exploration of the whole search space. There are different forms of mutation, for different kinds of representation. A simple mutation is about inverting the value of each gene with the user pre-defined probability.

GA Objectives:
The GA has the ability to optimize multiple objective of the application simultaneously. In a multiobjective problem there is no best solution with respect to all objectives. Thus, it produces the pareto-optimal solution for the objectives which cannot be simply compared with each other.

Fitness Function:
The fitness function consists of mathematical model of GA objective. In this research, the fitness function consists of three objectives: minimize response time, minimize energy consumption, and minimize monetary cost of the offloading problem.

Chapter 3: The Pareto-Optimal Solution
In this chapter, the theoretical research framework is discussed to find a near-optimal offloading allocation for the multi-objective code offloading problem. The gain due to the external and internal parallel execution, the relevant definitions, and computing parameters to achieve the nearoptimal offloading solution will also be discussed in this chapter. The conversion of a mobile application to workflow graph and the effect of genetic algorithm will be discussed in this chapter.
This chapter is organized as follow: 3 Where 0 is the mobile device and 1 , 2 , … . . represent the cloud servers.

Definition 3 (Mobile device).
A mobile device is a cell-phone or any portable device that can connect to the internet and request execution of application tasks from computing clouds. The mobile device d0 is a homogenous multi-core device which is modeled as a five tuple < 0 , 0 , 0 , 0 , 0 , 0 > where 0 is the current battery percentage of the mobile device, 0 is the number of processors in the mobile device, and for each processor 0 is the processing speed of that processor (in million instructions per second), 0 is the computation power consumption, 0 is the power consumption for communication (send and receive data), and 0 is the power consumption while the device is idle.

Definition 4 (Remote Cloud Servers).
In this work, a mobile device can offload its computation to more than one cloud servers. A cloud server is a homogenous multi-core computational resource (e.g. a virtual machine) that can execute tasks of a mobile application. A cloud server ℎ = 1, 2, … is modelled as a three tuple < , , > where is the number of cores in the cloud server, is the processing speed of each core (in million instructions per second), and is the monetary rate of renting the cloud server from the cloud provider (in dollars per minute). It is assumed that if a cloud server is used for executing certain tasks of a mobile application, then the mobile user must rent the server from the cloud provider for the whole duration of execution of the application.

Definition 5 (Device-to-device bandwidth):
The current data bandwidth between any two devices must be known. This is necessary to estimate the communication time between the two devices for data transfer. Let bandwidth ( , ) be the bandwidth between device and device , where u, v = 0, 1, 2, …K, and u is not equal to v. A task of the mobile application receives some input data and produces some output data. All the tasks of a mobile application may not be suitable for offloading to remote cloud servers. A task may not be offloadable if it needs access to local components (such as a camera or other sensors) or its execution on a remote cloud server might cause security problems.
In the workflow graph , each task ∈ is modelled as a two tuple < , > where is the type (true for offloadable or false for non-offloadable) of the task and is the amount of CPU cycles (in million instructions) required for execution of task .
Each directed edge = { ( , ) ℎ ℎ , ∈ } represents the dependency of on for execution. Each edge ( , ) is associated with a value < > where represents the amount of data that needs to be communicated between the devices executing the tasks and .
This data transfer does not happen if the tasks and are executed on the same device. Let ( ) be the set of tasks on which the task depends on for execution. Let ( ) be the set of tasks which depends on task for execution. We define the level of task , ( ) be the maximum of the levels of the tasks on which depends on for execution plus 1, i.e.
For a given offloading allocation[ 1 , 2 , … , ], we can construct a time-weighted workflow graph = ( , ) from the workflow graph G as follows: Each vertex ∈ is associated with a weight that represents the time to execute the task on computation resource . can be computed by dividing the amount of CPU cycles required for execution of task by the processing speed of single core for device i.e. = ⁄ . Each ( , ) such that , ∈ is associated with a weight that represents the communication time needed for data transfer when task will be executed on device du (i.e. = ) and task will be executed on device (i.e. = ). This communication time depends on the amount of the data that needs to be transferred and the bandwidth between the devices and . Thus, can be computed as:

Definition 8 (Offloading Allocation):
In a multi-server offloading scenario, each offloadable task of a mobile application can be allocated to run on either the mobile device or on one of the remote cloud servers. Each non-offloadable task must be allocated to run on the mobile device.
An offloading allocation is defined as one such allocation of tasks of the workflow graph to devices. An offloading allocation a, of the tasks in set T to the devices in set D is represented as

Model without Considering Parallel Execution of Tasks
In this section, a mathematical model is represented, which can be used to calculate the theoretical values of any offloading allocation a. It consists of response time of the application, battery energy consumption of the mobile device, and monetary cost that will be incurred for the mobile user for executing the application according to the allocation a. This model does not consider the parallel execution of tasks. Here we follow the philosophy of the works in [24] and [27].

Response Time:
The response time is the sum of the execution time of each task = { 1 , 2 , … } and Where is the amount of CPU cycles (in million instructions) required for execution of task .
Each edge ( , ) is associated with a value < > where represents the amount of data that needs to be transferred between the devices executing the tasks and for communication.

Energy Consumption:
The energy consumption of an offloading allocation is the sum of execution energy, communication energy and idle energy. The energy consumption for an offloading allocation a is as follow: The execution energy is a product of the execution time of the mobile processor with the user pre-defined value of the execution power 0 The communication energy is the product of the communication time of each edge ( , ) ∈ E with the user pre-defined 0 .
The idle Energy is the sum of the amount of time the local processor is idle when the execution is taking place in other computing resources (cloud servers) and the amount of communication time between two cloud servers where local processor is staying idle.
Thus, the mathematical model of energy consumption is given below.

Execution Cost:
The execution cost of an application is dependent on the user mobile network package and cloud VM renting cost based on the offloading allocation a. It is assumed that the monthly subscription is limited and is the current remaining amount of data available for offloading and is the monetary rate for the additional amount of data (in dollars per MB).
The cost is calculated by calculating the additional data required for any offloading allocation Where is the difference between the remaining network data bandwidth and required network bandwidth.
Thus, the execution cost a is the sum of the products of total response time RT a of an allocation a with cloud server operating charges if there is no additional network data bandwidth is required. In case, if there is additional network data bandwidth required, the network charges (D a * α) are added in the overall cost.

Introductory Example
Workflow Graph: Figure 1 shows a workflow graph example consisting of seven different tasks for a mobile application. Similarly, there are three different resources available for execution, local mobile device, and cloud servers 1 2 . Such that each of these seven tasks can be either offloaded to one of the cloud server ( 1 , 2 ) or executed in local processor. The goal is to process all seven tasks in the shortest possible time with minimum consumption of mobile energy and with spending lowest processing cost possible.
In Figure 1, the task 1 must be executed first. When it is finished, the tasks 2 , 3 , 4 and 5 can execute in parallel. When task 2 finishes, task 6 can start its execution. Similarly, when tasks 3 , 4 , 5 and 6 are finished, task 7 can begin execution. Once task 7 is finished, the execution of the mobile application is complete.

Figure 1: A Simple Workflow Graph
In Figure 1 above, the task 1 is non-offloadable task whereas tasks 2 , 3 , 4 , 5 , 6  the tasks 5 and 7 are executed on different devices, then 16MB of data needs to be transferred between those devices. The details of the workflow graph in Figure 1 is as follow: = 3

Time-Weighted Workflow Graph for the allocation = [ , , , , , , ]:
Let us refer Example-1 whose workflow is shown in Figure 1. Let the mobile user gets 1GB of data per month as part of monthly subscription. Let us assume that out of 1GB, 500MB is unused and the charge for using additional data is 2 cents per MB of data such that For the mobile user profile. Further, let's assume that the current state of the mobile device is described as: The mobile device is currently at 95% battery remaining, only has 1 processing core, 1 is the processing speed of the core and 0.5, 0.25 and 0.15 will be the execution, communication, and being idle power consumption. Let us consider that there are two cloud servers 1 and 2 where the offloadable tasks can be allocated. The cloud resources configuration is as follow , 4000 , 0.05 / > 1 is described as i.e. it has 1 core, 2 MIPS is the speed of the core, and 0.03 dollars is the charge per minute for renting 1 . Similarly, the cloud server 2 also has 1 core, with MIPS core speed and 0.05 dollars per minute renting charge. The data transfer bandwidth among different resources is assumed to be: Such that the bandwidth between local mobile device 0 and cloud server 1 is 1MBPS and the bandwidth between mobile device 0 and cloud server 2 is 2MBPS. Finally, the data bandwidth between two cloud servers is 1 and 2 is 4MBPS.
Let's consider the following offloading allocation = [ 0 , 2 , 1 , 1 , 1 , 0 , 2 ] for the tasks [ 0 , 1 , 2 , 3, 4 , 5 , 6 , 7 ] respectively. The time-weighted workflow graph corresponding to Figure   1 is shown in Figure 2. In Figure 2, all the weights (on the vertices and the edges) are in seconds. The Figure 2 above shows the time-weighted workflow graph of Figure 1 for the offloading allocation a. The task t1 and t2 are assigned to the local mobile device so their execution time is 4 seconds. The task t3, t4 and t5 execution times are 2 seconds since the cloud server d1 speed is 2MIPS. Similarly, the task t2, t7 execution times are only 1 seconds since cloud server d2 speed is 4MIPS.

Evaluating a given allocation without Considering Parallel Execution of Tasks (for the introductory example)
In this section, for the introductory example in section 3.3, we compute the Response Time, Energy Consumption and Monetary Cost for a given allocation = [ 0 , 2 , 1 , 1 , 1 , 0 , 2 ] without considering parallel execution.

Response Time
Corresponding to each offloading allocation a, we can compute the response time of the mobile application, battery energy consumption in the mobile device, and monetary cost that will be incurred for the mobile user for executing the application according to the allocation a. The computation of these three measures is given in the next section. The response time for

Energy Consumption:
The energy consumption for the offloading allocation a is the sum of computation energy, communication energy and idle energy. The computation energy can be calculated using equation (2) which is as follow: Computation energy is equal to the local mobile device assigned tasks (t1, and t6) execution time multiply by the user pre-defined computation power 0 of the mobile processor which is equal to 0.5 Watts. The total Computation energy for the offloading allocation a is 4 Joules. Similarly, the communication energy can be calculated by using equation (3) which is as follow: The communication energy is equal to the sum of all edges where local mobile processor is either sending or receiving the offloading data from any other resource i.e. edges ( 1 , 2 ),  The total energy consumption for the offloading allocation a is the sum of computation energy, communication energy, and idle energy which is 9.25 Joules.

Monetary Cost:
The computation cost of the offloading allocation a can be calculated using the cost equation (7) which is as follow: Since the remaining data limit in the user wireless plan is assumed to be 500MB and the offloading allocation a does not exceed the remaining data limit so there is no on additional charge on the user wireless network plan, resulting Da is 0. However, the renting of cloud servers, d1 and d2, for the application duration of 37 seconds incur monetary processing cost which is as follow. The monetary cost of the offloading allocation a is the renting cost of cloud server d1 which is 0.03 cents/minute as well as the renting cost of cloud server d2 which is 0.05 cents/minute multiply by the response time of the application which is 37 seconds. The total monetary cost of the offloading allocation a is equal to 4.9 cents/minute.
Thus, the response time of the workflow graph presented in Figure 2 for the offloading allocation = [ 0 , 2 , 1 , 1 , 1 , 0 , 2 ] is 37 seconds with the overall energy consumption of 9.25 Joules and the processing cost of 4.9 /

Execution (for the Introductory Example)
In this section, we will apply Genetic Algorithm (GA) to find the near-optimal offloading allocation for the introductory example provided earlier (section 3.3). The GA will start with a set number of offloading solutions' population which is defined by the user. Using the fitness function, it will generate new offloading solutions' genes based on the mutation and crossover operator probability. In the exploration of the near-optimal solution, the GA will consider multi-objectives to find an offloading allocation which has minimum response time, energy consumption and processing cost of the application.

GA Parameters:
The genetic algorithm parameters are as follow: The GA starts with an initial population of 100 solutions and iterate on the initial population for 10,000 evaluations. Each evaluation generates a new gene, based on the mutation or crossover operator probability. The mutation algorithm is chosen to be a Uniform Mutation (UM) with a probability of 0.03. The uniform mutation mutates each decision variable in the gene by selecting a new value within its bound uniformly at random [15]. Similarly, the crossover algorithm is chosen to be Subset Crossover (SSX) with a probability of 0.9. The subset crossover swaps half of the non-matching decision variable between two parent's genes [15].
Next, we will show the results of the introductory example (section 3.3) by applying GA. The GA will be applied for three cases: allocate all the tasks on only local mobile processor (no offloading); allocating the tasks among the local mobile processor and one VM (single-site offloading); and allocating the tasks among the local mobile processor and two VMs (multi-site offloading).

No offloading
When applying GA with a local processor, there is only one offloading allocationwhich is all tasks must be executed in the local processor. The GA reveals the following offloading allocations for only the local processor resource d0. Thus, the processing of all task in the example 1 will required 28 seconds which will consume 14 Joules of mobile power and there will not be any additional processing cost since there is no offloading is possible for this configuration.

Single Site Offloading
In this configuration of a local processor with one VM, since each element can have a value of either 0 (Local) or 1 (VM), the available offloading allocations are 2 7 = 128. GA will start with the initial population of 100 solution and generate new solution to find the near-optimal solution.
In this configuration task 0 is forced to be executed in the local processor, since example 1 refer to a mobile application and offloading should start from the local processor. All other tasks of the application will be processed in the local processor or the cloud server. The GA provides the following offloading allocations: The GA suggest the two offloading allocations 1 and 2 for the user to choose based on their application objective. First offloading allocation, 1 , which provides the minimum 1 = 23 , 1 = 5.55 , however the user will be charged a processing fee of Similarly, the offloading allocation 2 does not have any processing cost, = 0.00 / . However, the application takes longer to complete 2 = 28 and it will consume 2 = 14.0 of the mobile battery.

Multi-Site offloading
In the configuration with the local processor with two VMs, each element of the offloading solution can have three states, 0 (local), 1 (VM1), 2 (VM2) so there are 3 7 = 2187 different available offloading allocations. The GA starts with the initial population of 100 solutions and generates a new gene using either crossover or mutation. It observes the behavior of the new gene to explore the near-optimal solution. Like the local processor with one VM configuration, the task 0 is forced to be processed in the local processor, and all other tasks can be processed in the local processor or any available cloud server. The GA provides two different offloading allocations: Likewise, if the user is hesitant to spend any additional cost on the application, they can use the allocation 2 which does not have any processing cost, = 0.00 / . However, the application takes longer to complete at 2 = 28 and it will consume 2 = 14.0 of the mobile battery.

Our Model considering Parallel Execution of Tasks
In this section, we describe our new algorithm to evaluate an allocation that considers both (external and internal) parallel execution of the application tasks. The external parallelism is among different available computing resources, and internal parallelism among different processing cores of a single computing resource. The goal is to explore for an offloading allocation with minimum Response Time, Energy Consumption, and Monetary Cost.
The model of the internal parallelism of a device (where u = 0, 1, 2, …K) with cores is shown in Figure 3 below, Figure 3: The device du modeled as a multi-server queueing station with ru number of servers The Figure 3 above represents a device as a single multi-server queueing station that consists of a job queue and number of identical servers. The ru cores of the queuing station perform internal parallelism for the device du. Similarly, for the model of external parallelism, we assume that the parallelism can exist among different devices and each device maintains its own queue.

Definitions:
1. Each action which is required in the workflow graph shown in the Figure 2 above, is referred to a Job.

2.
A job can be scheduled to execute on any available device.
3. There can be three kinds of jobs with respect to the execution of task ti: 3.1. receiveJob (tj, ti): The execution of this job represents receiving of data-produced by task tj-by the device hosting task ti. This data will be needed for executing the task ti. This job is relevant when tasks tj and ti are hosted in different devices.

executeJob(ti):
The execution of this job represents execution of the task ti.

sendJob (ti, tj):
The execution of this job represents sending of data-produced by task ti-from the device hosting task ti to the device hosting task tj. This data will be needed for executing the task tj. This job is relevant when tasks ti and tj are hosted in different devices. 6. Each core in a device can be either in busy state or idle state.
6.1. The core is in busy state means that it is busy processing a job.
6.2. The core is in idle state means that the core is idle.

Job Generation
The process of generating, assigning appropriate time and processing Execution_Jobs, Receiving_Jobs, and Send_Jobs are as follows: Step-1: In this step, we generate jobs and schedule them in relevant devices. We set the arrival times, service times and depth for the jobs.

totalJobsList = {}
For each ti ∈ processed in the order of increasing levels: Step-2:

In this step, we process all the jobs to set the start time and end time for each of them.
Process each job in the totalJobsList in the order of increasing depths as follows: 1. Obtain the time instant t when one of the cores of the job's scheduled device will be free. Let the core which will be free at time t be p. Step-3: Compute the required measures as follows: 1. Compute the response time RTa as: 3. Compute the monetary cost, Ca, using equation (7) 3.1. Since the value of RTa may be different when parallel execution is considered, the cost Ca may be different in case of parallel execution as well.
For a given allocation a, the values of the three performance measures will likely vary when considering or ignoring parallelism in the execution. Hence, the near-optimal allocation(s) that minimizes one or more of these measures may also vary as well when parallelism is considered.

External Parallel Execution
External parallel execution involves multiple task processing at the same instance of time among    (t2, t6)) arrives at the time = 7 seconds but wait in the queue until time = 9 when the mobile device d0 becomes available.
The total time for all the jobs in the mobile device d0 is 17 seconds.
The Table 5 below shows the jobs schedule for cloud server d1 The The total time the cloud server d1 takes to process all the jobs is 27 seconds.
The Table 6 below represents the Jobs schedule for Cloud Server d2 The Table 6 above shows the jobs schedule in the Cloud Server d2. Server d2 throughput total 8 jobs which includes five ReceiveJobs, and only one ExecuteJobs and one SendJobs from task t2.
Similarly, t7 is the last task in the workflow graph so it does not contain any SendJobs. The server d2 spends 16 seconds on ReceiveJobs, 2 seconds on executeJobs and 2 seconds on the SentJobs.
In server d2, Job 6 must wait for availability in the queue otherwise all jobs are processed upon their arrival time. The total time the cloud server d2 takes to process all the jobs is 27 seconds.

As per
Step-3 of our algorithm, we compute: Since this job execute(t7) has the highest depth in the task graph so the response time will be the end time of job execute(t7). The energy consumption can be calculated using equation (5). The energy consumption for the offloading allocation a with external parallel execution is 8.2 Joules. Finally, the cost can be calculated using equation (7) = 30 * ( 0.03 60 ) + 30 * ( 0.05 60 ) + 0 = 0.04 / .
Thus, the allocation a = [d0, d2, d1, d1, d1, d0, d2] with the server queue takes 30 seconds to process and it requires 8.2 Joules of energy with only the processing cost of only 4 cents/Minutes. The following   Table 7 shows that introducing the queue in the work-flow graph reduces the response time to 30 seconds from 37 seconds. Also, the energy consumption on the mobile processor is reduced to 8.2 Joules instead of 9.25 Joules. Finally, the processing cost is reduced to 4.0 cents/minute instead of 4.9 cents/minutes.

Parallelism (for the Introductory Example)
Let's introduce the genetic algorithm on the above queuing model of the work flow graph presented in Figure 2 with the following GA parameters: As in non-queuing model, we will be performing the GA on the following three configurations.

No offloading:
When applying GA with only the mobile processor configuration, there is only one offloading allocationwhich is all tasks must be executed in the local processor. The results of the offloading allocation of the work-flow graph for Figure 2 above are shown in Table 8 below:

Single Site Offloading:
The number of available offloading allocation in parallel execution model and no queue model are same (2 7 = 128). However, the parallel execution model can save the queue of jobs for each processor until the processor is not available for execution. GA will start with the initial population of 100 solution and generates new solution on each iteration to find the near-optimal solution. The task 0 is forced to be executed in the local processor as it is in the no queue model. The genetic algorithm provides three solutions after exploration. As such, the user will decide which objective is more important for their current mobile device configuration. The following Table 9 below represents the offloading allocation suitable in the presence of only one VM. The above Table 9

Multi-Site Offloading
In this configuration of a local processor with two VMs, each element of the offloading solution can have three states, 0 (local), 1 (VM1), 2 (VM2) so there are 3 7 = 2187 different available offloading allocations as it is in non-queuing model. It is important to note that, in this model, each resource can store the queue for the upcoming jobs to process them in parallel. The GA starts with the initial population of 100 solutions and generates a new gene using either crossover or mutation.
It observes the behavior of the new gene to explore the near-optimal solution. Like local processor with one VM configuration, the task 0 is forced to be processed in the local processor, and all other tasks can be processed in the local processor or any available cloud server. The GA provides four different offloading allocations after the exploration of the problem. The offloading allocations are as follow:

Evaluating a given allocation Considering both Internal and External Parallelism (for the introductory example)
In this section, we will enhance further the queuing model presented in Example 1 for the offloading allocation a = [d0, d2, d1, d1, d1, d0, d2] with the implementation of internal parallel execution. We will be assuming that each available computing resource has 2 cores which can be used for parallel execution. There are three available resources for the offloading allocation a, the sequence of the jobs for each resource are shown in the tables below.  (t1, t5)) arrives at the time = 4 seconds but wait in the queue for 2 seconds until any processor (P1 or P2) is available. The total time for all the jobs in the mobile device d0 is 15 seconds. The Table 12 above shows the jobs schedule in the Cloud Server d1. The processor 1 of Cloud Server d1 throughput total 6 jobs which includes three ReceiveJobs, one ExecuteJobs, and two SendJobs. The processor 1 perform 2 seconds on computation and 11 seconds on communication. However, the processor 2 handles total three jobs which includes two ExecuteJobs and one SendJobs. The processor 2 spends 4 seconds on computation and 4 seconds on communication. In server d1, only Job number 6 (sendJob(t3, t7)) encounter the server in busy state and wait for its availability in the queue for 2 seconds. The total time the cloud server d1 takes to process all the jobs is 17 seconds. The Table 13 above shows the jobs scheduled in the Cloud Server d2. The processor 1 of Cloud Server d2 throughput total 6 jobs which includes three ReceiveJobs, two ExecuteJobs, and only one SendJobs. The processor 1 perform 2 seconds on computation and 12 seconds on communication. However, the processor 2 handles only two SendJobs and spends 6 seconds on communication. In server d2, all jobs start upon their arrival and no job wait in the queue for the processor availability. The total time the cloud server d2 takes to process all the jobs is 18 seconds.

As per
Step-3 of our algorithm, we compute: The job execute(t7) has the highest depth in the workflow graph thus the response time of the offloading allocation a is the end time of this job which is only 18 seconds. The energy consumption can be calculated using equation (5). The energy consumption for the offloading allocation a with external and internal parallel execution is reduced to 6.7 Joules. Finally, the cost can be calculated using equation (7) = 18 * ( 0.03 60 ) + 18 * ( 0.05 60 ) + 0 = 0.024 / .
The renting cost of cloud server d1 is 0.03 and renting cost of cloud server d2 is 0.05 in the model  Thus, the above Table 14 shows that performing internal and external parallel execution reduces the response time to 18 seconds from 37 seconds from no queue model. Also, the energy consumption on the mobile processor is reduced to 6.7 Joules instead of 9.25 Joules from no queue model. Finally, the processing cost is reduced to 2.4 cents/minute instead of 4.9 cents/minutes from no queue model.

Near-Optimal Allocation(s) using Genetic Algorithm Considering both Internal and External Parallelism (for the Introductory Example)
Let's introduce the genetic algorithm on the above of the work flow graph presented in example 1 above for the near-optimal offloading allocation. The genetic algorithm parameters are as follow: As in previous configurations, the GA will be performed on the following three configurations. To visualize the gain of internal parallel execution, we will be assigning two processor to each resource processing the following three configurations, No_Offloading, Offloading_With_1VM, and Offloading_With_2VM.

No offloading
When applying GA with only the mobile processor configuration, provides only one offloading allocationwhich is all tasks must be executed in the local processor. The offloading allocation details of the task graph is shown in the Table 15 below.

Single Site Offloading
In this configuration, there are two available resources, local mobile device and cloud VM1 each with two available processors. The task 0 of this allocation is forced to be executed in the local mobile device and all other resources can be assigned to any resource. The results of this configuration are shown in the table 16 below.  16.00 6.90 1.60 The Table 16 above presents two possible offloading allocation a1 and a2 for this configuration.

Multi-Site Offloading
In this configuration, there are two cloud servers added to the mobile device each has two processors for internal parallel execution. The offloading allocation for this configuration is shown in the Table 17 below.  The Table 17 above presents the offloading allocation of offloading with two VM configuration.
Like offloading with one VM, GA explore two offloading allocation a1 and a2. The offloading allocation a1 perform all tasks to mobile device to reduce the operating cost as in no offloading allocation. However, offloading allocation a2, executes all task to the VM2 except task 0 which is forced to compute in the mobile device. The mobile device throughput 1 compute and 4 sends job.
The processor 1 of mobile device spends 4 seconds on computation and 2 seconds on

Summary
In the section, we will be summarizing the gain of all three configurations without queue, (Section 3.5) external parallel execution (Section 3.8) and internal and external parallel execution (Section 3.10). The Table 18

Chapter 4: Case Study
In this chapter we will analyze a real-world face recognition application problem to answer the following questions: • Does consideration of parallel execution of different tasks of an application while solving the offloading allocation problem influence the near-optimal solution?
• What is the effect of multi-core devices on the near-optimal solution of the offloading allocation problem?

Mobile Application Specification
In this section, the proposed code offloading framework will be evaluated for external and internal parallel execution using a real-world face recognition based on the call graph presented in [27] shown in Figure 4 below The Figure 4 above represents the call graph of the face recognition application which is built upon an open source code to implement the Eigen face recognition algorithm [27]. The call graph is structured by analyzing the application with Soot tool and building a network and energy profiler. Each step in the call graph has two bold lines where the first line represents the class name and bottom bold line with colon represents the method name of that class for the application. The execution time for each step is presented in ms (milli-seconds) and the data transfer between two steps is presented in KB (killo-bytes). For the analysis of our project, we convert, the call graph into a work-flow graph which is shown in Figure 5 below.
The Figure 5 above represents the work-flow graph of the face recognition example from [27].  The execution times wi for each task ti where i < 0< N of the application is converted in MI from ms. we assume that the mobile device has one core with processing speed of 1000 MIPS. Using this assumption, we convert all tasks execution time from milliseconds to million instructions. As an example, the conversion of the task t1 (Jama.Matrix :time) which is has 68.6ms execution time will be as follow.
The task which are not offloadable are represented in black specifically, task t14 and t15 in Figure   6 above.

Model Specification
In this section, the specification about mobile device, mobile user profile, cloud servers, GA configurations and system configuration will be specified.

Mobile Device:
The mobile device d0 is modeled as a tuple of five < 0 , 0 , 0 , 0 , 0 , 0 > where 0 is the current battery percentage of the mobile device, 0 is the number of processors in the mobile device, and for each processor 0 is the processing speed of that processor (in million instructions per second), 0 is the computation power consumption, 0 is the power consumption for communication(send and receive data), and 0 is the power consumption while the device is idle.
< 0 , 0 , 0 , 0 , 0 , 0 > = <10%, 1 core, 1000MIPS, 0.9W, 1.3W, 0.3W> For the analysis it is assumed that the mobile device d0 is currently at 10% and there are total 4 available cores for the execution. The number of instruction that can be processed in each core is 1000 MIPS. The computation energy, communication energy and idle energy is 0.9W, 1.3W and 0.3W respectively.

Mobile User Profile:
The mobile user profile is the specification the user network package which is represented as tuple of two 〈 , 〉 where is the current remaining amount of data left from the fixed portion and is the monetary rate for the additional amount of data (in dollars per MB).
For the analysis of this project, we assume to be 1024 MB and to be 0.3 dollars per MB.

Cloud Server d1:
The specification of the cloud server d1 is represented as a tuple of three < , , > where is the number of available cores in the cloud server, is the processing speed of each core (in million instructions per second), and is the monetary rate of renting the cloud server from the cloud provider (in dollars per minute).
< , , > = < 4 , 2000 , 0.6 / > Thus, the cloud server d1 has 4 available cores and each core is twice as fast as the local mobile processor which can process 2000 million instructions per minute with the total renting cost of 0.6 dollars / mins.

61
The cloud server d2 has 4 available cores and each core is 4 times as fast as the mobile processor which has the ability to process 4000 million instructions per minute with the total renting cost of

Device to Device Bandwidth:
In the analysis of our framework, we consider the maximum resource availability of two cloud servers and the network bandwidth between any two resources is as follow: Thus, the network bandwidth between local mobile device 0 and cloud server 1 as well as cloud server 2 is 1MBPS. Similarly, the network bandwidth between two cloud resources 1 and 2 is also 1MBPS.

Genetic Algorithm Configurations:
The GA is designed based on NSGA-II optimization algorithm which introduces fast nondominated sorting and uses more computation efficient crowding distance metric during survival selection as compare to NSGA. [15]. The NSGA-II has the ability of binary tournament selection with Pareto dominance and crowding distance, subset crossover and uniform mutation operators.
We apply subset crossover to apply crossover on the GA solutions and uniform mutation to mutate GA solution. The probability of applying the subset crossover operator to a decision variable is 0.9 and the probability of applying the uniform mutation operator to a decision variable is 1 / (number of decision variables, i.e. number of tasks) = 1/15. The initial population size GA solutions is 1000 solution. The GA performs 100,000 evaluations of solutions with the set probability of crossover and mutation to find the near-optimal solution. The GA is set to optimize three objective function, response time, energy consumption and processing cost.

System Configurations:
The evaluations of all configuration were performed on Windows 10 (Redstone 4) operating system. The hardware consists of Intel Core i7-6800k CPU with total 6 cores and processor base frequency of 3.4 GHz. The system has 16 GB of total Random-Access Memory (RAM).

Results and Discussion:
In this section, based on the above mobile face recognition application (section 4.1) and the model specifications presented in section 4.2, each code offloading sub-set will be evaluated on the following three cases to observe the effect of external parallel execution.
• No-offloading: In this case, we assume that there is no cloud server available for computation offloading. Thus, all the tasks must execute locally in the mobile device d0.
• Single-Site offloading: In this case, we assume that there is one cloud server d1 available for computation offloading so each task which is offloadable can be executed either in mobile device d0 or the cloud server d1.
• Multi-Site offloading: In this case, we assume that there are two cloud servers d1 and d2 available for computation offloading. Thus, the execution of all offloadable tasks can take place in any of the available resource, d0, d1 and d2.
Similarly, the effect of internal parallel execution will be represented through the following three configurations: • 1-Core in each computing resource: It is assumed that all available resources can only perform the execution in only 1-core of the processor.
• 2-core in each computing resource: In this configuration, the processor can perform the internal parallel execution among only two cores.

• 4-core in each computing resource
This configuration allows the processor to divide the work-load among 4-core to reduce the throughput time of the applications tasks.

Including or excluding parallel execution to find the near-optimal offloading allocation
This section tries to answer the question: Does consideration of parallel execution of different tasks of an application while solving the offloading allocation problem influences the near-optimal solution? We assume every device has one processing core. We therefore consider only external parallelism while comparing parallel versus non-parallel execution in this section. In general, for each evaluation, GA optimizes the problem based on three objectives, reduced response time, and energy consumption with no additional processing cost. For each solution, GA produces paretooptimal results for each of the application objectives. It is up to the user to pick the right paretooptimal solution based on the current mobile device configuration. In this research will be looking at all three objectives individually.

Response Time:
The response time relates to the performance of the application and optimization of response time refer to reducing the time required to process all the task of the application. The pareto-optimal solution for the total reduced response time on all three cases is shown in Table 19 below.  that the No-Offloading case for both scenarios of considering or not considering parallel execution, the minimum response time is same 5.5149 Seconds. This is because in No-offloading case, all the tasks must be allocated in the mobile device and that the device consists of only one core forcing the tasks to execute sequentially. Thus, no parallel execution is performed in both scenarios.
In conclusion, the two cases Single-Site and Multi-Site offloading, provides a gain of 11.68% and 12.83% respectively for considering external parallel execution among different resources as compare sequential execution. It is also observed that the task allocation for both cases is different in both scenarios.

Energy Consumption:
The energy consumption is aligned with the power saving mode of the application. It is suitable for the scenarios when the mobile device current battery percentage 0 is low and user wants the application to consume as less battery as possible. As it is mentioned in the section 4.2.1 (mobile device) the current battery level is assumed to be 10 % remaining so the user chooses this option to make the battery lasts longer. The near-optimal solution for the optimized energy consumption objective is shown in the Table 20 below. The Table 20 above represents the application energy consumption based pareto-optimal offloading allocations for the scenarios when the parallel execution is performed among computation resources verses sequential execution. The energy consumption for the no-offloading case, is same 4.9634 Joules for both scenarios since there is only one resource (local mobile device d0) with only one available core. Thus, there is no external or internal parallel execution is performed in the case of no-offloading.
In Single-site offloading case, the parallel execution consumes 44.71% less energy as compare to no parallel execution among different resources. The energy consumption for the parallel execution scenario is 2.1422 Joules as compare to 3.8746 Joules when the parallel execution is ignored. The near-optimal offloading allocation for the application tasks in parallel execution and sequential implementation scenario is same for the energy consumption which is [d1, d1, d1, d1,   d1, d1, d1, d0, d1, d1, d0, d0, d1, d0, d0].
Similarly, in Multi-Site offloading, the parallel execution scenario consumes 33% less energy as compare to sequential implementation. It is also observed, the GA near-optimal solution for the Multi-Site offloading case is different in both scenarios. The near-optimal offloading allocation with parallel execution is [d2, d2, d2, d2, d1, d2, d2, d0, d1, d2, d0, d0, d2,  Thus, the two scenario reveals that if the user current mobile device's battery percentage is low, the user should consider parallel execution to consume less power during execution. The Single-Site offloading, and Multi-Site offloading consumes 44.71 % and 33.00 % less power as compare to not considering parallel execution.

Monetary Cost:
Cost with Response Time: The processing cost is incurred by the mobile user can also be chosen as the objective function.
However, the minimum monetary cost (zero dollars) intuitively refers to the No-offloading case.
Thus, the GA optimization for any scenario for this object always leads towards the No-offloading case. In order to solve the near-optimal offloading allocation problem in terms of monetary cost as the objective function makes sense only when other objectives (such as response time, energy consumption) are also considered as well. Let's consider the processing cost with respect to response time and energy consumption individually.
The processing cost is any additional cost which is required to achieve the near-optimal response time and energy consumption. The Table 21 below represents the cost objective with respect to response time. Based on the Table 21 above, it is safe to conclude that parallel execution of the tasks does not just improve the response time of the application. It also reduces the processing cost of the near-optimal solution. The cost for no offloading case is same 0.0¢ for both scenarios. However, the single site offloading case requires 10.81% less cost for the parallel execution scenario as compare to sequential execution. Similarly, the percentage difference of cost for Multi-Site offloading case is 12.5% so the parallel execution of task will incur 12.5% less operating cost as compare to not considering parallel execution.

Cost with Energy Consumption:
In this section, the relationship between operating cost and the application energy consumption will be discussed. The required operating cost with respect to mobile device energy consumption is shown the  Based on Table 22, one can easily visualize the behavior of cost and energy consumption for parallel execution and non-parallel execution. Similar to the response time, the no-offloading case does not require any additional operating cost since all tasks are processed in the local processor.
Further, the gain of parallel execution in single-site offloading is 8.108 % as compare to no parallel execution scenario. Finally, the parallel execution for the Multi-Site offloading case requires 23.287% more operating cost as compare to no-offloading scenario.
Hence, it is safe to conclude that parallel execution of the application tasks optimizes all three objectives (response time, energy consumption and cost) as compare to sequential execution.

Evaluating the effect of multi-core devices on near-optimal offloading allocation:
This section tries to answer the question: What is the effect of multi-core devices on the nearoptimal solution of the offloading allocation problem? We consider both internal and external parallelism here. In order to visualize the effect of internal parallel execution, we extend the three cases of (No-offloading, Singe-Site offloading, Multi-Site offloading) with three more configurations 1-core, 2-core and 4-core of the processor. The renting rate of any cloud server di (i = 1, 2) with m cores is m * ri where ri is the renting rate of di with one core. Similar to external parallel execution (section 4.3.1) there is not a single best solution that minimizes all the three objectives at the same time since a small improvement in one objective may deteriorate at least one other objective [8]. Instead, we will have a Pareto-optimal set of solutions. Pareto optimality considers a solution to be better or worse in comparison to another solution only if it is better with respect to all objectives or worse with respect to all objectives. Any two solutions are nondominated if neither dominates the other, i.e. neither one is better than the other. The set of all nondominated solutions is captured by the Pareto-optimal set of solutions [15]. For each pair of case (Case-1, Case-2 and Case-3) and core (1-core, 2-cores and 4-cores), each pareto-optimal set contains a bold value representing the minimum value of an objective function among the solutions. For example, corresponding to the Pareto-optimal set for the case-core pair (Case-1, 1core), the solution 11 yields the minimum energy consumption of 4.9634 Joules, the solution 12 yields the minimum response time of 4.8431seconds with no additional cost, we will be looking at three cases (No-offloading, Singe-Site offloading, Multi-Site offloading) individually to observe the gain of internal parallel execution.
In case1, there is only local processor available for execution and all tasks are forced to process in the local processor resulting no additional processing cost. The only one possible solution for this case is solution [d0, d0, d0, d0, d0, d0, d0, d0, d0, d0, d0, d0, d0, d0, d0] all task are forced to process in the local mobile device. However, varying the number of processor cores introduces internal parallel execution which makes a significant improvement on the response time with the trade-off of energy consumption. The GA pareto-optimal solutions for the case1-no offloading are shown in the Table 23 below. Similarly, for the 2-core processing, the response time and energy consumption is 4.8431 Seconds and 6.2148 Joules respectively which shows the gain of 12.18% in response time with the tradeoff of 25.12% more energy consumption as compare to 1 core processing scenario. It is observed in the Figure 7 above that the response time for the 2-core processing reaches its near-optimal value and does not improve any further by adding any additional core for the face recognition example. Hence the 4-core processing yields the same response time as it is in 2 core processing 4.8431 Seconds. However, the energy consumption value increases for any 4-core processing scenario. The energy loss is due to processor being in idle state when there is not enough application task available. Thus, for the no offloading case the energy consumption is minimum for the 1-core processing and the response time is near-optimal for the 2-core processing and there is no additional processing cost.

4.3.2.2.Case 2: Single Site offloading:
In case2, Single Site offloading, there are two available resources, local mobile processor one and cloud server. In this case, all tasks except t14 and t15 can be either offloaded to the cloud server or to be process in the local mobile device. The pareto-optimal solutions for the case2-single site offloading scenario are shown in the Table 24 below. The Figure 8 above shows the pareto-optimal solutions for single site offloading when executed in 1-core, 2-core and 4-core configuration. Let's analyze each configuration individually to conclude the gain of internal parallel execution.

2-Core each Resource:
In this scenario, each computing resource has the ability to perform internal parallel execution among both cores of the processors as well as the external parallel execution between both resources. Each core of the processor can handle the computation or communication of application tasks jobs depending on their arrival or the processor availability. The GA reveals there are two pareto-optimal solutions in this scenario labelled as 22a and 22b in the Table 24 above. The solution 22a is pareto-optimal for the cost objective and provides the user an option to process all tasks of the application without adding any additional processing cost. However, the solution 22b provides the minimum response time and energy consumption of 3.1846 Seconds and 3.0462 Joules respectively with an additional processing cost of 6.36¢. In this scenario, the response time is further enhanced by 3.88% as compare to minimum response time of the pareto-optimal solution in 1-core scenario (21b). However, energy consumption increases by 42.20% as compared to minimum energy consumption of the pareto-optimal solution (21b) in 1-core scenario. The additional processing cost of the application also increases to 6.36¢ as compare 1-core paretooptimal solution (21b) which requires the maximum processing cost.

4-Core each Resource:
In this scenario, each computing resources has the ability to execute application tasks parallel among the different processors (internal parallel execution). Each processor can perform computation or communication based on their availability and the application tasks jobs arrival.
Similar to 2-core scenario, the GA also reveals to 2 pareto-optimal solution in this case labelled as 24a and 24b in the Table 24 above. This case further enhanced the application response time with the trade-off of energy consumption and processing cost. The solution 24a provides the minimum operating cost the application however, the solution 24b provides the response time of 3.1703 Seconds with the energy consumption of 4.9398 Joules and the additional processing cost of 12.68¢. In this case, the gain of the response time is just 0.45% however, the trade-off of energy consumption and cost is 38.33 % and 49.84% respectively as compare to 2-core scenario. Since the loss of energy consumption and cost is much higher than the gain in the response time so it is safe to conclude that this case is only for those users whose main goal is to reduce the application response time without prioritizing the energy consumption or the additional processing cost of the application.
However, 31f consume more battery of 1.9303 Joules as compare to 1.8614 Joules from 31b but requires less operating cost of 4.86¢ as compare to 7.30¢ in 31b. The 31a does not offload any task to the cloud server and does not require any additional processing cost. All other pareto-optimal solution are intermediate solutions for the user to choose based on their current device configuration.

2-Core each Resource:
In this scenario, each computing resource contains two processors and has the ability to perform internal parallel execution all the independent application tasks based on their availability. There are 5 different pareto-optimal solutions in this configuration which are labelled as 32a, 32b, 32c, 32d, and 32e in the Table 25 above. The solution 32b and 32d both provides the same response time of the application of 2.3715 Seconds but different in the energy consumption and the processing cost. The solution 32b provides the minimum energy consumption of 2.5554 Joules with the trade-off of the processing cost of 14.22¢. However, the solution 32e provides the low operating cost with slightly higher energy consumption. The user can choose the any solution from these pareto-optimal solutions to meet their desired goals based on their current situation of the mobile device.

4-Core each Resource:
In this scenario, each computing resource has 4 available processor for the internal parallel execution of the tasks based on their availability. The GA reveals 4 different pareto-optimal solution for the (case3, 4-core) situation which are labelled as 34a, 34b, 34c, 34d in table 25 above.
The solution 34b, provide the lowest response time and energy consumption 2364.4 and 3968.4 respectively among all pareto-optimal solution with the trade-off of high operating cost of 28.37¢.
The gain of the response time as compare to 2-core scenario, is 0.3% as compare to energy loss of 55.38%. Thus, this solution provides the lowest response time but consumes more mobile energy and requires high operating cost as compare to all other cases.

Summary:
In this section, we summarize the gain of internal and external execution with respect to response time, energy consumption and monetary cost of the application. The Table 26 below combines the results for all three cases (No-Offloading, Single-Site Offloading and multi-Site offloading) for the scenarios of 1-core, 2-core and 4-core in each computing resource. Based on the results form Table 26, it is safe to conclude that the response time of an application reduces with both external and internal parallel execution as compare to sequential execution. Similarly, the energy consumption is reduced through external parallel execution. However, for the internal parallel execution, there are several different cores to the resource thus the execution division requires more energy consumption to complete the tasks as compare to sequential execution. Finally, offloading monetary cost is consider with respect to response time and energy consumption separately. The user can choose the offloading cost with respect to response time and energy consumption by paying an additional monetary cost for using cloud resource and network service on the mobile device. Resp. Time The user can choose any pareto-optimal solution based on their objective and offloading needs.
The gain of our multi-objective algorithm between external and sequential as well as internal and sequential is verified through a real-world face recognition application from [27]. The results show that accounting for the effect of parallel execution yields a better near-optimal solution for the allocation problem as compared to excluding parallelism in the analysis.

Future Work
Our proposed code offloading framework performs the parallel implementation of the parallel path and the parallel execution itself is an open problem. The framework can be further enhanced by addressing some of these parallel execution limitations. The future research work to address these limitations are as follow: • In the internal parallel execution, the data is accessed from the memory and cache by all processors of the single device. In the current framework, the data access time from the memory and cache is not included in the response time of the application.
• In regards to VM (virtual machine), there is an initialization time of each VM which is currently not considered in the calculation. It is assumed that each VM is already initialized and it is ready to receive the offloading tasks of an application from the mobile device.
• In the external parallel execution, the communication among different devices is considered to be constant. However, the data exchange rate between the mobile device and the cloud server continuously changes based on the user wireless network plan and the geographic location. The dramatic change in the wireless connection needs to be addressed in the current framework.
• In the current framework, it is assumed that the mobile device is always connected to the internet and there is no sudden interrupt in the wireless connection. Further research is required to handle unexpected disconnection of the mobile device or cloud servers from the network.
• The current state of this framework is heavily dependent on the user input for the model specification, mobile device and cloud server configurations. The user manually has to enter all the details before performing the simulation. A user interface can be created to gather the model specifications from the mobile device profiler and cloud server's APIs.