Towards Optimal Parallelism-Aware Service Chaining and Embedding

Emerging 5G technologies can significantly reduce end-to-end service latency for applications requiring strict quality of service (QoS). With network function virtualization (NFV), to complete a client’s request from those applications, the client’s data can sequentially go through multiple service functions (SFs) for processing/analysis but introduce additional processing delay. To reduce the processing delay from the serially-running SFs, network function parallelism (NFP) that allows multiple SFs to run in parallel is introduced. In this work, we study how to apply NFP into the SF chaining and embedding process such that the latency, including processing and propagation delays, can be jointly minimized. We introduce a novel augmented graph to address the parallel relationship constraint among the required SFs. Considering parallel relationship constraints, we propose a novel problem called parallelism-aware service function chaining and embedding (PSFCE). For this problem, we propose a near-optimal maximum parallel block gain (MPBG) first optimization algorithm when computing resources at each physical node are enough to host the required SFs. When computing resources are limited, we propose a logarithm-approximate algorithm, called parallelism-aware SFs deployment (PSFD), to jointly optimize processing and propagation delays. We conduct extensive simulations on multiple network scenarios to evaluate the performances of our schemes. Accordingly, we find that (i) MPBG is near-optimal, (ii) the optimization of end-to-end service latency largely depends on the processing delay in small networks and is impacted more by the propagation delay in large networks, and (iii) PSFD outperforms the schemes directly extended from existing works regarding end-to-end latency.


I. INTRODUCTION
N ETWORK function virtualization (NFV) implements network functions (e.g., firewall, parental control) that run on traditional dedicated hardware to software-based modules, called virtual network functions (VNFs) or service functions (SFs) [1]- [3]. In the NFV paradigm, the client's NFV service request (NSR) includes the service source, destination, a set of SFs, and corresponding network resource demands (e.g., computing resource, bandwidth) [3]. To meet the client's request, the service provider can concatenate the required SFs into a service function chain (SFC) and embed it onto a shared physical network (PN) [4], [5]. The process of accommodating an NSR by composing and embedding an SFC onto a shared PN is referred to as service function chaining and embedding (SFCE). The physical forwarding path established by SFCE is called the service function path (SFP).
Recently, NFV techniques are applied in 5G networks to facilitate the low-lantecy service delivery [6]- [9], where 5G technologies are designed to significantly reduce latency (as much as 10x) [10]- [12]. Under such scenarios, the total processing delay from the serially-running SFs in an SFC may be comparable to the propagation delay and could be the bottleneck of optimizing the overall end-to-end latency for SFC delivery [13], [14]. To mitigate the impact caused by this bottleneck, network function parallelism (NFP) is introduced to run multiple SFs from the same request parallelly at one physical node (e.g., commercial/edge server) [13]. As a result, the processing delay of SFs working in parallel can be reduced from their processing delay sum (i.e., serially running SFs) to the highest processing delay among them (i.e., parallelly running SFs). According to [13], two SFs can be executed in parallel only if their operations do not conflict. For example, a flow monitor (FM) only monitors the client's data stream without any modifications, which can be operated with deep packet inspection (DPI) parallelly. On the contrary, both DPI and encryption might modify packets; thus, they cannot work in parallel. The constraint on whether two SFs can work in parallel or not is referred to as parallel relationship constraint.
When applying NFP to deliver SFC services with ultralow latency requirements, we need to jointly optimize SFs' processing delay and the SFP propagation delay. Ideally, a physical node with enough computing resources can run many SFs in parallel to reduce SFs' processing delay, while SFP propagation delay can be reduced through embedding the required SFs along the shortest path connecting the service source and the destination. In practice, due to limited computing resources at each physical node, greedily minimizing SFs' processing delay may end up with increasing the SFP propagation delay. For example, a physical node with enough computing resources (e.g., a datacenter) may be geographically far away from both service source and destination, which may require a long physical routing path. In the literature, to reduce the latency of delivering a client's service, many existing works have focused on minimizing SFP length/propagation delay in SFCE, where SFs' processing delay is regarded as fixed or ignored [15]- [33]. The problem of how to apply NFP into SFCE to optimize the end-to-end latency is challenging.
In this work, we investigate how to efficiently apply NFP into SFCE such that the end-to-end latency (including SFs' processing delay and the SFP propagation delay) of delivering parallelism-based services can be minimized. We introduce a novel augmented graph, called parallel graph (PG), to address the parallel relationship constraints. Considering parallel relationship constraints, we mathematically model the parallelismaware service chaining and embedding (PSFCE) problem with the goal of jointly minimizing the SFs' processing delay and the SFP propagation delay. Next, we prove that the PSFCE problem is NP-hard under various network scenarios. For this NP-hard problem, we propose two efficient heuristic algorithms called maximum parallel block gain (MPBG) first optimization and parallelism-aware SFs deployment (PSFD) to optimize the end-to-end service latency. Meanwhile, we show that MPBG can achieve the optimal performance in some scenarios and PSFD is generally logarithm-approximate. We conduct extensive simulations to evaluate the performances of our proposed algorithms. Specifically, we show that (i) MPBG is near-optimal, (ii) the optimization of the end-to-end service latency largely depends on SFs' processing delay in small networks and is impacted more by the SFP propagation delay in large networks, (iii) to achieve latency-efficient service delivery in edge-cloud systems, short parallelism-aware SFCs (less than 10 SFs) should be deployed at the edge, while long parallelism-aware SFCs should be deployed in the cloud, and (iv) PSFD outperforms the schemes that are directly extended from existing works regarding the end-to-end latency.
The rest of this paper is organized as follows. Section II summarizes related work, while Section III introduces network function parallelism and parallelism-based SFC (P-SFC). We formulate the problem of parallelism-aware service function chaining and embedding (PSFCE) in Section IV. In Sections V, VI, and VII, we analyze the NP-hardness of PSFCE and present novel analysis and algorithms to optimize it in various network scenarios. Section VIII analyzes experimental results. We conclude our work in Section IX.
To satisfy ultra-low latency requirements, much work has been done to optimize SFP length/propagation delay [15]- [33]. When delivering service as a traditional SFC (linear logical structure), authors in [15] proposed a heuristic algorithm by applying the betweenness centrality technique to minimize the number of hops and hence the propagation delay. When the client has specific QoS requirements, authors in [16] formulated the QoS-aware and reliable traffic steering (QRTS) problem and proposed an approximate algorithm by applying the primal and dual technique. When the SFC is given a priori, authors in [17] developed SFC-constrained shortest-path schemes with the transformation of network graphs. To optimize propagation delay in the scenario of online/continuous learning, authors in [18], [19] investigated delivering a hybrid service function chain (HSFC). When the client identifies HSFC, authors in [18] proposed an optimal hybrid SFC embedding (Opt-HSFCE) algorithm, which optimizes propagation delay from the constructed SFP. When the HSFC is not given a priori, to optimize propagation delay, authors in [19] proposed a 2-approximation algorithm by applying graph-theory based techniques, called Eulerian circuit-based hybrid SFP optimization (EC-HSFP). Under latency limitation, the work in [21] proposed a heuristic algorithm to jointly optimize the resource utilization of both physical nodes and links. In [28], the authors proposed an architecture to reduce the latency of NFV systems with 5G techniques. The authors in [30] proposed a mixed integer linear program (MILP) and three heuristics to optimize the resource utilization for accommodating SFCs. In [33], the authors proposed a heuristic approach to deploy a given SFC under latency and mobility constraints. In [38], the authors investigated a problem of composing, computing and networking SFs by embedding them onto the physical nodes such that the overall latency is minimized and proposed a mathematical model by using the resource constrained shortest path technique on the proposed auxiliary layered graph.
To reduce SFs' processing delay, the authors in [13] proposed the technique of network function parallelism (NFP), which enables simultaneously executing multiple parallelable SFs. Based on that, works in [39]- [46] investigated how to apply NFP while considering dependencies among the required SFs. Here, dependencies represent executing orders of the required SFs. In [44], the authors proposed a heuristic algorithm, called parallelism-aware residual capacity first placement (PARC), to maximize the request acceptance while satisfying a specific latency constraint. In [45], the authors proposed a heuristic approach, called partial parallel chaining (PPC), to accommodate a P-NSR, where SFs are mutually parallelizable. That is, if SF1 is parallelizable with SF2 and SF3, then SF2 can co-execute with SF3 as well. In [46], the authors proposed a scheme, called delay-balanced parallelism, to balance the overhead and the processing delay of running the required SFs. However, the above works hardly take the parallel relationship constraints into account. Notably, the parallel relationship constraint identifies whether two SFs can work in parallel or not, and it is different from the dependency constraints. For example, firewall (FW) and DPI can be executed in either sequential order (i.e., FW to DPI or DPI to FW), but they cannot work in parallel as they both modify packet contents. As the flow monitor (FM) does not modify the traffic, FM can be parallelly executed with DPI or FW. That is, even though SF1 is parallelizable with SF2 and SF3, SF2 and SF3 may not be able to co-execute. The parallel relationship constraint is further illustrated in the subsequent Section. Meanwhile, the SFC is given a priori in the above works, so the SFC composition process is not taken into consideration. Compared to the above works, this work takes into account the parallel relationship constraints and study how to jointly compose and embed a parallelism-aware SFC (P-SFC) in diverse network scenarios such that the end-to-end service latency (including the processing delay and the propagation delay) is minimized.

A. Network Function Parallelism (NFP)
Different network functions will perform diverse operations on client's data packets. These operations include writing, reading, dropping, and so on [13]. When two operations need to simultaneously change the client's packet content (e.g., dropping and dropping, writing and dropping), the relationship between these two operations is referred to as conflicting. When operations of multiple SFs are non-conflicting, network function parallelism (NFP) is introduced to allow these SFs running in parallel and simultaneously process the packets for the same client [13]. Parallelizable SFs are those with non-conflicting operations.

B. Parallel Graph (PG)
We use parallel graph (PG) to describe parallel relationship constraints among the required SFs. PG is defined as an undirected graph PG = (V, E), where V represents the set of SFs, and E is the set of parallel relationship constraints among the SFs. Two SFs are parallelizable if and only if they are directly connected in the PG; otherwise, they are non-parallelizable. For example, in the PG of Fig. 1a, SF1 and SF2 can be executed in parallel, while SF5 is non-parallelizable with all other four SFs. The triangle in Fig. 1a (including SF1, SF2, and SF4) is a complete graph, and these three SFs can work in parallel. For example, SF1, SF2, SF3, SF4, and SF5 can be the functionalities of load balancing (LB), flow monitor (FM), gaming core (GC), network address translation (NAT), and DPI, respectively [13]. Since DPI needs to perform the operations of reading, writing, and dropping on diverse fields in packets, it cannot be executed in parallel with any other SFs. As FM only performs the operation of reading, it can parallelly execute with LB, GC, and NAT. Even though GC and LB both perform the operations of reading and writing, the packet fields that are executed by these two operations are different. Thus, GC and LB can co-execute with FW as the triangle (i.e., SF1, SF2, and SF4) shows. NAT needs to perform writing operations on multiple packet fields that are overlapped with GC and LB, so it can only be parallelly executed with FM.

C. Parallelism-Based Service Function Chain (P-SFC)
A service function chain (SFC) represents the sequential execution order of the required SFs. When applying NFP to SFCE, we define the parallelism-based SFC (P-SFC) to specify both consecutive execution order and parallelisms among the required SFs. A P-SFC is composed of a set of consecutivelyordered Parallel Blocks, each of which consists of either a set of parallelizable SFs or one SF. Figs. 1b and 1c show two possible P-SFCs that are composed from the PG in Fig. 1a, where a blue dashed-line rectangle represents a parallel block. Specifically, the P-SFC in Fig. 1b contains three parallel blocks, where the first parallel block includes SF1, SF2, and SF4; while the second and third parallel blocks include SF3 and SF5, respectively. The P-SFC in Fig. 1c includes five parallel blocks, each of which contains one SF, i.e., five SFs will be sequentially executed in Fig. 1c. Note that the SFs assigned to the same parallel block must be parallelizable SFs, i.e., the PG sub-graph formed by the SFs in the same parallel block must be a complete graph. For example, in Fig. 1a, the sub-graph of SF1, SF2, and SF4 is a complete graph.
Lemma 1: The sub-graph of SFs that can work in the same parallel block must be a complete graph in PG.

D. Processing Delay and Propagation Delay
When parallelly executing multiple SFs, it is necessary to merge the results from these SFs to guarantee the correctness and uniqueness of the output [13]. It is worth noting that the SFs in the same parallel block will be embedded onto the same physical node in this work. Fig. 1d shows an example of embedding the P-SFC in Fig. 1b onto a PN, where the reddotted arrow shows the embedding relationship between the parallel block and the physical node. To facilitate the description, the necessary notations are summarized in Table I. We use ξ i to represent the i th parallel block in the P-SFC. In Fig. 1d, ξ 1 = {SF1, SF2, SF4} is embedded onto the physical node A, ξ 2 = {SF3} is embedded onto physical node B, and ξ 3 = {SF5} is embedded onto physical node C. We use L ξ i to represent the processing delay of ξ i , which is calculated from Eq. (1), where L v is the processing delay of SF v.
The processing delay of a P-SFC (L PC ) is the sum of its parallel blocks' processing delay. We use B to represent the set of parallel blocks including source and destination in a P-SFC, and overall processing delay is given by Eq. (2).
When ξ i and ξ j are embedded onto physical nodes n i and n j , there will be a physical forwarding path P n i ,n j composed of a series of physical links from node n i to node n j . For example, in Fig. 1d, the forwarding path P A,B includes A → E → B carrying the traffic from parallel block ξ 1 to ξ 2 . Accordingly, to represent the propagation delay from physical node n i that hosts ξ i to physical node n j that hosts ξ j . We use L PD to represent propagation delay of the forwarding path (i.e., the SFP in the physical network) for the P-SFC, which can be calculated by Eq. (3).
The overall processing delay and propagation delay of Fig. 1d are respectively. The overall end-to-end service latency in Fig. 1d is the sum of L PC and L PD .

A. Physical Network Model
The physical network (PN) is an undirected graph PN = (N, L), where N represents the set of physical nodes, and each node can represent a datacenter, edge server, or point of presence (PoP), and L is the set of physical links that connect the physical nodes in the PN. Each physical node n ∈ N has a certain amount of computing resources C n and can host any type of SF. For a link l m,n ∈ L (m, n ∈ N ), it has a specific amount of available bandwidth bw lm,n and has a propagation delay L lm,n . Note that the propagation delay of a link is fixed and determined by the length of the link. A physical forwarding path P m,n is composed of a series of physical links from physical node m to n. We use L Pm,n and BW Pm,n to represent the propagation delay and the bandwidth of P m,n , respectively.

B. Parallelizable NFV Service Requests
The parallelizable NFV service request (P-NSR) is defined as a 4-tuple P-NSR = <s, d, PG, BW>, where s and d are the service source and destination; PG = (V, E) represents the parallel graph; and BW is the bandwidth demand. For each SF v ∈ V , it requires a specific network function, a certain amount of computing demand C v , and has a maximum processing delay L v .

C. Problem Formulation
Given a P-NSR, PSFCE is defined as: how to compose a P-SFC for a P-NSR and embed it onto a given PN such that (i) the end-to-end service latency (i.e., the sum of processing delay and propagation delay) of the constructed SFP is minimized; and (ii) the following constraints are satisfied. Table I lists the notations used in the problem formulation. The objective function of the proposed PSFCE problem is shown as Eq. (4), which minimizes the overall service latency.
SF Node Embedding Constraint: Eq. (5) shows whether an SF node v is assigned to the i th parallel block or not, while Eq. (6) checks whether the i th parallel block is mapped onto physical node m or not. Eq. (7) ensures that u and v can work in parallel if and only if there is an edge between u and v in the given PG. If ξ i is embedded onto the physical node m, Eq. (8) requires that m must have enough computing resources to host it. Eq. (9) and Eq. (10) guarantee that each SF node v will be assigned to one parallel block, and each parallel block must be embedded onto one physical node. Eq. (11) calculates the processing delay of a parallel block.
SFP Routing Constraint: We use P ξ i ,ξ j n i ,n j to denote whether P n i ,n j is the forwarding path from ξ i to ξ j as Eq. (12). Eq. (13) and Eq. (14) ensure that, if P n i ,n j is employed to transmit the traffic from ξ i to ξ j , then ξ i and ξ j must be embedded onto n i and n j , and P n i ,n j must have enough bandwidth. Eq. (15) and Eq. (16) require that (i) there must be a physical path starting from each parallel block (including source), and (ii) there must be a physical path ending at each parallel block (including destination). Eq. (17) computes the propagation delay of P

V. NP-HARDNESS OF PSFCE
In this section, we analyze the NP-hardness of the proposed PSFCE problem in various network scenarios: (i) every physical node has enough computing resources to host the required SFs, and the PG is a complete graph, (ii) every physical node has enough computing resources to host the required SFs, and the PG is a non-complete graph, and (iii) every physical node has limited computing resources, and the required SFs cannot be hosted by one physical node.

A. Enough Computing Resources and Complete PG
With enough computing resources, any physical node can host all required SFs in a client's request. In practice, the length of an SFC is generally no more than 8 [47]. Therefore, with enough computing resources, it is possible that any physical node can host the SFs for a moderate SFC. In this case, SFP propagation delay can be optimized by embedding all required SFs onto the physical nodes along one bandwidth-aware shortest path from the service source to the destination. Since PG is a complete graph, all required SFs can work in parallel, and the optimally-constructed P-SFC only has one parallel block, whose processing delay is the highest processing delay among all SFs. As a result, the PSFCE problem in this scenario can be optimized by embedding and parallelly executing all SFs at one physical node along the bandwidth-aware shortest path connecting the service source to the destination. Fig. 2a shows an example of accommodating a P-NSR to a PN, where all nodes have enough computing resources while PG is a complete graph. In this case, all SFs can be executed in one parallel block, and this block is embedded onto one physical node along the bandwidth-aware shortest path, i.e., S → E → D. Similarly, when one physical node with enough computing resources is located along the bandwidth-aware shortest path as the physical node G in Fig. 2b, this shortest path is an optimal solution for PSFCE. Since the running time complexity of the bandwidth-aware shortest path algorithm is polynomial [48], the following theorem holds.
Theorem 1: When each physical node has enough computing resources and PG is a complete graph, the PSFCE problem can be optimized within polynomial time.
Lemma 2: When one physical node along the bandwidthaware shortest path has enough computing resources, the PSFCE problem can be optimized within polynomial time.

B. Enough Computing Resources and Non-Complete PG
When each physical node has enough computing resources, the SFP propagation delay can be optimized. However, since PG is a non-complete graph, optimizing SFs' processing delay and PSFCE will be NP-hard, as shown in Theorem 2.
Theorem 2: When PG is a non-complete graph, the PSFCE problem is NP-hard.
Proof: If each SF has the same processing delay, optimizing the P-SFC processing delay is equivalent to constructing the minimum number of parallel blocks in the P-SFC. Then, we can create the complement graph (PG C ) of PG, each edge in which represents the non-parallel relationship. When we label the SFs that are assigned to the same parallel block with the same color, minimizing the number of parallel blocks in the constructed P-SFC is equivalent to coloring the PG C with the minimum number of colors, which is a well-known NP-hard problem [49]. Therefore, in this scenario, optimizing SFs' processing delay is NP-hard, and PSFCE is NP-hard.

C. Limited Computing Resource
When computing resources at each physical node are limited, the physical nodes along the bandwidth-aware shortest path may not have enough computing resources to host all required SFs. In this case, propagation delay is determined by the complex routing process, and SFs' processing delay can not be optimized no matter whether PG is a complete graph or not. As a result, we need to jointly optimize propagation and processing delays, which is NP-hard, as shown in Theorem 3.
Theorem 3: When each physical node has limited computing resources, PSFCE is NP-hard.
Proof: We assume that every SF requests the same amount of computing resources, and the physical network only contains |N | = |V | physical nodes, each of which has only enough computing resources to host one SF. As a result, SFs' processing delay is fixed as the sum of processing delay from all SFs, and optimizing SFP propagation delay is equivalent to finding the shortest path spanning all physical nodes in PN, which is the traveling salesman path problem (TSPP) [50]. As TSPP is a well-known NP-hard problem, PSFCE is also NP-hard in this scenario.
From the above discussion, we can see that there are three important scenarios to optimize PSFCE. (i) When each physical node has enough computing resources, and PG is a complete graph, PSFCE can be optimized by embedding and parallelly executing all SFs at one physical node along the bandwidth-aware shortest path. (ii) When each physical node has enough computing resources and PG is a non-complete graph, PSFCE becomes NP-hard. In this case, SFP propagation delay is solvable by applying the bandwidth-aware shortest path algorithm, and we propose an efficient maximum parallel block gain (MPBG) first optimization algorithm in Section VI to minimize SFs' processing delay. (iii) When computing resources at each physical node are limited, PSFCE remains NP-hard regardless of whether PG is a complete graph or not. To jointly optimize SFs' processing delay and SFP propagation delay, we propose and analyze a novel parallelism-aware SFs deployment (PSFD) algorithm in Section VII.

VI. PSFCE WITH ENOUGH COMPUTING RESOURCES
When each physical node has enough computing resources to host the required SFs, the SFP propagation delay can be minimized by embedding all required SFs at physical nodes along the bandwidth-aware shortest path from the service source to the destination. However, minimizing the SFs' processing delay is challenging when PG is a non-complete graph, as proved by Theorem 2. To minimize SFs' processing delay, one needs to properly assign the required SFs to different parallel blocks and satisfy all parallel relationship constraints. Here, we propose parallel block gain (PBG) and maximum parallelism block gain (MPBG) first optimization algorithm to efficiently construct P-SFC while minimizing SFs' processing delay.

A. Parallel Block Gain (PBG)
A parallel block ξ i is composed of a set of SFs that can work in parallel. The processing delay of a parallel block is denoted by L ξ i , which depends on the in-block SF that has the highest processing delay. According to Eqs. (1) and (2), only the SF with the highest processing delay will determine the overall P-SFC processing delay for that parallel block, while the processing delay of other SFs are not counted, thus being saved. For a set of SFs, the maximum processing delay is the processing delay sum of these SFs (i.e., serially running these SFs). That is, minimizing the processing delay for a set of SFs is equivalent to maximizing the processing delay being saved by the SF parallelism process. Accordingly, we define the amount of processing delay being saved in a parallel block ξ i as parallel block gain (PBG ξ i ) in Eq. (18). The parallel block gain for a P-SFC (PBG P-SFC ) is the sum of PBG ξ i from all parallel blocks as Eq. (19).
The P-SFC processing delay L PC can be calculated by Eq. (20), where L SFC is the processing delay of executing all required SFs without any parallelisms (i.e., the traditional SFC).

B. Maximum Parallel Block Gain (MPBG) First Optimization
Maximizing PBG ξ i for a parallel block ξ i will maximize parallelism(s) in this block, leading more SFs to operate in parallel, which can in turn increase the parallel block gain for P-SFC as shown in Eq. (19). As L SFC is fixed for a given P-NSR, Eq. (20) shows that minimizing the overall P-SFC processing delay is equivalent to maximizing PBG of P-SFC (i.e., PBG P-SFC ). To optimize P-SFC processing delay, we need to properly construct parallel blocks ξ ∈ B such that the PBG of the constructed P-SFC in Eq. (19) is maximized.
The PG in Fig. 3a identifies four SFs and their parallel relationship constraints, while the number beside each SF node represents its processing delay. There are multiple options to construct parallel blocks. Fig. 3b first selects the largest complete sub-graph in PG to construct ξ 1 = {SF 1, SF 2, SF 3} and then the remaining SF4 will be alone in ξ 2 . As a result, the PBG for ξ 1 and ξ 2 is 70 μs and 0, respectively. In total, PBG P-SFC in Fig. 3b is 70 μs. Fig. 3c identifies another way to construct parallel blocks via maximizing the PBG value of each parallel block. By doing so, parallel block ξ 1 will include SF1 and SF4, which has PBG of 100 μs, and parallel block ξ 2 includes SF2 and SF3, which has PBG of 30 μs. Overall, PBG P-SFC of the scheme in Fig. 3c is 130 μs.
In fact, optimizing PBG P-SFC in Eq. (19) is NP-hard, which can be proved via reductions from the maximum clique problem (MCP) or weighted maximum clique problem (WMCP) [51], [52]. Thanks to the importance of MCP and WMCP problems in the field of data science, many existing works have proposed fast and exact optimization techniques [53], [54]. Based on the above examples in Fig. 3, to

C. Bound Analysis of MPBG Algorithm
An essential property of ξ i constructed by applying MPBG in the i th iteration is described in Lemma 3.
Lemma 3: Each parallel block (ξ i ) constructed by MPBG includes the SFs whose sub-graph in PG is a maximal clique.
Proof: We prove this lemma by contradictions. We assume that there exists a parallel block (ξ i ) that is the super set of ξ i . We define the SF that is included in ξ i but not in ξ i as v . If L v > L ξ i , then Eq. (21) holds, which contradicts the fact that PBG ξ i is the maximum value found.
As a result, there does not exist a parallel block (ξ i ) that is the superset of the parallel block constructed by MPBG. And the parallel block ξ i constructed by MPBG includes the SFs that construct a maximal clique in PG. With Lemma 3, we then prove that MPBG can achieve optimal performance in the following scenarios.
Theorem 4: With enough computing resources, when there exist only two maximal cliques in PG, MPBG achieves optimal performance. Proof: As proved by Lemma 3, the first parallel block ξ 1 constructed by MPBG is not the subset of any other possible parallel block. In other words, the set of SFs in ξ 1 is, in fact, the set of vertices that construct one of the maximal cliques in PG. Similarly, the parallel block constructed in the next iteration will be composed of the SFs in the remaining maximal clique of the PG. As only two maximal cliques exist in the PG, MPBG optimizes the PBG value in Eq. (19). Hence, MPBG achieves optimal performance.
We list some examples where there exist only two different maximal cliques in a PG. For instance, when the complement graph of PG (PG C ) is a path, complete binary (ternary) tree, star, or complete bipartite graph, there exist only two maximal cliques in the PG. Additionally, when PG is one of the graphs listed in Lemma 4, the optimality of MPBG also holds.
Lemma 4: With enough computing resources, when the complement graph of PG is a trivial graph, circle, wheel, or complete multi-partite graph, the MPBG algorithm achieves optimal performance.

VII. PSFCE WITH LIMITED COMPUTING RESOURCES
When each physical node has limited computing resources, optimizing the PSFCE problem is different from and more challenging than in Section VI. This is because (i) one cannot flexibly create a parallel block as there might not exist a physical node with enough computing resources to host it, and (ii) the physical nodes along the bandwidth-aware shortest path might not have enough computing resources to accommodate the P-NSR. To demonstrate these, we use the PG in Fig. 3a as the request example. We assume that SF1, SF2, and SF3 require 10 Gb computing resources while SF4 needs 25 Gb computing resources. Fig. 4a is the physical network, where the number in the square bracket represents the available computing resources at each physical node. Due to limited computing resources at each physical node, none of the physical nodes can host the parallel block ξ = {SF1, SF4} that requires 10 + 25 = 35 Gb computing resources, even though this ξ maximizes PBG in Eq. (19). Hence, one may accommodate the PG in Fig. 3b by composing ξ 1 = {SF1, SF2, SF3}, ξ 2 = {SF4} and embedding them onto physical nodes A and C in Fig. 4b, where physical nodes C and D now have no computing resources left. Similarly, due to the limited computing resources at physical node E, we cannot route the forwarding path along the shortest path S → E → D.
The above example shows that the limited computing resources impact how to construct parallel blocks and how to deploy (i.e., embed and route) parallel blocks in the physical network. Here, to optimize PSFCE, we have to jointly take the parallel block construction, physical node embedding, and routing process into account. As optimizing processing delay and propagation delay are interrelated, it is essential to identify physical node candidates that can facilitate the joint optimization of both delays.

A. Parallelism-Aware Betweenness Centrality (PBC)
The parallel block gain (PBG) in Section VI is defined as the processing delay saved from the parallel blocks when computing resources are enough at each physical node. Now, there might not exist a physical node n with enough computing resources to host the parallel block constructed from the maximum PBG strategy. Accordingly, to evaluate the processing delay impact of a physical node, we define PBG n in Eq. (23), where V sub is a set of parallelizable SFs that can be jointly hosted at n and L V sub is processing delay of the parallel block constructed by the SFs in V sub . This PBG n measures how much a physical node n can facilitate saving processing delay by considering (i) the number of parallelizable SFs that n can host and (ii) the processing delay of the parallel block constructed by the SFs that are co-hosted at n.
Similarly, to optimize SFP propagation delay, we need the scheme to measure how much propagation delay a physical node n might introduce when it is selected to host a parallel block. Here, we define potential propagation delay (L PPDn ) to evaluate the propagation delay when selecting n to construct SFP as Eq. (24), where Δ is the set of physical nodes that have been added into SFP. Initially, Δ = {s, d }.
L PPDn = min n i ,n j ∈Δ L Pn,n i + L Pn,n j , n i = n j (24) Traditionally, to optimize traffic latency, betweenness centrality (BC) measures the node centrality based on the number of shortest paths passing the physical node. Likewise, we propose parallelism-based latency betweenness centrality (PBC) in Eq. (25) to measure the node centrality of how much a physical node can facilitate deploying the parallel blocks with low processing and propagation delays. Note that α and β are two coefficient values to balance the optimization priority between processing delay and propagation delay. A physical node with a high PBC value means it enables great parallelisms and yields a low potential latency (i.e., the sum of potential SFs' processing delay and potential propagation delay).

B. Parallelism-Aware SFs Deployment (PSFD) Algorithm
Based on the technique of parallelism-based latency betweenness centrality, now we propose the parallelism-aware Update PBC value of each physical node based on Eq. (25), if a node n is already in Δ, L PPDn is 0; 6: Select physical node δ with highest PBC value, and add δ to Δ; 7: Create a parallel block including the set of SFs (V sub ) that maximize the PBC value of δ, and embed the SFs in the constructed parallel block onto δ; 8: Remove the set of SFs in V sub from V; 9: end while 10: Initialize S = ∅, set the SFP endpoint node n e as s, add n e to SFP; 11: while Δ = ∅ do 12: Find node x in Δ with the smallest L Px,n e ; 13: Add x to S, set n e = x , and Δ = Δ − x ; 14: end while 15: Add d to S, construct the SFP by connecting each pair of adjacent nodes in S, and form the corresponding P-SFC; return SFP, P-SFC; SFs deployment (PSFD) algorithm to efficiently accommodate a given P-NSR on the physical network while jointly taking the parallel block construction, physical node embedding, and routing process into account. To begin, PSFD initializes a node set Δ with {s, d}. Then, PSFD repeats the following operations until all SFs are satisfied: (i) update PBC value of each physical node; (ii) select the node δ whose PBC value is highest and add it to Δ; (iii) create a parallel block that includes the set of SFs (V sub ) and maximizes the PBC value at δ, and embeds it onto δ; and (iv) remove the set of embedded SFs from V. Note that, to further utilize the available computing resources remaining in the selected physical nodes, if a physical node n is already in Δ, the L PPD value of n is counted as 0. Next, the PSFD algorithm repeats the following operations until all the parallel blocks are embedded: (i) find physical node x in Δ which has the smallest propagation delay to the SFP endpoint node n e ; and (ii) add x to S, set n e = x , and remove x from Δ. At last, the PSFD algorithm constructs the SFP by connecting each pair of adjacent physical nodes in S and formulates the corresponding P-SFC.

C. Bound Analysis of PSFD Algorithm
Next, we prove that the PSFD algorithm is logarithmapproximate. Table II lists the notations used in the proof.
Proof: Eq. (26) shows the relationships between the number of satisfied SFs and the unsatisfied SFs in adjacent iterations (i.e., iterations i and i + 1).
For a physical node m in SFP OPT i , to maximize the PBC value, the sum of L PPDm and L V sub m is not greater than the latency of the SFP OPT i as shown in Eq. (28).
Since different physical nodes can provide the same SF instance(s) to satisfy the P-NSR, Eq. (29) holds.
According to the properties of the harmonic series, Eq. (33) hold.
Based on Eq. (24), the propagation delay of the created SFP is less than the sum of L PPD δ i . Thus, Eq. (36) holds.
Therefore, the proposed PSFD algorithm achieves a logarithmapproximation performance.

A. Simulation Settings
To evaluate the performances of the proposed schemes, we conduct our experiments in three network scenarios: (i) 24-node-43-link USNET [55] (as Fig. 5), (ii) edge-cloud system [56], and (iii) random network. If not otherwise specified, the parameters are generated in terms of the state-of-theart simulation settings as follows [13], [15], [19], [57], [58]. Each physical node's amount of computing resources are randomly set in a discrete-uniform range [20,100] gigabit (Gb). Each physical link has a bandwidth in a discrete-uniform range [5,10] gigabit per second (Gbps). The propagation delay of a physical link is randomly set in a uniform range [10,50] microseconds (μs). In P-NSR, the number of SF nodes is randomly set in a discrete-uniform range [5,20], edge density of PG is randomly set in a uniform range [0, 1], and bandwidth demand is random in a discrete-uniform range [1,5] Gbps. Each SF node requires computing resources in a discreteuniform range [10,25] Gb and needs the maximum processing delay in a uniform range [10,30] μs. The service source and destination of the P-NSR are randomly selected from the PN. The coefficient values α and β in PBC calculations are both set to 1. It is worth noting that the edge density of a PG is estimated by the ratio between the number of edges existing in this PG and the maximum number of edges that this PG can have (i.e., the number of edges in the complete graph for this PG). We evaluate the performances of the proposed algorithms in terms of (i) propagation delay, (ii) processing delay, (iii) end-to-end latency (i.e., the sum of overall processing delay and propagation delay), (iv) number of accepted requests, and (v) resource utilization ratio.

B. Performance Bound Analysis
To demonstrate the impact of SF parallelisms, we set each physical node with enough computing resources. As a result, propagation delay can be optimized by embedding all SFs onto one physical node along the bandwidth-aware shortest path. To minimize processing delay, we implement the integer program model under the constraints in Eqs. (5) - (11) to obtain the optimal processing delay of a P-SFC. We also implement the maximum parallel block size (MPBS) first optimization algorithm, which replaces Eq. (18) as PBG ξ = |ξ| (i.e., maximizes the size of each parallel block). In Fig. 6, we show the performances of the integer program, MPBG, MPBS, and no parallelism (i.e., L SFC in Eq. (20)), which are denoted by red rhombus-dashed, blue triangle-solid, grey square-dotted, and dark circled-solid curves, respectively. In Fig. 6a, when increasing the number of SFs, all schemes need more processing delay to construct the P-SFC. This is because more required SFs likely construct more parallel blocks, thereby increasing the processing delay. As MPBG creates the parallel block that saves the most processing delay at each iteration, it has near-optimal performance. In Fig. 6b, when fixing the number of required SFs as 20 and increasing the edge density in PG, all three proposed schemes need less processing delay, except that no-parallelism scheme always needs the Fig. 7. Impact of routing. same processing delay. Note that, when edge density is 0 (i.e., no parallelism) and 1 (i.e., all SFs can work in parallel), integer programming, MPBG and MPBS have the same performance. When edge density is high (i.e., more than 0.6), MPBG has a similar performance as the integer program. High edge densities increase the connectivities of PG and reduce the number of maximal cliques in PG. Therefore, with a limited number of maximal cliques in PG, MPBG, achieves nearoptimal performance. As shown in Fig. 3, assigning more parallelizable SFs in one parallel block reduces processing delay to construct P-SFC, but may not maximize PBG of P-SFC in Eq. (19). Thus, when the edge density is 0.4, MPBG outperforms MPBS by as much as 58% in our examples.
To show the impact of routing, each physical node is set to host only one SF. As a result, the optimization problem of PSFCE becomes finding the best path connecting the service source and destination while hosting all required SFs. We implement the branch-and-bound (B&B) method to obtain the optimal path with the least propagation delay. Fig. 7 shows the performances of B&B and PSFD (α = 1, β = 1), which are denoted by the red and blue bars, respectively. In Fig. 7a, when increasing the number of required SF nodes, both schemes need more propagation delay to construct the SFP. This is because more required SFs lead to more physical nodes in the SFP, thereby increasing the propagation delay. Similarly, when fixing the number of SF nodes as 15 and increasing the bandwidth demand as shown in Fig. 7b, both the B&B method and PSFD need more propagation delay. From Figs. 7a and 7b, we see that the PSFD algorithm achieves near-optimal performance when the number of required SFs is small or the bandwidth demand is high. According to Eq. (24), when |V | = 1, PSFD can construct the optimal SFP by applying the PBC technique. Similarly, when increasing the bandwidth demand, the number of bandwidth-aware shortest paths reduces. The SFP created by applying the PBC technique is likely the same as the one created by the B&B method. Thus, with higher bandwidth demand, the performance of PSFD is closer to the optimal.
We conduct experiments to evaluate the joint impact of SF parallelisms and routing by varying the physical network size (i.e., number of physical nodes and links in the network) and computing resources at each physical node in Fig. 8. Here, we compare the performances of PSFD (α = 1, β = 0), PSFD (α = 0, β = 1), and PSFD (α = 1, β = 1); where PSFD (α = 1, β = 0) focuses on the optimization of SFs' processing delay and PSFD (α = 0, β = 1) concentrates on optimizing SFP propagation delay. In Fig. 8, performance of PSFD (α = 0, β = 1), PSFD (α = 1, β = 0), and PSFD (α = 1, β = 1) are denoted by the yellow-solid, green-dashed, and grey-dotted curves, respectively. In Fig. 8a, when increasing the physical network size, all schemes require more latency to construct SFP. This is because the service source is likely far away from the destination in a larger network, which increases SFP propagation delay. As PSFD (α = 1, β = 1) takes optimization of processing delay and propagation delay into account, the physical node selected to embed the parallel block by PSFD (α = 1, β = 1) will yield the smallest sum of processing and propagation delays. Interestingly, when comparing PSFD (α = 1, β = 0) with PSFD (α = 0, β = 1) in Fig. 8a, we can see that PSFD (α = 1, β = 0) is better in smaller PNs and PSFD (α = 0, β = 1) is better in larger PNs. This indicates that the optimization of the endto-end service latency largely depends on SFs' processing delay in small networks, while end-to-end service latency is impacted more by SFP propagation delay in large networks. In Fig. 8b, when fixing the number of required SFs as 15 and increasing the computing resources at each physical node, all schemes need less latency. This is because, with more computing resources, more parallelisms are enabled to save processing delay, and fewer physical nodes are required in SFP to host all SFs, which can save on propagation delay. When investigating the joint impact of SF parallelism and routing processes, the proposed PSFD (α = 1, β = 1) algorithm outperforms PSFD (α = 1, β = 0) and PSFD (α = 0, β = 1) by an average of 52% and 36%, respectively, in our examples.

C. Performance Evaluation in Edge-Cloud Systems
In this section, we evaluate the performances of the proposed schemes in edge-cloud systems. Similar to the network topology in [56], the cloud is directly connected to the clients' access network via fiber links, while edge servers are distributed around clients. The propagation delay from a client to the cloud is in a uniform range [50,100] μs, and the propagation delay between edge servers and clients is in a uniform range [5,15] μs. Servers in the cloud have high-performance computing hardware and the processing delay of running an SF is in a uniform range [5,15] μs. The edge servers own low-performance computing hardware and the processing delay of running an SF is in a uniform range [10,30] μs. Additionally, the cloud has enough computing resources to accommodate the requested P-NSR, while the edge has the limited number of servers in a discrete-uniform range [20,40]. The edge servers are interconnected with probability of 60%. Notably, for each  When increasing the number of required SFs, both algorithms need more service latency to deliver the P-NSR, as shown in Fig. 9a. For PSFD at the edge, both propagation and processing delays increase with the number of required SFs as more parallel blocks are created. This is because (i) more parallel blocks likely lead to a higher processing delay according to Eq. (2), and (ii) more edge servers are needed to host parallel blocks, leading to a higher propagation delay for connections. Interestingly, for MPBG in the cloud, the processing delay increases with the number of required SFs while the propagation delay fluctuates. This can be explained as follows. More SFs are likely forming more parallel blocks, leading to higher processing delay. As the source and destination nodes are randomly located, the propagation delay fluctuates. Fig. 9b further verifies the above observations and analysis. With enough computing resources and high-performance hardware in the cloud, MPBG creates fewer parallel blocks and needs much lower processing delay compared to running PSFD at the edge. Overall, deploying services at the edge by PSFD is better when the number of required SFs is smaller (less than 10), while deploying services in the cloud by MPBG is more latency-efficient when the number of required SFs is larger. Since the SFC length is no more than 8 in practice [47], most P-NSRs should be deployed at the edge to achieve latency-efficient service delivery.
In Fig. 10, when increasing the number of edge nodes, the latency of deploying services by MPBG in the cloud varies, while the latency of deploying services by PSFD at the edge  increases. Again, as the source and destination nodes are randomly selected from the edge nodes, the propagation delay of deploying services by MPBG in the cloud fluctuates. Enlarging the size of the edge network likely makes source and destination far away from each other, leading to a higher propagation delay in PSFD. Since the same set of SFs is required, the processing delay of deploying services by MPBG in the cloud remains the same. When enlarging the network size, as different set of edge servers are employed to host SFs in a P-NSR, the processing delay of deploying services by PSFD at the edge fluctuates.

D. Performance Evaluation With Online Traffic
In this section, we evaluate PSFD when P-NSRs are generated in an online fashion and will stay in the network once accommodated. We use 24-node-43-link USNET and 40-node-180-link randomly-generated mesh network topology in our experiments. Note that the random network is generated based on JAVA language by first generating a tree with 40 nodes and then adding links to ensure the existence of 180 links in the network. The computing resources of each physical node and bandwidth resources of each physical link are randomly set in discrete-uniform ranges [100, 300] Gb and [50,100] Gbps, respectively. The number of required SFs is set to {5, 10, 15, 20, 25} with an even distribution [18], [19], [44].
We implement and improve the proposed MPBS algorithm, partial parallel chaining (PPC) [45], and the parallelismaware residual capacity first placement (PARC) algorithm [44] to compare with PSFD. We name the algorithm improved from MPBS as MPBS with nearest-neighbor (MPBS-NN).
MPBS-NN is implemented by (i) selecting the node with the highest betweenness centrality value from source to the destination; (ii) creating and embedding the largest parallel block that this physical node can host; (iii) repeating steps (i) and (ii) until all SFs are embedded; and (iv) applying the nearestneighbor algorithm to connect the source with the destination going through all physical nodes that host parallel block(s). We extend PPC by (i) generating all possible size vectors to indicate all possible P-SFCs; (ii) selecting one with the least processing delay; (iii) if PN can accommodate the constructed P-SFC, creating the corresponding parallelism-aware layered graph, otherwise, denying the request; (iv) if the parallelismaware layered graph is created by step (iii), finding the shortest path in the constructed layered graph as the SFP. The PARC is extended by (i) randomly generating an SFC based on P-NSR, and setting the source s as the endpoint n e ; (ii) creating the P-SFC with least processing latency by exhaustive search; (iii) embedding as many parallel blocks as possible to physical node δ with the highest residual computing resources, connecting δ to n e , and updating n e as δ; (iv) repeating step (iii) until all SFs are embedded; and (v) connecting n e with the destination to formulate the SFP. Note that, if the extended algorithms cannot accommodate an incoming P-NSR, it will be denied and counted as unaccepted. We evaluate the performances of PSFD, MPBS-NN, PPC, and PARC regarding: (i) service end-to-end latency, (ii) number of accepted P-NSRs, and (iii) resource utilization ratio.
Figs. 11 and 12 present the performances of PSFD, MPBS-NN, PPC, and PARC in USNET and 40-node network, respectively. The latency and acceptance performances of PARC, MPBS-NN, PPC, PSFD (α = 1, β = 1) are denoted by the green circled-solid, red rhombus-dashed, blue triangle-dashed, and grey square-dotted curves, respectively. The resource utilization performances of those algorithms are represented by the green bar, red bar, blue bar, and grey bar, respectively.
In Fig. 11a, when increasing the number of P-NSRs, the performances of all schemes show the trend of increasing at first and then decreasing. When the algorithms are embedding more P-NSRs to the PN, the network resources of some key links or nodes are running out. Then, the algorithm needs to accommodate a P-NSR via a longer SFP, thus increasing the average service latency. As the network resources are running out, all schemes begin accommodating the P-NSRs with fewer SFs while denying the P-NSRs with many SFs. As the SFP generated for a small number of SFs is generally short, the average service latency of all SFPs later decreases. Even though the PARC algorithm creates the P-SFC with the least processing latency using exhaustive search, the embedding policy of PARC will accommodate the parallel block on the physical node with many residual computing resources, which likely will introduce detour routing processes, thus leading to high service latency. MPBS-NN creates large parallel blocks and embeds the block on the physical node with high betweenness centrality value to reduce both processing and propagation delays. When the system load is low, MPBS-NN accommodates P-NSRs with a relatively lower service latency than PARC does. When more P-NSRs are accommodated, the network resources of the links and nodes with high betweenness centrality value runs out, and MPBS-NN has to employ longer routing paths for creating SFPs, leading to a sharp increase in average service latency. PPC generates the P-SFC with the least processing delay and accommodates it by establishing a layered graph to reduce the propagation latency. It outperforms PARC and MPBS-NN according to the service latency. However, the processing delay and propagation delay are independently optimized. A large parallel block might be deployed far away from the others, leading to a relatively higher overall latency than PSFD. PSFD jointly takes processing delay, propagation delay, and computing resources usage (i.e., V sub ) into account and can dynamically pick the betterfit physical node to reduce the service latency. In Figs. 11b and 11c, PARC has the highest acceptance and resource utilization. Meanwhile, PPC has the worst acceptance. This is because PPC greedily creates large parallel blocks to optimize processing delay, and there might not exist a physical node with enough computing resources to host such large blocks. Interestingly, MPBS-NN accepts fewer P-NSRs but has higher resource utilization compared to PSFD. This is because the static path-selection method (i.e., selecting the node with the highest betweenness centrality value) will quickly fill capacities of key links and nodes, which will lead the later P-NSRs to be accommodated via detours, thus leading to higher service latency and resource utilization. Fig. 12 shows the performances of PSFD, MPBS-NN, PPC, and PARC in the 40-node mesh network. Similarly, PSFD outperforms MPBS-NN, PPC, and PARC in service latency, while PARC has the highest acceptance and resource utilization. It is worth noting that all schemes in this larger network have higher acceptance than USNET. Higher connectivities mitigate the pressure of running out of bandwidth resources for key links because more physical links can be used for substitutions. Numerically, in our examples, the average acceptance ratios for all four schemes are 66.7% and 80.6% in USNET and 40-node networks, respectively. In addition, the performance gap between PSFD and MPBS-NN/PPC/PARC in Fig. 11a and Fig. 12a becomes smaller. Regarding the service latency in our examples, PSFD, on average, outperforms MPBS-NN by 32% and 15%, PPC by 29.4% and 10.1%, and PARC by 58% and 42% in USNET and 40-node network, respectively.

IX. CONCLUSION
This paper comprehensively investigated how to minimize the latency in the parallelism-aware service function chaining and embedding (PSFCE) problem. When each physical node has enough computing resources to host all required SFs, SFP propagation delay can be optimized by embedding the required SFs along the bandwidth-aware shortest path. We have proposed the maximum parallel block gain (MPBG) first optimization algorithm to efficiently create a parallelism-based SFC (P-SFC) with a low processing delay. When computing resources at each physical node are limited such that the required SFs have to be accommodated by multiple physical nodes, we proposed the parallelism-aware SF deployment (PSFD) algorithm to jointly optimize processing and propagation delays. Through thorough analysis, we showed that PSFD is in general logarithm-approximate. For different network settings, we have shown that PSFD can effectively optimize the end-to-end latency and outperform the schemes directly extended from existing works. Additionally, we had the following findings: (i) MPBG achieves near-optimal performance when computing resources are enough at every physical node, (ii) when the computing resources are limited, the end-to-end latency largely depends on the optimization of SFs' processing delay in small networks, and (iii) to achieve latency-efficient, short PSFCs (less than 10 SFs) should be deployed at the edge, while long PSFCs should be deployed in the cloud. Future work should investigate the end-to-end service latency optimization problem of P-SFC composition and embedding when considering SF dependencies and parallel relationships for latency-deterministic network scenarios.