AUTOSAR Extensions for Predictable Task Synchronization in Multi-Core ECUs

Multi-core processors are becoming increasingly prevalent, with multiple multi-core solutions being offered for the automotive sector. Recognizing this trend, the AUTomotive Open System ARchitecture (AUTOSAR) standard version 4.0 has introduced support for multi-core embedded real-time operating systems. A key element of the AUTOSAR multi-core specification is the spinlock mechanism for inter-core task synchronization. In this paper, we study this spinlock mechanism from the standpoint of timing predictability. We describe the timing uncertainties introduced by standard test-and-set spinlock mechanisms, and provide a predictable priority-driven solution for inter-core task synchronization. The proposed solution is to arbitrate critical sections using the well-established Multi-processor Priority Ceiling Protocol [3], which is the multiprocessor version of the ceiling protocol for uniprocessors [1, 2] used by AUTOSAR. We also present the associated analysis that can be used in conjunction with the AUTOSAR task model to bound the worst-case waiting times for accessing shared resources. The timing predictability provided by our protocol is an important requirement for automotive applications from both certification and validation standpoints.


INTRODUCTION
Processor architectures have reached a turning point in their evolution with manufacturers now placing an emphasis on achieving parallelism through multi-cores as opposed to increasing the underlying clock frequency.Excessive power consumption, increasing hardware complexity, clock synchronization problems, and saturation of pipeline optimizations represent reasons for this shift in paradigm towards multi-core processors.Although the introduction of multi-core processors promises a linear increase in raw throughput, it poses many daunting challenges for software development.Existing applications may have to be evolved and parallelized to exploit the available performance from multiple cores.Also, the true parallelism exposed by such architectures violates many assumptions currently made in developing embedded real-time software for single-core processors.Specifically, mutual exclusion protocols such as the single-core priority ceiling protocol [2] and highest-locker priority implementations [7] do not directly apply in the multi-core scenario.
Automotive systems have recently seen a tremendous growth in terms of advanced software features such as driver-assist technologies, active safety systems, and interactive entertainment platforms.Multi-core processors constitute an effective means for realizing such computationally-intensive applications in future automotive platforms.Recognizing these changes in application characteristics and processor capabilities, the AUTomotive Open System ARchitecture (AUTOSAR) Version 4.0 standard has introduced support for multi-core embedded real-time operating systems.New concepts such as locatable entities (LEs), multi-core startup/shutdown, Inter-OS-Application Communicator (IOC), and SpinlockTypes have been introduced in the AUTOSAR multi-core OS architecture specification to extend the single-core OS specifications.
In this paper, we study the SpinlockType mechanism specified in AUTOSAR 4.0 from the perspective of timing predictability.We specifically focus on non-nested spinlocks, which are recommended by the AUTOSAR 4.0 specification 1 [1].Given the context of non-nested spinlocks, we review the sources of unbounded timing introduced by the current specification of AUTOSAR SpinlockType: (i) deadlocks, and (ii) starvation.We show example scenarios to illustrate both these problems, and discuss prioritydriven approaches to solve them.Based on these discussions, we develop our solution where critical sections can be arbitrated using the well-established Multi-processor Priority Ceiling Protocol.This latter protocol is the multiprocessor version of the ceiling protocol for single-core processors used by AUTOSAR.This solution guarantees bounded delays to accessing critical sections, and avoids the timing unpredictability issues introduced by the current specification of AUTOSAR SpinlockType.

ASSUMPTIONS AND SYSTEM MODEL
We first provide the context for our work by describing the underlying assumptions and our system model.In this work, we specifically focus on basic tasks as defined in AUTOSAR 4.0.As defined by the standard, basic tasks do not block by themselves or wait on any OS events.We denote a basic task executing in the system as τ i .Each task τ i is assumed to be a periodic task, which implies that each task repeats itself after a time frame or period denoted by T i .Each occurrence of the task τ i is assumed to require no more than C i units of execution time on its processor.Each occurrence of the task τ i is assumed to have a relative deadline equal to the task period, i.e. each occurrence of τ i should complete before the next occurrence of τ i .Any task τ i in the system can therefore be represented by the tuple (C i , T i ).We define the task set as a collection of n tasks {τ 1 , τ 2 , ...,τ n } ordered in the increasing order of task periods.We also assume a priority-driven OS scheduler with task priorities assigned in a rate-monotonic fashion, i.e. tasks with shorter periods (and relative deadlines) have higher scheduling priorities [9].We use the notation hp(τ i ) to denote the set of all tasks with higher priority than τ i , and lp(τ i ) to denote the set of all tasks with lower priority than τ i .For simplicity of presentation, we will assume that all tasks have unique priorities.We will represent the m processor cores in the system as {P 1 ,P 2 , ...,P m }.The priority of task τ i is denoted as π i .We will also use the AUTOSAR convention with a higher value of π i denoting a higher priority.
In this paper, we are interested in tasks that share resources among each other.We will use the notation R k to denote the k th shared resource.The lock protecting the shared resource R k will be denoted as S k .The maximum time for which resource R k can be held is denoted by C(R k ).We will now leverage this system model in the context of AUTOSAR to provide response-time bounds when resources can be shared across tasks executing on different cores.

AUTOSAR SPINLOCKS
Mutual exclusion in single-core processors is facilitated by AUTOSAR using a function called GetResource().This function leverages the priority ceiling protocol: when a task acquires a resource, its priority is temporarily promoted to the highest priority among all the tasks that can use the resource.This bounds and reduces the duration for which any task waits for a lower-priority task to release a shared resource.In other words, it bounds and significantly reduces priority inversion.Priority inversion cannot be completely eliminated when logical resource sharing is required, but it can be minimized by using appropriate protocols.In the current context, priority inversion cannot be eliminated since the lower-priority task holding the resource has to still release it before the higher-priority task can proceed due to the non-preemptive and mutually-exclusive nature of shared resources.However, GetResource() uses the priority ceiling protocol, which ensures that the priority inversion is no more than the duration of a single resource-holding duration.This mechanism does not scale to multi-core processors since priorities are insufficient in preventing access from tasks executing on other cores.
A key extension in the AUTOSAR 4.0 multi-core specification is a new mechanism for mutual exclusion across tasks on different cores called as the SpinlockType.This is a busy-waiting mechanism that polls a (lock) variable until it becomes available using an atomic test and set functionality.Once a lock variable S l is obtained by a task, other tasks on other cores will be unable to obtain the lock variable S l , effectively ensuring the constraint of mutually exclusive access to the resource R l protected by the lock S l .This mechanism works with multi-core processors, since it relies on shared memory locks as opposed to the priority attribute that does not span cores.
The SpinlockType is an approach to address the shortcomings of the priority ceiling protocol when extending AUTOSAR for use in multi-core processors.However, it introduces two problems from a timing standpoint: (i) deadlocks, and (ii) starvation.We will now describe these problems and provide priority-driven approaches to avoid them and achieve response-time guarantees.

DEADLOCKS
In the context of non-nested spinlocks, the SpinlockType mechanism could potentially lead to deadlocks, as noted in the standard specification itself:

Deadlock due to preemption
The deadlock happens when a lower priority task holding a resource protected by SpinlockType A gets preempted by a higher priority task that later tries to acquire the same SpinlockType A (see Figure 1).

Figure 1 -Deadlock due to Preemption
The current solution presented in the AUTOSAR specification is to (a) return an error to TASK/ISR2 trying to acquire a spinlock assigned to another TASK/ISR2 on the same core (per MCOS0112 and MCOS0113 of the specification), or (b) protect a TASK by wrapping the spinlock with a SuspendAllInterrupts() call so that the task cannot be preempted.
With respect to (a), returning an error to the second task trying to acquire the spinlock A is not quite useful.It would be desirable to avoid this scenario in the first place, since application developers should not have to worry about such errors resulting from operating system scheduling decisions.
The advantage of (b) is that using SuspendAllInterrupts() reduces the amount of remote blocking suffered during multi-core synchronization.In the context of multi-core mutual exclusion, remote blocking is defined as the duration for which a task waits for a shared resource to be released on a remote core.Consider an example with three tasks τ 1 ,τ 2 , and τ 3 in decreasing order of task priorities.Let tasks τ 2 andτ 3 be assigned to core P 1 , and task τ 1 be assigned to core P 2 .Tasks τ 1 and τ 3 share a resource R protected by a SpinlockType S. Figure 2 shows the scenario when task τ 3 acquires the resource R and gets preempted by task τ 2 on core P 1 before releasing R. In this case, task τ 1 executing on core P 2 might try to acquire the resource R and get blocked indirectly by τ 2 .This remote blocking could be potentially very large if there are many tasks with higher priority than τ 3 on core P 1 .

Figure 2 -Remote Blocking without SuspendAllInterrupts
The use of SuspendAllInterrupts() prevents any preemption for the duration of time in which the shared resource is held.In this example the use of SuspendAllInterrupts() leads to a significant reduction in the remote blocking suffered by τ 1 as shown in Figure 3.The caveat here is that the use of SuspendAllInterrupts() converts the duration for which a shared resource R is held into a non-preemptible section, which in turn could block other shared resources (say R' ) potentially required by tasks with higher priority than all tasks that access R.This is the main disadvantage of using recommendation (b) since any high priority task τ 0 requiring shared resource R' and having higher priority than both τ 2 and τ 3 will also suffer from the non-preemptive blocking due to R.

Figure 3 -Remote Blocking with SuspendAllInterrupts
The second source of unbounded timing with AUTOSAR SpinlockType arises due to starvation as we will describe next.STARVATION Consider 3 tasks τ 1 ,τ 2 , and τ 3 , each running individually on three cores P 1 , P 2 , and P 3 .All three tasks access the same shared resource R, which is protected using AUTOSAR Spinlocks.In this scenario, we show that task τ 3 can suffer from starvation.As can be seen in Figure 4, depending on the hardware implementation of the test-and-set mechanism and the task set characteristics, it could lead to starvation with task τ 3 potentially never getting the Spinlock.In this scenario, the test-and-set hardware implementation may not guarantee that the test-and-set request from core P 3 will succeed before the test-and-set requests from cores P 1 and P 2 .Even if such a guarantee is provided, there are still scenarios where task τ 3 might be starved as we will show next.Another possible scenario is shown in Figure 5, where task τ 1 runs on core P 1 and tasks τ 2 & τ 3 run on core P 2 .τ 1 & τ 3 share a resource.In the sequence presented, task τ 3 gets preempted by τ 2 , first before the shared resource is released by τ 1 on core P 1 .In effect, τ 3 would be starved of the shared resource even though it is available periodically.This scenario is independent of the test-andset implementation.
A bound on the waiting time for the spinlock can be obtained if (A) both the GetSpinLock() mechanism used to acquire the spinlock and the actual duration of holding the spinlock itself are wrapped with the SuspendAllInterrupts setting.(B) When there are m(S) contenders for spinlock S that can be held for C(S) units of time, then each task issues no more than one request for the spinlock every m(S)C(S) units of time .Condition (A) prevents any preemption when spinning for the lock, thus preventing the scenario described in Figure 5. Condition (B) ensures that the hardware test-and-set implementation does not result in starvation as illustrated in the example given in Figure 4.

Figure 5 -Starvation with Spinlocks due to Preemption
Although extending the AUTOSAR 4.0 SpinlockType mechanism with conditions (A) and (B) will enable us to achieve bounded timing properties for spinlocks, the priority inversion resulting from non-preemptible sections can be quite large, and this can be an especially high overhead for high-priority tasks with really short periods and deadlines.We next describe the multiprocessor priority ceiling protocol.This protocol is an extension of the priority ceiling protocol used in AUTOSAR for single-processor mutual exclusion that provides bounded timing properties, while reducing the priority inversion and utilization loss from remote blocking.

MULTIPROCESSOR PRIORITY CEILING PROTOCOL
The multiprocessor priority ceiling protocol (MPCP) was developed in [3] for specifically dealing with the mutual exclusion problem in the context of shared-memory multiprocessors.Modern multi-core processors largely resemble shared-memory multiprocessors.Thus, MPCP is also a good fit for mutual exclusion in multi-core processors.In the context of MPCP, a global mutex is defined as a mutual exclusion lock shared across tasks executing on different cores.A local mutex is defined as a mutual exclusion lock that is only shared across tasks executing on the same core.A brief description of MPCP from [3] is applicable in the AUTOSAR context as presented next.

MPCP WITH AUTOSAR SPINLOCKS
• Tasks use their assigned scheduling priority unless holding a mutex.
• The single-core priority ceiling protocol is used for all requests to local mutexes.
• The priority ceiling of a global mutex M is defined as π(M) = π G + π c , where π G is a priority higher than all normal scheduling priorities assigned to tasks in the system, and π C is the highest normal priority of any task accessing M.
• Any task holding a global mutex M executes at the priority ceiling of global mutex M.
• Any task holding a global mutex M 1 can preempt another task holding a global mutex M 2 , if the priority ceiling of M 1 is higher than the priority ceiling of M 2 .
• When a task τ requests a global mutex M, M can be granted by means of an atomic transaction on shared memory, if M is not already held by any other task.
• If a request for a global mutex M cannot be granted, then the requesting task τ is added to a prioritized queue on M. In a suspension-based implementation, task τ will be blocked on the event that M is released.In a spinning-based implementation, task τ will spin on a local variable until M is released.
• When a task releases a global mutex M, the highest priority task τ waiting in the prioritized queue for M is signaled on its local core, and M is marked as being held by τ.If there are no tasks waiting for M, then it is just marked as released.Figure 6 illustrates an example scenario where MPCP is applied.There are 3 tasks τ 1 , τ 2 , and τ 3 , with tasks τ 1 and τ 2 executing on core P 1 and task τ 3 executing on core P 2 .Tasks τ 2 and τ 3 share a resource R, which is arbitrated using the Multi-processor Priority Ceiling Protocol with a global mutex M.

Figure 6 -Illustration of MPCP
In this example scenario, task τ 2 starts executing on processor P 1 while task τ 3 starts executing on processor P 2 .Task τ 3 acquires the mutex M and hence gets exclusive access to resource R.Meanwhile, task τ 2 requests the resource R and gets blocked on mutex M. When task τ 1 gets released, it continues to execute.Whenever the mutex M is released on processor P 2 and it is given to task τ 2 , task τ 2 acquires a priority greater than all normal executing priorities thereby preempting task τ 1 .When τ 2 releases the mutex M, it reverts back to its normal priority thereby enabling τ 1 to preempt τ 2 after the release of M.

Figure 7 -Preemption of Critical Sections under MPCP
A perhaps more interesting scenario is illustrated in Figure 7.In this example, there are 4 tasks τ 1 , τ 2 , τ 3 and τ 4 , with tasks τ 1 and τ 2 executing on core P 1 , task τ 3 executing on core P 2 , and task τ 4 executing on core P 3 .Tasks τ 2 and τ 3 share a resource R 1 , which is arbitrated using the Multi-processor Priority Ceiling Protocol with a mutex M 1 .Tasks τ 1 and τ 4 share a resource R 2 , which is arbitrated using the Multi-processor Priority Ceiling Protocol with a mutex M 2 .
For the scenario in Figure 7, all the tasks are released simultaneously in their respective processor cores.Tasks τ 1 preempts task τ 2 since it has a higher scheduling priority.First, task τ 4 acquires the resource R 2 by locking the mutex M 2 .Task τ 3 then acquires the resource R 1 by locking the mutex M 1 .When task τ 1 tries to acquire R 2, it gets blocked on mutex M 2 .Task τ 2 therefore starts executing on processor P 1 while task τ 3 continues executing on processor P 2 .Later, task τ 2 requests the resource R 1 and gets blocked on mutex M 1 .When mutex M 1 is released on processor P 2 and it is given to task τ 2 , task τ 2 acquires a priority greater than all normal executing priorities.However, when the mutex M 2 is released on processor P 3 and it is given to task τ 1 , this mutex M 2 has a higher global priority ceiling than the mutex M 1 since M 2 is shared between τ 1 and τ 4 , while M 1 is shared betweenτ 2 and τ 3 .Task τ 1 therefore preempts task τ 2 that is holding mutex M 1 .This preemption ensures that the mutex M 2 required by a task with higher priority does not get blocked by mutex M 1 shared among lower priority tasks.This preemption is not possible when using the standard AUTOSAR SpinlockType with SuspendAllInterrupts(). MPCP thus provides a priority-driven approach to dealing with shared resources, as opposed to the AUTOSAR SpinlockType, which would effectively lead to non-preemptive sections.
From an implementation perspective, the prioritized queues can be located in shared memory.The task releasing the resource is responsible for signaling the highest-priority task waiting on the priority queue.If no task is waiting on the priority queue, then the task can simply mark the resource as available.The latency of locking this priority queue is dependent on the processor implementation and the memory system.

RELATED WORK
The mutual exclusion problem in the context of single-core processors and static priority scheduling was addressed in [2].The proposed solution known as the Priority Ceiling Protocol (PCP) has been adopted in AUTOSAR for mutual exclusion in single-core processors.For dynamic priority scheduling systems, an alternative mechanism known as the Stack Resource Policy (SRP) was proposed in [8].These mutual exclusion techniques have been specifically designed for the single-core context, and do not extend directly to the multi-core scenario.As noted by the AUTOSAR 4.0 [1] specification, the key underlying problem here is that priorities are effective in preventing access to shared resources in a single-core context but do not work across multiple cores.The AUTOSAR 4.0 specification therefore proposes the SpinlockType mechanism for handling mutual exclusion in multi-core processors.The contributions of our work in this regard are (i) highlighting the timing issues such as deadlock and starvation that can occur when using SpinlockTypes, and (ii) developing extensions and associated analysis for achieving predictable timing with SpinlockTypes.
The issue of extending PCP to the multiprocessor context was considered in [3], and the multiprocessor priority ceiling protocol (MPCP) was proposed.MPCP was designed for shared memory multiprocessor systems, and also applies to multi-core processors which have similar architectural characteristics.In this work, we have reviewed MPCP in the context of AUTOSAR SpinlockTypes to reduce priority inversion and contain the utilization loss resulting from remote blocking.We have also leveraged the analysis provided in [6], which considers both suspension and spinning based versions of the MPCP protocol.The key contribution of our work in this regard is to adapt the existing MPCP results in the AUTOSAR context, and provide the required timing analysis.
A flexible approach for mutual exclusion was studied in [4], where short resource requests use a busy-wait mechanism, while long resource requests are handled using a suspension approach.There are two key differences between our MPCP-based approach and the one used in [4] for long resource requests: [4] uses a First-In-First-Out queue for determining the task that gets the resource, whereas MPCP uses prioritized queues, and [4] uses the standard priority ceiling for long request locks, whereas MPCP uses a global priority ceiling that is higher than all normal execution priorities and reduces remote blocking.Under many conditions [6], the MPCP-based approach performs better.
In the Appendix, we capture the analyses that can be performed for the various schemes we have discussed.

SUMMARY/CONCLUSIONS
Multi-core processors are becoming increasingly prevalent in the general-purpose computing market.Chip vendors in the embedded market have also started offering multi-core chips due to their power and performance benefits.These trends have influenced the AUTOSAR 4.0 release to add support for multi-core processors.The proposed multi-core extension identifies that the single-core priority ceiling protocol does not directly extend to multi-core processors, and introduces a new SpinlockType mechanism for mutual exclusion when resources are shared across cores.In this work, we studied different sources of unbounded timing such as deadlock and starvation introduced by the SpinlockType mechanism as currently defined in AUTOSAR 4.0.We then discussed extensions to achieve timing predictability with the SpinlockType.Subsequently, we have developed our approach where resources can be arbitrated using the Multi-processor Priority Ceiling Protocol (MPCP).The resulting solution provides bounded timing support, thus it is beneficial from both certification and validation standpoints.
Where I i,i (C(S l )) denotes the worst-case preemption suffered by task τ i when holding the spinlock S l at its normal scheduling priority.

Remote Blocking with SuspendAllInterrupts()
For remote blocking with SuspendAllInterrupts(), the worst-case lock holding time L i (S l ) for any task τ i for the spinlock S l is: However, any task τ h executing on core P k needs to account for non-preemptive blocking B n h from lower-priority tasks given by: Remote Blocking with Standard Single-Processor Priority Ceiling Protocol: In this case, the worst-case lock holding time L i (S l ) for any task τ i for the spinlock S l is given by: where ߨ(S l ) denotes the highest scheduling priority among all tasks that can acquire the spinlock S l , and I i,π(Sl) (C(S l )) denotes the worstcase preemption suffered by task τ i when holding the spinlock S l at its priority ceiling of π(S l ).Comparing Equations ( 5) and ( 2), we see that ( 5)≤ ( 2), since the priority ceiling of S l is higher than or equal to the priority of task τ i , and hence the interference from higher priority tasks when executing at the priority ceiling will be less.
‫ܫ‬ ,గ(ௌ ) ‫ܵ(ܥ(‬ )) ≤ ‫ܫ‬ , ‫ܵ(ܥ(‬ )) Any task τ h executing on core P k must account for non-preemptive blocking B n h from lower-priority tasks, which could be holding a spinlock and blocking τ h from getting scheduled on the processor.This non-preemptive blocking B n h is given by: Comparing Equations ( 4) and ( 6), we find that (6) ≤ (4) since (4) considers all lower priority tasks ߬ ∈ ‫߬(݈‬ ) assigned to processor P k and acquiring a spinlock S l , while (6) only considers the lower priority tasks τ j that hold a spinlock S l shared with a task τ r with higher priority or equal priority to τ h ({߬ , ߬ } ∈ ܵ &߬ ∈ ‫߬(݈‬ )&߬ ∉ ‫߬(݈‬ )).From a designer's perspective, the conclusion is that the blocking duration under the single-processor priority ceiling protocol is less than or equal to the blocking term from spinlocks.

Waiting Time Bound with SuspendAllInterrupts():
If conditions (A) and (B) are in place, the worst-case waiting time ωi(S l ) of a task τ i for a lock S l can be bounded, and is given by: Where, m(S l ) is the number of tasks that share the spinlock S l .
The duration of the non-preemptible section N(S l ) for spinlock S l wrapped by the SuspendAllInterrupts() call is then given by: Inequality (1) can be extended for the task τ i accessing shared resources protected using spinlocks and surrounded (including the GetSpinLock()) using SuspendAllInterrupts().We assume that C i includes the time C(S l ) spent holding the resource S l but excludes the waiting time for S l (reasonable if the WCET is defined as truly the worst-case execution time obtained by running the task in isolation), then the response-time W i for a task τ i on core P k that can acquire a spinlock S l is given by the convergence with initial condition ܹ = 0: ∀ௌ |ఛ ೕ ఢௌ ∀ఛ ೕ |ఛ ೕ ∈ ೖ &ఛ ೕ ∈(ఛ ) (9) where, ωi(S l ) is the maximum waiting time for the lock S l when using SuspendAllInterrupts().

Waiting Time Bound with MPCP:
When tasks only use MPCP for synchronization on shared resources, the worst-case response time W i for a task τ i executing on a core P k that can acquire shared resource guarded by spinlocks is given by the convergence with initial condition ܹ = 0: where, ߤ (ܵ ) is the maximum waiting time for the lock S l when using MPCP (see [6] for a more details regarding MPCP analysis).
The duration H i (C(S l ))for which a critical section of length C(S l ) can be held by a task τ i executing on processor P k is given by: From a practical perspective, Equation (11) leads to higher-priority tasks having shorter waiting times on global mutexes, while Equation ( 7) leads to a uniform waiting duration across all tasks.
Waiting Time Bound with a Hybrid Approach: In this case, let ξ denote the set of all spinlocks that are held for a short duration of time and protected using SuspendAllInterrupts(), while λ denotes the set of all spinlocks that use MPCP.In this case, the worst-case response time W i for a task τ i executing on a core P k is given by the convergence: ܹ = 0

Figure 4 -
Figure 4 -Task τ 3 suffers starvation with Spinlocks since it can in principle always lose the lock contest.