Parallel Algorithms for Generating Harmonised State Identifiers and Characterising Sets

Many automated finite state machine (FSM) based test generation algorithms require that a characterising set or a set of harmonised state identifiers is first produced. The only previously published algorithms for partial FSMs were brute-force algorithms with exponential worst case time complexity. This paper presents polynomial time algorithms and also massively parallel implementations of both the polynomial time algorithms and the brute-force algorithms. In the experiments the parallel algorithms scaled better than the sequential algorithms and took much less time. Interestingly, while the parallel version of the polynomial time algorithm was fastest for most sizes of FSMs, the parallel version of the brute-force algorithm scaled better due to lower memory requirements.


INTRODUCTION
S OFTWARE testing is an important part of software devel- opment but typically is expensive, manual and error prone.One possible solution is to base testing on a formal model or specification [1], [2], allowing rigorous test generation algorithms to be used.In this context, one of the most widely used types of formal model is the finite state machine (FSM).The tester might devise an FSM to drive test generation and execution or an FSM might be produced from the semantics of a model in a more expressive language, such as specification and description language (SDL) or state-charts [3], [4], [5].
Approaches that derive a test sequence (an input sequence) from an FSM model have been developed in various application domains such as sequential circuits [6], lexical analysis [7], software design [8], communication protocols [5], [9], [10], [11], object-oriented systems [12], and web services [13], [14].Such techniques have also been shown to be effective when used in important industrial projects [15].Once a test sequence has been devised from an FSM specification M, the test sequence is applied to the implementation N. The output sequences produced by N and M are then compared and if they differ then we can deduce that N is faulty.
There are many techniques that automate the generation of test sequences from an FSM specification M, with this work going back to the seminal papers of Moore [16] and Hennie [17].Most such techniques use state identification sequences that distinguish the states of M [8], [17], [18], [19], [20], [21], [22], [23].For example, a test technique can check that the input of x in state s leads to output y and state s 0 as follows: start with a preamble (input sequence) x that takes M to s, then apply x, check that the output produced is y, and finally use one or more input sequences to distinguish the expected state s 0 from all other states of M. The test technique might also check the state reached by x.One approach to state identification is to use a characterising set (CS): a set of input sequences that distinguish all pairs of states [8], [24].Harmonised state identifiers (HSIs) improve on CSs by allowing different sets of input sequences for different states [25].One of the benefits of CSs and HSIs is that every minimal deterministic FSM 1 has a CS and an HSI and so test generation techniques that use these are generally applicable.This paper focuses on generating CSs and HSIs.The main focus of previous work has been complete FSMs, where the response to input x in state s is defined for every input x and state s.However, it has been observed that often FSM specifications are not complete (they are partial) [26], [27], [28], [29].Complete FSMs are a special class of partial FSMs and so partial FSMs model a wider range of systems.However, the traditional state identification methodologies are not directly applicable to partial FSMs [17], [30].The assumption that FSMs are complete is typically justified by assuming that missing transitions can be completed by, for example, adding transitions with null output.
Although it is sometimes possible to complete a partial FSM, this is not a general solution [30].For example, an FSM M might specify a component that receives inputs from another component M 0 ; the behaviour of M 0 influences what input sequences can be received by M. In addition, it might not be possible for certain input sequences to be provided by the environment due to physical constraints.Furthermore, ensuring completeness may introduce redundancy.For example, during the experiments we found a benchmark FSM 'scf' provided in [31] that has 97 states and 13 million inputs but only 166 transitions.Clearly, it is not sensible to complete such an FSM.There has thus been interest in techniques for testing from a partial FSM [20], [21], [25], [30], [32], [33], [34], [35], [36], [37].
While one might expect a tester to use quite small models, an FSM M might represent the semantics of a model M 0 written in a more expressive language such as state-charts or SDL.A state of M will represent a combinations of a logical state for M 0 and a tuple of values for the variables in M 0 .Thus, even small models can lead to large FSMs.However, the scalability of deriving CSs and HSIs from partial FSMs has received little attention despite FSM specifications often being partial and many FSM-based test generation methods using a CS or HSI.Note that by 'scalability' we refer to the size (number of states) of the largest partial FSMs that can be processed by an algorithm in a reasonable amount of time.To our knowledge there exists only one paper that proposes algorithms for deriving CSs and HSIs for partial FSMs [25] despite there being test generation algorithms, for testing from a partial FSM, that use such state identifiers [20], [21], [25], [30], [36], [37].The CSI/HSI generation algorithms presented are sequential algorithms that operate on a single thread.The sequential CS algorithm has exponential worst case time complexity and the sequential HSI algorithm requires CSs to first be constructed.
Despite the increasing interest in graphics processing units (GPUs) [38], [39], [40], [41], [42], [43], no previous work has utilised GPU technology to generate CSs or HSIs.In this work, our primary motivation is to address the scalability problem raised when constructing CSs and HSIs and to do so through using GPU technology.
As noted, the only published algorithm for generating CSs and HSIs from partial FSMs has exponential worst case time complexity.This paper tackles scalability from two directions.First, we devise a polynomial time sequential algorithm for generating CSs.We also produced a parallel implementation of this.We devise a parallel HSI construction algorithm and also a parallel implementation of the brute-force algorithms for generating CSs and HSIs (based on previous work).We also present the results of experiments.In experiments with randomly generated FSMs, the parallel algorithms scaled much better than the sequential algorithms.Such improvements in performance should help test generation techniques scale to larger FSMs.Interestingly, although the parallel brute-force (worst case exponential time) algorithms were slower than the parallel versions of the polynomial time algorithms, they scaled better.This is because the polynomial time algorithms have greater memory requirements.The parallel algorithms also outperformed the sequential algorithms on benchmark FSMs.
There were several challenges in designing a scalable massively parallel algorithm for deriving CSs and HSIs from partial FSMs.First, it was necessary to develop data structures that (1) can be processed quickly; (2) can be efficiently stored in GPU memory; and (3) can encapsulate enough data for constructing CSs and HSIs.It is also important that the parallel algorithm maximises GPU utilisation (occupancy).There is a trade-off between these factors since, for example, storing information in local GPU memory reduces memory access time but can also reduce the number of threads that can run in parallel.This paper is organised as follows.We provide preliminary material in Section 2 and in Section 3 we develop a polynomial time sequential algorithm.In Section 4 we present the proposed parallel HSI and CS algorithms.Section 5 outlines the experiments and the results of these.Section 6 describes threats to validity and in Section 7 we draw conclusions.The appendix explains how the algorithms were implemented on a GPU.

Finite State Machines
An FSM is defined by a tuple M ¼ ðS; s 0 ; X; Y; hÞ where S ¼ fs 1 ; s 2 ; . . .; s n g is the finite set of states; s 0 2 S is the initial state; X ¼ fx 1 ; x 2 ; . . .; x p g is the finite set of inputs; Y ¼ fy 1 ; y 2 ; . . .; y q g is the finite set of outputs (we assume that X is disjoint from Y ); and h S Â X Â Y Â S is the set of transitions of M. We let h ¼ ft 1 ; t 2 ; . . .; t k g.
If ðs; x; y; s 0 Þ 2 h and input x 2 X is applied when M is in state s then M can change its state to s 0 and produce output y.Here t ¼ ðs; x; y; s 0 Þ is a transition of M with starting state s, ending state s 0 , and label x=y.An FSM M is deterministic if for all s 2 S and x 2 X we have that M has at most one transition of the form ðs; x; y; s 0 Þ.
An FSM can be represented by a directed graph.Fig. 1 gives two FSMs with state sets fs 1 ; s 2 ; s 3 ; s 4 g, inputs fx 1 ; x 2 ; x 3 ; x 4 ; x 5 ; x 6 g, and outputs fy 1 ; y 2 g.A node represents a state and a directed edge from a node labelled s to a node labelled s 0 with label x=y represents the transition t ¼ ðs; x; y; s 0 Þ.
An input x is defined at state s if there exists a transition of the form ðs; x; y; s 0 Þ 2 h and we then use the notation dðs; xÞ ¼ s 0 and ðs; xÞ ¼ y.If x is not defined at state s then we write dðs; xÞ ¼ e (for a special symbol e 6 2 S) and ðs; xÞ ¼ ".Thus d is a function from S Â X to S [ feg and is a function from S Â X to Y [ f"g.If for every state s 2 S and input x 2 X, there exists a transition of the form ðs; x; y; s 0 Þ then M is a complete FSM, otherwise it is a partial FSM.In this paper, we consider partial deterministic FSMs and from now on we use the term FSM to refer to partial deterministic FSMs.
We use juxtaposition to denote concatenation: if x 1 , x 2 , and x 3 are inputs then x 1 x 2 x 3 is an input sequence.We use " to represent the empty sequence and use prefð:Þ to denote the set of all the prefixes of parameter ð:Þ longer than 0. For Fig. 1.CS for M 1 has 4ð4 À 1Þ=2 elements W ¼ ffx 1 g; fx 2 g; fx 3 g; fx 4 g; fx 5 g; fx 6 gg.The length of separating sequence for states s 1 ; s 2 of M 2 is 4ð4 À 1Þ=2 and is x 1 x 2 x 3 x 4 x 5 x 6 .
The behaviour of an FSM M is defined in terms of the labels of walks of M. A walk is a sequence t ¼ ðs 1 ; x 1 ; y 1 ; s 2 Þðs 2 ; x 2 ; y 2 ; s 3 Þ . . .ðs m ; x m ; y m ; s mþ1 Þ of consecutive transitions.This walk has starting state s 1 , ending state s mþ1 , and label x 1 =y 1 x 2 =y 2 . . .x m =y m .
z ¼ x 1 =y 1 x 2 =y 2 . . .x m =y m is an input/ output sequence, x 1 x 2 . . .x m is the input portion (inð zÞ) of z, and y 1 y 2 . . .y m is the output portion (outð zÞ) of z.For example, M 1 has the walk ðs 2 ; x 4 ; y 1 ; s 3 Þðs 3 ; x 6 ; y 1 ; s 4 Þ that has starting state s 2 and ending state s 4 .The label of this walk is x 4 =y 1 x 6 =y 1 and this has input portion x 4 x 6 and output portion y 1 y 1 .FSM M is strongly connected if for every ordered pair ðs; s 0 Þ of states of M there is a walk with starting state s and ending state s 0 .
An input sequence x ¼ x 1 x 2 . . .x m is a defined input sequence for state s if M has a walk with starting state s and label z such that inð zÞ ¼ x 1 . . .x m .For example, since ðs 2 ; x 4 ; y 1 ; s 3 Þðs 3 ; x 6 ; y 1 ; s 4 Þ is a walk of M 1 , we have that x 4 x 6 is a defined input sequence for s 2 .We will let L I M ðsÞ denote the set of defined input sequences for s and L I M ðS 0 Þ ¼ \ s2S 0 L I M ðsÞ (the set of input sequences defined in all states in S 0 ).In an abuse of notation we use and d with input sequences: if x and x denote an input and an input sequence respectively such that x x is a defined input sequence for s then dðs; "Þ ¼ s, dðs; Given a set X we let X Ã denote the set of finite sequences of elements of X and let X k denote the set of sequences in X Ã of length k.An input sequence x 2 X Ã is a separating sequence for states s and s 0 if x is a defined input sequence for s and s 0 and ðs; xÞ 6 ¼ ðs 0 ; xÞ.Consider, for example, M 1 (Fig. 1a) and states s 1 and s 2 .Then x 1 is a separating sequence for this pair since x 1 is defined in both states and ðs 1 ; x 1 Þ ¼ y 1 6 ¼ y 2 ¼ ðs 2 ; x 1 Þ.In contrast, no input sequence starting with input x 4 can be a separating sequence for a pair that contains s 1 since x 4 is not defined in s 1 .If every pair of states of FSM M has a separating sequence then M is a minimal FSM.
In this work, we consider only minimal FSMs.If an FSM is not minimal then a minimal FSM can be formed by merging pairs of compatible states in an iterative manner, where two states are compatible if they cannot be distinguished.It is possible to check whether two states are compatible in polynomial time and so it is also possible to construct a minimal FSM M 0 from a partial FSM M in polynomial time [44] 2 .
There are a number of approaches to state verification and this paper focuses on two that are applicable to any minimal FSM.A characterisation set is a set of input sequences that, between them, distinguish all of the states of M. Definition 2.1.A CS for FSM M ¼ ðS; s 0 ; X; Y; hÞ is a set W X Ã such that for all s i ; s j 2 S with i 6 ¼ j there exists x 2 W such that a prefix of x distinguishes s i and s j .
We use S M ¼ fðs i ; s j Þjs i ; s j 2 S; i < jg to denote a set of distinct pairs ðs i ; s j Þ of states of M. The restriction that i < j ensures that a pair of states is represented exactly once.We will use S to denote S M and g ij will denote a pair ðs i ; s j Þ in S M .Since jSj ¼ n, set S contains nðn À 1Þ=2 pairs.A CS W is a set of input sequences that distinguish the pairs in S.
When considering a minimal complete FSM M with state set S (jSj ¼ n), a set A of input sequences defines an equivalence relation $ A over S, with two states s; s 0 being equivalent under $ A if ðs; xÞ ¼ ðs 0 ; xÞ for all x 2 A. Let us suppose that we add an input sequence x to A, to form A 0 , and this makes A more effective at distinguishing the states of M: there exist s; s 0 2 S such that s $ A s 0 and ðs; xÞ 6 ¼ ðs 0 ; xÞ.Then $ A 0 has more equivalence classes than $ A .Since an equivalence relation $ A on S can have at most n equivalence classes, it is thus straightforward to see that, starting from the empty set, we can add at most n À 1 input sequence if each input sequence is to make the set more effective.Thus, M has a CS with at most n À 1 sequences.Further, we can use the following result, which is straightforward to prove, to deduce that there is such a CS where each input sequence is of length at most n À 1.

Proposition 2.1. If input sequence
x distinguishes states s; s 0 , no proper prefix of x distinguishes these states, and x ¼ x 0 x 00 for input sequences x 0 and x 00 then x 00 distinguishes states dðs; x 0 Þ and dðs 0 ; x 0 Þ.
For minimal partial FSMs we have different bounds since a set A of input sequences need not define an equivalence relation on the states of a partial FSM M if different input sequences from A are defined from different states of M. In particular, it is possible to construct an FSM M such that for each pair of states s; s 0 there is a unique input x such that x is the only input defined in both s and s 0 (and so a characterising set must contain at least nðn À 1Þ=2 input sequences).However, for a minimal partial FSM M we require at most one input sequence for each pair of states in S M and so at most nðn À 1Þ=2 sequences.We can use Proposition 2.1 to deduce that we require input sequences of length at most nðn À 1Þ=2 and so a CS requires Oðn 4 Þ memory.Fig. 1 contains two FSMs.FSM M 1 has nðn À 1Þ=2 inputs and for every input x i there is a pair of states s; s 0 such that x i is the only input that is defined in both s and s 0 and so a CS must contain an input sequence that starts with x i .For FSM M 2 there is a pair of states whose shortest separating sequence is of length nðn À 1Þ=2.
It may not be necessary to execute all sequences from a CS to distinguish a state s from all other states [25].Definition 2.2.A state identifier (SI) for a state s i of FSM M ¼ ðS; s 0 ; X; Y; hÞ is a set H i X Ã such that for all s j 2 S n fs i g, there exists x 2 H i such that x is a separating sequence for s i and s j .
This leads to HSIs.Definition 2.3.A set of harmonised state identifiers for FSM M ¼ ðS; s 0 ; X; Y; hÞ is a set of state identifiers H ¼ fH 1 ; H 2 ; . . .H n g such that for all s i ; s j 2 S with i 6 ¼ j, there exists x 2 prefðH i Þ \ prefðH j Þ that is a separating sequence for s i and s j .
As suggested in [25] one can derive HSIs from a CS.This provides an upper bound n 4 on the size of HSIs.

Previous HSI Generation Method
The HSI construction algorithm given in [25] takes an FSM M and CS W for M and in the first step the algorithm constructs an SI for the initial state (s 0 ).To do so the algorithm generates a subset H 0 of W such that for all s j 2 S n fs 0 g there exists at least one input sequence x in H 0 such that x is a separating sequence for s 0 and s j .The remaining SIs are computed in the second phase.For state s i , the algorithm finds a subset H i of W such that 1) For j < i, there exists x 2 H i and x 0 2 H j such that some input sequence in prefð xÞ \ prefð x 0 Þ distinguishes s i and s j .
2) For all s j with i < j, there exists x 2 H i and a prefix x 0 of x with x 0 2 L I M ðs j Þ (the set of input sequences defined in s j ) such that x 0 distinguishes s i and s j .

NOVEL CS GENERATION ALGORITHM
The previously devised CS generation algorithm, for partial FSMs, has exponential worst case execution time.In this section we devise a polynomial time algorithm.The first step of the proposed algorithm computes separating sequences of length one.The following is an immediate consequence of Proposition 2.1.Proposition 3.1.Let M be a minimal FSM M. Then there exists a pair of states ðs; s 0 Þ 2 S M and an input x 2 X such that x distinguishes ðs; s 0 Þ.
After the first step we have two sets of pairs of states: a set P p of pairs with separating sequences of length one and a set P Â of pairs whose separating sequences have not yet been computed.In the second step, the algorithm computes new separating sequences through using the previously computed separating sequences.Proposition 3.2.Let M be a minimal FSM M, let P p 6 ¼ ; be the set of pairs of states with known separating sequences, and let P Â ¼ S M n P p .Then there exists ðs; s 0 Þ 2 P Â , ðs 00 ; s 000 Þ 2 P p with separating sequence x, and a defined input x 2 X for s; s 0 such that x x is a separating sequence for ðs; s 0 Þ.
We now present terminology used in this section.A pair-node h is a tuple ðs i ; s j ; #Þ such that ðs i ; s j Þ 2 S M (i < j) and # 2 X Ã is a separating sequence for s i and s j or is " if such a separating sequence has not yet been found.We use ðh; xÞ to denote fðs i ; xÞ; ðs j ; xÞg.If # ¼ ", dðfs i ; s j g; xÞ ¼ fs; s 0 g and there is a pair-node h 0 ¼ ðs; s 0 ; # 0 Þ with # 0 6 ¼ " then we say that h 'evolves to' (becomes) ðs i ; s j ; x# 0 Þ.In such a situation, x# 0 is a separating sequence for s i ; s j .
The algorithm (Algorithm 1) initialises P ; for each ðs; s 0 Þ 2 S it adds pair-node h ¼ ðs; s 0 ; "Þ (Line 1).This requires Oðn 2 Þ time.The algorithm then enters a loop (firstloop) in which it computes separating sequences of length one, iterating over P and X (Lines 2-4).The first-loop requires Oðn 2 pÞ time.The algorithm then enters a while loop (main-loop) and computes separating sequences through evolving the elements of P .At each iteration the algorithm enters a for-loop that iterates over P Â X.For a pair-node h from set P with # ¼ " and input x 2 X, the algorithm checks whether h evolves to a pair-node h 0 (with x) that has separating sequence # 0 6 ¼ ".If so, the algorithm uses the separating sequence # ¼ x# 0 (Lines 6-8).If, after the for-loop, no pairnode has evolved, the algorithm declares that M is not minimal (Lines 9-10) and otherwise it continues.If all items in P have non-empty separating sequences, the algorithm returns P and terminates.To find a pair to be evolved the main loop can iterate Oðn 2 pÞ times; since there are Oðn 2 Þ pairs, the main-loop requires Oðn 4 pÞ time.The main-loop is the most expensive component.The following results, which demonstrate that Algorithm 1 is correct, follow immediately from the construction of the algorithm and Proposition 2.1.

PARALLEL CS AND HSI ALGORITHMS
In this section we first provide the approach employed to address scalability problems and then the parallel CS and HSI generation algorithms.

Design Choices
There are two main strategies: the Fat Thread strategy and the Thin Thread strategy [50].The fat thread approach minimises data access latency by having threads process a large amount of data on shared memory [50].However, the number of threads may be restricted by the available shared memory and this may reduce performance.
In contrast, the thin thread approach aims to maximise the number of threads by storing only small amounts to data in shared memory.Although global memory transactions are relatively slow, it has been reported that the high global memory transaction latency can be hidden when there are many threads [50].In this work we employed the thin thread strategy.

Parallel CS Algorithm 4.2.1 Parallel CS Algorithm: Parallel Design
To implement a thin thread based algorithm we propose what we call a conditional pair-node vector (CPn-vector for short), and later we will see that a scalable CS generation algorithm can be based on this.A CPn-vector P captures information related to pair-nodes plus additional information that will allow us to evolve its elements.Definition 4.1.A conditional pair-node vector (CPn-vector) P for an FSM M ¼ ðS; s 0 ; X; Y; hÞ with n states is a vector with nðn À 1Þ=2 conditional pair-nodes.Each element r in the vector P is a tuple ðf; hÞ for a pair-node h ¼ ðs; s 0 ; #Þ, and a flag f : fT; F g that states whether states s; s 0 are distinguished by #.
The following are based on Theorem 3.3 and the definition of evolution of CPns and show how CPn-vectors are related to the states distinguished.Lemma 4.1.Given CPn-vector P, if P contains r ¼ ðT; ðs; s 0 ; #ÞÞ then # is a separating sequence for ðs; s 0 Þ.
Theorem 4.1.Given FSM M ¼ ðS; s 0 ; X; Y; hÞ, if the flags of all elements of CPn-vector P of M are set to true then the input sequences retrieved from pair-nodes of P define a Characterising Set for M.

Parallel CS Algorithm: An Overview
The parallel-CS algorithm first initialises the CPn-vector P in parallel (Line 1).A na€ ıve implementation would require Oðnðn À 1Þ=2Þ time to initialise P.However, as we will see later, the initialisation of P can be achieved in OðnÞ time if G ! n, where G is the number of threads used by the GPU.Afterwards, the algorithm enters the first-loop in which it finds all elements that are distinguished by a single input (Lines 2-4).This step takes Oðnðn À 1Þp=ð2GÞÞ time.The algorithm then enters the main-loop.In the main-loop the algorithm applies all inputs to elements of P whose flag is F and evolves elements.If an evolution to a distinguished pair of states is possible then the flag values are set to T (Lines 6-8) (can be achieved in Oðnðn À 1Þ=ð2GÞpÞ time).If no elements of P have changed then the algorithm terminates, otherwise it continues (Lines 9-10).This step may also require OðnðnÀ 1Þp=ð2GÞÞ time.As the length of a separating sequence is bounded above by nðn À 1Þ=2, the main-loop iterates Oðn 2 Þ times and hence the algorithm requires Oðn 4 p=GÞ time.The existing HSI generation algorithm takes as input a CS for the FSM M.However, we require Oðn 4 Þ space to store a CS for an FSM with n states.This has at least two practical implications: (1) it may be impossible to keep data in the main memory and (2) threads will process a very large amount of data.
Recently, Hierons and T€ urker proposed a heuristic to construct HSIs for complete FSMs that overcomes this bottleneck [51].Instead of using an existing CS, they construct HSIs from incomplete distinguishing sequences.Their algorithm keeps a list (Q) of pairs of states (a list for the items of set S) such that at each iteration an input sequence that removes the maximum number of pairs from Q is selected and the algorithm terminates when Q is empty.In the parallel-HSI algorithm, we adopted this strategy by using statetrace vectors.A state-trace-vector contains the information regarding which pairs from S have been distinguished by an input sequence x.
Definition 4.2.A state-trace vector (ST-vector) D for an FSM M ¼ ðS; s 0 ; X; Y; hÞ with n states is a vector associated with input sequence x 2 X Ã and having n elements such that: an element d of D is a tuple ðs i ; s c ; yÞ such that s i is an initial state, s c ¼ dðs i ; xÞ is a current state, and y ¼ ðs i ; xÞ is an output sequence.
We may need to construct a set of ST-vectors since there may not be a single input sequence that distinguishes all pairs of states.Theorem 4.3.A set of state-trace vectors for an FSM M defines an HSI for M if for every pair of states s i ; s j there exists a statetrace vector D associated with input sequence x such that there exists x 0 2 prefð xÞ that is a separating sequence for s i and s j .

Parallel-HSI Algorithm: An Overview
We now give a brief overview of the algorithm and in the Appendix we show how the parallel-HSI algorithm was implemented using GPUs.For an FSM with n states, the parallel HSI algorithm uses a boolean valued vector B of length nðn À 1Þ=2.The elements of B correspond to pairs in S: the ith pair of S corresponds to the ith item of B and is set to 1 if these states have been distinguished.Initially all elements in B are set to 0 and when all elements are 1 the algorithm terminates.
The parallel HSI algorithm (Algorithm 3) has three nested loops.The first loop (main-loop) iterates as long as at least one element in B is 0. In every iteration, the algorithm checks whether the upper-bound on the length of the input sequence x has been reached.If so, the algorithm terminates.Otherwise, the algorithm enters the middle-loop and increments ' (initially ' ¼ 0).The middle-loop iterates as long as there exists an unprocessed input sequence of length ' and not all elements in B are 1.
In the middle-loop the algorithm first resets the ST-vector D. It then receives the next input sequence x of length ' and evolves elements in D with x. Since the FSM is partially specified, the algorithm evolves an element if x is defined at the associated state.Then the algorithm executes the inner loop (for-loop).The for-loop iterates over the states and for state s i it compares the output sequence obtained from s i and all other states (in parallel).If there exists a state s j that produces an output sequence that is different from that produced from s i , then the algorithm writes 1 to the corresponding element of B (corresponding to g ij or g ji ).Later it writes x to the corresponding SIs (H i s) of distinguished pairs.For state s i there are n À 1 pairs in S (and in B) with state s i and so the size of H i cannot be larger than n À 1.
The process of finding a separating sequence for a pair of states might require all possible input sequences to be considered.Therefore the worst case time complexity of the algorithm is exponential.

Generating Characterising Sets from HSIs
We first show how HSIs relate to characterising sets.Lemma 4.2.Let M ¼ ðS; s 0 ; X; Y; hÞ be an FSM and H be a set of harmonised state identifiers for M. Then S H i 2H H i defines a characterising set of M.
Following the intuition provided by Lemma 4.2 we can construct a CS by using the parallel-HSI algorithm through replacing Line 18 with the following line: It is possible to parallelise the process of taking the union of harmonised state identifiers through two steps.In the first step we sort all the input sequences in parallel and in the second step we pick unique state identifiers to form the CS.We present details in the Appendix.

Experimental Design
The experiments had two main aims.
1) To explore how well the algorithms scaled.Therefore, we recorded the time taken.2) To explore properties of the CSs and HSIs constructed.We recorded the number of sequences and the lengths of these sequences since fewer/shorter sequences lead to cheaper testing.
Initial experiments used FSMs generated by the tool used in [52].This randomly assigns dðs; xÞ and ðs; xÞ for each s 2 S and x 2 X, discarding the FSM if it is not strongly connected and minimal.After constructing M, the tool randomly selects 1 K np and K state-input pairs.For a pair ðs; xÞ it erases the transition of M with start state s and input x.If deleting a transition t disconnects M then t is retained and another pair chosen.We used the tool to construct three test suites.
In test suite one (TS1), for each n 2 f2 6 ; 2 7 ; . . . 2 17 g we had 100 FSMs with number of inputs/outputs p=q ¼ 3=3.These experiments explored the performance of the algorithms under varying numbers of states.To see the effect of the numbers of inputs and outputs we constructed TS2 and TS3.In TS2 we set n ¼ 1;024 and q ¼ 3 and for each of p 2 f2 4 ; 2 5 . . .; 2 8 g we had 100 FSMs.For TS3 we set n ¼ 1;024 and p ¼ 3 and for each of q 2 f2 4 ; 2 5 . . .; 2 8 g we again had 100 FSMs.Therefore we used 2,200 FSMs 3 .
It is possible that FSM specifications of real life systems differ from randomly generated FSMs.Therefore we also performed experiments on case studies: FSMs from the ACM/SIGDA benchmarks.This is a set of FSMs used in workshops in 1989-91-93 [31].In Fig. 2 we present the specifications.

Experimental Settings
Throughout this section we use SEQ-BF-CS to refer to the sequential brute force CS generation algorithm [25] and SEQ-BF-HSI to refer to the sequential brute force HSI generation algorithm [25].PAR-BF-HSI, PAR-BF-CS, SEQ-PLY-CS, PAR-PLY-CS will denote the parallel-HSI algorithm, the CS generation algorithm base on the parallel-HSI algorithm, the sequential polynomial time CS generation algorithm, and the parallel polynomial time CS generation algorithm respectively.We set a bound of n on the length of sequences considered in the PAR-BF-CS algorithm.However, this did not affect the results regarding scalability since there were no cases where the PAR-BF-CS algorithm failed to find separating sequences and other algorithms returned separating sequences of length greater than n.
We used an Intel Core 2 Extreme CPU (Q6850) with 8 GB RAM and NVIDIA TESLA K40 GPU under 64 bit Windows Server 2008 R2.During the experiments we stored the generated CSs and HSIs in the hard disk drive as for large FSMs the available CPU/GPU memory becomes insufficient.The timing information does not include the time for storing the sequences.
Finally, to perform the experiments in an acceptable amount of time, we set 1;500 seconds as the limiting time.The bottleneck for PAR-BF-HSI is the B vector, requiring nðn À 1Þ=2 boolean variables.PAR-BF-HSI was able to process FSMs with 131;072 states, making PAR-BF-HSI 128 times more scalable than the existing HSI construction algorithm.The bottleneck for PAR-PLY-CS is the P vector, requiring 2nðnðn À 1Þ=2Þ integer values plus nðn À 1Þ=2 boolean values.PAR-PLY-CS was able to process FSMs with 32;768 states making PAR-PLY-CS 16 times more scalable than the sequential brute-force characterising set generation algorithm.

The Effect of the Number of States
Table 1 shows the mean sequence lengths.For CSs there are no differences.In addition, the mean sequence length for PAR-BF-HSI was slightly less than that for SEQ-BF-HSI.However, when we applied the parametric Kurskall   3.After an FSM is generated we do not check whether it is minimal.However, during experiments we found that 316 FSMs were not minimal.When such an FSM was found we simply generated a replacement and so each test suite contained 100 minimal FSMs.
Vallis [53] significance test 4 , we found that the difference was not statistically significant.
Table 2 shows the mean size of state identifiers.PAR-BF-HSI tended to generate SIs that, on average, were slightly smaller than those returned by SEQ-BF-HSI.According to the Kruskal Vallis test there is a statistically significant difference when n !256.This indicates that as the number of states increases, the parallel HSI generation algorithm tends to find fewer SIs.Moreover, we can see that the number of CSs generated does not depend on whether the algorithm is sequential or parallel.Overall, the results suggest that PAR-BF-HSI constructs more compact HSIs and faster and SEQ-PLY-CS and PAR-PLY-CS are faster than the brute-force versions.

The Effect of the Number of Inputs and Outputs
The results for TS2 and TS3 are in Fig. 5.As the number of inputs increases (Figs.5a and 5c), the time required to construct CSs and HSIs increases regardless of the algorithm used.However, mean lengths of state identifiers reduce as we increase the number of inputs (Fig. 5e).This reflects there being more transitions that might be used in finding separating sequences.Fig. 5g shows the mean number of separating sequences in HSIs, the differences between the HSIs constructed by SEQ-BF-HSI and PAR-BF-HSI being limited.The number of separating sequences constructed by SEQ-BF-CS, PAR-BF-CS, SEQ-PLY-CS and PAR-PLY-Cs shown in Fig. 5i.The number of separating sequences reduces with the number of inputs, indicating that when there are more inputs it was possible to find separating sequences that distinguish more states.
Consider now the experiments where we increased the number of outputs (TS3).The time required, the mean length of the separating sequences and mean number of separating sequences produced are in Figs.5b, 5d, 5f, 5h, and Fig. 5j.The results are as expected: as the number of outputs increases it is easier to find separating sequences (one expects separating sequences to be shorter) and so the time taken reduces.

Benchmark FSMs
We found that PAR-BF-CS, PAR-PLY-CS, and SEQ-BF-CS generated identical CSs for each FSM except for scf.Only the PAR-BF-CS and PAR-PLY-CS algorithms were able to construct CSs for scf (Fig. 6 gives the times).Moreover PAR-BF-HSI, and SEQ-BF-HSI generated identical state identifiers except for scf.SEQ-BF-HSI was not able to construct HSIs for scf.
The results are promising.Although the FSMs are relatively small, we see that the parallel algorithms outperformed the sequential algorithms.

Discussion
The proposed algorithms accelerated the construction of CSs and HSIs and increased scalability.However, randomly generated FSMs will tend to have separating sequences that are much shorter than the upper bound.Sokolovskii introduced a special class of (complete) FSMs that we called s-FSMs [54].The shortest separating sequence for states s 1 and s 2 of an s-FSM with n states has length n À 1.We performed additional experiments with a set of s-FSMs to explore how the algorithms perform when the separating sequences are relatively long.The transition and the output functions of an s-FSM are defined as follows in which n 0 ¼ n=2 and n > 2: We generated 600 s-FSMs with n states, n 2 f10; 20; . . .; 6;000g.The results for the brute-force approaches (Table 3) indicate that scalability drops drastically for FSMs with long separating sequences.SEQ-BF-CS, PAR-BF-HSI, and PAR-BF-CS can process s-FSMs with up to 30 states.While SEQ-BF-HSI is very fast, this is because it does not construct separating sequences: for state s it picks state identifiers from CSs that have previously been constructed.However, it requires the output of SEQ-BF-CS and so SEQ-BF-HSI also is not scalable for such FSMs.
However, SEQ-PLY-CS and PAR-PLY-CS were able to process all 600 s-FSMs.When n ¼ 30, SEQ-PLY-CS required 150 milliseconds and PARL-PLY-CS required 2.98 milliseconds and when n ¼ 6;000 SEQ-PLY-CS required 307 seconds and PARL-PLY-CS required 2.2 seconds.We present the results of these experiments in Fig. 7.These suggest that for s-FSMs, SEQ-PLY-CS and PAR-PLY-CS are at least 200 times more scalable than SEQ-BF-CS, PAR-PLY-CS is 537;000 times faster than SEQ-BF-CS and SEQ-PLY-CS is 10;357 times faster than SEQ-BF-CS.
We also investigated the effect of K by examining the relationship between performance and K for PAR-BF-HSI when n ¼ 4;096 in TS1.In Table 4 we observe that as we increase the number of transitions we increase the number of separating sequences but reduce the mean separating sequence length and the time required by PAR-BF-HSI.These results are unsurprising since there being many transitions typically allows pairs of states to be distinguished using shorter sequences.

THREATS TO VALIDITY
Threats to internal validity concern factors that might introduce bias and so largely concern the tools used in the experiments.The tool that generated the FSMs, used as experimental subjects, is one that has previously been used and we also tested this.We carefully checked and tested the implementations of the algorithms.We used C++ to code the sequential CS and HSI methods and CUDA C++ to implement the parallel algorithms.We used the CUDA-Thrust library for sorting output sequences while constructing characterising sets.
Threats to construct validity refer to the possibility that the properties measured are not those of interest in practice.Our main concern was scalability and this is important if FSM based test techniques are to be applied to larger systems.However, the sets of separators will typically be generated to be used within a test generation algorithm and so we are also interested in the potential effect on the size of the resultant test.There are two aspects that affect this: the number of sequences in a CS/HSI and the lengths of these sequences.We therefore measured mean values for these in the experiments.
Threats to external validity concern the degree to which we can generalise from the results.One cannot avoid such threats since the set of 'real FSMs' is not known and we cannot uniformly sample from this.To reduce this threat we varied the number of states, inputs and outputs.We also used FSMs from industry.

CONCLUSIONS AND FUTURE WORK
This paper explored the use of GPU computing to devise massively parallel algorithms for generating CSs/HSIs.The main motivation was to make such algorithms scale to    larger FSMs.An FSM M used in test generation could represent the semantics of a model M 0 written in a more expressive language such as state-charts or SDL and even quite modest models can result in large FSMs.We used the thin thread strategy in which relatively little data is stored in shared memory, this allowing there to be many threads.We devised polynomial time algorithms and massively parallel versions of the brute-force and polynomial time algorithms.
We performed experiments with randomly generated partial FSMs, the parallel algorithms being much quicker than the sequential algorithms and scaling much better.Interestingly, the parallel brute-force algorithm, with exponential worst case complexity, outperformed the sequential polynomial time algorithm.The parallel version of the polynomial time algorithm was fastest but did not scale as well as the parallel brute-force due to its memory requirements.When the techniques were applied to FSMs with relatively long separating sequences, only the polynomial time algorithms scaled to FSMs with 40 states or more.However, these algorithms scaled quite well, easily handling FSMs with 6,000 states.As expected, the parallel version of the polynomial time algorithm was much faster than the sequential version.
There are several lines of future work.First, experiments might explore how the results change when using different GPU cards.It would also be interesting to run additional experiments with FSMs generated from models in languages such as SDL and state-charts.Third, there is the challenge of designing and implementing parallel algorithms for CS and HSI algorithms that can run on multi-core systems.Finally, there is the potential to investigate new approaches that are capable of constructing HSIs for larger FSMs.

APPENDIX
We first discuss issues affecting performance and then describe how the parallel algorithms were implemented.Table 5 gives the terms used.

Performance Considerations
In implementing a massively parallel algorithm, one needs to consider coalesced memory transaction and thread divergence.As we followed the thin thread strategy, we needed to perform many global memory transactions.In a GPU, global memory is accessed in chunks of aligned 32, 64 or 128 bytes.If the threads of a block access global memory in the right pattern, the GPU can pack several accesses into one memory transaction.This is a coalesced memory transaction [55].For example, whenever a thread t i requests a single item, the entire line from global memory is brought to cache.If thread t iþ1 requests the neighbouring item, t iþ1 can read this data from the cache.As reading global memory is hundreds of times slower than reading cache memory, coalesced memory access may drastically improve performance.Therefore, to reduce the time spent on global memory transactions, we needed an appropriate storage layout.
All threads in a multiprocessor execute the same code (kernel).If the code has branching statements such as if or switch then some of the threads may follow different branches and hence different threads need to execute different lines of code.In such cases the GPU will serialise execution: a GPU is not capable of executing if or switch like statements in parallel.This problem is known as thread divergence [55].However, if one can guarantee that the threads execute the same sequence of instructions then one can use branching statements.

Parallel-CS Algorithm Data Structures
In implementing the Parallel-CS algorithm we used the P vector and the FSM vector.The P vector holds nðn À 1Þ=2 CPn-vectors and we use P½i to denote the ith element of P. Such an element holds a pair-node and a flag where a nodepair consists of a pair of states ðs; s 0 Þ and possibly a separating sequence # for this pair.Although the maximum length of # is nðn À 1Þ=2, we set this bound as n, and therefore the memory required for a CPn-vector was of Oðn 3 Þ.
The transition structure of the FSM is kept in the FSM vector.For state s and input x, the FSM vector returns output y 2 Y [ f"g and next state s 0 2 S [ feg.The size of the FSM vector is therefore 2njXj.For a thread t i and an input sequence x of length greater than 1, reads on the FSM vector may not be coalesced as the memory access pattern on FSM is data dependent.For example, let us assume that we need to apply x ¼ x 1 x 2 to s i .For x 1 , thread t i will retrieve output and next state (s j ) information from the ith location of FSM and it will then apply input x 2 to s j which will cause thread t i to access a different location of the FSM vector.

Initiating the P Vector
First note that for a pair ðs i ; s j Þ there are n À i elements in P that start with s i .Therefore, if an initialisation kernel receives the P vector, integers i and n as its parameters and launches with n À i threads, it can set n À i elements of P. Since i varies from 1 to n À 1, the Host can call the kernel OðnÞ times.If G ! n the time required for initialisation is of OðnÞ as stated in Section 4.2.

Computing Initial Separating Sequences
To compute the initial separating sequences, we can launch a kernel with nðn À 1Þ=2 threads and with the FSM and the P vectors, inputs X and a boolean variable isMin (set to F prior to the kernel call) as parameters.Thread t i processes one element P½i ¼ ððs; s 0 ; #Þ; fÞ in a for-loop.The for-loop iterates over inputs X and in each iteration the kernel applies input x to states s; s 0 and retrieves outputs from the FSM vector and then checks if ðs; xÞ 6 ¼ ðs 0 ; xÞ, ðs; xÞ 6 ¼ ", and ðs 0 ; xÞ 6 ¼ ".If these conditions are met, the thread sets the flag to 1, separating sequence to x, and isMin to T and exits from the for-loop.More than one thread can set isMin; as the only value to be written to isMin is T , this does not cause a race.After all threads have finished, the kernel returns and the algorithm checks the value of isMin.If it is F , the algorithm declares that M is not minimal and the algorithm terminates, otherwise it continues.

Evolving the P Vector
After some separating sequences are set, the algorithm uses these to compute additional separating sequences.To achieve this we implemented two kernels.The first kernel is launched with nðn À 1Þ=2 threads, and the FSM vector, the P vector, n, and inputs X as parameters.A thread t i processes P½i ¼ ðf; ðs; s 0 ; #ÞÞ through a forloop if and only if f ¼ 0. The for-loop iterates over inputs; at each iteration it applies an input to s and s 0 and receives s i ¼ dðs; xÞ and s j ¼ dðs 0 ; xÞ from the FSM vector.t i then computes the index k for pair ðs i ; s j ).If the flag of P½k is 1 then the thread retrieves the separating sequence # 0 , assigns x# 0 to #, and sets f ¼ 2. The thread sets the flag value to 2 rather than 1 to avoid long separating sequences being constructed.For example, consider the scenario in which thread t i tries to evolve P½i and there is a freshly evolved element P½j (evolved in the current iteration) and an element P½! that has already evolved.As t i checks input symbols one by one, if the flags of P½j and P½! are 1 then t i may use # j in forming its separating sequence.However, j# j j > j# !j.By setting the flags of freshly evolved elements to 2, we ensure that these cannot be used to form new separating sequences until the next iteration.
After the first kernel returns, the algorithm launches another kernel with P and isMin to convert 2s to 1s.A thread t i checks if the flag of P½i is 2, if so it sets the flag to 1 and sets isMin to T .After the kernel returns, the algorithm checks isMin and if it reads F it terminates declaring that M is not minimal; otherwise it continues.

Parallel-HSI Algorithm Data Structures
The parallel-HSI algorithm uses a boolean vector B, of size nðn À 1Þ=2, to record which pairs of states have known separating sequences.Recall that elements of an ST-vector are associated with an input sequence, initial and current states and output sequences.We simulated an ST-vector by using CurrentStates, InputSequence and OutputSequence vectors.The InputSequence vector holds the input sequence x to be applied and so has length at most nðn À 1Þ=2.CurrentStates½i gives the current state dðs i ; xÞ for initial state s i and so CurrentStates has size n.The OutputSequences vector holds the output sequences produced from the different states and so its length can vary from n to nðnðn À 1Þ=2Þ.
The parallel-HSI algorithm uses the FSM vector.The algorithm also uses a vector Flags: for state s, Flags½i is larger than 0 if and only if s i has been distinguished from s.As the FSM and InputSequence vectors never change they are kept in the Texture memory.

Resetting/Initiating Vectors
The parallel-HSI algorithm first initialises the B vector using a kernel in which thread t i writes 0 to B½i.The algorithm then enters a loop and in every iteration it resets vector D. As D is simulated by different vectors, we used more than one kernel to achieve this.The first kernel receives CurrentStates and its length and during execution thread t i writes s i to CurrentStates½i.The second kernel receives a vector (InputSequence, Flags, OutputSequences etc.) and its size.During execution a thread t i writes 0 to the ith index of the vector.

Evolving the ST Vector
A thread t i applies the input sequence x to state s i , iterating over a loop (kernel-loop).The number of iterations of the kernel-loop is equal to the length of the input sequence.At each iteration of the kernel-loop, a thread t i reads the next input x from the InputSequences vector and retrieves the next state s and the observed output y from the FSM vector.It writes s to CurrentState½j and writes the output observed in the current iteration to the corresponding index of the OutputSequence vector.

Evaluating the Output Sequences
After the elements of D have been evolved, the parallel-HSI algorithm evaluates the output sequences through a loop (states-loop) called by the CPU.At each iteration, the states-loop chooses a state s i and calls a kernel that writes 0 to all elements of the Flags vector.It then executes another kernel in which a thread t j compares the output sequences produced by states s i and s j through a loop (kernel-loop-2).At each iteration of kernel-loop-2, thread t j retrieves outputs y i ; y j respectively and checks whether the value " È y i ^" È y j is 0 (here È denotes XOR).If so the thread is suspended since the input sequence is not a defined sequence (from at least one of s i and s j ).Otherwise it writes y i È y j _ Flags½j to Flags½j.

Gathering State Identifiers
After the output sequences have been compared, a kernel compares the values of Flags and B. An input sequence is added to H i and H j if it distinguishes s i and s j (Flags½j is true) and the states were not previously distinguished (!B½index is true).Thus, t j writes InputSequence to H j , H i and 1 to B½index if Flags½j ^!B½index is larger than 0.

Constructing a Characterising Set from HSIs
While constructing the CS W from HSIs we need to eliminate duplicates.Thus, once HSIs have been computed by the parallel-HSI algorithm we form a group G by collecting state identifiers from every H i .We then sort these input sequences (sortðGÞ) in parallel, removing duplicated using the generateW ðGÞ kernel.We used the Thrust Sort function [56] to sort input sequences.
The generateW kernel receives G and an empty W .A loop iterates j x i j times and at each iteration a thread t i compares the jth input of neighbouring input sequences: it compares the jth input of x i with both x iþ1 and x iÀ1 .If t i finds that x i is different from a neighbouring input sequence then it adds this input sequence to W .

Theorem 3 . 1 .
If M is an FSM with n states and p inputs then Algorithm 1 requires Oðn 4 pÞ time.

Theorem 3 . 2 .
If Algorithm 1 is given a minimal FSM M then it returns a set P .Theorem 3.3.If Algorithm 1 returns P when given minimal FSM M then P defines a CS for M.

Figs. 3
Figs. 3 and 4 presents the mean construction times in ms.The results are promising: on average PAR-BF-HSI was 420 times faster than SEQ-BF-HSI and 1;836 times faster when n ¼ 1;024.On average, PAR-BF-CS was 605 times faster than SEQ-BF-CS; SEQ-PLY-CS was three times faster than SEQ-BF-CS; and PAR-PLY-CS was 6;316 times faster than SEQ-BF-CS.With n ¼ 2;048, PAR-BF-CS was 3;118 times faster than SEQ-BF-CS and PAR-PLY-CS was 33;241 times faster than SEQ-BF-CS.The bottleneck for PAR-BF-HSI is the B vector, requiring nðn À 1Þ=2 boolean variables.PAR-BF-HSI was able to process FSMs with 131;072 states, making PAR-BF-HSI 128 times more scalable than the existing HSI construction algorithm.The bottleneck for PAR-PLY-CS is the P vector, requiring 2nðnðn À 1Þ=2Þ integer values plus nðn À 1Þ=2 boolean values.PAR-PLY-CS was able to process FSMs with 32;768 states making PAR-PLY-CS 16 times more scalable than the sequential brute-force characterising set generation algorithm.Table1shows the mean sequence lengths.For CSs there are no differences.In addition, the mean sequence length for PAR-BF-HSI was slightly less than that for SEQ-BF-HSI.However, when we applied the parametric Kurskall

Fig. 7 .
Fig. 7. log 10 scale of time (in ms) required to construct CSs for the s-FSMs with PAR-PLY-CS and SEQ-PLY-CS.
For example, in M 1 we have that dðs 2 ; x 4 x 6 Þ ¼ s 4 and ðs 2 ; x 4 x 6 Þ ¼ y 1 y 1 .For example, L M 1 ðs 2 Þ contains the input/output sequence x 4 =y 1 x 6 =y 1 .FSM M defines the language LðMÞ ¼ L M ðs 0 Þ of labels of walks with starting state s 0 .Given S 0 S we let L M ðS 0 Þ ¼ [ s2S 0 L M ðsÞ.States s; s 0 are equivalent if L M ðsÞ ¼ L M ðs 0 Þ and FSMs M and N are equivalent if LðMÞ ¼ LðNÞ.
Algorithm 3. Parallel HSI Construction Algorithm, Highlighted Lines are Executed in Parallel Input: An FSM M ¼ ðS; s 0 ; X; Y; hÞ where jSj ¼ n and n > 1 i ; d j 2 D j Þwith i 6 ¼ j j Þand B½g ij ¼ 0 and x 0 2 prefð xÞ that is a separating sequence for s i Þ ,

TABLE 1 Average
Length of State Identifiers Generated for TS1

TABLE 3
The s-FSMs Time in Milliseconds

TABLE 4 The
Effect of Number of Transitions on the Distinguishability of FSMs Where n ¼ 4;096, p=q ¼ 3=3

TABLE 5
Nomenclature for the Algorithms