Learning by sampling: learning behavioral family models from software product lines

Family-based behavioral analysis operates on a single specification artifact, referred to as family model, annotated with feature constraints to express behavioral variability in terms of conditional states and transitions. Family-based behavioral modeling paves the way for efficient model-based analysis of software product lines. Family-based behavioral model learning incorporates feature model analysis and model learning principles to efficiently unify product models into a family model and integrate the behavior of various products into a behavioral family model. Albeit reasonably effective, the exhaustive analysis of product lines is often infeasible due to the potentially exponential number of valid configurations. In this paper, we first present a family-based behavioral model learning techniques, called FFSMDiff. Subsequently, we report on our experience on learning family models by employing product sampling. Using 105 products of six product lines expressed in terms of Mealy machines, we evaluate the precision of family models learned from products selected from different settings of the T-wise product sampling criterion. We show that product sampling can lead to models as precise as those learned by exhaustive analysis and hence, reduce the costs for family model learning.


Introduction
Several technology companies, such as ABB (Svendsen et al. 2010), Boeing (Sharp 2000), Philips (van der Linden et al. 2007b;2007a), and Siemens (van der Linden et al. 2007c), have been facing an increasing demand for mass production and customization of software products (Pohl et al. 2005).To cope with this need, they have been investing in establishing common platforms to build software families using production line principles (SPLC 2020).Software product lines (SPL) provide a means to support the mass production and customization of software systems (Clements and Northrop 2001).Unlike traditional systems, which are tailored for a specific use, SPLs are developed for reuse and with reuse.Thus, products are not created anew but derived from assets managed as commonalities and variabilities (Linden et al. 2007).
Analysing (e.g., validating and verifying) the system-level functionalities of an SPL on a product-based basis is very demanding due to the potentially exponential number of valid products, e.g., the Linux kernel and its 6,320 features (Berger et al. 2013).Explicit models have been used to support the analysis and development of high-quality systems.They help software engineers in program comprehension (Bauer et al. 2014), software refactoring (Schuts et al. 2016), model checking (Baier and Katoen 2008), and model-based testing (Utting et al. 2012).Family-based modeling approaches have been developed to facilitate SPL analysis without going through each and every product individually (Thüm et al. 2014).Such approaches typically involve two types of family-based models: structural and behavioral (Pohl et al. 2005).Family-based structural models, such as feature models (Kang et al. 1990), capture the presence and absence of features in various products.Family-based behavioral models, such as featured finite state machines (Hafemann Fragal et al. 2017), capture the functionality of features and their interactions.Family-based behavioral models are often referred to as a family model (Thüm et al. 2014;Oster 2012) or 150 % model (Schaefer et al. 2012;Beuche et al. 2016) and are the corner-stone of efficient model-based SPL behavioral analysis.
More specifically, family-based behavioral analysis techniques have been developed for efficient test case generation (Atlee et al. 2015;Beohar and Mousavi 2016;Hafemann Fragal et al. 2017) and model checking (Sabouri and Khosravi 2013;ter Beek et al. 2017) of SPLs.Family models have been used for conformance analysis (Fragal et al. 2018), probabilistic model checking (Varshosaz and Khosravi 2013;Chrszon et al. 2018), and real-time software testing (Luthmann et al. 2019).Nevertheless, the creation and maintenance of family models are difficult and time consuming due to crosscutting features (Oster 2012) and the traceability between the family and feature models can become hard to maintain (Schaefer et al. 2012).Thus, as new requirements emerge and products evolve, the lack of maintenance may lead to outdated and incomplete models (Walkinshaw 2013).Additionally, in practice, many software development environments do not have a structured SPL development process in place and rely on individual product models without knowledge about their commonalities and variabilities (Holthusen et al. 2014).To remedy these issues, we propose an approach for learning behavioural family models of SPLs.
In recent years, we have seen a resurgence of interest in model learning (Vaandrager 2017;Aichernig et al. 2018), particularly supervised techniques for learning state-based models (Angluin 1987;Shahbaz and Groz 2009).This has led to successful applications in industrial practice (Aichernig et al. 2018) and empirical studies to evaluate the performance of new algorithms and tools for model learning (Neider et al. 2019).Our approach builds upon these recent attempts and aims to abstract a family-based state machine model from individually learned or hand-crafted product models.
We introduce the Featured Finite State Machine Difference (FFSM Diff ) algorithm, a technique that employs a similarity measure for state-based models (Walkinshaw and Bogdanov 2013) to identify similar behavior in various products specified as Mealy machines (Gill 1962), annotate conditional states and transitions with feature constraints, and integrate them into a succinct family model.Our technique is discussed in terms of a Featured Finite State Machines (FFSM), a family-based formalism that unifies Mealy Machines of SPLs into a single representation to enable an efficient model-based analysis of SPLs (Hafemann Fragal et al. 2017;Fragal et al. 2018).However, the ideas surrounding our algorithm can be extended to other family-based notations (Benduhn et al. 2015), such as Modal Transition Systems (MTS) (Larsen and Thomsen 1988), various extensions of MTS (Fischbein et al. 2006;Larsen et al. 2007;ter Beek et al. 2016;ter Beek et al. 2019a), and Featured Transition System (FTS) (Classen et al. 2013;Beohar and Mousavi 2016).
Additionally, we evaluate the use of product sampling (Perrouin et al. 2010;Johansen et al. 2011) to efficiently choose individual products that are to be analyzed and learn precise family model.Product sampling techniques, such as T-wise (Varshosaz et al. 2018), should collectively cover the behavior of an SPL using a subset of all valid combinations of T selected features (Perrouin et al. 2010;Johansen et al. 2011).Hence, they should address family model learning with reasonable precision and execution costs lower than in an exhaustive analysis.To evaluate the precision of learned models and the efficiency of learned models by sampling, we compare the sampling and exhaustive approaches.
To evaluate our approach, we perform an empirical study of its efficiency on a benchmark set of SPLs (Hafemann Fragal et al. 2017;Classen 2010;Samih et al. 2014;Devroey et al. 2015;2016).Through this empirical evaluation, we aim to answer the following research questions: (RQ1) Is our approach effective in learning succinct family models with respect to the total size of the products under learning?(RQ2) Is our approach effective in learning succinct family models with respect to the total size of the hand-crafted models?(RQ3) Is the size of learned family models influenced by the configuration similarity degree of the products under learning?(RQ4) Is our approach effective in learning precise family models compared to those obtained by exhaustive analysis?
Regarding (RQ1) and (RQ2), we evaluate the succinctness of the learned family model with respect to the individual product models and the hand-crafted family-based specifications.We describe succinctness in terms of the number of transitions and states as these are factors that influence the complexity of model-based techniques (Broy et al. 2005;Baier and Katoen 2008) and that are used to interpret the language and structure of state-based models (Walkinshaw and Bogdanov 2013).Regarding (RQ3), we show that our approach is effective when it can identify reuse and leads to more succinct family models by integrating the reused features.Hence, we set out to test the correlation between the degree of reuse and the succinctness of learned models.Finally, regarding (RQ4), we test the effectiveness of various sampling techniques in learning precise family models.
This paper builds upon and extends on preliminary results of a conference paper that has been published in the proceedings of the 23rd International Systems and Software Product Line Conference (SPLC 2019) (Damasceno et al. 2019b).Besides providing a more detailed explanation throughout the paper, we have introduced new parameters for model comparison and merging; we have incorporated three extra models in our benchmark, including a state machine model from a real system (Samih et al. 2014); and we have evaluated the precision of family models learned by sampling.We briefly summarize our contributions as follows: 1. We introduce a technique to learn family models from individual product specifications by means of state-based model comparison and feature model analysis; 2. We present an experiment evaluating our technique and showing its effectiveness for learning succinct family models in terms of numbers of states and transitions; 3. We show that the amount of feature reuse is a factor that affects the size of learned family models; 4. We evaluate the effectiveness of family models learned by sampling against those learned by exhaustive analysis.
The remainder of this paper is structured as follows: In Section 2, we introduce the fundamental background concepts used in this study, such as SPLs, sample-based analysis for SPLs, finite state machines and structural comparison of state-based models.In Section 3, we introduce our family model learning approach and how it incorporates feature model analysis into the process of structural comparison of state machines.In Section 4, we present a process that employs product sampling to reduce the costs for family model learning.In Section 5, we discuss an empirical evaluation to evaluate the effectiveness of our approaches for family model learning and the precision of models learned by sampling.In Section 6, we discuss some works related to our paper.In Section 7, we conclude this paper by presenting our conclusions and future work.To support the reader, a glossary of symbols is available in Appendix.

Background
This section presents the background concepts and formalisms used in this study.We introduce SPLs, finite state machines, featured finite state machines, and the Labeled Transition Systems difference (LTS Diff ) algorithm for structural comparison of state-based models represented as Labeled Transition Systems (LTS), a well-known variant of FSM (Keller 1976).We follow this particular ordering in order not to break the logical flow of the presentation on behavioral modeling for SPLs (i.e., the concepts of FSMs and FFSMs are strongly related by definition) and because no study has ever associated the LTS Diff algorithm to family models.

Software Product Lines
A software product line is a family of products sharing a common and managed set of features developed in a prescribed way to satisfy the needs of a particular market segment.Pohl et al. (2005) introduces an SPL engineering framework with two key processes: Domain engineering and Application engineering.This separation of concerns enables to build robust platforms to develop customer-specific applications in a shorter time, at lower cost, and with improved quality.
During the domain engineering, the common and variable artifacts, and the scope of an SPL are defined, managed and constructed.During the application engineering, commonalities and variabilities of an SPL are exploited to achieve the highest possible reuse of domain artifacts.Artifacts generated during the domain engineering are used to support the creation of software products.Thus, most of the application artifacts are not developed anew but reused from domain engineering artifacts with the support of software generators and valid product configurations.In Fig. 1, we illustrate the software product line engineering framework and the processes of domain and application engineering.
Let F be the set of features of an SPL.A product p is defined by a subset of features p ⊆ F selected from a variability model, such as a feature model (Kang et al. 1990).A feature model captures the structural information and dependencies between common and variant features of an SPL.Features are concrete, if they are mapped to any implementation artifact; or abstract, if they are only used to group other features (Thüm et al. 2011).These dependencies are denoted as a hierarchically arranged set of interconnected features where parent-children relationships indicate dependency relations among features, and cross-hierarchy constraints are typically denoted by propositional logic formulas (Don Batory 2005).
There are four basic types of parental relationships among features: Mandatory, if a child feature is included in all products in which its parent appears; Optional, if a child is optionally included; Alternative, when only one child feature can be selected; and Or, when one or more of features can be included.For cross-hierarchy relationships, we have two typical forms: Requires, if the implementation of a feature A demands another feature B; and Excludes, if two features cannot be part of the same product.Propositional logic formulas can be used to describe more complex and advanced cross-hierarchy constraints among features (Don Batory 2005).In fact, propositional logic formulas have been extensively used in the automated analysis of feature models.
Boolean satisfiability solvers have been used as key elements under the hood of many feature model analysis tools (Benavides et al. 2010).The SAT4J project (Le Berre and Parrain 2010) is an example of a satisfiability solver widely used in feature model analysis.It composes the FeatureIDE (Thüm et al. 2014) library, an Eclipse-based IDE that supports all phases of feature-oriented software development for SPLs (FeatureIDE 2004).
Feature models have been also extended with cardinality constraints and attributes to cope with the need for richer specifications (Czarnecki et al. 2002;Czarnecki et al. 2005;Berger et al. 2013).In this paper, we investigate the problem of family model learning using an extensive academic benchmarks (Hafemann Fragal et al. 2017;Classen 2010) of SPLs that included non-trivial aspects, such as the possibility of infinite behavior and the existence of states with similar or identical behavior in different products.Let the set of features of a feature model be F , the powerset P(F ) of all feature combinations is constrained to a subset of valid products P ⊆ P(F ) that satisfy its feature constraints.Feature constraints are propositional logic formulae that interpret the elements from F in terms of propositional variables.SAT solvers (Le Berre and Parrain 2010) are often used to detect valid feature models, feature combinations, core features (i.e., features that are part of all products) and redundancies (Benavides et al. 2010).We denote by B(F ) the set of all feature constraints.The subset ⊆ B(F ) defines all valid product configurations of an SPL.We interchangeably refer to products as sets of features and propositions.
The configuration ξ ∈ B(F ) of a product p ∈ P is a feature constraint that expresses the conjunction of all features included in p and the conjunction of negated features absent from it, i.e., ξ = ( f ∈p f ) ∧ ( f ∈p ¬f ).Given a feature constraint χ ∈ B(F ), a configuration ξ ∈ satisfies χ , denoted by ξ χ , iff the feature constraint ξ ∧ χ is true.Given two feature constraints ω a and ω b from a feature model F M, and a , b ⊆ satisfying ω a and ω b , respectively, we say that ω a and ω b are equivalent under F M if a = b .To illustrate the concepts of SPLs, we begin using the Arcade Game Maker feature model as our running example.
Example 1 (The Arcade Game Maker SPL) The Arcade Game Maker (AGM) SPL includes three alternative features (i.e., Brickle, Pong and Bowling) and one optional feature (i.e., Save).In Fig. 2, we depict the AGM feature model.In the feature expressions to come, we typically use the abbreviated names of features as shown in Fig. 2. The AGM feature model has six valid product configurations, among which three satisfy the feature constraint ¬S, indicating that the Save feature is absent.

Product-Based Analysis of SPL
In product-based strategies, valid products of an SPL are individually specified and analyzed.While theoretically possible, these strategies are impractical due to the potentially exponential number of feature combinations (Thüm et al. 2014).Hence, one should avoid exhaustive and redundant analysis, and cater for valid feature interactions (Apel et al. 2013).To tackle these issues, combinatorial interaction testing and similarity analysis have been relevant approaches to optimize the analysis of product lines.In the next sections, we present two criteria that have been employed to optimize the analysis SPLs.

Configuration Sampling
Combinatorial interaction testing (CIT) aims at using interaction coverage to sample product configurations (Kuhn et al. 2013).It is based on the observation that most faults emerge by the interaction between a small number of features (Kuhn et al. 2004).For interactions between any t features of SPLs, CIT is often referred to as T-wise testing (Perrouin et al. 2010).The T-wise sampling criterion, defined below, aims at sampling valid configurations from all possible combinations of selected and unselected features.These interactions are called a t-set.
Definition 1 (Valid t-set) A valid t-set is a set of features {±f 1 , ±f 2 , ..., ±f t } satisfying the constraints defined by the feature model F M over the set of features F , where t < |F |, +f i indicates a selected feature i and −f i an unselected one.A T-set is invalid if it does not satisfy the constraints of F M.
Definition 2 (T-wise coverage) The t-wise coverage of a set of configurations , where T t,P C i is the set of t-sets included within the configuration P C i , T t,F M is the set of all possible valid t-sets in F M, and #A denotes the cardinality of a set A.
Sample-based techniques are known to improve the efficiency of SPL analysis by discarding products that may already be covered by other products (Thüm et al. 2014).However, such analysis may be incomplete and miss product specific behaviors.Higherorder feature interaction coverage is known for its improved fault detection capabilities (Steffens et al. 2012;Petke et al. 2013).Thus, for larger T values, the T-wise coverage should lead to more complete analysis.The Chvatal algorithm (Chvatal 1979) is an example of technique for T-wise product sampling (Johansen et al. 2011) that is available in the FeatureIDE workbench (Thüm et al. 2014).

Configuration Similarity
Studies in software testing have shown that similar test cases tend to have equivalent fault detection capabilities, and no additional gain should be expected when these are simultaneously executed (Cartaxo et al. 2011).To mitigate these issues, similarity metrics have been used as test prioritization criteria (Yoo and Harman 2012;Cartaxo et al. 2011) for access control systems (Bertolino et al. 2015;Damasceno et al. 2018) and SPLs (Henard et al. 2014;Al-Hajjaji et al. 2017).
In configuration similarity, a similarity metric describes a similarity relation between two configurations as a numeric value.Similarity metrics often range from zero, if product configurations are totally distinct; to one, if they implement the same set of features.
The Hamming distance is a well-known measure (Deza and Deza 2013) that has been used in the context of SPLs to calculate the similarity between product configurations.It is represented as the normalized number of common selected and unselected features for the two configurations as follows: Definition 3 (Configuration similarity) The configuration similarity between two product configurations p i , p j from a feature model F M with the set of features F is defined as shown in (3.1).
In the conf Sim() metric, |p i ∩p j | denotes the number of common features selected between p i and p j and |(F \p i ) ∩ (F \p j )| represents the number of common unselected features between them.These two values are normalized by the total number of features |F |.

Family-Based Analysis of SPLs
Family-based analysis relies on domain artifacts that incorporate knowledge about valid feature combinations to perform efficient model-based analysis of SPLs, e.g., model-based testing (Beohar et al. 2016) and model checking (Sabouri and Khosravi 2013;ter Beek et al. 2017).Thus, not every individual product has to be analyzed, and redundant computations are minimized or avoided (Thüm et al. 2014).The performance family-based strategies is mainly influenced by the number of features, the size of feature implementations, and the amount of reuse during feature combinations (Brabrand et al. 2012).
In this section, we introduce the Featured Finite State Machine notation (Hafemann Fragal et al. 2017;Fragal 2017) to express the individual features and feature combinations as finite state machines extended with feature constraints.We start defining finite state machines (Gill 1962) as a behavioral model to specify product families and hence, its featured extension for family-based modeling.
Definition 4 (Finite state machine) A finite state machine (FSM) is a septuple M = S, s 0 , I, O, D, δ, λ where S is the finite set of states, s 0 ∈ S is the initial state, I is the set of inputs, O is the set of outputs, D ⊆ S × I is the specification domain, and δ : D → S and λ : D → O are the transition and output functions, respectively.
Initially, an FSM is in the initial state s 0 .Given a current state s i ∈ S, when a defined input x ∈ I , such that (s i , x) ∈ D, is applied, the FSM responds by moving to state s j = δ(s i , x) and producing output y = λ(s i , x).The concatenation of two input sequences α and ω is denoted by α • ω.An input sequence α is a prefix of β, denoted by α β, when β = α • ω, for some sequence ω.An input sequence α is a proper prefix of β, denoted by α < β, when β = α • ω, for ω = .The prefixes of a set T of input sequences are denoted by pref (T ) = {α|∃β ∈ T , α < β}.When T = pref (T ), it is prefix-closed.
An input sequence α = x 1 • x 2 • ... • x n ∈ I * is defined in state s ∈ S if there are states s 1 , s 2 , ..., s n+1 such that s = s 1 and δ(s i , x i ) = s i+1 , for all 1 ≤ i ≤ n.Transition are often represented as a quadruple (s i , x, y, s j ) with the source state, input, output, and destination states, respectively, as their components; or by directed edges labeled with input and output symbols, i.e., s i i/o − → s j .Transition and output functions are lifted to sequences of input in the standard way.Namely, for the empty input sequence , δ(s, ) = s and λ(s, ) = .For an input α • x defined in state s, we have δ(s, α An input sequence α ∈ I * is a transfer sequence from s to s when δ(s, α) = s .An input sequence γ is a separating sequence for s i , s j ∈ S when λ(s i , γ ) = λ(s j , γ ).Two states s i , s j ∈ S are equivalent when for all α ∈ I * , defined in both in s i and s j , we have that λ(s i , α) = λ(s j , α), otherwise they are distinguishable.An FSM is complete when D = S × I , otherwise it is partial.
An FSM is deterministic when, for each state s i and input x, there is at most one possible state s j = δ(s i , x) and output y = λ(s i , x).When all states of an FSM are pairwise distinguishable, it is minimal.When all states of an FSM are reachable from s 0 , it is initially connected.When every state is reachable from all states, it is strongly connected.
Example 2 (FSM for an AGM product) In Fig. 3, we show an FSM for the AGM product derived from the feature constraint ξ = B ∧ ¬S.In this product FSM, we have the states  In this study, we focus on complete, deterministic, minimal, and initially connected FSMs, which are hereafter referred to as finite state machines.This is a reasonable assumption as FSMs are suitable abstraction models for testing reactive systems (Gill 1962;Broy et al. 2005;Chow 1978) and specifying the semantics of richer notations (Harel 1987;Sofia Cassel Falk Howar 2015).Furthermore, the ideas surrounding our proposal can be extended to non-connected, non-minimal and non-deterministic product models (Walkinshaw and Bogdanov 2013).
Definition 5 (Featured Finite State Machine) An FFSM is a septuple F, , C, c 0 , Y, O, , where: F is a finite set of features, is the set of product configurations, C ⊆ S × B(F ) is a finite set of conditional states, where S is a finite set of state labels, B(F ) is the set of all feature constraints, and C satisfies the condition: (5.1) Conditions (5.1) and (5.2) ensure that all conditional states and transitions are present in at least one valid product of the SPL.A conditional state c = (s, φ) ∈ C is alternatively denoted by s [φ].
A conditional transition (c, (x, φ), o, c )  Given an FFSM F F = F, , C, c 0 , Y, O, and a configuration ξ ∈ , the product derivation operator ξ (Hafemann Fragal et al. 2017) parameterized by the configuration ξ derives a product FSM ξ = (S, s 0 , I, O, D, δ ξ , λ ξ ), where: Example 3 (The Arcade Game Maker FFSM) Figure 4 depicts an FFSM for the AGM SPL.In this example, the conditional state Save Game[S] and all conditional transitions reaching or leaving it are implemented by all products implementing feature S. In Fig. 3, the FSM is an example of product derived using the configuration ξ To make FFSMs suitable for model-based testing (Utting et al. 2012), Fragal, Simao andMousavi (Hafemann Fragal et al. 2017) proposed a validation techniques to check if it satisfies the basic properties of FSMs, i.e., determinism, completeness, initially connectedness, and minimality, at the product line level.Added to this, they also show that the SPL-level validation is sound, i.e., if an FFSM satisfies these properties, so do all the FSM products that can be derived from it.
Recently, FFSMs have been employed to generate configurable test suites that can be pruned using feature constraints and product configurations (Fragal et al. 2018).The readability of FFSMs also has been improved by grouping up conditional states and transitions into hierarchical entities (Fragal et al. 2019).Thus, FFSMs have the prospect of serving as a suitable models basis for family-based analysis.

Structural Comparison of State-Based Models
According to Walkinshaw and Bogdanov (Walkinshaw and Bogdanov 2013), structurally comparing two state machines is a difficult task which involves establishing equivalence relationships between states and transitions.To achieve this goal, they proposed LTS Diff , an algorithm to compute the precise difference between two LTSs, a well-known variant of FSM (Keller 1976).In this section, we discuss the LTS Diff algorithm in terms of FSMs.

Similarity Score
In the LTS Diff algorithm, the differences between two FSM models M r = S r , s 0 r , I r , O r , D r , δ r , λ r and M u = S u , s 0 u , I u , O u , D u , δ u , λ u are described in terms of states and their surrounding transitions matching input and output symbols.To achieve this, it first calculates the set of matching transitions for all states a ∈ S r , b ∈ S u using the individual number of pairs of states that can be reached by matching transitions, as follows: Second, a global similarity score is calculated by aggregating the scores of states connected to the original pair as follows: An attenuation ratio k is used to give precedence to state pairs that are closer to the original pair of states and the notation out r (a) refers to the set of labels of outgoing transitions for state a of M r .Thus, the expression | denotes the number of outgoing transitions from both states a and b that do not match each other.
Given two FSMs M r and M u , the global similarity score S G Succ (a, b) is used to build a system of linear equations, such that each equation corresponds to the S G Succ (a, b) for one specific pair of states (a, b) ∈ S r × S u .
The global similarity is calculated both in terms of future behavior (i.e., outgoing transitions) and past behaviors (i.e., incoming transitions).The global similarity score for incoming transitions S G P rev (a, b) is calculated in a similar manner.Consider the systems of equations for S G Succ (a, b) and S G P rev (a, b), the similarity scores for all pairs (a, b) are averaged as follows: Example 4 (Illustration of a system of linear equations) In Table 1, we depict the coefficients of our system of equations for the comparison of the two FSMs shown in Figs. 3 and  5. State pairs are represented by the first two letters of their respective names.
In the leftmost column, we indicate the respective state pair of each row.In the mid columns, we denote the coefficients for state pairs reachable via matching transitions outgoing from its respective row's state pair.In the rightmost column, we indicate the number of matching transitions for an specific row's state pair.The rightmost value is calculated as the summation of the unitary value in the global similarity score's numerator.A solution for this system of equations indicates the similarity degrees for each state pair.

The LTS Diff Algorithm
The comparison of two FSMs is performed in a similar fashion to how we manually navigate in an unfamiliar landscape using a map.In Algorithm 1, we describe this process as proposed by Walkinshaw and Bogdanov (2013).
First, we compute the similarity scores for all state pairs of the models, as indicated the previous section.Hence, in Line 3, we use a filtering function denoted by identifyLandmarks() to select the top t % most equivalent pairs, i.e., those pairs with score above t.If one state is matched to several others, a ratio r includes only those pairs that are at least r times as good as any other match.If no state pair is identified as landmark, then in Line 4-6 the initial states are mapped and selected as initial landmark.Parameters t, k and r can be adapted depending on how many similar transitions the models have.If there are many similar transitions, the threshold should be higher, to make sure that we start the matching process from state pairs that clearly stand out compared to the others pairs.
Second, in Line 7, the algorithm starts from the initial landmarks to find surrounding states reachable via incoming and outgoing transitions matching input/output labels.Once the initial members of the KP airs set are found, we use the Surr() function to search for all surrounding state pairs reachable via incoming and outgoing transitions matching labels.These surrounding states are added to the NP airs set of matched state pairs to be analyzed.
Third, we begin an iterative process where we pick one state pair of highest similarity degrees (a, b) ∈ NP airs, incorporate (a, b) into the KP airs set of common transitions, and remove every state pair conflicting with (a, b) from the NP airs set, i.e., (x, y) is con- Each of these steps are shown in Lines 10, 11 and 12, respectively, and repeated until there are no elements left in the NP airs set.In Line 14, we discard all surrounding state pairs reachable via matching transitions.This iterative process is repeated until there are no pairs left in the NP airs set, as indicated between Lines 8-15.At the end of this process, the KP airs set is used to derive the sets of transitions added, removed and kept by checking matching transitions for all states in the KP airs set.
The worst-case complexity for solving this system of linear equations is O((|S r | × |S u |) 3 ), where |S r | and |S u | are the number of states in each of the compared FSM models.However, in practice, the average complexity is lower due to the sparse nature of the produced matrices (Walkinshaw and Bogdanov 2013).

Model Precision
Originally, the LTS Diff algorithm (Walkinshaw and Bogdanov 2013) has been proposed to identify structural differences between two state-based models reverse engineered by model learning algorithms (Lang et al. 1998;Cook and Wolf 1998).This structural difference is categorized in terms of a confusion matrix (Sokolova and Lapalme 2009).In Table 2, we show the confusion matrix used to compute the structural difference between two FSMs, namely the reference model M r and the target M u .These respectively indicate the system's internal behavior and reverse engineered model.
In this confusion matrix, the set of true positives T P is derived from the set of correctly learned transitions, i.e., D r − Rem.The set of false positives F P is denoted by the set of extra transitions Add, i.e., those that were incorrectly hypothesized.The set of false negative F N is defined as the set of removed transitions Rem, i.e., those that should be in the learned model M u but are missing.The set of true negatives T N is empty because it refers to impossible transitions, i.e., those that should be in none of the models M r or M u .
Based on these sets, performance metrics, such as Precision, Recall and F-measure, can be computed for model learning algorithms (Walkinshaw and Bogdanov 2013).Precision Table 2 The confusion matrix to compute the performance metrics for model learning algorithms (2021) 26: 4 Empir Software Eng tells the proportion of transitions in D u that are also in D r , and Recall tells the proportion of transitions in D r that are also in D u .In Table 3, we show the formula used to calculate the aforementioned performance metrics.

Learning Family Models from Product Specifications
Family models have been exploited as theoretical foundation for efficient SPL analysis techniques, e.g., model-based testing (Beohar et al. 2016) and model checking (Sabouri and Khosravi 2013;ter Beek et al. 2017).Albeit reasonably efficient, family-based analysis is a challenging task because the creation and maintenance of family models is time consuming and error-prone, especially if there are crosscutting features and large models (Schaefer et al. 2012).Additionally, as requirements change and product instances evolve, the lack of maintenance may render outdated family models (Walkinshaw 2013).
In this section, we introduce the FFSM Diff algorithm, a family model learning technique that builds succinct FFSM models (Hafemann Fragal et al. 2017;Fragal et al. 2018) for an SPL by integrating feature model analysis into the process of structural comparison of statebased models (Walkinshaw and Bogdanov 2013).Although our technique is discussed in terms of FFSMs, it can be extended to non-connected, non-minimal and non-deterministic models (Walkinshaw and Bogdanov 2013) and other family-based modeling approaches (Benduhn et al. 2015), such as FTSs (Classen et al. 2013;Beohar and Mousavi 2016), as the FSM notation is a variant of LTS where labels indicate input/output pairs.
The FFSM Diff algorithm allows to (i) learn a new FFSM model from two product FSMs, or (ii) include a product FSM into an existing FFSM.The former approach is applicable when there is no FFSM existing a priori, and the latter if there is a new configuration ξ u ∈ r not included in an FFSM F F r specifying a set of configurations r , respectively.In both cases, we assume that the feature model, the FSMs, and the configurations of each product under learning are known a priori.This means that the product FSMs shall be previously hand-crafted or learned using some variant of model learning (Angluin 1987;Shahbaz and Groz 2009;Damasceno et al. 2019a).Furthermore, the product FSMs shall satisfy the basic properties of testing.These properties are also assumed to be valid for the existing FFSM that will incorporate new product behavior (Hafemann Fragal et al. 2017).

The FFSM Diff Algorithm
In the FFSM Diff algorithm, we employ the structural comparison of state-based models proposed by Walkinshaw and Bogdanov (Walkinshaw and Bogdanov 2013) to match product models.Hence, we learn FFSMs by merging them into an unified family model where differences are indicated by feature constraints.To date, this is the first study to employ automata learning principles for learning family models.In Algorithm 2, we depict the pseudocode for the FFSM Diff algorithm and discuss the main changes that we incorporated to employ the idea of state-based model comparison as a family model learning algorithm.
First, to identify the landmarks between product models, we have adapted the identifyLandmarks() function from Algorithm 1 to assume the state pair (s 0 r , s 0 u ) as a default landmark.Hence, we search for all pairs likely to be equivalent, given the threshold t and ratio r parameters.The state pairs identified as satisfying the threshold t and ratio r are returned to the set KP airs of common transitions (i.e., commonalities), as shown in Line 3. Since the state pair (s 0 r , s 0 u ) is taken by default as an initial landmark, any pair with s 0 r or s 0 u is also discarded.Additionally, we eliminated the risk for having an empty KP airs set, as shown in Lines 4-6 of Algorithm 1.
Second, we employ the resulting sets Add, Rem to identify product-specific states that will receive a presence condition indicating the particular product that it is associated with, and the set Kpt is used to indicate the matched states that will be annotated with the conjunction of both simplified configurations.Conditional transitions departing from matching states that also match I/O labels must be unified, otherwise they shall be represented by distinct transitions with their respective simplified configurations.The process of matching and annotating states and transitions is indicated by the mergeAndAnnotate() function shown in Line 16.As the LTS Diff , the FFSM Diff also has a (worst-case) complexity of O((|S r |×|S u |) 3 ) that, in practice, the is often lower due to the sparse nature of the produced matrices.
In the next section, we formally describe how this mergeAndAnnotate() process is performed for learning a new FFSM from two products.Afterwards, we extend the formalities for this idea to the task of incorporating new product-specific behavior into an existing FFSM.

Learning a New FFSM
Let M r = S r , s 0 r , I r , O r , D r , δ r , λ r and M u = S u , s 0 u , I u , O u , D u , δ u , λ u be the FSMs of two products p r and p u that implement configurations ξ r = ( f ∈p r f ) ∧ ( f ∈p r ¬f ) and ξ u = ( f ∈p u f ) ∧ ( f ∈p u ¬f ).To learn a new FFSM from M r and M u , there are two assumptions: (i) M r and M u are FSMs built a priori (e.g., using automata learning (Angluin 1987;Vaandrager 2017)), (ii) their respective feature model and configurations ξ r and ξ u are known a priori.To learn new FFSMs from two product FSMs, we proceed as follows: is the set of features implemented by the two products -= {ξ r , ξ u } is a smallest set composed by the two configurations, -C ⊆ (S r ∪ S u ∪ (S r × S u )) × B(F ) is the set of conditional states where -otherwise, for two transitions (s i , x) ∈ D r and (s j , y) ∈ D u , there are two independent conditional transitions defined as follows: Condition (6.1) ensures that product states are either unified into one or two distinct conditional states.These are annotated either with the disjunction or individual configurations, respectively.Condition (6.2) denotes when two transitions shall be unified due to their matching labels and conditional states.Finally, Condition (6.3) describes the case where two transitions cannot be merged and hence, there are two distinct conditional transitions one for each configuration.
To guarantee the mapping between the initial states of the products, we set the state pair (s 0 r , s 0 u ) as the initial conditional state for the learned FFSM model.This state pair also helps to steer the identification of commonalities between product FSMs.To reduce the (2021) 26: 4 Empir Software Eng Page 16 of 46 complexity of feature constraints, the product configurations are simplified by discarding the core features (Benavides et al. 2010) expressed in their associated formulas.
Example 5 (FFSM learned from two product configurations) In Fig. 6, we depict a fragment of the FFSM learned by comparing and merging the two product FSMs shown in Figs. 3  and 5.
In this example, the states Pong Game and Bowling Game were merged into one state Bowling * Pong where there is one conditional transition with input symbol Exit for each configuration.The feature constraint (W ∧ ¬S ∧ ¬B ∧ ¬N) is an example of simplified configuration for the product in Fig. 3.

Including New Product Behavior into an Existing FFSM
Let the model F F r = F r , r , C r , c 0 r , Y r , O r , r be an FFSM learned from a set of product configurations r .If the FFSM F F r does not include the behavior of a product FSM M u = S u , s 0 u , I u , O u , D u , δ u , λ u specifying a configuration ξ u ∈ r , a new FFSM F F that includes the product behavior from ξ u can be learned by matching and merging the models F F r , M u .
To include a new product into an existing FFSM, we adapted the Definition 6 to compare product models against FFSMs, we introduce another definition of how an existing family model incorporates novel product behavior described in terms of a product FSM.Thus, there are three required assumptions: (i) F F r and M u are state machine models built a priori, (ii) configuration ξ u is known in advance, and (iii) the FSM and FFSM under learning share a feature model that is known a priori.To include a new product-specific behavior into an existing FFSM, we proceed as follows: Definition 7 (FFSM learned from F F r and configuration ξ u ) An FFSM learned from F F r , M u is a septuple F F = F, , C, c 0 , Y, O, where F F r is a reference FFSM and M u is the FSM specifying an updated product p u where -F = F r ∪ {p u } is the set of features in F F r and implemented by p u -= r ∪ {ξ u } are the configurations in F F r and implemented by p u , -C ⊆ (S r ∪ S u ∪ (S r × S u )) × B(F ) is the set of conditional states where In addition to the procedure of including new product behavior, we also have extended our FFSM Diff algorithm (Damasceno et al. 2019b) to identify the sets of transitions added, removed and kept.Thus, we can quantify the behavioral overlap between an updated product model M u and a reference FFSM F F r .To identify the amount of behavioral overlap, we use the concept of precision between two models M u and F F r in terms of their sets of common transitions (Walkinshaw and Bogdanov 2013).
Family model learning has been proposed as an approach to build feature finite state machine models and learn presence conditions indicating feature-specific and productspecific behavior in terms of conditional states and transitions (Damasceno et al. 2019b).Since family models are expected to represent all product line variants within the same artifact (Schaefer et al. 2012), the most straightforward method should be to analyze all valid configurations in a brute-force fashion.However, this is only feasible for product lines with not too many members (Thüm et al. 2014).To address this issue, we present an approach that employs product sampling on family model learning.

Incorporating Product Sampling into Family Model Learning
Analysing an SPL on a product-based basis is very demanding and cumbersome as there is a huge number of possible product configurations.Thus, we propose to incorporate product sampling into the family model learning so it can be ran without the need for exhaustive learning.
For large SPLs, a typical software analysis approach is to sample product configurations such that reasonable statements on the behavior of the entire product line are possible (Thüm et al. 2014;Varshosaz et al. 2018).Product sampling techniques, such as T-wise (Perrouin et al. 2010;Johansen et al. 2011), shall collectively cover the behavior of a product line.Hence, they should pave the way for learning family models with reasonable precision and cost lower than exhaustive analysis.
In our approach, we assume there is an arbitrary product sampling technique sample() that generates a minimal subset of configurations C smpl to be considered during the sampling process, such as the Chvatal algorithm (Johansen et al. 2011).In Algorithm 3, we depict our approach for learning family models by sampling.
Let C smpl = {ξ 1 , ξ 2 , . . ., ξ n } be the list of sampled product configurations under learning by an arbitrary product sampling technique, referred to as sample(), In the family model learning by sampling process, we start by first building a new family model F F 1 for the two product models M 1 and M 2 .Second, given an initial FFSM F F 1 , the learning by sampling enters into an iterative stage where novel product-specific behavior expressed in terms of a state machine M j +1 is included in a partial family model F F j .This family model is said to be partial as it describes only a subset of valid product instances.Again, product FSMs M j +1 may be nonexistent and hence, buildF SM() may be required to build FSMs.In the buildF SM() step, product-specific FSM models can be either hand-crafted or built by means of automata learning (Angluin 1987;Vaandrager 2017).At the end of this iterative stage, a family model F F n learned from all product configurations ξ i ∈ C smpl is constructed.To evaluate the benefits of the product sampling criteria, in the next section we present few experiments to quantify the precision of such models learned by sampling.

Empirical Evaluation
Several studies have underscored the importance of feature interaction coverage in product sampling (Varshosaz et al. 2018).Thus, we designed a set of experiments to analyze our family-based learning technique with the purpose of evaluate its effectiveness in learning succinct models using exhaustive analysis.Hence, we extended our investigation to evaluate whether feature interaction coverage metrics can alleviate the cost of family model learning and collectively cover the behavior of SPLs.Particularly, we applied the T-wise coverage criteria and analyzed the precision of models learned by sampling (Perrouin et al. 2010;Johansen et al. 2011).
(2021) 26: 4 In this section, we present the context of our experiment, selected variables, formulated hypotheses, experiment design, and subject systems.Next, we present the analysis and interpretation of results and threats to validity for our empirical evaluation.We close this section discussing the implications and limitations of our study.This section is organized based on recommendations by Wohlin et al. (2012).For the sake of reproducibility, we have made a web page describing the artifacts (e.g., source-code, scripts, FFSMs, FTSs, FSMs, feature models) used and generated in this study available at https://damascenodiego.github.io/learningFFSM/.The repository has been structured based on recommendations by Mendez et al. (2020).

Methodology
According to Thüm et al. (2014), the effectiveness of family-based analysis should be mainly influenced by the number of features, the size of feature implementations (e.g., modeling, coding artifacts) and the amount of reuse among configurations, rather than the number of valid configurations (Thüm et al. 2014).Therefore, for our technique to qualify as an effective family-based learning technique, we expect to learn succinct FFSMs where states and transitions are annotated with simplified configurations.
By succinct, we mean that the FFSMs learned are smaller than the products under learning and hand-crafted models, especially if there is high feature sharing.By simplified, we mean that product configurations are modified by discarding core features from feature constraints using SAT solvers (Le Berre and Parrain 2010).
Additionally, we expect that family models learned by sampling product configurations shall collectively cover the behavior of a product line and be at least as precise as those models recovered by exhaustive analysis.Thus, we designed a set of experiments to measure the succinctness and precision the learned family models and answer our research questions.
In Table 4, we present our hypotheses about each proposed research question.
As a measure of succinctness, we used the size of the FFSMs learned from product pairs.We describe size in terms of number of transitions as it is one of the factors that influence the complexity of model-based techniques (Broy et al. 2005;Baier and Katoen 2008) and The size of learned FFSMs is equal to the total size of the pairs of products under learning The size of learned FFSMs is smaler than the total size of the pairs of products under learning The learned FFSMs are larger than hand-crafted models The learned FFSMs have at most the same size as hand-crafted models The size of learned FFSMs is not influenced by configuration similarity The size of learned FFSMs is influenced by configuration similarity The FFSMs learned by sampling configurations are less precise than those learned by exhaustive analysis The FFSMs learned by sampling configurations can be as precise as those learned by exhaustive analysis that is used to interpret the language and structure of FSMs (Walkinshaw and Bogdanov 2013).To complement our analysis, we also measured the number of states.
To measure the statistical significance, we used the Mann-Whitney test to check if there was significant difference (p < 0.05) between the sizes of the learned FFSM and the reference model, i.e., the product pair or the hand-crafted family model.To measure the scientific significance (Kampenes et al. 2007;Arcuri and Briand 2011), we used the Vargha-Delaney's Â effect size (Vargha and Delaney 2000;Wohlin et al. 2012) to assess the probability of the learned FFSMs being more succinct than the reference model.If Â < 0.5, the learned FFSM is smaller than the pair of products.If Â = 0.5, they have equivalent sizes.To categorize the magnitude of the Â effect size, we used the intervals between Â and 0.5 implemented in the effsize package (Hess and Kromrey 2004; Torchiano 2017): negligible < 0.147 ≤ small < 0.33 ≤ medium < 0.474 ≤ large.
As a measure of configuration similarity, we applied the Hamming distance between product configurations with respect to normalized number of common selected and unselected features (Al-Hajjaji et al. 2017).Thus, we analyzed the impact of configuration similarity on family model succinctness by calculating the Pearson's correlation coefficient between the ratio of the size of the learned FFSM with respect to the total size of the product pairs, on one hand, and the similarity between configurations, on the other hand.
As a measure of precision, we used the concept of model precision proposed by Walkinshaw and Bogdanov (2013) for evaluating the performance of reverse engineering techniques.Inspired by their observations, we gradually changed our parameter values to constraint the possibilities of matches and applied the same value for all product lines.If product states were too homogeneous, with many similar transitions, we increased the thresholds to make sure that we started from state pairs that clearly standed out, as Walkinshaw and Bogdanov (2013) indicate.Hence, we set the learning parameters for the attenuation ratio as k = 0.5, the threshold of most equivalent pairs as t = 0.4 and the ratio for best matches as r = 1.4.

Experiment Design
To answer RQ1 and RQ2, we implemented the FFSM Diff algorithm on top of the LearnLib framework (Isberner et al. 2015) for dealing with the state machine models, the SAT4J solver (Le Berre and Parrain 2010) for feature model analysis, the FeatureIDE (Thüm et al. 2014) library for product sampling and configuration similarity, and the Apache Commons Mathematics library (Apache 2016) for solving the systems of linear equations.We used the FFSM Diff implementation to combine the FSM models into FFSMs for all pairs of product configurations.Then, we checked whether there were significant and relevant differences between the sizes of the learned FFSM, the pair of products under learning and the handcrafted models.In Fig. 7, we illustrate our experiment to answer the RQ1 and RQ2.
To answer RQ3, we normalized the size of the learned FFSMs by the total size of the product pairs to the interval between 0.5, if both product FSMs are equivalent; and 1.0, otherwise.Based on this normalized size, we calculated the Pearson's correlation coefficient between the normalized size of learned FFSMs and configuration similarity to measure the impact of similarity between product configurations on the size of learned family models.For the statistical analysis, we used the R statistical package (RStudio 2019).
To answer RQ4, we used the FeatureIDE workbench (Thüm et al. 2014) to generate subsets of valid products satisfying the feature-wise (aka.1-wise), pair-wise (aka.2-wise), 3-wise, 4-wise and all-valid configurations criteria.Particularly, we used the Chvatal algorithm (Chvatal 1979) to perform T-wise product sampling (Johansen et al. 2011) that is available in the FeatureIDE workbench (Thüm et al. 2014).In Fig. 8, we illustrate our experiment to answer the RQ4.Let {ξ 0 , ξ 1 , . . ., ξ m } ⊆ B(F ) be a subset of valid configurations generated by some arbitrary sampling criteria, such that they are sorted by configuration similarity (Al-Hajjaji et al. 2017).For each sampled subset, we iteratively learned partial FFSMs by merging the FSMs of the configurations j −1 i=0 (ξ i ) with its next configuration ξ j .To evaluate the precision of the partial family model learned by sampling, we used the FFSM Diff to measure how many transitions from all valid products were included into the FFSMs learned by sampling.The same t, k, r parameters used for learning family models were taken to calculate the precision of models learned by sampling.

Subject Systems
In order to evaluate our hypotheses, we searched in the literature of model-based SPL analysis, existing open source projects, and benchmarks for subject systems accompanied by 1) a feature model, 2) models of individual products, and preferably, 3) a behavioral family model.Items 1 and 2 form the basis for the application of our technique and item 3 was used to evaluate our learning technique against the provided models.
We selected 105 Mealy machines derived from six abstract representations of SPLs (Hafemann Fragal et al. 2017;Classen 2010;Samih et al. 2014;Devroey et al. 2015;2016).While one of these abstract representations of SPLs has been already made available as a set of FSMs (Hafemann Fragal et al. 2017), the other five sets of FSMs had to be hand-crafted from LTS models instantiated from academic benchmarks of SPLs (Classen 2010;Samih et al. 2014;Devroey et al. 2015;2016).In Table 5, we present the SPLs in terms of numbers of features, valid configurations, and total of states and transitions in its family model.
To instantiate these product FSMs, we used the VIBeS tool (Devroey and Perrouin 2014) to derive LTSs for every valid product of each SPL.For each LTS state, we created one FSM state.For every valid input of an LTS state, we added an FSM transition returning 1.For every missing transition, we added a self-loop transitions returning 0. This process was hand-crafted by the first author of this paper, who has former experience in modeling software systems for model-based testing (Damasceno et al. 2016(Damasceno et al. , 2018) ) and automata learning (Damasceno et al. 2019a).Moreover, this process was partially automated using Bash and Python scripts that are included in our lab package.
Although these are not fully realistic systems, we believe these academic benchmarks are more representative than random models, which may constitute very rare and special cases in which our techniques do not perform well.They comprise many non-trivial aspects, such as the existence of infinite behaviour, states with similar or identical behaviour in different products (Damasceno et al. 2019b) and distinct input alphabets (Classen 2010) that can make family model learning more difficult.Additionally, these models constitute widely used benchmarks for family-based analysis techniques (Classen et al. 2010;Asirelli et al. 2011;Classen et al. 2013;Beohar et al. 2016;Beohar and Mousavi 2016;Devroey et al. 2016;ter Beek et al. 2019b).Since the AGM has been discussed before, in the next sections, we briefly present the other five subject systems used in this study.

The Vending Machine SPL
The Vending Machine (VM) is an SPL that we hand-crafted (Damasceno et al. 2019b) based on LTSs derived from a collection of illustrative examples of FTS models (Classen 2010).In Fig. 9, we depict the VM feature model.
In the VM SPL, product instances shall feature at least one and at most three beverages (i.e., Coffee -COF, Tea -TEA, and Cappuccino -CAP), they support one currency (i.e., Dollar -DOL or Euro -EUR) and can play one optional Ringtone -TON.The VM SPL constitutes an interesting case as it can derive FSMs with distinct input alphabets and languages.Among the main characteristics of the derived product FSMs, we highlight two main differences: the possibility to add extra states for each beverage; and changes in the valid input symbols of outgoing transitions departing from the initial state depending of the supported Fig. 9 The VM feature model currency.Finally, the VM SPL also shows a "requires" relationship explicit in the feature model as its corresponding propositional formula.

The Wiper System SPL
The Wiper System (WS) is another SPL that we hand-crafted (Damasceno et al. 2019b) based on models from the same collection of SPLs aforementioned (Classen 2010).In Fig. 10, we depict the feature model of the WS SPL.
Our WS SPL has two subsystems: the Sensor to detect rain and the Wiper itself; available in two qualities, namely, high and low; and one optional feature for permanent movement PermanentWiper.A high quality sensor sHigh can discriminate between heavy and light rain, whereas a low quality sensor sLow can only distinguish between rain and no rain.Similarly, the wHigh and wLow quality wipers can operate at two and one speeds, respectively.Each of these features lead to significant changes in the structure and language of its derived product FSM models.

The Aero UC5 SPL
The Aero UC5 (AEROUC5) model has been originally presented by Samih et al. (2014) as a set of extended Markov models designed by engineers.It is an industrial situational awareness system for helicopters flying in degraded visual environments that has been employed as a benchmark in SPL research studies (Devroey et al. 2015(Devroey et al. , 2016)).The AEROUC5 feature model has been originally composed by 25 features and more than 5 million valid configurations (VIBeS 2016a).We adapted the AEROUC5 SPL because there were only four features that were used in the behavioural model of the products (VIBeS 2016a).Hence, the original model had a huge amount of identical product models.We have thus restricted the feature model to only those four concrete features used in the behavioural models as shown in Fig. 11.
Our adapted version of the feature model is composed by four features related to displaying (i) real object or (ii) 3D conformal visual cues on a head-tracked Helmet, and marking (iii) intended landing positions or (iv) obstacles on ground using an Obstacle Warning System.This SPL is intended to be a more realistic subject as it has been designed by engineers and is one of our largest behavioral model in terms of number of states and transitions.

The Card Payment Terminal SPL
The Card Payment Terminal (CPTERMINAL) is another SPL originally designed as an FTS (Devroey et al. 2015(Devroey et al. , 2016)).In Fig. 12, we depict the feature model for the Card Payment Terminal product line.
This product line has been defined by a software engineer based on EMV and PCI norms (Devroey et al. 2015(Devroey et al. , 2016)).The Card Payment Terminal FTS describes the behavior of one terminal that accepts card payment with DirectDebit and/or CreditCard.It accepts a card owner authentication method (i.e., Signature and optionally PIN code), and with a synchronous (Online) or asynchronous (Offline) connection to the payment service (VIBeS 2016b).The CPTERMINAL SPL also includes a "requires" relationship explicit as its corresponding propositional formula.To derive FSMs, we have used the same approach applied to the AEROUC5 SPL.

The Minepump SPL
The Minepump (MINEPUMP) product line has been presented by Classen et al. (2013).The purpose of this system is to keep a mine shaft clear of water while avoiding the danger of methane related explosions.In Fig. 13, we show the feature model for the Minepump SPL.
It monitors the mine shaft using the WaterRegulator and MethaneDetect features.The system is activated once the water level reaches a preset threshold, but only if the methane is below a critical limit.Similarly to the AEROUC5 and CPTERMINAL SPLs, the FSMs for the Minepump SPL were derived from an FTS model (VIBeS 2016c).

Analysis of Results
In this section, we discuss the main results of our experiments in terms of the four defined RQs and the Hypotheses shown in Table 4.For the sake of space, we will only plot and highlight the main findings of our experiments.The full set of plots and tabulate results are available in our online repository (Nascimento Damasceno 2020).In the boxplots, the red

RQ1 -is our Approach Effective in Learning Succinct Family Models with Respect to the Total Size of the Products Under Learning?
Regarding the succinctness of the learned FFSMs, we observed that on average all learned FFSMs presented fewer transitions than their respective pairs of products under learning.In Fig. 14, we show boxplots for the sizes of the learned FFSMs and the total size of the pairs of products under learning in terms of number of transitions.The number of transitions of the original hand-crafted family model is indicated by a red dashed line.
In terms of number of states, we also found that the learned FFSMs had fewer states than their pairs of products under learning.Figure 15 shows the boxplots for the numbers of states in the learned FFSMs and the total number of states in the pair of products under learning.
To assess the statistical difference and significance of our results, we ran the Mann-Whitney test and Vargha-Delaney's Â effect size to check the significance (p < 0.05) and magnitude of the difference between the sizes of the learned FFSMs and the pairs of products under learning.In Table 6, we present the p-values and effect sizes comparing the sizes of our learned family models against the pairs of product under learning in terms of states and transitions.
As indicated by Figs. 14 and 15, as well as by Table 6, there were statistically significant differences between the sizes of the learned FFSMs and the pair of products under learning.For the effect sizes, we also found that the differences had large magnitude.Thus, our results support the hypothesis H RQ1 1 that the sizes of learned FFSMs is at most equal to the total size of products under learning.

RQ2 -is our Approach Effective in Learning Succinct Family Models with Respect to the Total Size of the Hand-Crafted Models?
To evaluate the succinctness of the learned FFSMs, we also compared the size of handcrafted models against the FFSMs learned from pairs of products.In Figs. 14 and 15, the size To compare the sizes of the hand-crafted models and the FFSMs learned from product pairs, we used the Mann-Whitney test and Â effect size.Table 7 shows the results for the Mann-Whitney test and effect size comparing the size of the learned FFSMs against the size of hand-crafted models.
By analyzing the results of the Mann-Whitney test, we found statistically significant differences (p < 0.01) between the sizes of FFSMs learned from all SPLs.The  Vargha-Delaney's effect sizes indicated differences of large magnitude where FFSMs learned from product pairs included fewer transitions than their hand-crafted versions.These findings persisted for the number of states, except for the VM SPL where we found a small magnitude on the difference between the number of states of the FFSM models learned from product pairs.Thus, our results support the hypothesis H RQ2 1 that learned FFSMs have at most the same size as hand-crafted FFSMs.

RQ3 -is the Size of Learned Family Models Influenced by the Configuration Similarity Degree of the Products Under Learning?
In addition to comparing the size of learned FFSMs against the size of products under learning, we analyzed the relationship between learned family model size and configuration similarity using the Pearson's correlation coefficient.In Fig. 16, we show scatter plots for the configuration similarity degree against the size of learned FFSMs for all pairs of products of each SPL.
A configuration similarity equal to 1.0 means that both products have the same feature configuration.A ratio between the size of learned FFSM and total size of products equal to 0.5 means that the products analyzed implement equivalent behavior, otherwise they have some variability expressed by mismatching transitions.
By analyzing the Pearson correlation coefficient, we found strong negative correlations between FFSM size and configuration similarity for the VM, WS, AEROUC5 and MINEPUMP product lines; very strong negative correlation for the AGM product line; and moderate negative correlation for the CPTERMINAL.These results indicate that FFSMs learned from product models with high configuration similarity tend to be smaller than those built from products implementing distinct sets of features.Therefore, our results support the hypothesis H RQ3 1 that the size of FFSMs is influenced by configuration similarity and our approach can exploit common features and produce more succinct FFSM models when these are prone to behavioral similarity.For each T ∈ {1, 2, 3, 4}, we have used the T-wise sampling criteria to generate subsets of valid product configurations and learn FFSM models by sampling.To evaluate the precision of learning by sampling, we used the all-valid configurations criteria to derive all product FSMs and build reference FFSMs for each SPL.In Table 8, we depict the sizes of the subsets of products generated by each configuration sampling criteria.
To evaluate the precision of the models learned by sampling, we have used the FFSM Diff to measure the proportion of transitions from the analyzed models (i.e., learned by sampling) that are also in the reference models (i.e., individual FSMs of all valid products).Thus, a precision equals to 1 indicates that all transitions from all valid products are included into the FFSM learned by sampling.Figure 17 shows the precision of the FFSM learned by each sampling criteria compared against the full set of models from all valid products.
As our results indicate, model precision turned out to be higher for larger values of T .For most of the product lines, excluding the AGM, we found a significant difference between the models learned by feature-wise sampling and all-valid configurations, i.e., exhaustive analysis.By comparing exhaustive analysis against feature-wise sampling, we found effect sizes categorized as medium to large with the exhaustive criteria reaching higher precision.
Higher interaction strengths are known by their improved fault detection capabilities (Steffens et al. 2012;Petke et al. 2013).Similarly, our results corroborate to these findings as they indicate that family models learned by 3-and 4-wise sampling tend be more precise than those built by feature-wise and pairwise sampling.
As shown in Table 8, the 3-and 4-wise sampling criteria generated the same number of configurations as the exhaustive criteria for the AGM, WS and AEROUC5 product lines.Thus, similar precision levels should be expected.The results for the Mann-Whitney and Vargha-Delaney's tests corroborate these findings as they indicated no significant difference between the precision of models learned by 3-wise, 4-wise and all-valid sampling criteria.
For the VM, CPTERMINAL and MINEPUMP product lines, we found that models learned by 3-wise and 4-wise sampling reached precision levels similar to those learned by using the exhaustive criteria.For these product lines, we found either no significant differences or effect sizes categorized as negligible to small between the models learned by the 3-wise, 4-wise and all-valid sampling criteria.These findings indicate that product sampling can be helpful at reducing the costs for recovering family models from product families without analysing all-valid products.For the AEROUC5 and MINEPUMP product lines, we found that FFSMs learned by exhaustive analysis did not reach precision levels equal to 1.We associate this to a possibly high number of state pairs with equal scores returned by the identifyLandmarks() function.Thereafter, multiple possible maps between states pairs were found byour algorithm where the selected pairs deemed some transitions as removed and affected precision.These results support our hypothesis H RQ4 1 that FFSMs learned by sampling can be at least as precise as those learned by running exhaustive analysis.

Threats to Validity
In this section, we discuss the threats to validity of the methods used in this paper.To do this, we follow the recommendations by Wohlin et al. (Wohlin et al. 2012).

Conclusion Validity
These threats concern the relationship between treatment and outcome of our investigation.To avoid the risk of violating assumptions of statistical tests, we have opted for the Mann-Whitney nonparametric statistical test.Despite these actions, there are still threats to conclusion validity due to the risk of random heterogeneity in our subject systems as these are academic models in their majority.

External Validity
These concern the generalization of our results to industrial SPLs.Our results are based on six subjects, of which one of them has been inspired by a real system (Samih et al. 2014); the small number of real product lines and the fact that most feature models did not have complex constraints pose a threat to external validity.Another variable that will form a threat to external validity is the variability inherent to the valid products of our subject systems.For some of our SPLs, the behavioral difference between products made the exhaustive analysis the only criteria able to recover family models fully precise.In these cases, sampling techniques may not be applicable and hence, configuration prioritization (Henard et al. 2014) may be required.The impact of prioritization techniques in family model learning is out of the scope of this study.Internal Validity These threats concern issues that may indicate a causal relationship, when there is none.As the validity of experiments is highly dependent on the reliability of the measures and treatment implementation, we designed our experiments on top of three widely used tools for state-machine learning (Raffelt and Steffen 2006), SAT solving (Le Berre and Parrain 2010), and SPL analysis (Thüm et al. 2014).The number of product models and the diverse characteristics in the academic benchmarks used in our study support that the internal validity of our results is good.

Construct Validity
These are concerned the ability to draw correct conclusions about the treatment and outcomes.Two factors that will form threats to construct validity are the nature of the hand-crafted FFSMs used as ground-truth models and the subsets of product configurations sampled using T-wise criteria.Highly experienced modellers will be able to produce more concise representations and subsets of product configurations better than professionals with less experience.In addition to that, configuration subsets sampled by Twise criteria may be still large, compared to the set of all-valid products.In these cases, domain-specific expertise may be useful to optimize family model learning.The fact the modeller in our case was an expert both in SPL and in the formal modelling language, mitigates the risk for our results.

Discussion
What are the implications for practitioners and researchers?While exhaustive learning may be suitable for small product lines, in large SPL projects, it becomes impractical.Our proposal aims to recover domain-level artifacts (i.e., family models) from application-level artifacts (i.e., finite state machines).Thus, we believe that our technique can enable modelbased analysis techniques, such as regression testing, performance analysis, and product sampling, to cases where family models are missing or incomplete.To employ our technique, we expect engineers to have skills on model learning, reverse engineering, and feature model analysis.
In regression testing, family model learning could be employed to support test suite optimization and reduce the potentially large number of test cases generated from productbased techniques (Fragal et al. 2018).In performance analysis, family model learning could be employed to support non-functional testing by incorporating conditional probabilities for family-based probabilistic model checking (Varshosaz and Khosravi 2013) and conditional time guards for stochastic real-time analysis of software product lines (Luthmann et al. 2017).In product sampling, iterative techniques, such as IncLing (Al-Hajjaji et al. 2016), could incorporate partial family models to check for feature interactions, e.g., by testing if richer product variants subsume the behavior/properties of its constituents without unexpected behavioral changes, i.e., interaction problems (Varshosaz et al. 2018).
What types of systems may it work/not work?In our learning by sampling technique, there is an assumption that sampled products shall collectively cover the behavior of product families and have their models specified a priori.However, if there is no such behavioral overlap, then products learned by sampling may never be precise enough and exhaustive learning should be required.If that is the case, an iterative sampling process could be employed for prioritizing product configurations for learning novel and unseen behavior and testing, if a partial family model already includes the behavior of a given product.
Regarding the size of product models, the worst-case complexity indicates a cubic growth in the cost for learning.However, due to the sparse nature of the produced matrices, the FFSM Diff algorithm tends to scale well.Figure 18 shows the distribution of times required to learn an FFSM for all pairs of product models.The average time to learn from products with total size of 40 states lies below 3000 milliseconds and these results corroborate the results by Walkinshaw and Bogdanov (2013) where their algorithm was comparatively cheap.
How are the different notions of variability represented?Currently, our approach annotates state and transitions using the disjunction of simplified configurations.As a result of this design decision, the representation of feature constraints is limited to a unique format (i.e., OR with ANDs).To overcome this limitation, more sophisticated presence-condition simplification techniques (von Rhein et al. 2015) could be used to reduce the complexity of feature constraints.Other possible solutions are the usage of feature model refactoring and specialization (Benavides et al. 2010) to come up with the constrains for conditional state and transitions.
To which other models this technique can be applied?We expect that our technique can be applied to learn other behavioral models for software product lines such as MTS (Larsen and Thomsen 1988;Fischbein et al. 2006;Larsen et al. 2007;ter Beek et al. 2016;ter Beek et al. 2019a) and FTS (Classen et al. 2013;Beohar and Mousavi 2016).Regarding deterministic subsets of these two models, we expect that our technique can be readily applied without much modifications.Learning non-deterministic models, however, requires further investigation.

Related Work
In this section, we discuss our approach in terms of related work and how it can be helpful in the respective context.Studies related to ours are in the fields of state-machine learning, product sampling, family-based analysis, comparison of state models, reverse engineering feature models, and SPL evolution.

State-Machine Learning
As software requirements change and systems evolve, the lack of maintenance may render outdated and incomplete models (Walkinshaw 2013) and hamper the application of modelbased techniques (Mariani et al. 2015).To tackle these issues, state-machine learning, also known as automata learning (Angluin 1987), has become popular technique to automate the construction of behavioral models.
State-machine learning has been harnessed for black box model checking (Peled et al. 1999), real-world protocols (Aarts et al. 2012; Fiterȃu-Bros ¸tean and Howar 2017), software evolution (Hungar et al. 2003;De Ruiter and Poll 2015), automatic test generation (Raffelt et al. 2009), and generalization of failure models (Chapman et al. 2015;Kunze et al. 2016).For an overview of state-machine learning and applications, we refer the reader to (Irfan et al. 2013;Stevenson and Cordy 2014;Aichernig et al. 2018).The problem of learning models from SPLs becomes more complex as it has to cope with products and features that may have their own models, requirements and code.
Our study improves upon the state-of-the-art by evaluating the quality of models learned by sampling subsets of valid products.Thus, we pave the way for more efficient and precise family model learning approaches, which is a topic that is still understudied (Damasceno et al. 2019b).

Product Sampling for SPLs
Due to the number of valid configurations that usually grows exponentially with the number of features, the exhaustive analysis of SPLs is impractical (Thüm et al. 2014).To alleviate this issue, sampling techniques that provide subsets of all valid products are being used to cover the behavior of SPLs and hence reveal most faults in all other products (Perrouin et al. 2010).
In our work, we have used product sampling to generate subsets of valid configurations satisfying T-wise coverage.We have used the Chvatal algorithm (Chvatal 1979) implemented in the FeatureIDE workbench (Thüm et al. 2014).This algorithm has been adapted by Johansen et al. (2011) for product sampling by generating all T-wise feature combinations.Incremental product sampling algorithms, such as the IncLing (Al-Hajjaji et al. 2016), could be employed for family model learning, but this has been left as future work.

Family-Based Analysis of SPLs
Family-based analysis operates on domain artifacts and incorporates knowledge about valid feature combinations, given a feature model.Thus, not every individual product has to be analyzed (Thüm et al. 2014), as opposed to traditional analysis strategies that are influenced by the number of valid feature combinations (Brabrand et al. 2012).To achieve this goal, family-based analysis techniques rely on family models.For an overview on techniques for family-model analysis, testing and modeling, we refer the reader to recent surveys (Thüm et al. 2014;Benduhn et al. 2015;Beohar et al. 2016).Family models have been exploited as theoretical foundation to perform efficient modelbased testing of SPLs (Atlee et al. 2015;Beohar and Mousavi 2016), family model checking (Sabouri and Khosravi 2013;ter Beek et al. 2017), to automate the generation of specifications for individual products (Asirelli et al. 2012), to efficiently validate families of products (Hafemann Fragal et al. 2017), and to describe fine-grained differences among product variants (Schaefer et al. 2010).We believe that our approach is complementary to the aforementioned techniques as it can give insights about optimizing family model learning to scenarios where there is a large number of valid product configurations.Our technique is discussed in terms of FFSMs, but it can be extended to other family-based notations, like FTSs (Classen et al. 2013;Beohar and Mousavi 2016), as FSMs can be represented as a variant of LTS labeled with input/output pairs.

Comparison of State-Based Models
The comparison of FSMs is an important task for software engineering (Walkinshaw and Bogdanov 2013) such as conformance testing (Broy et al. 2005), and performance analysis of state-machine learning techniques (Angluin 1987;Vaandrager 2017).Studies related to ours are by Damasceno et al. (2019b), Nejati et al. (2012), and Walkinshaw and Bogdanov (2013).Damasceno et al. (2019b) introduced an approach to compare product FSMs and build family models (Damasceno et al. 2019b).In this paper, we evaluate how product sampling can help to reduce the costs for learning family models by sampling product configurations.Product lines may have an exponential number of valid configurations and hence, sampling techniques can be helpful to reduce the effort required to recover family models.Nejati et al. (2012) presented an approach for matching and merging Statecharts (Harel 1987).Their approach relies on two operators for matching and merging transitions.The latter uses static and behavioral properties to match state pairs.The former produces a combined model in which variant behaviors are parameterized using guards on their transitions where temporal properties are preserved.The authors showed that relying on both operators produces higher precision than relying on them independently.Walkinshaw and Bogdanov (2013) evaluated two approaches to compute the precise difference between LTSs in terms of their language and structure.To compare the language of state-based models, the authors have proposed an approach based on the proportion of test sequences (Chow 1978;Vasilevskii 1973) that are classified in the same way by two models M r and M u .Thus, performance metrics, (e.g., precision, recall, and F-measure) can be used to compare the languages of LTS models.A major issue on comparing the language of FSMs is the fact that some minor differences can mask structural similarities.To tackle this issue, the authors have proposed an algorithm to compare the structure of FSMs.The aforementioned approaches are complementary as two models may have similar state transition structure, but completely different languages, or vice-versa.
The family model learning process may face scalability issues in large SPLs as a result of the worst-case complexity required to solve the system of linear equations.Alternatively, search-based techniques could be used for product sampling (Ensan et al. 2012) and matching and merging states and transitions of product and family models (Al-Khiaty and Ahmed 2017).Furthermore, expert knowledge (Varshosaz et al. 2018) or prioritization techniques (Henard et al. 2014) could also be incorporated to identify what products should be analyzed first.These are left as future work.

Reverse Engineering Feature Models
Feature models play a central role on the variability management for SPLs (Benavides et al. 2010).By using SAT solvers (Le Berre and Parrain 2010), feature models can be analyzed to detect invalid relationships or product configurations, core or dead features, redundancies, and enumerate or quantify all valid products of an SPL.Unfortunately, companies often develop software variants in an unstructured way and may lack feature models as their construction is time-consuming and error prone (Haslinger et al. 2011).
In this context, several approaches have been proposed to automatically build feature models from sets of product configurations (Haslinger et al. 2011;Ryssel et al. 2011;Al-Msie'deen et al. 2014).Approaches based on Formal Concept Analysis (FCA) show promising possibilities on reverse engineering feature models as they can detect interdependencies and hierarchies between features (Al-Msie 'deen et al. 2014).
Our proposal focuses on the problem of "reverse engineering" family models from sampled product configurations.In our study, we assume that the feature model is known a priori.However, we believe that our technique can be extended to cope with non-existent feature models and learn family and feature models at once, but the succinctness of the feature constraints may be compromised.Thus, investigations combining feature model and behavioral model learning are still required.

SPL Evolution
The tasks of SPL reengineering and refactoring are vital to the maintenance and evolution of their software products.For an overview on product line evolution, refactoring and reengineering, we refer the readers to Laguna and Crespo (2013), Fenske et al. (2014) and Marques et al. (2019).
A large variety of artifacts have been considered in SPL evolution, but feature models are by far the most researched ones (Marques et al. 2019).Moreover, recent studies have shown that there is a need for reengineering approaches specifically tailored for agile processes (Marques et al. 2019), and migration of SPL paradigms (Laguna and Crespo 2013).
Several studies have investigated model learning techniques to cope with traditional software evolution and regression testing (Sery et al. 2015;Huistra et al. 2018).However, to the best of our knowledge, there are no works investigating model learning in the setting of SPLs.Combined with state-machine learning (Angluin 1987), we believe that our algorithm can support model-based regression testing in SPLs (Runeson and Engström 2012) and family model checking (Sabouri and Khosravi 2013;ter Beek et al. 2017) in agile processes (Neubauer et al. 2012).

Conclusion
In this paper, we present a technique for learning behavioral family models in terms of Featured Finite State Machines (FFSMs).Our technique builds upon a known feature model for a product line and its individually learned or hand-crafted finite state machines, corresponding to product-specific models with their respective known sets of features.We presented the FFSM Diff algorithm, that unifies these product models into an FFSM by employing a state-based model comparison technique and feature model analysis.Furthermore, we combined our approach with product sampling to reduce the cost of exhaustive learning and integrate sampled product models into an accurate family model.
We performed an empirical study of the effectiveness of our approach by analyzing the and the accuracy of the learned models.We show that the learned family models are more succinct than the total size of the individual product models, particularly when there is a high degree of reuse among these products.In addition, we also performed a set of experiments to investigate whether feature interaction criteria (e.g., T-Wise) can alleviate the costs for family model learning by sampling valid products to collectively cover the behavior of product families.Our empirical analysis showed that family models learned by sampling can be as precise as those learned from exhaustive analysis.These results pave the way for reducing the costs for recovering family models from product lines.
This paper extends our previous conference publication (Damasceno et al. 2019b) by including three extra models into our empirical evaluation.Our results corroborate our previous findings where product models were effectively merged into succinct FFSMs with fewer states, especially if there is high feature sharing among products.Also the integration of product sampling into the learning process and the empirical study of the accuracy of the learned models in this respect are novel in the present paper.
As future work, we envision to investigate three problems: how to incorporate family models in active model learning, and how to improve the readability of our family models.
Adaptive model learning is a variant of automata learning (Angluin 1987) that attempts to reuse input sequences from existing models to speed up state coverage and identification (Huistra et al. 2018;Damasceno et al. 2019a).We believe that the performance of automata learning algorithms could be improved by reusing partial family models describing subsets of valid products, in a similar fashion to the standard adaptive model learning.
For incremental family model learning, we believe that search-based or interactive techniques could be used to recommend product configurations to be analyzed and pave the way for an incremental family model learning framework.Incremental product sampling algorithms, e.g., IncLing (Al-Hajjaji et al. 2016), could be employed in combination with model-based testing techniques to test-and-learn behavioral variability of black-box product instances and incorporate new product behavior in partial family models.
Finally, to improve the readability of the learned family models, we aim at investigating alternative approaches for presence-condition simplification.Currently, our approach annotates conditional state and transitions using the disjunction of simplified configurations.As a result of this process, the representation of feature constraints is limited to a unique format (i.e., OR with ANDs).To overcome this limitation, more sophisticated presence-condition simplification techniques (von Rhein et al. 2015) could be used to reduce the complexity of feature constraints.Alternatively, feature model refactoring and specialization (Benavides et al. 2010) could also be employed to redesign constrained feature models as conditions of conditional state and transitions.

Fig. 3
Fig.3Example of FSM(Hafemann Fragal et al. 2017) from conditional state c to c with conditional input x and output o is alternatively denoted c x[φ]/o −−−→ c .The logical operators and, or and not are denoted by the symbols &, |, and ¬, respectively.An omitted condition means that the condition is true.
defined transitions derived from to ξ .The transition and output functions δ ξ and λ ξ are defined in terms of the transitions in D.

Table 1 Fig. 5
Fig. 5 FSM of an alternative product from the AGM SPL the set of conditional transitions where -two transitions (s i , x) ∈ D r and (s j , x) ∈ D u are unified in the same conditional transition if

Fig. 6
Fig. 6 Fragment of the FFSM learned for the AGM SPL

Fig. 14 Fig. 15
Fig. 14 Number of transitions in the learned FFSMs and pairs of products Fig. 16 Scatter plots for the relationship between the normalized size of the learned FFSM and configuration similarity

Fig. 18
Fig. 18 Average times for learning FFSMs from all product pairs

Table 3
Performance metrics for comparing FSMsProportion of transitions from M u that are in M r is the set of conditional transitions where -two transitions ((s i , φ i ), (x, φ r ), o, (s k , φ k )) ∈ r and (s j , x) ∈ D u are combined into the same conditional states if j , y) ∈ D u , there are two conditional transitions defined as follows:

Table 5
Description of the SPLs under learning -Feature and family models

Table 6
Results for the Mann-Whitney test and Vargha-Delaney's effect size: Learned FFSM vs. Results for the Mann-Whitney test and Vargha-Delaney's effect size: Learned FFSM vs.

Table 8
Number of configurations in the subsets generated by each criteria