Optimizing Parallel Belief Propagation in Junction Trees using Regression

The junction tree approach, with applications in artiﬁcial intelligence, computer vision, machine learning, and statistics, is often used for computing posterior distributions in probabilistic graphical models. One of the key challenges associated with junction trees is computational, and several parallel computing technologies - including many-core processors - have been investigated to meet this challenge. Many-core processors (including GPUs) are now programmable, unfortunately their complexities make it hard to manually tune their parameters in order to optimize software performance. In this paper, we investigate a machine learning approach to minimize the execution time of parallel junction tree algorithms implemented on a GPU. By carefully allocating a GPU’s threads to diﬀerent parallel computing opportunities in a junction tree, and treating this thread allocation problem as a machine learning problem, we ﬁnd in experiments that regression - speciﬁcally support vector regression - can substantially outperform manual optimization.


INTRODUCTION
Parallel processing is becoming increasingly important in all areas of computing, including in knowledge discovery and machine learning.This is to a large extent due to recent developments in hardware, and in particular a key difference between Moore's law and clock frequency of CPUs.Moore's law, which states that the number of transistors that can be placed on an integrated circuit will increase exponentially, is as of 2013 still going strong.However, the clock frequency of CPUs has stalled, due to physical limits on heat dissipation in integrated circuits.As a consequence, computers are now multi-core (CPUs) or many-core (GPUs), and algorithms that take advantage of this fact will be at an advantage.The importance of parallel, and also distributed, computing in data mining is further increased due to Big Data; the size of the data sets available for analytical processing has recently been increasing drastically.
In this paper, we discuss parallel computing for belief propagation in junction trees.Belief propagation (BP) over junction tree can be used to compute posterior marginals in Bayesian networks (BNs) [9].However, belief propagation is computationally hard, and the computational difficulty increases dramatically with the density of the BN, the number of states of each network node and the BN treewidth, which is upper bounded by the generated junction tree [12].This computational issue may hinder the application of BNs in cases where real-time inference is required.Parallelization of Bayesian network computation is a feasible way of addressing this computational issue [8,13,17,16,7,10,6,11,2].These parallel BP algorithms are implemented on various state of the art parallel computing platforms.However, due to the complexity of these modern platforms, the junction tree algorithm, and the way they interact with each other, it is not trivial to make parallel BP algorithms work efficiently on these platforms.
Many-core computers including GPUs, which are built around an array of processors running many threads of execution in parallel, are among the most popular platforms.However, it is non-trivial to optimize GPU programs.The challenges with GPU optimization for parallel BP includes: • Junction tree clique and separator sizes vary significantly, resulting in unbalanced workload.
• The junction tree algorithm and a GPU platform have distinctive sets of parameters.Mismatch between algorithm and platform parameters can result in poor performance.
• The relationship between the input workload and the output performance metrics is generally unknown.
• In the two-dimensional parallel BP algorithm (see Section 3), allocation of threads to each parallel dimension requires great care.
In this paper, we focus on minimizing compute time of node level parallel BP on many core system, in particular GPUs, using system performance modeling.We expand on research using GPU to implement node level parallelism of belief propagation [18] [6].Our work is done on top of a parallel BP algorithm [18].Our contribution in this work includes • We investigate another dimension of parallelism, arithmetic parallelism, and integrate it with element-wise parallelism [18].
• We use statistical models for GPU parameter optimization, resulting in an average cross platform speedup of 10.70x (arithmetic average) or 8.68x (geometric average) as opposed to that of 3.43x (arithmetic average) or 2.44x (geometric average) obtained previously [18] • We propose new metrics "squared deviance" and "miss rate" to measure the quality of statistical models for our problem.
This work is relevant to data mining and machine learning in two distinct ways.First, we investigate parallel computation using probabilistic graphical models, in particular junction trees [1], which are applied in many data ming contexts, including Expectation Maximization.Second, in order to solve the central problem of thread allocation for parallel junction tree computation, we take a machine learning approach.
Our paper is organized as follows: In Section 2, we briefly review previous research and formulate the problem we are going to solve.In Section 3, we describe the two-dimensional parallel belief propagation algorithm, where GPU parameter optimization dicussed in Section 4 is key to good performance.In Section 5, we describe our approach of using statistical models for system performance prediction and use it for GPU parameter optimization.Experimental results are discussed in Section 6.We conclude in Section 7.

Belief Propagation in Junction Tree
A BN is a compact representation of a joint distribution over a set of random variables X .A BN is structured as a directed acyclic graph (DAG) whose vertices are the random variables and the directed edges represent dependency relationship among the random variables.The evidence in a Bayesian network consists of instantiated variables.
The junction tree algorithm propagates beliefs (or posteriors) over a derived graph called a junction tree.A junction tree is generated from a BN by means of moralization and triangulation [9].Each vertex Ci of the junction tree contains a subset of the random variables that forms a clique in the moralized and triangulated BN, denoted by Xi ⊆ X .Associated with each vertex of the junction tree there is a potential table φX i .With the above notations, a junction tree can be defined as J = (T, Φ), where T represents a tree and Φ represents all the potential tables associated with this tree.Assuming Ci and Cj are adjacent, a separator Sij is induced on a connecting edge.The variables contained in Sij are defined to be Xi ∩ Xj.
The computation of belief propagation can be measured by treewidth, which is defined to be the minimal size of the largest set in junction tree minus one.Considering a junction tree with a treewidth tw, the amount of computation is lowered bounded by O(exp(c * tw)) where c is a constant.
Belief propagation is invoked when we get new evidence e for a set of variables E ⊆ X .We need to update the potential tables Φ to reflect this new information.To do this, belief propagation over the junction tree is used, this is a two-phase procedure: evidence collection and evidence distribution.For the evidence collection phase, messages are collected from the leaf vertices all the way up to a designated root vertex.For the evidence distribution phase, messages are distributed from the root vertex to the leaf vertices.A recursive algorithm of collecting and distributing evidence is shown in Algorithm 1 and 2. In the following sections, we are going to focus on parallelizing each message passing in Algorithm 1 and 2.
for each child of Ci do Message Passing(Ci, Collect Evidence(J,child)) end for return(Ci) for each child of Ci do Message Passing(Ci, child) Distribute Evidence(J, child) end for

Graphics Processing Units (GPU)
GPUs are designed for compute-intensive, highly parallel computations.In GPUs, more transistors are devoted to data processing rather than data caching and flow control.GPUs are especially well-suited to problems that can be expressed as data-parallel computations where data elements are mapped to parallel processing threads.GPUs are mainly used as accelerators for compute-intensive parts of an application, and therefore attached to a host CPU that performs control-dominant computations.
The GPU is programmed using the CUDA programming framework [14].An application is organized into a sequential host program that is run on a CPU, and one or more parallel GPU kernels that are run on a GPU.
In GPUs, threads launched are partitioned into thread blocks.There is a limit on the number of threads per block, since all threads of a block are expected to reside on the same processor core and must share the limited memory resources of that core.On current GPUs, a thread block may contain up to 1024 threads [14].For a GPU to run efficiently and effectively, the concurrency provided by the platform should match the parallel opportunities in the application.The challenge in our work, detailed in Section 3, is that we have two dimensions of parallelism and therefore we need to allocate threads in each block to the two dimensions of parallel opportunities.A poor split may result in waste of computing resources and low efficiency.

TWO-DIMENSIONAL PARALLEL BELIEF PROPAGATION
We parallelize the atomic operation of belief propagationmessage passing for junction trees.The advantage of doing so is that this node level parallelism can be embedded in different belief propagation algorithms unobtrusively, without any change of those algorithms.
Associated with each junction tree vertex Ci and the contained set of variables Xi, there is a potential table φX i containing non-negative real numbers that are proportional to the joint distribution of Xi.If each variable can take sj states, the size of the potential table is j=1 sj, where |Xi| is the cardinality of Xi.
Message passing from vertex Ci to an adjacent vertex C k , with separator S ik , involves two steps: 1. Reduction.The potential table φS ik of the separator is updated to φ * S ik by reducing the potential table φX i : 2. Scattering.The potential table of C k is updated using both the old and new table of S ik : We define 0 0 = 0 in this case, that is, if the denominator in ( 2) is zero, then we simply set the corresponding φ * X k to zeros.
A close look at Equation ( 1) and ( 2) reveals two dimensions of parallelism opportunity in a message passing.The first dimension of parallelism is the separator potential table (SPT) element-wise parallelism.The second dimension of parallelism is the arithmetic parallelism.
Element-Wise Parallelism: An index mapping table µX,S stores the index mappings from φX to φS [5].We create |φS ik | mapping tables.In each mapping table µ X i ,φ S ik (j) we store the indices of the elements of φX i mapping to the j-th With the index mapping table, element-wise parallelism can be obtained by assigning a specific group of threads to handle the computation related to a specific separator potential table.
Arithmetic Parallelism: Arithmetic parallelism needs to be explored in different ways for reduction and scattering, and also integrated with element-wise parallelism, as we will discuss now.
For reduction, given a certain fixed element j, Equation ( 1) is essentially a summation over all the clique potential table (CPT) φX i elements indicated by the corresponding mapping table µ X i ,φ S ik (j) .The number of sums is |µ X i ,φ S ik (j) |.We compute the summation in parallel by using an existing approach [4].The summation is done in several iterations.In each iteration, the numbers are divided into two groups and the corresponding two numbers in each group are added in parallel.
For scattering, note that (2) updates the elements of φX k independently despite that φS ik and φ * S ik are re-used to update different elements.Therefore, we can compute each multiplication in (2) with a single thread.
Given the two dimensions of parallelism, our parallel message passing approach is illustrated in Figure 1.Denote µ X i [m] to be the m-th element in table µX i and φS ik [n] the n-th element in table φS ik .If the size of mapping table µX i ,S ik is integer power of 2 (assuming |µX i ,S ik | = 2 d ), the parallel reduction algorithm can be written as in Algorithm 3. If µX i ,S ik is not integer power of 2, we can use techniques such as zero-padding to make it integer power of 2. Algorithm 3 integrates the two dimensions of parallelism for the reduction step -the element-wise parallelism determined by |φS ik | and the arithmetic parallelism determined by the size of mapping table |µX i ,S ik |.The scattering step is shown in Algorithm 4.

PARAMETER OPTIMIZATION FOR PAR-ALLEL MESSAGE PASSING
Belief propagation is essentially a sequence of message passings {mi} 2M −2 i=1 over the edges of a junction tree, where M is the number of nodes in the junction tree.Each message passing has a reduction step and a scattering step.The parallel algorithms for reduction (Algorithm 3) and scattering (Algorithm 4) assume infinite threads available for parallel computing.However, a parallel computing platform such as a GPU has only limited number of threads available.
GPU message passing is repeatly called by the CPU.Each time the GPU is called, the thread block size as well as the thread allocation can be set from the CPU side.Therefore, when implementing Algorithm 3 and Algorithm 4 on a GPU, a programmer faces the problem of how to set thread block size and allocate the GPU parallel threads to the two dimensions of parallelism, i.e., we need to find a sequence of GPU run-time parameters for each message passing.GPU Parameters: Before invoking a GPU kernel function for message passing, two question should be resolved on the CPU side: (1) What should thread-block size be?(2) In each thread block, how should threads be divided between element-wise and arithmetic parallelism?To efficiently compute the message passing, we need to carefully choose the set of GPU parameters: Pgpu = {Kr, Ks, p r a , p r e , p s a , p s e }, where Kr and Ks are the total threads of one thread block for reduction and scattering respectively; p r a and p s a are the number of threads used for arithmetic parallelism in reduction and scattering respectively; and p r e and p s e are the number of threads used for element-wise parallelism used by each thread block in reduction and scattering.

GPU Optimization Examples
Thread allocation optimization is very important for parallel message passing in junction tree.Due to the variability of cliques and separators involved in message passing, the performance surfaces have greatly varying shapes.
Figure 1 suggests if a mapping table is large, it is better to assign more threads to the arithmetic parallelism; if a separator table is large, it would be wise to assign more threads to the element-wise parallelism.
Figure 2 shows three examples of different performance surfaces with respect to GPU parameters p r e and Kr.For these examples, we get several intuitions about choosing GPU parameters: 1) The search space is so diverse that using the same set of parameters for all message passings can seldom achieve optimal performance.Good parameters for one message passing could work inefficiently for another.
2) For a given message passing, optimal GPU parameters can result in huge improvement over a poor choice of GPU parameters (in some cases, more than 20x difference).

GPU Optimization
The metrics for measuring system performance vary.In our work, we use the execution time for belief propagation to all the cliques in junction tree as our metric.Since belief propagation over a junction tree is a sequence of message passings, in our node level parallel BP algorithm, minimizing the total BP execution time can be broken down to a sequence of tasks, minimizing the execution time for each message passing.
It requires N = 2(M − 1) message passings to complete belief propagation over a junction tree J with M cliques.Let fn : Pintr × Pgpu → R be the execution time for one message passing.The total BP time is where P n intr and P n gpu are the parameters of the n-th message passing.
Thus, the GPU optimization problem can be modeled as: min P Unfortunately, traditional optimization techniques can not be applied to this optimization problem since an analytical form of fn(•) is usually not available due to the complexity

Theoretical Model for Message Passing
Before we proceed to the black-box modeling approach with pre-execution parameters and post-execution performance, as our first attempt to characterizes the relationship between the message passing workload, GPU parameters and the output performance (execution time), we develop a simple mathematical model.
For simplicity, assume that the GPU takes a constant time τa and τm for the add and multiplication operations respectively.We ignore memory access time, device set up time, etc. Suppose the GPU can accommodate It is analytically and numerically hard to optimize ( 7) and (8) due to the irregular form of both the objective and constraint functions.In addition, the overly simplified assumptions make them not practically useful for the purpose of parameter selection.Consequently, we turn to a regression approach as discussed below.

Regression Models for Message Passing
In this subsection, we develop statistical models for message passing.The models used are Polynomial Regression and Support Vector Regression (SVR).Essentially, we want to establish a statistical relationship between the GPU configuration parameters and the performance as illustrated in Figure 3.
Polynomial Regression: For the polynomial model, in order to get better insight about how the thread allocation affects the GPU execution time, we use the Lasso method to shrink the model and compare the resulting model with (7) and (8).A polynomial Lasso has the form βLasso = arg min

Support Vector Regression (SVR):
A second regression model we use is support vector regression [3].In SVR, the input is first mapped onto a high-dimensional feature space using some fixed (nonlinear) mapping, and then a linear model is constructed in this feature space.The linear model (in the feature space) f (x, ω) is given by where gj(x), j = 1, . . ., m, denotes a set of nonlinear transformations, and b is the bias term.The loss function of SVR is called ǫ-insensitive loss function defined as ) SVR performs linear regression in the high-dimensional feature space using ǫ-insensitive loss and, at the same time, tries to reduce model complexity by minimizing ω 2 .Thus SVR is formulated as minimization of the following function:

Features for Regression Models
The features we collected for regression model training are shown in Table 1.There are two possible ways that statistical modeling can help us with run-time GPU parameter selection: 1) We can directly approximate the function where model for GPU execution time and then search the regression model to obtain the best run-time GPU parameters.In this paper, we use the regression method.

Metrics for Regression Model Quality
Since the trained model is used to optimize GPU parameters, we want to find a regression model whose minimum point is the same as or close to that of the real GPU performance surface.Residual squared sum (RSS) is often used to measure the quality of a regression model's fit to the training data.However, RSS is not a direct metric for the quality of a regression model in our thread allocation optimization problem.In other words, small RSS does not necessarily guarantee a good model.
Our goal is to find optimal estimated parameters K * and p * a for reduction and scattering with given |φS | and |φX |.Thus, we propose to use the squared deviance (SD) from the real optimal value as a metric for model training quality, e.g., where Tr(•)/Ts(•) is the measured GPU reduction/scattering time with respect to junction tree parameters |φS |, |φX | and GPU parameters K and p and K * and p * a are the optimal parameters obtained from the statistical model with given |φS | and |φX |.
Aside from the squared deviance from the optimal value, we also use the miss rate (MR) as a measurement of model quality.The miss rate is defined as where 1(•) is the indicator function, K * and p * a are the real optimal GPU parameters for a given |φS | and |φX |, and N is the total number of message passings in the training set.Practically, we exhaustively try all the small number of possible GPU configurations on GPUs to find K * and p * a .

EXPERIMENTS
In this section, we address the following questions: • How accurately can a statistical model emulate GPU performance?
• How much GPU execution time can be saved as a result of using statistical model-based parameter optimization compared to manual parameter optimization? NVIDIA

Experimental Data and Platforms
Our implementation is tested on a number of BNs 1 from different problem domains, with varying structures and state spaces.In our experiments, we compile a BN into a junction tree offline and then run belief propagation over the junction tree.
As a baseline, we implement a sequential junction tree program on an Intel CPU, whose execution time is comparable to that of the SMILE [15], a widely used C++ software package for BNs inference.As a second baseline, we use the SMILE junction tree algorithm.Detailed information for the CPU and GPU platforms is in Table 2.We have performed sanity checks on the parallel junction tree algorithm to ensure the correctness of our implementation.

Regression Results
We use both polynomial-lasso regression and support vector regression to fit the data.For the polynomial model, the terms we include in the model are |φS |, |φX |, K, K 2 , pe, p 2 e , and all the interaction terms of the above-mentioned terms.We set the Lagrangian multiplier λ in (9) to be λmin, which is the value of λ that gives minimum mean cross-validated error, and λ1se, which is the largest value of λ that gives the mean cross-validated error within 1 standard error of minimum.
For SVR, we use a radial basis kernel where the kernel bandwidth γ is chosen to be Other parameters for SVR training are  Optimization: SVR and lasso); Previous results [18] and Speedup (current SVR versus previous CPU [18]).
Figure 7: GPU execution times for different parameter optimization methods both manual (0-threshold and 1-threshold) and regression (lasso and SVR).Optimization using SVR is best in all cases.2. In Figure 4, both SVR and lasso approximate the GPU time nicely.However, in Figure 5, neither SVR nor lasso approximate the GPU time well, because of an abrupt drop of GPU time when pa increases.However, a closer look shows that the minimum points of both the SVR and lasso models are located not far from the minimum point of the measured GPU time.In Figure 6, lasso approximates the GPU time better than SVR, but both statistical models' minimum points are close to that of the real GPU execution time surface.

Pigs
Table 3 shows the residual sum of squares (RSS), the squared deviance (SD) from the optimal value, and the miss rate (MR) of models.From the table, we see that GPU execution time for reduction is harder to emulate than scattering.This suggests that reduction is likely to be the bottle-neck of parallel message passing.Also from the table we see that even though SVR does not have the lowest miss rate, its squared deviance is the smallest for reduction.That is to say, even though SVR might have missed some optimal parameters, its parameter choices are not much worse than the optimal.Thus we expect SVR to perform the best in thread allocation, which is illustrated experimentally in Table 4.

Manual versus Regression-Based GPU Parameter Optimization
In this section we compare different GPU parameter settings and show that regression-based GPU parameter optimization can achieve much better performance than (extensive) manual parameter optimization.We use 0-threshold and 1-threshold parameter selection scheme, suitable for manual optimization, as benchmarks.In the 0-threshold parameter setting, where there is no threshold on the mapping table size, we set K = 256 and qa = 16 for all the message passings throughout belief propagation.The 0-threshold parameter setting is perhaps the most straightforward and widely used GPU parameter selection scheme.A programmer just sets a group of reasonable GPU parameters which do not change at run time.
In the 1-threshold parameter selection with threshold t1 = 90, we still keep K = 256, but we set pa in the following way: where µX is the mapping table size.The rationale behind this 1-threshold parameter optimization is that in reduction/scattering cases where there is a long mapping table (see Figure 1), there are many opportunities for arithmetic parallelism and therefore we should assign many threads (in (17), pa = 128).In cases where the mapping table is "short", many threads should be assigned to element-wise parallelism, leaving "few" to arithmetic parallelism (in 17, pa = 4).In (17), t1 = 90 is chosen as a reasonable threshold to differentiate short and long mapping tables.
In experiments, we used the SVR and polynomial-lasso models to select GPU parameter for each BP message passing.Results are summarized in Table 4 and Figure 7. On average over all data sets, we get a speedup of 10.70x (arithmetic average) or 8.68x (geometric average) as compared to that of 3.43x (arithmetic average) or 2.44x (geometric average) achieved previously only using the element-wise parallelism [18].We also compare the parallel junction tree algorithm with SMILE, the improvement is still significant as shown in Table 4.
We highlight several points in Table 4: 1) Across the columns, we see that parallelism opportunity determines the GPU performance.The GPU in general performs well for data sets that have big cliques (which means more arithmetic parallelism) or big separators (which means more elementwise parallelism).For the data-sets that have neither big cliques nor big separators, such as "Munin2" and "Munin3", the GPU speedup is smaller.
2)Across the different statistical models, we see that SVR is best for all the data sets we have, which coincide with our observation for squared deviation from optimal value and miss rate in Table 3. Lasso(λ = 0) is comparable to the 1-threshold parameter optimization; Lasso(λ = 1se), with a severe punishment on model complexity, performs the worst of all.The difference between the best and worst statistical model parameter settings can be as large as 5-7x.

CONCLUSION AND FUTURE WORK
In this paper, we discuss a two-dimensional parallel algorithm for belief propagation over junction trees, and implement the algorithm on a GPU.Due to the great variety in clique and separator sizes in junction trees from applications, the parallel opportunity for both dimensions of parallelism varies.Since the GPU performs best when the concurrency provided by the GPU matches the parallel opportunity in the algorithm, it is necessary to carefully optimize the thread allocation for both dimensions of parallelism as well as for thread blocks.Experiments show a large difference in GPU performance given different thread allocations.Therefore, we use statistical models to approximate the parameter space, for the purpose of searching for optimal parameters.Among the models we used, SVR performs best, and outperforms manual GPU optimization.We show that our approach is an effective way to improve the GPU performance when seeking for fast junction tree belief propagation.
Intrinsic Junction Tree Parameters: Consider a message passing from Ci to C k through a separator S. From the computational perspective, the input cliques and separators can be characterized by a set of Intrinsic Junction Tree Parameter Pintr = {|φX i |, |φX k |, |φS |}, where |φX i |, |φX k |, |φS | represent the size of potential tables of Ci, C k , and S respectively.

Figure 1 :
Figure 1: Two parallelism opportunities in a junction tree: element-wise and arithmetic parallelism.Arithmetic parallelism (tree structure at the bottom) is added on top of the element-wise parallelism (look up table at the top).

Figure 2 :
Figure 2: Examples of how GPU execution time (yaxis) varies with different GPU parameters, specifically the thread block size (TBS) and number of arithmetic threads (x-axis).

Figure 3 :
Figure 3: Statistical model of many-core system performance.BC 00 01 10 11 N b thread blocks to run simultaneously.Consider a message passing from Ci to C k using GPU parameter p s a , p s e , p r a and p r a .Define g(φ, p) = |φ| p .The time for reduction is Tr = 1 N b g(φS , p r e ) g(φX i , p r a )⌊(log 2 p r a + , p s e ) g(φX j , p s a )τa (6) The total message passing time between Ci and C k is given by T = Tr + Ts.Due this decomposition of message passing time into reduction time and scattering time, we can optimize the GPU parameters related to reduction and scattering separately.For reduction, we have: minpr e ,p r a Tr s.t.: p r e * p r a ≤ Kr (7) and for scattering we get: minpr e ,p r a Ts s.t.: p s e * p s a ≤ Ks (8) where P olyj(x) is a polynomial function of the feature vector x.The Lasso in the equivalent Lagrangian form is βLasso = arg min where λ is the Lagrangian multiplier.

10 •
GPU parameters 1−thresold GPU parameters GPU parameters with lasso model (λ = 1SE) GPU parameters with lasso model (λ = 0) GPU parameters with SVR • The weight for slack terms: C = The ǫ in insensitive loss function: ǫ = 0.001.In Figure 4, 5 and 6, we plot three examples of statistical models emulating measured GPU execution time Tr for the reduction step.In each figure, |φX | and |φS | are fixed and the GPU parameters K and pa change.These three examples correspond to the three examples in Figure

Table 1 :
Features used in regression models.
(12)he GPU execution time as a function of K, pa, |φS | and |φX |.Practically, the number of possible values K and pa is finite, therefore, we can model(12)as a classification problem.Or 2) we can alternatively take an indirect approach by first training a regression

Table 2 :
Experimental Platforms: GPU and CPU

Table 3 :
Residual Squared Sum (RSS), Squared Deviance (SD) and Miss Rate (MR) of polynomial and SVR model for GPU reduction and scattering execution time

Table 4 :
The GPU execution time (in milliseconds) for different GPU parameter optimization methods, using different junction trees (Pigs, Water, ...) with very different clique potential table (CPT) and separator potential table (SPT) characteristics.The table shows junction tree information (Data Statistics); varying GPU optimization methods (Manual Optimization: 0-threshold and 1-threshold versus Regression based