Learning Probability Distributions over Partially-Ordered Human Everyday Activities

—We propose a method to learn the partially-ordered structure inherent in human everyday activities from observations by exploiting variability in the data. Using statistical relational learning, the system extracts a full-joint probability distribution over the actions that form a task, their (partial) ordering, and their parameters. Relevant action properties and relations among actions are learned as those that are consistent among the observations. The models can be used for classifying action sequences, but also for determining which actions are relevant for a task, which objects are usually manipulated, or which action parameters are typical for a person. We evaluate the approach on synthetic data sampled from partial-order trees as well as two real-world data sets of humans activities: the TUM kitchen data set and the CMU MMAC data set. The results show that our approach outperforms sequence-based models like Conditional Random Fields for activities that allow a large degree of variation.


I. INTRODUCTION
When observing people cooking a meal, one will notice a large variability in how they perform the different actions: Some people first prepare all the tools and ingredients, others start to cook right away and get the things they need just in time.In addition, people get distracted, perform irrelevant actions in between, or forget something they need to make up for later on.As a result, the observations differ in terms of which actions have been performed, in which order they have been done, and what their parameters have been.This high degree of variability stems from the relative freedom in how many tasks can be performed: Though humans tend to describe them as sequences, many tasks are in fact much less constrained, and only impose a few partial ordering constraints among their sub-actions instead of a total ordering among all of them.These ordering constraints may be due to causal dependencies, e.g. if one action depends on the outcome of another, but may also result from person-specific habits or preferences.Both kinds of constraints can be useful: When planning robot actions, a model of dependencies among actions can serve for computing a suitable ordering and to exploit freedom for optimization.When observing human actions, such models describe different styles to perform an activity and can be used to spot differences and anomalies, for example caused by medical conditions.
In this paper, we propose a method for learning such action models from observation.Given a diverse training set of observed actions, we can exploit the variability in the data to * The Centre for Computing Technologies (TZI).learn about the structure and properties of the task.The more diverse the training set is, the more alternative ways of how to perform a task can be learned.Those actions, properties and relations that consistently appear in many examples will have a higher likelihood to be relevant for the task than those that are only incidentally observed.
The models represent a joint probability distribution over the types of actions, their parameters (like the hand that is used or the object that is manipulated), and their pairwise ordering.Combined, these pairwise ordering constraints result in a partial order imposed on all actions in a task.In order to learn such models from noisy, uncertain observations, one needs to be capable of representing both relational knowledge and uncertainty, which is why we employ statistical relational learning techniques.Our implementation uses Bayesian Logic Networks (BLNs, [1]), which are a relational extension of Bayesian Networks.The learned full-joint probability distribution can be used for various inference tasks: • Classification of activities by checking which constraints are satisfied • Verification if an action has been performed correctly with respect to a reference model • Identification of relevant actions in the activity as those that consistently appear in the training data • Inference of missing information like the type of an action or object given the overall activity model • Manual analysis of the learned models can give important insight into how a person performs a task The remainder of the paper is organized as follows: We start with a review of related work on modeling and recognizing partially ordered activities, and formally describe the representation of actions in the system and the applied statistical relational learning techniques.We then evaluate the approach on a synthetic and two real-world data sets and finish with a discussion of its scalability and generalization.

II. RELATED WORK
The common approach for describing and recognizing human activities is to model the observed action sequences using techniques such as Hidden Markov Models (HMMs) [2], Conditional Random Fields (CRFs) [3] or Suffix Trees [4].These models describe the sequences in terms of local action transitions, which is particularly suited to largely sequential activities.Once the order of actions is not that welldetermined any more, this approach shows its limitations: Even if only a few actions can be shifted around in the task context, this will create much variation and a large number of possible local transitions that confuse sequence-based methods.Also, the Markov assumption that the subsequent action only depends on the current one does not hold for many such activities, but the history of which actions have already been done needs to be taken into account.
There are few other systems in the area of action recognition that also address the problem of learning models of partially ordered tasks: Shi [5] uses (manually specified) Dynamic Bayesian Networks to represent the partial order.Gupta [6] describes a method for learning story lines of actions in baseball games using an AND/OR graph.Ekvall [7] learns deterministic ordering constraints from multiple observations in a blocks-world setting.All these approaches focus on only the ordering among atomic action entities, while our system learns a distribution over the order as well as the action parameters.
In the fields of planning (e.g.[8]) and plan recognition, there is much work about partially ordered plans.Kautz and Allen's seminal paper [9] formalizes plan recognition as a logical inference problem.Goldman et al. [10] extend this work to a probabilistic model that can handle partially ordered and interleaved plans.Both approaches, as well as more recent ones, rely on manually created model of the complete task and have mainly been applied to synthetic problems.Research in preference learning also deals with learning and representing orderings, though 'partial order' in that context usually refers to a total order among the topk elements in a set, as opposed to a partial ordering of the complete set.

III. DESCRIBING THE STRUCTURE OF TASKS
The proposed system is to learn a partially-ordered model of a task T from a set of observed action sequences S that are instantiations of the abstract task.Action sequences are described as These actions can have different lengths due to missed actions as well as noise actions in between the relevant ones, and are also expected to show strong variation regarding the order of actions.The abstract model learned from these actions describes a set of tasks T , each of which is described by a set of actions A t , a possibly empty set of action properties P t and an ordering relation O t among the actions.
Observed actions in an action sequence are marked with a subscript index a i , the prototypical actions in a task model have a superscript index a i .Action sequences are related to tasks via the activityT predicate. activityT Each task model comprises a set of n actions, which have one of m different types A 0 , . . .A m that correspond to classes in the robot's action ontology.
Actions may have different properties like the object manipulated or the hand used.P t assigns a probability value to each property π j ∈ Π of each action a i : The ordering relation O t for a task T describes the probability that an action a i is executed before an action a j in the respective task context.
In our system, both P t and O t are described using probabilistic relations that are learned from the training set of sequences S T train and described as predicates combined with a probability that this relation holds.The relative ordering of two actions is expressed using the precedes predicate (Figure 1): Observations of actions are also described in terms of the predicates used in the action models like actionT , precedes, and optional predicates for action parameters like the objectActedOn.In the example, N 1, N 3, and N 4 are action classes, while O3 is an object class.

IV. BAYESIAN LOGIC NETWORKS
In this paper, we apply Bayesian Logic Networks (BLNs) [1] to representing the aforementioned action structures.BLNs are a form of probabilistic logics and combine the expressiveness of first-order logics, required to describe the complex interactions between actions, parameters of these actions, with the representation of uncertainty.A BLN serves as a template for the construction of a ground mixed network to which standard Bayesian network (BN) inference techniques can be applied.For our experiments, we use Backward Sampling [11], an approximate BN inference algorithm.Due to space limitations, we will only briefly describe BLNs and refer to [1] for details.
Formally, a BLN is described as a tuple B = (D, F, L) consisting of the declarations of types and function D, a set of fragments of conditional probability distributions F, and a set of hard logical constraints L as formulas in first-order logic.The fragments F describe dependencies of abstract random variables, in our case for instance between precedes(a i , a j , S s ) and actionT (a i ).Compared to Bayesian Networks, BLNs abstract away from concrete entities and represent generic relations between classes of entities, similar to the way predicate logic abstracts away from the concrete entities in propositional logics.Examples of the BLN fragments are shown in Figure 3, where the oval nodes denote random variables and the rectangular nodes contain preconditions for the respective fragments to be applicable.For a given set of entities (in our case observations of actions), the BLN gets instantiated to a ground mixed network, expanding the abstract relations with the concrete domains of e.g.actions and objects.Learning BLNs requires determining the conditional probability tables in the fragments in F, which reduces to simply counting the relative frequencies of the relations the training set.While the declarations D, the fragments F and logical constraints L are defined manually, they only describe the form of the observed actions and that a partial order among them may exist.The actual action types, their properties, relations to objects, as well as their ordering relations are learned from data.

V. EXPERIMENTS
We evaluate the system first on synthetic data, and then on two real-world data sets of human activities.Due to space limitations, we cannot show every aspect of evaluation for each of them, but concentrate on the most interesting aspects, respectively.

A. Synthetic Data
First, we tested the approach with synthetic data sequences that have been sampled from the two precedence graphs in Figure 2. Note that both graphs consist of the same basic actions, i.e. no single action can be used as a hint which activity is performed, but only the order contains information.This is certainly more difficult than most real-world applications, but for instance required when distinguishing between different styles of performing the same activity.The sampling is performed using the following procedure: Let N represent the set of nodes whose ordering constraints are met and who can thus be selected in the next step, and let prereq(n) be the set of nodes that are prerequisites for node n.The sampling starts with the set of nodes thus the set of all actions that are not prerequisites for any other action.At each sampling step i, a random element n i is chosen, and the sampling continues with All actions occur exactly once in this data set, thus for both graphs is m = n = 8, and there are no action properties, i.e.P = ∅.The data can be modeled with the very simple BLN in Figure 3 (left).1) Learning the partial order: The learning algorithm should be able to recover the partial order from the data.Figure 4 visualizes the conditional probabilities inside the precedes-node of the BLN.In this visualization, redundant relations have been pruned, i.e. when P (precedes(A, B)) = 1, P (precedes(A, C)) = 1 and P (precedes(B, C)) = 1, we did not draw the edge A − C to improve clarity.As can be seen in the picture, the algorithm successfully recovered the partial-order structure the data was sampled from.
Interconnections between for instance the nodes N 1, N 2, N 3, and N 4, which are not present in the original graph, reflect the properties of the sampling algorithm.It is equally likely to switch to a different branch of the activity (i.e. between N 1 − N 2 and N 3 − N 4) as it is to continue on the same branch.If observations of humans show such structures, these interconnections can reflect an alternating behavior as opposed to a stringent execution of a sequence of actions.
2) Classification in the presence of noise: Observations of activities often comprise irrelevant actions that are performed in between the essential actions, like wiping up spilled liquids or drinking a glass of water while cooking a meal.Similar action noise can result from errors in the segmentation of observations into single actions.
To test the influence of irrelevant actions in between the important ones, we modified the sampling algorithm described earlier so that, in each step, noise action may be chosen instead of one of the relevant actions with a certain probability.Formally, equation ( 12) changes to where X is a set of noise actions, i.e. actions that are irrelevant to the activity.In the experiments, we sampled from X = 10 noise actions, denoted x 0 . . .x 9 , with a probability of 10%, 20% and 50% respectively.The sequences in both the training and the testing set comprised these noise actions, so the system did not know a priori which actions are actually relevant.
Figure 5 (right) shows the classification performance (F1 value) of our system.The results were obtained by approximate inference on the BLN model using Backward Sampling with 5000 samples (note that this is not the size of the training or testing database, but the number of samples drawn by the Bayesian network inference algorithm).Even with the very noisy sequences, in which about half of the actions are irrelevant to the activity, the system is still able to learn a model that allows for good classification.If there is few noise (lines without markers), as few as five example sequences suffice for reasonable performance, while the more noisy data requires about 15 sequences to obtain similar results.
We compare the classification results to those obtained using Hidden Conditional Random Fields (HCRF, [12]), which have been shown to outperform Hidden Markov Models and Conditional Random Fields, the probably most commonly used methods in action recognition.HCRF model the sequence of actions, but cannot take longer-range dependencies like global ordering constraints into account.The results in Figure 5 suggest that the model gets confused by the large variation in the data and the significant amount of noise.While the results are still rather stable for low-noise data (lines without markers), they get much worse when the proportion of irrelevant actions increases.
3) Inferring the types of single actions: Since the models learn a joint probability distribution over all aspects of the action, they can be used for different inferences, for example to infer the most likely type of a single action in a sequence: We randomly sampled sequences from the nosiest version of both activities (50% noise actions), removed the type of  an arbitrary action in the test sequence, and inferred this type given the rest of the sequence.The exemplary results in Table I show that it is possible to infer the type of an action given the type of the activity and the surrounding actions.The results also indicate that the model has learned which actions are easy to identify.Action N8, for example, is always the last non-noise action in every sequence and can thus easily be identified (seq.12, 33).When there is confusion, it is mostly between actions on a similar level of the precedence graph (e.g.N4 and N1 in seq.37) or between direct predecessors and successors (as in seq.25, where N5 and N6 are direct predecessors of N7). 4) Identifying (ir)relevant actions: A priori, the system does not know which actions are relevant and which are just noise.Using the proposed models, the probability of an action given the activity can be calculated.

P (actionT (a
Table II shows that, even in the extreme case of 50% noise actions, the relevant actions are more consistent across the observed episodes and therefore have a higher probability.Since both activities consist of the same number of actions, the results are identical for both the Act − 1 and Act − 2 activity.

B. TUM Kitchen Data Set
As a real-world data set, we use the TUM Kitchen Data Set [13] for evaluation which contains several observations of different subjects performing a table-setting task.In addition to motion-capture data, it also provides information about objects that are manipulated (from RFID readings) and doors and drawers being opened (via magnetic sensors).All subjects perform the same activity (setting the table for one person), using the same objects, but in different order: Some behave like an (inefficient) robot that transports the objects one-by-one, others are more human-like in carrying several objects at once.On the one hand, this makes this data set quite structured, but on the other hand, it creates a difficult classification challenge since all objects and actions are identical for both classes.In total, there are m = 8 types of actions like Reaching or OpeningACupboard, and the observation sequences have a length of about 70 action segments.P = {objectActedOn}, the object an action is performed on, is the only property.The BLN structure for this data set is shown in Figure 3 (right).In this paper, we do not deal with the problem of segmenting the continuous motion, but rather use the manually created labels provided with the data set.Inferring these segments from the data is a challenge by itself, and some first work on this topic has been presented by the authors of the data set [13].
Visualizing the learned model is difficult since the object type influences the order.However, when plotting the conditional probability for each action a 1 over the object o 1 × the subsequent action a with object o 2 , a peaked, sparse distribution can be observed (Figure 6).Many values are zero because several object-action pairs never occur (like opening a knife).Some actions always occur before others (conditional probability of one), others have softer ordering constraints as can be seen by the lower peaks in the diagram.We noticed in our experiments that such sparse, peaked distributions are typical for problems that show a distinct partial order.
1) Classification performance: We tested the model by discriminating between two different styles of setting the table, in the following referred to as robot-like (transporting one object at a time) and human-like (a more natural behavior, including e.g.grasping all pieces of silverware at once).Due to a lack of data, the test sequences were manually created to be a typical example of each activity style by changing the order of the transported objects, adding noise actions, and shortening sequences where some object interactions were omitted.One sequence (HumanRobot) was constructed by concatenating the first half of a human-like and the second half of a robot-like sequence.
Table III presents the inference results obtained using Backward Sampling with 5000 samples and, as a comparison, the classification obtained from the HCRF (identical results for m = 3, 5, 10, 20 hidden states).Features for the classification were the action class and the object.
The HCRF fails to classify the sequences and labels all of them as Human, supposedly because it did not learn the subtle differences in the ordering.Our system correctly classified almost all the sequences, only the HumanRobot sequence was classified as Human, whereas an indecisive result would have been expected.Apparently, the parts of the Human sub-sequence are more salient than those in the Robot part of the sequence.
As mentioned before, all actions and objects are identical for both classes and only the order differs.In other cases, the distinction between different activities would obviously become much easier.

C. CMU MMAC Data Set
The CMU MMAC Data Set [14]  the data has been labeled, namely a subset of the 'making brownies' and the 'cooking an omelette' recipes, which we use for learning the models.On this data, we will present some queries that show that the models do not only represent the ordering, but a complete joint probability distribution over different aspects of the observed actions.
1) Identifying (ir)relevant actions and objects: A priori, the system does not know which actions or objects are relevant for a task.Using the learned models, the probability of an action or object given the activity can be calculated.Those actions that occur several times per activity obviously have a higher probability, and those that are only rarely performed are much less likely.

VI. DISCUSSION
As we demonstrated in this paper, human everyday activities like household chores, assembly tasks in a factory, or games show a significant partial ordering among their actions.However, this is not reflected in many data sets that have been often recorded in very controlled settings in which the sequence of actions is completely determined, resulting in an artificially imposed total ordering.Lower-level data, e.g.observations on the motion level, also shows a more linear structure since, in smooth motions, subsequent poses mainly depend on the previous ones and less on the global task context.This is why models that are based on the Markov assumption (HMM, CRF) perform well on this kind of data.
Regarding scalability, models that represent a partial order are more complex compared to those describing only a sequence, theoretically scaling quadratically with in the length of the sequence, the number of actions and the parameters.In practice, however, the conditional probability table representing the precedence relation is often sparse: Many combinations of actions and objects do not make sense and thus have zero probability (Figure 6), so that the table can efficiently be represented using decision trees [15].Even without such optimizations, our implementation smoothly handles inference in models of about 40 segments with about 10 action and object classes.Compared to the inference, learning BLNs is generally much less of a problem because parameter learning of Bayesian networks comes down to counting.Training on 20,000 sequences runs very fast without problems.

VII. CONCLUSIONS
In this paper, we presented a system for modeling human activities based on Bayesian Logic Networks.The models are learned from observations and represent a full-joint probability distribution over the actions, their properties and their (partial) ordering.Therefore, they can not only be used for classifying activities, but also for more advanced reasoning on action-related properties.We evaluated the system on two real-world data sets of human activities as well as synthetic data in order to analyze in detail the properties of the learned models.This evaluation shows that the approach outperforms models often used in activity recognition like Conditional Random Fields for common tasks since they are much less confused by the variation inherent in human activities.

Fig. 2 .
Fig. 2. Precedence graphs for the fictional activities Act-1 (left) and Act-2 (right) which were used for sampling the synthetic action data .

Fig. 3 .Fig. 4 .
Fig. 3.The model structure for the synthetic data (left) and the TUM kitchen data (right) with dependencies as conditional probability distribution fragments.

Fig. 6 .
Fig. 6.Conditional probability distribution of the precedes-node in the TUM data set.Each curve corresponds to the first action in a pair (a 1 ), the values on the x-axis denote the set o 1 × a 2 × o 2 , and the value of the curve is the conditional probability that a 1 performed on o 1 precedes a 2 performed on o 2 .The very peaked distribution indicates distinct ordering constraints.

TABLE II RELEVANCE
OF AN ACTION AS ITS PROBABILITY GIVEN AN ACTIVITY.
provides observations of 43 subjects cooking 5 different recipes.So far, only part of