Inference Machines: Parsing Scenes via Iterated Predictions

2018-07-01T03:18:02Z (GMT) by Daniel Munoz
<p>Extracting a rich representation of an environment from visual sensor readings can<br>benefit many tasks in robotics, e.g., path planning, mapping, and object manipulation.<br>While important progress has been made, it remains a difficult problem to effectively<br>parse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms.<br>This process requires not only recognizing individual entities but also understanding<br>the contextual relations among them.</p> <p>The prevalent approach to encode such relationships is to use a joint probabilistic or<br>energy-based model which enables one to naturally write down these interactions. Unfortunately,<br>performing exact inference over these expressive models is often intractable<br>and instead we can only approximate the solutions. While there exists a set of sophisticated<br>approximate inference techniques to choose from, the combination of learning and<br>approximate inference for these expressive models is still poorly understood in theory<br>and limited in practice. Furthermore, using approximate inference on any learned model<br>often leads to suboptimal predictions due to the inherent approximations.</p> <p>As we ultimately care about predicting the correct labeling of a scene, and not<br>necessarily learning a joint model of the data, this work proposes to instead view the<br>approximate inference process as a modular procedure that is directly trained in order<br>to produce a correct labeling of the scene. Inspired by early hierarchical models in the<br>computer vision literature for scene parsing, the proposed inference procedure is structured<br>to incorporate both feature descriptors and contextual cues computed at multiple<br>resolutions within the scene. We demonstrate that this inference machine framework<br>for parsing scenes via iterated predictions offers the best of both worlds: state-of-the-art<br>classification accuracy and computational efficiency when processing images and/or<br>unorganized 3-D point clouds. Additionally, we address critical problems that arise in<br>practice when parsing scenes on board real-world systems: integrating data from multiple<br>sensor modalities and efficiently processing data that is continuously streaming from<br>the sensors.</p>