System overview.
(A) Action generation mode. Inputs to the system were the proprioception mˆt and the vision sense ŝt. Based on the current mˆt and ŝt the system generated predictions of proprioception mt+1 and the vision sense st+1 for the next time step. This prediction of the proprioception mt+1 was sent to the robot in the form of target joint angles, which acted as motor commands for the robot in generating movements and interacting with the physical environment. Changes in the environment were sent back to the system as sensory feedback. The main components of the system were modeled by the CTRNN, which is made up of input-output units and context units. Context units were divided into two groups based on the value of time constant τ: a group of fast context units (τ = 5) and a group of slow context units (τ = 70). Every unit of the CTRNN is connected to every other unit, including itself, with the exception of input units which do not have a direct connection to the slow context units (see Method). (B) Training mode. In the training process, the network generates behavior sequences based on the synaptic weights at a certain moment during the learning process. Synaptic weights are updated based on the error between generated predictions (mt+1, st+1) and the teaching signals (m*t+1, s*t+1). In training mode, the robot did not interact with physical environment. Instead of actual sensory feedback, predicted proprioception and vision served the input for the following time step (mental simulation). Through this mental simulation process, the network was able to autonomously reproduce behavior sequences without producing actual movements. In addition to virtual sensory feedback, in order to accelerate convergence, a small amount of the teaching signal of the previous time step m*t+1, s*t+1 was also mixed into mt+1, st+1 (see Method for details). Both in the generation mode and training mode, initial state of the slow context units was set according to the task goal.