Robotic manipulation, owing to its multi-modal nature, often faces significant training ambiguity, necessitating explicit instructions to clearly delineate the manipulation details in tasks. In this work, we highlight that vision instruction is naturally more comprehensible to recent robotic policies than the commonly adopted text instruction, as these policies are born with some vision understanding ability like human infants. Building on this premise and drawing inspiration from cognitive science, we introduce the robotic imagery paradigm, which realizes large-scale robotic data pre-training without text annotations. Additionally, we propose the robotic gaze strategy that emulates the human eye gaze mechanism, thereby guiding subsequent actions and focusing the attention of the policy on the manipulated object. Leveraging these innovations, we develop VIRT, a fully Transformer-based policy. We design comprehensive tasks using both a physical robot and simulated environments to assess the efficacy of VIRT. The results indicate that VIRT can complete very competitive tasks like ``opening the lid of a tightly sealed bottle'', and the proposed techniques boost the success rates of the baseline policy on diverse challenging tasks from nearly 0% to more than 65%.
In VIRT, we first adopt the proposed robotic imagery pre-training paradigm to pre-train VIRT based on large-scale robotic manipulation data. Then, we fine-tune the pre-trained policy on specific downstream tasks with the robotic gaze strategy. After these two phases of training, VIRT is able to complete diverse challenging tasks in both real-robot and simulated environments.
In this video, we show how VIRT completes the task of opening the lid of a tightly sealed bottle. This task mainly tests the precise manipulation ability of a policy. Interestingly, we can observe that when the gripper grasps at a false place, the policy knows to adjust it automatically. The video is played in the original speed.
In this video, we show how VIRT completes the task of pouring blueberries into the juicer cup. In this task, the robot needs to perform diverse kinds of actions and manipulate multiple objects, thereby evaluating the long-term manipulation capability of the policy. The video is played in the original speed.
In this video, the robot needs to move the three plates on the small table according to the random order given in the test time. In the shown trial, the order is purple, pink, and blue, and the order information is provided to VIRT through our proposed vision instruction. This tasks verifies the instruction following performance of a policy. The video is played in the original speed.
In this video, we show how VIRT completes the Move a Single Box task. This task is established in Isaac Gym and relatively easier. The robot needs to move the box on the table into the container. The video is played in the original speed.
In this video, we show how VIRT completes the Transfer the Specified Box task. This task is established in Isaac Gym. As shown, there are five boxes in different colors on the table. During test, a random color of box is specified, and the robot needs to move the box in this color into the container. In the shown case, the green color is specified. The video is played in the original speed.
In this video, we show how VIRT completes the Stack the Specified Boxes task. This task is established in Isaac Gym. During test, a random instruction is generated like ''first place the purple box in the container and then stack the blue box on this purple box''. The robot needs to follow the given instruction to complete this task. The video is played in the original speed.
@article{li2024virt,
title={VIRT: Vision Instructed Transformer for Robotic Manipulation},
author={Zhuoling, Li and Liangliang, Ren and Jinrong, Yang and Yong, Zhao and others},
journal={arXiv preprint arXiv:2410.07169},
year={2024}
}