VIP

Abstract

The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.

Open the Lid

In this video, we show how the proposed VIRT policy completes the task of opening the lid of a tightly sealed bottle. This task mainly tests the precise manipulation ability of a policy. Interestingly, we can observe that when the gripper grasps at a false place, the policy knows to adjust it automatically. The video is played in the original speed.

Pour Blueberries

In this video, we show how VIRT completes the task of pouring blueberries into the juicer cup. In this task, the robot needs to perform diverse kinds of actions and manipulate multiple objects, thereby evaluating the long-term manipulation capability of the policy. The video is played in the original speed.

Clean the Table

In this video, the robot needs to move the three plates on the small table according to the random order given in the test time. In the shown trial, the order is purple, pink, and blue, and the order information is provided to VIRT through our proposed vision instruction. This tasks verifies the instruction following performance of a policy. The video is played in the original speed.

Move a Single Box

In this video, we show how VIRT completes the Move a Single Box task. This task is established in Isaac Gym and relatively easier. The robot needs to move the box on the table into the container. The video is played in the original speed.

Transfer the Specified Box

In this video, we show how VIRT completes the Transfer the Specified Box task. This task is established in Isaac Gym. As shown, there are five boxes in different colors on the table. During test, a random color of box is specified, and the robot needs to move the box in this color into the container. In the shown case, the green color is specified. The video is played in the original speed.

Stack the Specified Boxes

In this video, we show how VIRT completes the Stack the Specified Boxes task. This task is established in Isaac Gym. During test, a random instruction is generated like ''first place the purple box in the container and then stack the blue box on this purple box''. The robot needs to follow the given instruction to complete this task. The video is played in the original speed.

My RedNote (小红书) QR Code

Welcome to follow my RedNote. I will share my latest research progress and insights not included in papers on RedNote. Contact me for collaboration is also welcomed!

BibTeX

@article{li2024virt,
      title={VIP: Vision Instructed Pre-training for Robotic Manipulation},
      author={Zhuoling, Li and Liangliang, Ren and Jinrong, Yang and Yong, Zhao and others},
      journal={arXiv preprint arXiv:2410.07169},
      year={2024}
    }