Building a Synthetic Motion Generation Pipeline for Humanoid Robot Learning

General-purpose humanoid robots are designed to adapt quickly to existing human-centric urban and industrial work spaces, tackling tedious, repetitive, or physically demanding tasks. These mobile robots naturally excel in human-centric environments, making them increasingly valuable from factory floors to healthcare facilities.

Imitation learning, a subset of robot learning, enables humanoids to acquire new skills by observing and mimicking expert human demonstrations, either from real videos of humans from teleoperation demonstrations or from simulated data. Imitation learning uses labeled datasets and is advantageous for teaching robots complex actions in diverse environments that are difficult to define programmatically.

While recording a demonstration may be simpler than specifying a reward policy, creating a perfect demonstration can be challenging, and robots may struggle with unforeseen scenarios. Collecting extensive, high-quality datasets in the real world is tedious, time-consuming, and prohibitively expensive. However, synthetic data generated from physically accurate simulations can accelerate the data-gathering process.

The NVIDIA Isaac GR00T blueprint for synthetic manipulation motion generation is a reference workflow built on NVIDIA Omniverse and NVIDIA Cosmos. It creates exponentially large amounts of synthetic motion trajectories for robot manipulation from a small number of human demonstrations.

Using the first components available for the blueprint, NVIDIA was able to generate 780K synthetic trajectories—the equivalent of 6.5K hours, or 9 continuous months, of human demonstration data—in just 11 hours. Then, combining the synthetic data with real data, NVIDIA improved the GR00T N1 performance by 40%, compared to using only real data.

In this post, we describe how to use a spatial computing device, such as the Apple Vision Pro or another capture device like a space mouse, to portal into a simulated robot’s digital twin and record motion demonstrations by teleoperating the simulated robot. These recordings are then used to generate a larger set of physically accurate synthetic motion trajectories. The blueprint further augments the dataset by producing an exponentially large, photorealistic, and diverse set of training data. We then post-train a robot policy model using this data.

The workflow begins with data collection, where a high-fidelity device like the Apple Vision Pro is used to capture human movements and actions in a simulated environment. The Apple Vision Pro streams hand tracking data to a simulation platform such as Isaac Lab, which simultaneously streams an immersive view of the robot’s environment back to the device. This setup enables the intuitive and interactive control of the robot, facilitating the collection of high-quality teleoperation data.

The robot simulation in Isaac Lab is streamed to Apple Vision Pro, enabling you to visualize the robot’s environment. By moving your hands, you can intuitively control the robot to perform various tasks. This setup facilitates an immersive and interactive teleoperation experience.

Synthetic manipulation motion trajectory generation using GR00T-Mimic

After the data is collected, the next step is synthetic trajectory generation. Isaac GR00T-Mimic is used to extrapolate from a small set of human demonstrations to create a vast number of synthetic motion trajectories.

This process involves annotating key points in the demonstrations and using interpolation to ensure that the synthetic trajectories are smooth and contextually appropriate. The generated data is then evaluated and refined to meet the criteria required for training.

Augmenting and generating large data and diverse dataset

Figure 5

Figure 6

To reduce the simulation-to-real gap, it’s critical to augment the synthetically generated images to the necessary photorealism and also to increase the diversity by randomizing various parameters, such as lighting, color, and background.

Typically, this process entails building photorealistic 3D scenes and objects and requires considerable time and expertise. With Cosmos Transfer (WFMs), this process can be accelerated considerably from hours to minutes with simple text prompts.

Figures 5 and 6 show an example of the photorealism that can be achieved by passing the synthetically generated image through NVIDIA Cosmos Transfer WFM.

Post-training in Isaac Lab using imitation learning

Finally, the synthetic dataset is used to train the robot using imitation learning techniques. In this stage, a policy, such as a recurrent Gaussian mixture model (GMM) from the Robomimic suite, is trained to mimic the actions demonstrated in the synthetic data. The training is conducted in a simulation environment such as Isaac Lab, and the performance of the trained policy is evaluated through multiple trials.

To show how this data can be used, we trained a Franka robot with a gripper to perform a stacking task in Isaac Lab. We used Behavioral Cloning with a recurrent GMM policy from the Robomimic suite. The policy uses two long short-term memory (LSTM) layers with a hidden dimension of 400.

The input to the network consists of the robot’s end-effector pose, gripper state, and relative object poses while the output is a delta pose action used to step the robot in the Isaac Lab environment.

With a dataset consisting of 1K successful demonstrations and 2K iterations, we achieved a training speed of approximately 50 iterations/sec (equivalent to approximately 0.5 hours of training time on the NVIDIA RTX 4090 GPU). Averaging over 50 trials, the trained policy achieved an 84% success rate for the stacking task.