Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots

Abstract

Human manipulation data is cheap, abundant, and diverse, making it one of the most promising resources for scaling up robot learning. Yet transferring skills from humans to robots remains hard: most prior work treats humans as just another bi-manual 6DoF embodiment, which suffers from two problems: hand-pose estimates are noisy, and the contact patterns of human fingers differ fundamentally from those of a parallel gripper, so that wrist rotation is semantically misaligned with gripper manipulation. We argue that learning rotation-inclusive action signals from human data is therefore sub-optimal, and instead propose a bridging action representation: the relative wrist translation within the initial head-camera frame, an action space shared by humans and robots. To handle the potential absence of certain action components in different embodiments, we build a π₀-like vision-language-action model with interleaved action tokens and attention masking. On a suite of novel bi-manual manipulation tasks, our bridging action transfers human manipulation knowledge to robots far more effectively than noisy 6DoF human actions and scales with the amount of human data.

Imitation Learning Ego-Centric Generalization Cross-Embodiment

Method

We train a vision-language-action policy on human action data and robot tele-operation data with different available action components. Instead of using 6DoF wrist actions for human data, we adopt only the wrist translation, defined over a chunk of k = 30 future steps.

Architecture overview — **Architecture Overview.** (a) We train on a mixture of human and robot action data. (b) Learning from 6DoF human action is challenging because of differing contact patterns and noisy hand-pose estimations. (c) We adopt a π₀-like vision-language-action model with interleaved action sequences to handle missing action components across human and robot data sources.

Per-embodiment action supervision

Each data source supervises only the action components that can be reliably extracted.

Data source	a^3D-wrist	a^6D-eef	a^gripper
In-the-wild human (EgoDex + out-sourced)	✓	–	–
In-lab human action data	✓	–	✓
Robot tele-operation	✓	✓	✓

Results

We conduct extensive real-world experiments on the ByteMini bi-manual robot to answer: is the bridging representation helpful and scalable for transferable robot skills?

Finding I — Transfer beyond pick-and-place

Training only on pick-and-place data achieves much lower task progress and success rate, showing that generalized PnP data alone is insufficient for the evaluation tasks.

Finding II — Scales with pre-training

Pre-training on large-scale human actions with a^3D-wrist yields substantial improvements, indicating the scalability and effectiveness of the bridging action.

Why not 6DoF human wrist actions?

The mainstream practice extracts relative 6DoF wrist actions, treating humans as another robotic embodiment. We compare this baseline with our bridging action, both trained from scratch. Our representation consistently outperforms the 6DoF baseline, which produces distorted, off-target wrist poses.

Human Actions	Overall Prog. (%)	Overall Succ. (%)
a^6D-eef	34.67	12.50
a^3D-wrist (Ours)	44.58	22.50

Qualitative rollout comparison — **Qualitative comparison.** 6DoF wrist supervision (left) yields a distorted, off-target wrist pose; our bridging actions (right) yield a natural pose aligned with the handle.

Video Demos

Across 15 manipulation tasks, the robot reproduces behaviors learned from human demonstrations under the same language instructions in 10-shot rollouts, using only 10 task-specific robot episodes per task.

Human action video → Shared 3D wrist translation in head-camera coordinates → Robot skill rollout

Human demonstration

Robot rollout

Task 1 Open the microwave door

Human demonstration

Robot rollout

Task 2 Close the microwave door

Human demonstration

Robot rollout

Task 3 Take the bowl out of the microwave

Human demonstration

Robot rollout

Task 4 Place the bowl into the microwave

Human demonstration

Robot rollout

Task 5 Wipe the microwave top from left to right

Human demonstration

Robot rollout

Task 6 Wipe the microwave top from right to left

Human demonstration

Robot rollout

Task 7 Open the drawer

Human demonstration

Robot rollout

Task 8 Close the drawer

Human demonstration

Robot rollout

Task 9 Hang the left mug on the mug holder

Human demonstration

Robot rollout

Task 10 Hang the right mug on the mug holder

Human demonstration

Robot rollout

Task 11 Stack the left cup onto the other cup

Human demonstration

Robot rollout

Task 12 Stack the right cup onto the other cup

Human demonstration

Robot rollout

Task 13 Insert the straw into cup

Human demonstration

Robot rollout

Task 14 Take the toast from the toaster and put it on a plate

Human demonstration

Robot rollout

Task 15 Unplug the charger

Bridging Action Visualization

Given the same vision and language input, we ask the model to produce both the bridging action a^3D-wrist and the 6DoF end-effector action a^6D-eef, then project both onto the head camera. The two align closely across diverse tasks, showing that the bridging action is a reliable proxy for executable robot actions.

a^3D-wrist (bridging action) a^6D-eef (end-effector action)

Open the microwave door

Close the microwave door

Take the bowl out of the microwave

Wipe the microwave top from left to right

Open the drawer

Close the drawer

Hang the left mug on the mug holder

Stack the left cup onto the other cup

Unplug the charger

Take the toast from the toaster and put it on a plate

Conclusion

We investigate the feasibility of learning task-specific bi-manual robot skills from human motion data. We propose a translation-based bridging action representation compatible with both robot and human action data, and introduce an interleaved action sequence to address potentially missing action components. Experiments show that human action co-training and human-only pre-training improve real-robot performance across the 15 evaluation tasks.

Limitations. Using only wrist translation limits transfer on tasks requiring rotational adjustments, and the robot can struggle to pick thin objects after co-training, due to observation/embodiment gaps and inevitable noise in human actions. Larger-scale, diverse robot actions can further narrow the human–robot domain gap.

BibTeX

@article{chen2026translation,
  title   = {Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots},
  author  = {Chen, Sijin and Jiang, Kaixuan and Shi, Haixin and Wang, Yanhui and Zhong, Weiheng and Li, Haosheng and Jiang, Bo and Liu, Yuxiao and Liu, Xihui},
  journal = {arXiv preprint arXiv:2606.28133},
  year    = {2026}
}