UMI on Air

Not all robots 🤖 are equally UMI-able - they can't all track UMI trajectories perfectly.

Our key idea 💡: allow the robot's controller to steer UMI's diffusion policy, leading to more robust cross-embodiment deployment. 🚀

Vanilla Diffusion Policy

Standard UMI policies generate action trajectories from visual observations, which are then tracked by controllers like MPC. But this is a one-way process — the policy tells the robot how to move, while the robot cannot signal back whether those actions are feasible. This mismatch often leads to failures, especially for embodiments like aerial manipulators.

Embodiment-Aware Diffusion Policy

Our solution is two-way communication. We compute the gradient of the controller's tracking cost with respect to the policy's actions, and feed it back into the diffusion process as guidance. This steers trajectory generation toward actions that the embodiment can actually execute, closing the gap between general policies and embodiment-specific constraints. We call this approach Embodiment-Aware Diffusion Policy (EADP).

UMI Benchmark Suite

We evaluate EADP in a simulation suite covering four tasks: long-horizon manipulation (open-and-retrieve), precision control (peg-in-hole), and articulated interactions (pick-and-place and valve rotation). Videos are shown in 2x speed.

Oracle

Open-and-Retrieve

Peg-in-Hole

Rotate-Valve

Pick-and-Place

Across these benchmarks, performance consistently declines as embodiments become less capable of tracking policy trajectories, revealing the embodiment gap. EADP improves success rates across all tasks and narrows the embodiment gap compared to standard diffusion policies.

Real-World Results

We evaluate the Peg-In-Hole task in the real world on a 4cm hole with a 2cm peg, with randomized starting positions and a 3-minute timeout. While the baseline DP failed due to dropped peg or timeout (3/5), EADP succeeded on all five trials (5/5), demonstrating that EADP improves precision in the real world. Videos are shown in 8x speed.

Diffusion Policy

Embodiment-Aware Diffusion Policy

We evaluate the Pick-and-Place task in the real world. The UAM must harvest a lemon from a randomized location and place it into a basket. EADP completed 4/5 trials successfully, with the only failure occurring when an unripe (green) lemon was selected.

Trial 1

Trial 2

Trial 3

Trial 4

Trial 5

We evaluate the Lightbulb-Insertion task in the real world. This long-horizon task requires threading a bulb into its socket until tight, followed by flipping the switch to confirm success. The task spans over 3 minutes of wall-clock time, underscoring the need for stability over extended horizons. EADP succeeded in all trials (3/3).

Trial 1

Trial 2

Trial 3

Technical Summary Video

Our Team

Harsh Gupta^†

Xiaofeng Gao^†

Huy Ha^‡

Chuer Pan^‡

Muqing Cao^†

Dongjae Lee^†

Sebastian Scherer^†

Shuran Song^‡

Guanya Shi^†

Harsh Gupta^†

Xiaofeng Gao^†

Huy Ha^‡

Chuer Pan^‡

Muqing Cao^†

Dongjae Lee^†

Sebastian Scherer^†

Shuran Song^‡

Guanya Shi^†

^†Carnegie Mellon University ^‡Stanford University

@misc{gupta2025umionairembodimentawareguidanceembodimentagnostic,
      title={UMI-on-Air: Embodiment-Aware Guidance for Embodiment-Agnostic Visuomotor Policies}, 
      author={Harsh Gupta and Xiaofeng Guo and Huy Ha and Chuer Pan and Muqing Cao and Dongjae Lee and Sebastian Sherer and Shuran Song and Guanya Shi},
      year={2025},
      eprint={2510.02614},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2510.02614}, 
}

If you have any questions, please contact Harsh Gupta.

🧠 Questions and Answers

Q: 🚁 Teleoperating drones is hard—how did you collect enough data?

We didn't directly teleoperate drones. Instead, we leveraged the Universal Manipulation Interface (UMI) — a lightweight handheld gripper that lets humans record diverse manipulation demonstrations without using any robot hardware. This decouples data collection from specific embodiments, making it possible to gather large, in-the-wild datasets quickly while keeping the action space consistent across robots.

Q: 🤖 Isn't this just UMI-on-Legs but with a drone controller?

Not quite. UMI-on-Air introduces ✨ embodiment-aware guidance ✨, where the low-level controller (like MPC for drones) provides gradient feedback during the diffusion sampling process. This two-way communication lets the policy adapt its trajectories to the dynamics of the embodiment in real time — something even the UR10e arm benefits from. So while UMI-on-Legs made UMI policies mobile, UMI-on-Air makes them embodiment-adaptive.

Q: ⚙️ What can't the Embodiment-Aware Diffusion Policy (EADP) do?

EADP currently relies on analytical gradients from model-based controllers (e.g., MPC), which limits its use to systems with known dynamics. However, recent progress in learning-based control — especially reinforcement learning for legged and whole-body systems (e.g., UMI-on-Legs, OmniH2O) — suggests an exciting path forward 🚀. Future versions could use learned or RL-based controllers to provide the same kind of guidance, making EADP more general and scalable.

Q: 🤔 What does "UMI-ability" mean?

"UMI-ability" measures how well a robot can execute trajectories produced by a UMI-trained policy. Robots like fixed-base arms are highly UMI-able—they can closely follow human demonstrations—while drones or legged robots face dynamic and control constraints that make them less so. UMI-on-Air helps bridge this gap by steering trajectories toward what's feasible for each embodiment.

The term actually came up when Prof. Guanya Shi was describing why some embodiments “just listen better” to UMI policies than others. It stuck — and later, Huy Ha decided to formalize it in the paper as a measurable notion.