Animated DAJI chicken running

DAJI: Dynamics-Aligned Joint Intent

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

An anticipatory joint-intent interface for streaming language-conditioned humanoid control, built on a simple claim: like human motor behavior, control should first form intent, then execute.

Haozhe Jia1,2,† Honglei Jin1,† Yuan Zhang3,† Youcheng Fan1,† Shaofeng Liang1 Lei Wang5,6 Shuxu Jin3 Kuimou Yu1 Zinuo Zhang3 Jianfei Song2 Wenshuo Chen1,2,* Yutao Yue1,4,*
1HKUST(GZ) 2LimX Dynamics 3Shandong University 4JITRI 5Data61/CSIRO 6Griffith University

Equal contribution  ·  * Corresponding author

Teaser of DAJI showing the anticipatory joint-intent concept.

Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control.

Abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent or action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation.

We propose DAJI, a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following.

A New Generation-to-Execution Paradigm

The paper's central claim is not just that DAJI is a better generator. It argues that language-conditioned humanoid control needs a different interface: one that helps the body prepare for what comes next. The future paradigm is not to generate a brittle reference and repair it afterwards, but to generate intent first and let execution decode that intent under feedback, much closer to how humans prepare movement before it appears.

Anticipatory Intent

DAJI encodes upcoming support shifts, contact transitions, and balance preparation before the visible motion transition is complete. The system prepares the body before the body visibly moves.

Flow-Generated, Directly Deployable

DAJI-Flow generates joint-intent latents directly, and those latents are consumable by DAJI-Act at deployment. The generator is not producing a display-only motion description; it is producing a deployable control interface.

Intent Then Execution

DAJI-Flow predicts future intent chunks while DAJI-Act executes them with live proprioception, supporting long-horizon instruction following through a clean two-stage pattern: intention first, action second.

Framework

DAJI separates generation and execution through a shared joint-intent space. The generator produces low-frequency future intent chunks, and the action policy decodes each latent into stable whole-body actions at control time. This is the key design difference from reference-driven pipelines: the output of generation is already the control interface used at deployment.

DAJI framework diagram showing DAJI-Flow and DAJI-Act.

DAJI-Act

Distills a future-aware privileged teacher into a deployable diffusion action policy.

DAJI-Flow

Autoregressively predicts future joint-intent chunks from language and recent intent history.

Joint-Intent Interface

Replaces reactive reference tracking with an executable, anticipatory latent command space.

Results

The evaluation focuses on whether the learned latent behaves as a deployable anticipatory command: executable under feedback control, predictive of upcoming motion, and stable over long-horizon rollouts. DAJI improves language-motion quality while preserving a key systems property: the flow-generated latent can be directly deployed through DAJI-Act rather than converted into a fragile external reference.

94.42% rollout success on HumanML3D-style generation
0.147 FID on single-instruction generation
0.152 subsequence FID on BABEL
87% 60s rollout success with future-aware latent design

Main Comparison on HumanML3D-Style Robot Motion Generation

Method MM-D ↓ R@1 ↑ R@2 ↑ R@3 ↑ FID ↓ Div ↔ MM ↑ Succ. (%) ↑
GT 0.968 0.720 0.878 0.919 0.000 1.372 -- 100.00
GT (Sim) 1.002 0.687 0.847 0.900 0.059 1.320 -- 94.50
LangWBC 1.462 0.205 0.328 0.412 0.510 1.045 0.365 70.50
ECHO 1.318 0.288 0.432 0.528 0.378 1.148 0.462 82.00
FRoM-W1 1.392 0.245 0.380 0.471 0.445 1.095 0.412 76.00
RoboGhost 1.275 0.325 0.468 0.562 0.335 1.178 0.501 85.00
MotionStreamer 1.225 0.378 0.542 0.645 0.278 1.205 0.585 87.50
TextOp 1.175 0.425 0.595 0.698 0.225 1.228 0.628 90.00
DAJI 1.093 0.549 0.711 0.796 0.147 1.291 0.847 94.42

DAJI improves both language-motion alignment and rollout executability because the generator produces controller-distilled intents rather than external kinematic references.

Long-Horizon Streaming Evaluation on BABEL

Method R@3 ↑ FID ↓ Div ↔ MM-D ↓ Transition FID ↓ Transition Div ↔ PJ ↓ MJ ↓
GT 0.654 0.000 1.299 1.299 0.000 1.182 0.043 0.015
GT (Sim) 0.625 0.059 1.287 1.157 0.054 1.153 0.090 0.022
TextOp 0.287 0.538 0.985 1.514 0.594 0.784 0.231 0.156
DAJI 0.443 0.152 1.219 1.237 0.236 1.033 0.115 0.054

The advantage of DAJI becomes clearer in streaming rollout, where anticipatory intent helps the controller prepare support transfer, posture adaptation, and transition smoothness before the visible change is complete.

Demonstrations

Single-instruction generation and long-horizon streaming instruction following across real-world and simulated humanoid deployments. Each video is conditioned on a natural language instruction and executed through the DAJI joint-intent interface.

BibTeX

@misc{jia2026bodymoveslearninganticipatory,
  title         = {Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control},
  author        = {Haozhe Jia and Honglei Jin and Yuan Zhang and Youcheng Fan and Shaofeng Liang and Lei Wang and Shuxu Jin and Kuimou Yu and Zinuo Zhang and Jianfei Song and Wenshuo Chen and Yutao Yue},
  year          = {2026},
  eprint        = {2605.14417},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2605.14417}
}