DAJI: Dynamics-Aligned Joint Intent

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

An anticipatory joint-intent interface for streaming language-conditioned humanoid control, built on a simple claim: like human motor behavior, control should first form intent, then execute.

Haozhe Jia^1,2,† Honglei Jin^1,† Yuan Zhang^3,† Youcheng Fan^1,† Shaofeng Liang¹ Lei Wang^5,6 Shuxu Jin³ Kuimou Yu¹ Zinuo Zhang³ Jianfei Song² Wenshuo Chen^1,2,* Yutao Yue^1,4,*

¹HKUST(GZ) ²LimX Dynamics ³Shandong University ⁴JITRI ⁵Data61/CSIRO ⁶Griffith University

^† Equal contribution · ^* Corresponding author

Paper Code Video BibTeX

Teaser of DAJI showing the anticipatory joint-intent concept.

Instead of using reference trajectories as the deployment interface, DAJI predicts executable and anticipatory joint-intent latents that improve future prediction and long-horizon humanoid control.

Abstract

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent or action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation.

We propose DAJI, a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following.

A New Generation-to-Execution Paradigm

The paper's central claim is not just that DAJI is a better generator. It argues that language-conditioned humanoid control needs a different interface: one that helps the body prepare for what comes next. The future paradigm is not to generate a brittle reference and repair it afterwards, but to generate intent first and let execution decode that intent under feedback, much closer to how humans prepare movement before it appears.

Anticipatory Intent

DAJI encodes upcoming support shifts, contact transitions, and balance preparation before the visible motion transition is complete. The system prepares the body before the body visibly moves.

Flow-Generated, Directly Deployable

DAJI-Flow generates joint-intent latents directly, and those latents are consumable by DAJI-Act at deployment. The generator is not producing a display-only motion description; it is producing a deployable control interface.

Intent Then Execution

DAJI-Flow predicts future intent chunks while DAJI-Act executes them with live proprioception, supporting long-horizon instruction following through a clean two-stage pattern: intention first, action second.

Framework

DAJI separates generation and execution through a shared joint-intent space. The generator produces low-frequency future intent chunks, and the action policy decodes each latent into stable whole-body actions at control time. This is the key design difference from reference-driven pipelines: the output of generation is already the control interface used at deployment.

DAJI-Act

Distills a future-aware privileged teacher into a deployable diffusion action policy.

DAJI-Flow

Autoregressively predicts future joint-intent chunks from language and recent intent history.

Joint-Intent Interface

Replaces reactive reference tracking with an executable, anticipatory latent command space.

Results

The evaluation focuses on whether the learned latent behaves as a deployable anticipatory command: executable under feedback control, predictive of upcoming motion, and stable over long-horizon rollouts. DAJI improves language-motion quality while preserving a key systems property: the flow-generated latent can be directly deployed through DAJI-Act rather than converted into a fragile external reference.

94.42% rollout success on HumanML3D-style generation

0.147 FID on single-instruction generation

0.152 subsequence FID on BABEL

87% 60s rollout success with future-aware latent design

Qualitative deployment results for DAJI. — DAJI produces executable motion under both streaming instruction switches and single-instruction generation.

Qualitative rollout results for DAJI. — Long-horizon rollouts remain coherent because the latent interface prepares upcoming transitions.

Main Comparison on HumanML3D-Style Robot Motion Generation

Method	MM-D ↓	R@1 ↑	R@2 ↑	R@3 ↑	FID ↓	Div ↔	MM ↑	Succ. (%) ↑
GT	0.968	0.720	0.878	0.919	0.000	1.372	--	100.00
GT (Sim)	1.002	0.687	0.847	0.900	0.059	1.320	--	94.50
LangWBC	1.462	0.205	0.328	0.412	0.510	1.045	0.365	70.50
ECHO	1.318	0.288	0.432	0.528	0.378	1.148	0.462	82.00
FRoM-W1	1.392	0.245	0.380	0.471	0.445	1.095	0.412	76.00
RoboGhost	1.275	0.325	0.468	0.562	0.335	1.178	0.501	85.00
MotionStreamer	1.225	0.378	0.542	0.645	0.278	1.205	0.585	87.50
TextOp	1.175	0.425	0.595	0.698	0.225	1.228	0.628	90.00
DAJI	1.093	0.549	0.711	0.796	0.147	1.291	0.847	94.42

DAJI improves both language-motion alignment and rollout executability because the generator produces controller-distilled intents rather than external kinematic references.

Long-Horizon Streaming Evaluation on BABEL

Method	R@3 ↑	FID ↓	Div ↔	MM-D ↓	Transition FID ↓	Transition Div ↔	PJ ↓	MJ ↓
GT	0.654	0.000	1.299	1.299	0.000	1.182	0.043	0.015
GT (Sim)	0.625	0.059	1.287	1.157	0.054	1.153	0.090	0.022
TextOp	0.287	0.538	0.985	1.514	0.594	0.784	0.231	0.156
DAJI	0.443	0.152	1.219	1.237	0.236	1.033	0.115	0.054

The advantage of DAJI becomes clearer in streaming rollout, where anticipatory intent helps the controller prepare support transfer, posture adaptation, and transition smoothness before the visible change is complete.

Demonstrations

Single-instruction generation and long-horizon streaming instruction following across real-world and simulated humanoid deployments. Each video is conditioned on a natural language instruction and executed through the DAJI joint-intent interface.

BibTeX

@misc{jia2026bodymoveslearninganticipatory,
  title         = {Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control},
  author        = {Haozhe Jia and Honglei Jin and Yuan Zhang and Youcheng Fan and Shaofeng Liang and Lei Wang and Shuxu Jin and Kuimou Yu and Zinuo Zhang and Jianfei Song and Wenshuo Chen and Yutao Yue},
  year          = {2026},
  eprint        = {2605.14417},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2605.14417}
}