Unified Embodied VLM Reasoning with Robotic Action via Autoregressive Discretized Pre-training

Overview

Main Teaser Figure

While general-purpose robots require both broad semantic reasoning and high-precision execution, existing Vision-Language-Action (VLA) models often struggle with the trade-off between these two critical capabilities. To address this, we present GenieReasoner, a unified framework that co-optimizes high-level embodied intelligence and low-level control within a single autoregressive transformer. Our work first establishes ERIQ, a comprehensive benchmark designed to quantify the reasoning bottleneck in robotic manipulation. To bridge the gap from reasoning to precise action, we introduce FACT, a novel flow-based action tokenizer that preserves high-fidelity trajectories in a discrete space. This unified design allows GenieReasoner to significantly outperform both diffusion-based and discrete-action baselines across simulated and real-world tasks.

Demos

Embodied Reasoning

Logic Reasoning
Identifies the target is blocked and plans a "remove-then-retrieve" action sequence.
Open-Set Shelf Resetting
Continuously identifies and restores misplaced items, generalizing to arbitrary SKUs.
Semantic Following
Precise semantic alignment for both single-step actions and full task execution.
Spatial Understanding
Reasoning over geometric relationships to place objects in distinct relative orientations.

Robust Generalization

Water Bottle
I'm thirsty, but I'm trying to lose weight. Please give me a suitable drink.
Water Bottle (FPV)
Red Pen
I want to write, please give me the corresponding items.
Red Pen (FPV)
White Wadded Paper
Pick up desktop trash.
White Wadded Paper (FPV)
Remote Control
I want to watch TV and switch channels. Please provide me the corresponding items.
Remote Control (FPV)
Charging Cable
My phone is out of battery, please give me the corresponding item.
Charging Cable (FPV)

Human-Robot Interaction

Cooperation
Always At Your Service
Adversarial
Dynamic Following
Interaction
Human Intention Understanding & Action Interaction

Methodology: GenieReasoner

Framework

We seek a unified "Action as Language" paradigm that enables action sequences to inherit the compositional generalization of large Vision-Language Models (VLMs), while simultaneously preserving the high-precision continuous control required for reliable physical execution.

Most prior VLA systems (e.g., π_0.5-style) couple a discrete VLM backbone with a continuous policy head to preserve control precision. However, the separation between token-level reasoning and continuous regression can introduce knowledge insulation, which may hinder tight reasoning-to-action alignment and lead to weaker generalization in complex, unseen scenarios.

GenieReasoner addresses this via two complementary directions:

Unifying perception, reasoning, and action into a single discrete representation to remove cross-objective interference.
Introducing FACT to discretize continuous trajectories with high-fidelity reconstruction.

This synergistic approach allows the entire model to be optimized with a single autoregressive objective, ensuring that abstract semantic reasoning flows seamlessly into precise physical control without compromise.

FACT: Bridging Discrete Thought & Continuous Action

FACT encodes continuous actions into discrete tokens to align with the VLM, utilizing Flow Matching to reconstruct high-fidelity trajectories from the quantized representation.

Tokenizer

Components

VQ-Encoder
Built on the MM-DiT architecture, it compresses continuous action chunks into compact discrete tokens. This transforms complex physical dynamics into a unified vocabulary compatible with the VLM.
Flow-Matching Decoder
Leveraging an MM-DiT backbone, it utilizes Flow Matching to reconstruct high-fidelity trajectories from discrete tokens. This design ensures smooth, precise motion recovery despite the discrete bottleneck.

Why It Works Better

MSE Comparison

High Fidelity
FACT achieves an order-of-magnitude lower MSE than FAST+ and significantly smaller tokens than simple binning. Its MM-DiT Flow-Matching decoder eliminates quantization artifacts, reconstructing smooth, continuous trajectories to ensure sub-millimeter precision.
Unified Space
Resolves the gradient interference that plagues hybrid architectures. By treating actions as discrete tokens, FACT aligns the motor control space with the VLM's reasoning space, allowing both to be co-optimized within a shared gradient space.
Real-World Success
By leveraging the FACT action tokenizer to bridge discrete reasoning and continuous control, GenieReasoner outperforms the discrete baseline (π0-FAST) in instruction following capabilities while surpassing continuous models (e.g., π0.5) in task success rates, ultimately achieving state-of-the-art aggregate performance.

Language Following

Task Success

ERIQ: A Large-Scale Benchmark for Embodied Reasoning

Design Motivation

We posit that advancing embodied reasoning is pivotal for achieving generalization and robustness in unstructured environments. Unlike traditional simulation-based VLA evaluations, ERIQ is designed to decouple and quantify task-specific reasoning capabilities, measuring the abstract cognition essential for generalization. By establishing this embodied intelligence evaluation suite, we can rigorously assess the abstract reasoning proficiency of pretrained models. This approach facilitates the optimization of multi-stage training and transforms the iterative development of VLAs into a more controllable and guided process.

Design Principles

ERIQ employs a standardized Visual Question Answering (VQA) protocol (multiple-choice or binary) to ensure deterministic, rule-based evaluation, eliminating the ambiguity of open-ended generation metrics.

The framework assesses four pillars of embodied intelligence:

Spatial Perception and Grounding
Error Detection and Recovery
Planning and Monitoring
Human-Robot Collaboration

Spanning 15 fine-grained sub-tasks and over 100 scenarios, it tests multi-modal reasoning across static, sequential, and interleaved image-text contexts.

ERIQ Benchmark Overview

Scene Distribution

Home: 35%
Restaurant: 20%
Supermarket: 20%
Industrial: 15%
Office: 10%

Vision Source

Static: 52.9%
Sequential: 26.6%
Interleaved: 20.6%

15 Diagnostic Tasks

Scene Understanding: 967
Success Detection: 800
Action Understanding: 600
Subtask Planning: 600
Trajectory Understanding: 505
Task Grounding: 486
Dualview Matching: 445
Mistake Existence: 334
Task Progress: 281
Relative Pos. Grounding: 249
Human-Robot Interaction: 227
Human Intention: 215
Mistake Classify: 116
Fine-grained Planning: 112
Mistake Recovery: 105

Closing Statement

We have fully open-sourced the ERIQ Benchmark, aiming to provide a reproducible and measurable technical foundation for the embodied AI community. We sincerely invite developers and researchers to leverage this benchmark, share feedback on edge cases in real-world scenarios, and collaborate to refine the metrics for embodied reasoning, accelerating the emergence of general-purpose embodied intelligence.