Published on

Imagination as Policy: 1X Director Daniel Ho on Leveraging the World Model Flywheel

Daniel Ho and NEO. Image: 1X/YouTube

The robotics industry is currently locked in a foundational debate: should the "brain" of a humanoid be built like a chatbot that speaks actions, or like a director that visualizes the future? For Daniel Ho, Director of Evaluations at 1X Technologies, the answer lies in the latter.

Speaking on a recent episode of the Robo Papers podcast, Ho detailed the company’s shift toward World Models as the primary driver for robot intelligence. The core thesis is that while Vision-Language-Action (VLA) models benefit from the semantic knowledge of Large Language Models (LLMs), they often lack an inherent "intuitive physics." By contrast, 1X is betting that a robot that can "imagine" a task through video generation can generalize to the messy, unpredictable reality of a human home far more effectively than one trained on motor-command regression.

Beyond the "VLA Wall"

The current state of the art often relies on VLAs—systems that "graft" an action head onto a pre-trained Vision-Language Model. However, as Ho noted during the discussion, these models typically require tens of thousands of hours of costly, teleoperated robot data to learn even basic tasks. This "data bottleneck" is a common critique in the industry, recently echoed by Yann LeCun’s AMI Labs, which argues that the industry is too "LLM-pilled."

Ho explained that 1X’s approach, the 1X World Model (1XWM), treats robotics as a video-prediction problem. "A world model is trained on video pre-training as the main objective function," Ho said. "It really allows you to zero-shot to new tasks because of the generalizability of video pre-training."

By training on "internet-scale" video, 1X allows its NEO humanoid to leverage the vast amount of human movement data already available online. This mirrors the "GPT-2 moment" described by NVIDIA researchers regarding their own DreamZero world model.

The 900-Hour Bridge

One of the most revealing technical details shared by Ho was the specific "recipe" 1X uses to move from web video to robot action. The pipeline involves three distinct stages:

  1. Web-Scale Pre-training: Learning general visual priors from the internet.
  2. Egocentric Mid-training: 900 hours of first-person human video.
  3. Robot Fine-tuning: Only 70 hours of specific robot data, primarily focused on "pick and place" tasks.

Ho revealed that the 70 hours of robot data acts as a "shim" to teach the model the robot's specific morphology and kinematics. The actual intelligence for complex chores—like steaming a shirt or scrubbing a dish—comes from the mid-training on human video. In ablation studies, 1X found that omitting the egocentric human data caused the model to overfit to simple grasping, failing at more nuanced tasks like scrubbing.

The World Model as a "Learned Simulator"

Beyond acting as a policy, the 1XWM is being utilized as a "learned simulator" to solve the evaluation problem. Traditionally, testing a new AI "checkpoint" requires running it on a physical robot—a slow and expensive process.

As previously detailed in 1X’s digital twin announcement, the world model can predict whether an action sequence will lead to success or failure with high correlation to real-world results. Ho noted that while a physics engine is better for high-throughput "standard" grasping, the world model excels at "long-tail" scenarios, such as interacting with soft, compliant objects like towels or sponges, which are notoriously difficult to model in classical simulators.

"You always have a limited budget for real-world eval," Ho explained. "We need some higher throughput signals to create the champion models that we go and evaluate every day." This aligns with DeepMind’s strategy of using Genie and SIMA to create "infinite training loops" in simulation.

Reality Check: Latency and Success Rates

Despite the optimism, Ho was candid about the hurdles. The 1XWM currently requires roughly 11 seconds of compute to "imagine" 5 seconds of video, making it too slow for the reactive, high-speed control required in dynamic environments.

Furthermore, "zero-shot" does not yet mean "perfect." While 1X showed impressive videos of NEO scrubbing dishes—a task it was never explicitly trained for—the success rate for that specific behavior currently hovers around 20%. Ho views this as a starting point rather than a ceiling, suggesting that the "flywheel" effect will take over as they begin training the model on its own successful autonomous rollouts.

The Path to $20,000 General Labor

The ultimate goal for 1X remains the deployment of its $20,000 NEO android into consumer homes. Ho argued that the humanoid form factor is a deliberate choice to maximize this "video-to-action" transfer. Because NEO is "kinematically congruent" with humans, the world dynamics inherent in human-centric internet video apply almost directly to the robot.

"If it’s possible for us to deploy humanoids today with reasonable success rates... that’s a place where the [system] will only get better," Ho concluded. "It’s a platform which can scale by itself."


Watch the episode on YouTube below, or here.

Share this article

Stay Ahead in Humanoid Robotics

Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.