- Published on
BitRobot and Hugging Face Drop HIW-500: A Massive 10TB Real-Home Humanoid Dataset

- BitRobot Network has released HIW-500 (Humanoids-in-the-Wild 500), the largest open-source humanoid teleoperation dataset collected in real household environments.
- Developed in partnership with Hugging Face and Unitree Robotics, the dataset captures over 500 hours of whole-body motion, 23,000 episodes, and 10+ TB of raw data.
- Hugging Face's LeRobot team re-encoded the dataset, compressing it from 10TB to 2TB with zero loss of fidelity to facilitate streaming and policy training.
- The data was gathered in 12 homes across Southeast Asia using the Unitree G1 humanoid platform, targeting whole-body mobile manipulation rather than simple locomotion.
- The release represents a major step toward addressing the "generalization bottleneck" and the industry's long-standing "80/80" benchmark for unfamiliar real-world environments.
While humanoid robots have made immense leaps in dynamic locomotion, teaching them to reliably navigate and manipulate objects inside unstructured human homes remains one of the most stubborn bottlenecks in physical AI. Most robotics datasets are gathered in highly sterile, controlled laboratory environments. Real homes, by contrast, are messy, unpredictable, and full of chaotic edge cases.
To bridge this gap, BitRobot Network, in partnership with Hugging Face and Unitree Robotics, has officially released HIW-500 (Humanoids-in-the-Wild 500). It stands as the largest open-source humanoid teleoperation dataset collected in real residential settings, providing researchers with a massive, foundational repository for training imitation and Vision-Language-Action (VLA) models.

Training for the Messy Real World
The sheer scale of HIW-500 is designed to address the industry's critical data deficit. Collected across 12 distinct homes in Southeast Asia, the dataset spans more than 500 hours of footage, encompassing 23,000 individual episodes and over 10 terabytes of raw data.
The project targets long-horizon household tasks—including navigating rooms, manipulating diverse objects, and executing multi-step activities—with some individual episodes lasting upwards of eight minutes. According to BitRobot, the dataset covers more than 10 core household tasks, broken down into thousands of demonstrations per task alongside detailed sub-task annotations. This multi-level abstraction allows researchers to train and evaluate AI models at different layers of complexity.
This release targets the exact "lack of generalization" that industry leaders have flagged as the primary barrier to consumer robotics. Unitree CEO Wang Xingxing previously defined embodied AI's "ChatGPT moment" as hitting an "80/80" target—achieving an 80% task completion rate across 80% of unfamiliar, real-world scenes. By moving data collection out of the lab and into authentic residential environments, HIW-500 represents a coordinated push toward that benchmark.
The Hardware and Teleoperation Stack
To collect this high-fidelity data, the team deployed a standardized fleet of Unitree G1 humanoid robots. The G1, which has rapidly become a favorite budget hardware target for global research labs due to its highly competitive sub-$30,000 enterprise pricing, was configured with a specific sensor array:
- Head Vision: A stereo head camera capturing RGB data at 480p and 30 FPS.
- Wrist Vision: Infrared (IR) stereo wrist cameras on both arms capturing RGB + IR data at 480p and 30 FPS to mitigate visual occlusion during manipulation.
- Kinematics & State: Full robot state and action logging across 29 degrees of freedom (DoF), alongside onboard IMU and odometry tracking.
Teleoperating a 29-DoF humanoid robot to execute delicate bimanual tasks is notoriously difficult. Gathering thousands of clean, whole-body trajectories inside narrow residential hallways and kitchens required months of coordinated effort, relying heavily on Unitree's hardware support to sustain the high-frequency physical wear and tear of real-world deployments.
The LeRobot Compression Breakthrough
While a 10TB dataset is an invaluable research asset, its sheer physical size poses a massive infrastructure hurdle for smaller academic labs attempting to stream and train models. To resolve this, Hugging Face's LeRobot team re-encoded the entire dataset into the open-source LeRobot format.
By optimizing the data structure, the LeRobot team successfully compressed the dataset from ~10TB down to ~2TB with absolutely zero loss of fidelity. The trajectories, camera feeds, and annotations remain identical, but the significantly reduced storage footprint makes the dataset far easier to stream, manage, and ingest into deep learning pipelines.
This partnership highlights Hugging Face's accelerating momentum in physical AI. Over the past year, the platform has systematically built out an open-source robotics ecosystem, ranging from the sub-$3,000 HOPEJr humanoid baseline to the recently unveiled LeRobot Humanoid 3D-printed bipedal platform. By hosting HIW-500, Hugging Face provides the crucial third pillar to their ecosystem: large-scale, high-quality training data.
Open-Source Accessibility
The full dataset has been made publicly available on Hugging Face in both native ROSbag and compressed LeRobot formats. Furthermore, developers can explore the dataset directly within their web browsers using the LeRobot Visualizer, which provides a live 3D render of the robot synced with the camera feeds, language instructions, and subtask annotations.
Crucially for the developer community, the Unitree G1 chassis is natively supported within the LeRobot library. This means roboticists can immediately load HIW-500 to begin training behavioral cloning or end-to-end VLA policies directly on their own physical G1 hardware, effectively lowering the barrier to entry for real-world home automation research.
Comments
No comments yet. Be the first to share your thoughts!
Share this article
Stay Ahead in Humanoid Robotics
Get the latest developments, breakthroughs, and insights in humanoid robotics — delivered straight to your inbox.




