The Apple Peeling Milestone: How Sharpa’s "MoDE-VLA" Unlocks Bimanual Dexterity

The SharpaNorth humanoid robot positioned over a table with an apple and a peeler, prepared to begin its bimanual manipulation task. — The setup for the 'ultimate task': SharpaNorth utilizes its 63 degrees of freedom and bimanual coordination to tackle the complex, contact-rich challenge of apple peeling.

Peeling an apple is a trivial morning task for a human, but for a robot, it represents a "final boss" of bimanual coordination. It requires one hand to maintain a stable grip while rotating the fruit, and the other to guide a blade with precise force—all while processing constant tactile feedback to prevent the apple from slipping or the blade from digging too deep.

This week, Sharpa Robotics released research detailing how it has bridged this gap. By combining a new AI architecture called MoDE-VLA with a shared-autonomy "copilot" system, the company has demonstrated what it claims is the first autonomous dual-dexterous-hand apple peeling sequence. The breakthrough moves the needle for Vision-Language-Action (VLA) models, which have historically been limited to simple "pick-and-place" tasks using low-degree-of-freedom grippers.

The Peeling of an Apple

Solving the Data Bottleneck with "IMCopilot"

The primary hurdle in training robots for human-like manipulation is data. While humans can easily teleoperate a simple robotic claw, controlling the 63 degrees of freedom (DoF) found in Sharpa’s SharpaNorth robot—which features two SharpaWave hands—is cognitively overwhelming for even expert operators.

To solve this, Sharpa introduced IMCopilot (In-hand Manipulation Copilot). During the data collection phase, the system operates in a shared-autonomy mode: a human operator uses an exoskeleton to control the robot's "gross" arm movements, but delegates "fine" in-hand rotation to the AI via a foot pedal.

A three-panel overview showing the teleoperation system with a human in an exoskeleton suit, the operator's VR view featuring tactile data overlays, and the dual-arm SharpaNorth robot platform. — Bridging the skill gap: The system integrates exoskeleton teleoperation (a) with immersive VR feedback (b) to collect high-fidelity data for the SharpaNorth platform (c).

This hybrid approach allowed Sharpa to collect high-fidelity demonstrations that would be impossible via traditional teleoperation. This marks a significant evolution since Sharpa first began shipping the hardware last year, shifting the focus from raw mechanical specs to the software intelligence required to drive them.

MoDE-VLA: A Mixture of Specialists

Once the data is collected, the robot is governed by MoDE-VLA (Mixture-of-Dexterous-Experts VLA). Standard VLA models often struggle when force and tactile data are simply "tacked on" to visual inputs, as these modalities have different temporal scales and physical meanings.

Sharpa’s solution is a dedicated "pathway" for touch. The architecture uses:

Sparse MoE Routing: A team of "specialist" neural networks that activate depending on the task phase—such as a "contact-onset" expert for the moment the knife touches the skin.
Residual Injection: Contact-aware corrections are "injected" into the robot’s movements without overwriting the general-purpose knowledge the model gained during pretraining.

This allows the robot to utilize the "feel by seeing" capabilities of the mass-produced SharpaWave hand, which uses internal cameras to detect minute fingertip deformations.

Performance Gains

In testing across four contact-rich tasks—apple peeling, tube rearranging, gear assembly, and charger plugging—MoDE-VLA demonstrated a 34% average success rate, more than doubling the performance of the base model.

Task	Baseline Success (π₀)	MoDE-VLA Success
Apple Peeling	0%	30%
Gear Assembling	40%	60%
Tube Rearranging	15%	30%
Charger Plugging	5%	15%

A close-up view of two silver SharpaWave dexterous hands; one hand precisely holds and rotates a red apple while the other uses a peeler to remove the skin. — Precision in-hand manipulation: MoDE-VLA allows the robot to coordinate tactile-guided apple rotation with the left hand while the right hand executes a vision-guided peeling stroke.

While a 30% success rate on apple peeling leaves room for improvement, the "Peel Completion Ratio" reached 73%, suggesting the robot is capable of sustained, complex sequences even when it doesn't reach the finish line every time.

The Road Ahead

Sharpa’s research (arXiv:2603.08122) suggests that one future of humanoid robotics lies in this hierarchy: high-level "planning" handled by large vision-language models, and low-level "reflexes" handled by reactive, RL-trained experts.

For those looking to see the hardware in person, Sharpa will be showcasing the SharpaNorth system at NVIDIA GTC Booth #1838, Hall 3. As the industry moves closer to deploying humanoids in domestic environments, the ability to handle delicate, slippery, and irregular objects like fruit remains a critical benchmark.