Data Engineering for Robotics: Converting LeRobot to HDF5

In my previous post, I explained why the JEPA architecture is such a promising lead for robotics. But between Yann LeCun’s theory and the first \(loss.backward()\), there is a massive wall: the data.

For my POC on the Koch arm (SO-ARM101), I’m using the LeRobot ecosystem. It’s a goldmine of data, but its default storage format isn't built for the intensive training cycles required by World Models. Here is why I had to build a technical "bridge" to the HDF5 format.

🔗 Bridge repository

1. The Format War: Storage vs. Training

Feature	LeRobot Format (Parquet + MP4)	HDF5 Format (LeWM Optimized)
Ideal Use	Lightweight distribution and archiving.	Intensive Training (GPU Datasets).
Pros	Highly compressed, easy to visualize, HF standard.	Ultra-fast Random Access to frames and actions.
Cons	Heavy CPU video decoding for every batch, potential desync.	Large file size, less "standard" for sharing.

The Problem: Training a World Model (JEPA) requires sampling random time windows across thousands of episodes. Attempting a seek in an MP4 file for every single frame in a batch of 256 is performance suicide. HDF5 allows us to treat the dataset as one massive tensor living on the disk.

2. The "Little Mac" Challenge: Optimize or Die

Fetching datasets via the lerobot library is a breeze. The real challenge began during conversion. My Mac, with its limited resources, suffered several Kernel Panics before I got it right.

To succeed without saturating the RAM, I had to implement a "Lean & Mean" pipeline:

The "Low-Memory" Conversion Strategy

Linear Pipeline: I abandoned aggressive parallelism. We process one episode at a time, one camera at a time. It’s slower, but it’s predictable.
Micro-Batching (64 frames): Instead of loading an entire episode, we decode and write to the HDF5 in small chunks.
Video Streaming: Using iterators (PyAV/OpenCV) to ensure a full video never materializes in the RAM.
LZF Compression: The perfect compromise. It's ultra-fast, CPU-light, and significantly reduces the final file weight.

Safety and Reliability (Diagnostic Mode)

Because a 2-hour conversion crashing at 99% is unacceptable, I integrated several safeguards:

Pre-validation: We check the integrity of metadata (episodes, frames, flags) before touching the video files.
Watchdog & Heartbeat: If the script shows no progress for 120 seconds, it "fails-fast" rather than wasting heat.
RAM Estimation: The script calculates and displays the estimated memory footprint of the batch before starting.

3. Alignment for LeWM (World Model)

The LeWM model is demanding. Conversion isn't enough; we need adaptation:

Resize 224x224: The standard for modern vision backbones. Resizing is done on-the-fly during conversion.
Key Normalization: LeRobot names columns one way, while LeWM expects another (e.g., pixels, action, state, done). My bridge handles the translation automatically.

4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.

My Policy: dirty_episode_policy=fail

On small datasets, the model is extremely sensitive to overfitting. Introducing inconsistent trajectories or ill-defined episode endings condemns the model to learn nonsense. I would rather have a converter that refuses to work than one that produces toxic data.

Tanguy's Advice

If you attempt this: watch your HDF5 chunks. A mismatch between your chunk size and your conversion micro-batches can turn your hard drive into a massive bottleneck. Yes, I learned this the hard way—remember, I only have a little Mac!

Next Step: We launch the training on an RTX 4090 runpod instance and see if our latent space survives physical reality.

[2] Data Engineering: Why and How to Convert LeRobot (Parquet/MP4) to HDF5

1. The Format War: Storage vs. Training

2. The "Little Mac" Challenge: Optimize or Die

The "Low-Memory" Conversion Strategy

Safety and Reliability (Diagnostic Mode)

3. Alignment for LeWM (World Model)

4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.

Tanguy's Advice

Comments

JEPA World Models: From Pixels to Physics

[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs

More from this blog

[3] Post-Mortem Analysis: Why My First World Model (JEPA) Is "Blind"

[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs

Command Palette

1. The Format War: Storage vs. Training

2. The "Little Mac" Challenge: Optimize or Die

The "Low-Memory" Conversion Strategy

Safety and Reliability (Diagnostic Mode)

3. Alignment for LeWM (World Model)

4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.

Tanguy's Advice

Comments

JEPA World Models: From Pixels to Physics

[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs

More from this blog