Skip to main content

Command Palette

Search for a command to run...

[2] Data Engineering: Why and How to Convert LeRobot (Parquet/MP4) to HDF5

Published
4 min read
[2] Data Engineering: Why and How to Convert LeRobot (Parquet/MP4) to HDF5

In my previous post, I explained why the JEPA architecture is such a promising lead for robotics. But between Yann LeCun’s theory and the first \(loss.backward()\), there is a massive wall: the data.

For my POC on the Koch arm (SO-ARM101), I’m using the LeRobot ecosystem. It’s a goldmine of data, but its default storage format isn't built for the intensive training cycles required by World Models. Here is why I had to build a technical "bridge" to the HDF5 format.

🔗 Bridge repository

1. The Format War: Storage vs. Training

Feature

LeRobot Format (Parquet + MP4)

HDF5 Format (LeWM Optimized)

Ideal Use

Lightweight distribution and archiving.

Intensive Training (GPU Datasets).

Pros

Highly compressed, easy to visualize, HF standard.

Ultra-fast Random Access to frames and actions.

Cons

Heavy CPU video decoding for every batch, potential desync.

Large file size, less "standard" for sharing.

The Problem: Training a World Model (JEPA) requires sampling random time windows across thousands of episodes. Attempting a seek in an MP4 file for every single frame in a batch of 256 is performance suicide. HDF5 allows us to treat the dataset as one massive tensor living on the disk.


2. The "Little Mac" Challenge: Optimize or Die

Fetching datasets via the lerobot library is a breeze. The real challenge began during conversion. My Mac, with its limited resources, suffered several Kernel Panics before I got it right.

To succeed without saturating the RAM, I had to implement a "Lean & Mean" pipeline:

The "Low-Memory" Conversion Strategy

  • Linear Pipeline: I abandoned aggressive parallelism. We process one episode at a time, one camera at a time. It’s slower, but it’s predictable.

  • Micro-Batching (64 frames): Instead of loading an entire episode, we decode and write to the HDF5 in small chunks.

  • Video Streaming: Using iterators (PyAV/OpenCV) to ensure a full video never materializes in the RAM.

  • LZF Compression: The perfect compromise. It's ultra-fast, CPU-light, and significantly reduces the final file weight.

Safety and Reliability (Diagnostic Mode)

Because a 2-hour conversion crashing at 99% is unacceptable, I integrated several safeguards:

  • Pre-validation: We check the integrity of metadata (episodes, frames, flags) before touching the video files.

  • Watchdog & Heartbeat: If the script shows no progress for 120 seconds, it "fails-fast" rather than wasting heat.

  • RAM Estimation: The script calculates and displays the estimated memory footprint of the batch before starting.

3. Alignment for LeWM (World Model)

The LeWM model is demanding. Conversion isn't enough; we need adaptation:

  • Resize 224x224: The standard for modern vision backbones. Resizing is done on-the-fly during conversion.

  • Key Normalization: LeRobot names columns one way, while LeWM expects another (e.g., pixels, action, state, done). My bridge handles the translation automatically.


4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.

My Policy: dirty_episode_policy=fail

On small datasets, the model is extremely sensitive to overfitting. Introducing inconsistent trajectories or ill-defined episode endings condemns the model to learn nonsense. I would rather have a converter that refuses to work than one that produces toxic data.


Tanguy's Advice

If you attempt this: watch your HDF5 chunks. A mismatch between your chunk size and your conversion micro-batches can turn your hard drive into a massive bottleneck. Yes, I learned this the hard way—remember, I only have a little Mac!

Next Step: We launch the training on an RTX 4090 runpod instance and see if our latent space survives physical reality.