Technical Blueprints

[3] Post-Mortem Analysis: Why My First World Model (JEPA) Is "Blind"

Tanguy Pauwels — Wed, 15 Apr 2026 12:30:10 GMT

The pipeline is ready, the data is converted, and the GPU has completed its first "stress test." Now comes the most critical phase for any research engineer: the autopsy of the latent space.

In this third installment of our series, we dive into the results of our first training run on the Koch SO-ARM101 using a JEPA (Joint-Embedding Predictive Architecture). With a limited dataset of only 12 episodes, we weren't expecting a miracle, but rather a clear diagnostic.

Is the model learning the laws of physics or just memorizing pixels? Does our latent space show signs of "collapse," or is it ready to support a high-level planner like a GFlowNet? By analyzing Latent MSE, PCA projections, t-SNE clustering, and Linear Probing, we will map the boundaries of our model's "internal world" and define the roadmap for our next scaling phase.

Setup detail

[2] Data Engineering: Why and How to Convert LeRobot (Parquet/MP4) to HDF5

Tanguy Pauwels — Wed, 15 Apr 2026 09:38:59 GMT

In my previous post, I explained why the JEPA architecture is such a promising lead for robotics. But between Yann LeCun’s theory and the first $loss.backward()$, there is a massive wall: the data.

For my POC on the Koch arm (SO-ARM101), I’m using the LeRobot ecosystem. It’s a goldmine of data, but its default storage format isn't built for the intensive training cycles required by World Models. Here is why I had to build a technical "bridge" to the HDF5 format.

🔗 Bridge repository

1. The Format War: Storage vs. Training

Feature	LeRobot Format (Parquet + MP4)	HDF5 Format (LeWM Optimized)
Ideal Use	Lightweight distribution and archiving.	Intensive Training (GPU Datasets).
Pros	Highly compressed, easy to visualize, HF standard.	Ultra-fast Random Access to frames and actions.
Cons	Heavy CPU video decoding for every batch, potential desync.	Large file size, less "standard" for sharing.

The Problem: Training a World Model (JEPA) requires sampling random time windows across thousands of episodes. Attempting a seek in an MP4 file for every single frame in a batch of 256 is performance suicide. HDF5 allows us to treat the dataset as one massive tensor living on the disk.

2. The "Little Mac" Challenge: Optimize or Die

Fetching datasets via the lerobot library is a breeze. The real challenge began during conversion. My Mac, with its limited resources, suffered several Kernel Panics before I got it right.

To succeed without saturating the RAM, I had to implement a "Lean & Mean" pipeline:

The "Low-Memory" Conversion Strategy

Linear Pipeline: I abandoned aggressive parallelism. We process one episode at a time, one camera at a time. It’s slower, but it’s predictable.
Micro-Batching (64 frames): Instead of loading an entire episode, we decode and write to the HDF5 in small chunks.
Video Streaming: Using iterators (PyAV/OpenCV) to ensure a full video never materializes in the RAM.
LZF Compression: The perfect compromise. It's ultra-fast, CPU-light, and significantly reduces the final file weight.

Safety and Reliability (Diagnostic Mode)

Because a 2-hour conversion crashing at 99% is unacceptable, I integrated several safeguards:

Pre-validation: We check the integrity of metadata (episodes, frames, flags) before touching the video files.
Watchdog & Heartbeat: If the script shows no progress for 120 seconds, it "fails-fast" rather than wasting heat.
RAM Estimation: The script calculates and displays the estimated memory footprint of the batch before starting.

3. Alignment for LeWM (World Model)

The LeWM model is demanding. Conversion isn't enough; we need adaptation:

Resize 224x224: The standard for modern vision backbones. Resizing is done on-the-fly during conversion.
Key Normalization: LeRobot names columns one way, while LeWM expects another (e.g., pixels, action, state, done). My bridge handles the translation automatically.

4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.

My Policy: dirty_episode_policy=fail

On small datasets, the model is extremely sensitive to overfitting. Introducing inconsistent trajectories or ill-defined episode endings condemns the model to learn nonsense. I would rather have a converter that refuses to work than one that produces toxic data.

Tanguy's Advice

If you attempt this: watch your HDF5 chunks. A mismatch between your chunk size and your conversion micro-batches can turn your hard drive into a massive bottleneck. Yes, I learned this the hard way—remember, I only have a little Mac!

Next Step: We launch the training on an RTX 4090 runpod instance and see if our latent space survives physical reality.

[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs

Tanguy Pauwels — Wed, 15 Apr 2026 09:19:51 GMT

To test my research intuitions, I’m documenting my work on a JEPA (Joint-Embedding Predictive Architecture) world model, starting with low-cost robotics datasets like LeRobot and the Koch SO-ARM101. This series follows the journey from raw pixels to physical understanding.

1. The Current Deadlock: The Limits of VLA (Vision-Language-Action)

Today, state-of-the-art robotics relies heavily on VLA models. The concept is straightforward: give the robot a text instruction and an image, and it generates an action. However, this approach hits two major walls:

The "Weight" Problem: Integrating a Large Language Model (LLM) into the control loop makes the embedding extremely heavy. We end up with models boasting billions of parameters just to decide whether to squeeze a 2cm gripper.
Physical Hallucination: Language is discrete and symbolic; physics is continuous and unforgiving. By relying on LLM-style architectures, VLAs are prone to hallucinations: the robot "thinks" it has successfully grabbed an object because the statistical probability of the next word is high, even as the object slips away.

"Even when not 'complete,' VLAs rely on massive language model backbones. They inherit the bloat and probabilistic nature of LLMs, whereas JEPA offers a compact architecture dedicated solely to world dynamics."

2. What is the JEPA Architecture?

The JEPA (Joint-Embedding Predictive Architecture), championed by Yann LeCun, proposes a shift away from the "reconstruction" paradigm and toward "understanding."

The Fundamental Intuition

Instead of predicting pixels—like a generative model wasting energy trying to reconstruct the exact reflection of a light bulb on a table—JEPA predicts abstract representations.

The model isn't trying to draw the future; it’s trying to comprehend it..

Architecture inspired by "LeWorldModel: Stable End-to-End JEPA from Pixels" (Maes et al., 2024).

This schema is licensed under CC-BY 4.0. You are free to share and adapt it for any purpose, even commercially, as long as you keep the watermark or credit the original article.

1. Encoding Images into Latent Vectors

The process starts with a shared Encoder (e.g., a ViT-Tiny). Its job is to condense a high-dimensional 224 x224 image into a compact [B, 128] latent vector. We encode both the initial observation at Step t and the "target" observation at Step t+k. This compressed representation is what we call the Latent Space.

2. Preventing Model Collapse via Statistical Regularization

Older JEPA architectures often suffered from "model collapse," where the encoder would output the same constant value for every image to artificially minimize the loss. To prevent this, LeWorldModel (LeWM) introduces SIGReg (Statistical Isotropic Gradient Regularization). This forces the Latent Space to follow an Isotropic Gaussian distribution:

Mean = 0
Variance = 1 in every dimension/direction.

3. Predicting the Future

We feed the regularized Latent Space z_t and the Action Tensor (shape [B, k, n]) into a Transformer-based Predictor. The goal is to predict the future state in the latent space:

$$\hat{z}_{t+k}$$

4. The Optimization Challenge

The training involves minimizing two competing losses:

Consistency Loss (PredLoss): Ensures the predicted latent matches the real encoded future.
SIGReg Loss: Prevents collapse by enforcing the Gaussian distribution.

The core challenge lies in this tension: To achieve good predictions, the encoder wants to "group" similar physical states together. However, SIGReg constantly tries to "spread" the points out to satisfy the statistical constraint. This "struggle" is exactly what defines the geometry of our model's internal world.

3. Critical Analysis

Advantages

Efficiency: With only 15M parameters, LeWM outperforms giant models in planning speed (up to 48x faster).
Intuitive Physics: The model naturally learns concepts like gravity and object permanence.

Disadvantages

Black Box: It is impossible to "see" the prediction without adding a third-party decoder.
Short Horizon: Error accumulation in autoregressive mode limits long-term planning.

4. The GFN > JEPA Intuition: Marrying System 1 & 2

The major hurdle for JEPA in robotics is reaching distant goals. My intuition is to use GFlowNets (GFN) to decompose the task:

JEPA (System 1 - Instinct): Handles local physics and immediate execution within the latent space.
GFN (System 2 - Reason): Explores the diversity of possible trajectories and defines intermediate sub-goals that JEPA can then easily reach.

My Roadmap: From Signal to Sense

To test this intuition, my work is divided into two major experimental phases:

Phase 1: World Model Stress-Test

I will begin by training a JEPA (based on the LeWM architecture) on specific Low-Cost robotics datasets (LeRobot ecosystem, Koch SO-ARM101 arm). The goal is to study the "scaling laws" of the representation:

Data Density: What is the impact of increasing the number of episodes?
Multi-View: Does the model become more stable by synchronizing multiple angles (e.g., laptop camera + mobile camera) for the same task?
Generalization: How does the model react to non-robotic videos or different robotic arms?

Phase 2: GFN-JEPA Alignment

If the latent space proves robust, I will tackle the "bridge" to deliberate planning. We will seek to define the limits of GFlowNets and align their workspace with JEPA’s. The end goal: a user provides a textual command, and the GFN translates it into a sequence of latent sub-goals for the JEPA to execute.

Follow the Progress

This project is an Open-Science exploration. I will publish results, failures, and breakthroughs in this serie.

A Note from the Author

This log is as much about the "how" as the "why." If you're interested in the intersection of latent dynamics and physical embodiment, stay tuned.