JEPA vs VLA: Rethinking Embodied AI and World Dynamics

To test my research intuitions, I’m documenting my work on a JEPA (Joint-Embedding Predictive Architecture) world model, starting with low-cost robotics datasets like LeRobot and the Koch SO-ARM101. This series follows the journey from raw pixels to physical understanding.

1. The Current Deadlock: The Limits of VLA (Vision-Language-Action)

Today, state-of-the-art robotics relies heavily on VLA models. The concept is straightforward: give the robot a text instruction and an image, and it generates an action. However, this approach hits two major walls:

The "Weight" Problem: Integrating a Large Language Model (LLM) into the control loop makes the embedding extremely heavy. We end up with models boasting billions of parameters just to decide whether to squeeze a 2cm gripper.
Physical Hallucination: Language is discrete and symbolic; physics is continuous and unforgiving. By relying on LLM-style architectures, VLAs are prone to hallucinations: the robot "thinks" it has successfully grabbed an object because the statistical probability of the next word is high, even as the object slips away.

"Even when not 'complete,' VLAs rely on massive language model backbones. They inherit the bloat and probabilistic nature of LLMs, whereas JEPA offers a compact architecture dedicated solely to world dynamics."

2. What is the JEPA Architecture?

The JEPA (Joint-Embedding Predictive Architecture), championed by Yann LeCun, proposes a shift away from the "reconstruction" paradigm and toward "understanding."

The Fundamental Intuition

Instead of predicting pixels—like a generative model wasting energy trying to reconstruct the exact reflection of a light bulb on a table—JEPA predicts abstract representations.

The model isn't trying to draw the future; it’s trying to comprehend it..

Architecture inspired by "LeWorldModel: Stable End-to-End JEPA from Pixels" (Maes et al., 2024).

This schema is licensed under CC-BY 4.0. You are free to share and adapt it for any purpose, even commercially, as long as you keep the watermark or credit the original article.

1. Encoding Images into Latent Vectors

The process starts with a shared Encoder (e.g., a ViT-Tiny). Its job is to condense a high-dimensional 224 x224 image into a compact [B, 128] latent vector. We encode both the initial observation at Step t and the "target" observation at Step t+k. This compressed representation is what we call the Latent Space.

2. Preventing Model Collapse via Statistical Regularization

Older JEPA architectures often suffered from "model collapse," where the encoder would output the same constant value for every image to artificially minimize the loss. To prevent this, LeWorldModel (LeWM) introduces SIGReg (Statistical Isotropic Gradient Regularization). This forces the Latent Space to follow an Isotropic Gaussian distribution:

Mean = 0
Variance = 1 in every dimension/direction.

3. Predicting the Future

We feed the regularized Latent Space z_t and the Action Tensor (shape [B, k, n]) into a Transformer-based Predictor. The goal is to predict the future state in the latent space:

$$\hat{z}_{t+k}$$

4. The Optimization Challenge

The training involves minimizing two competing losses:

Consistency Loss (PredLoss): Ensures the predicted latent matches the real encoded future.
SIGReg Loss: Prevents collapse by enforcing the Gaussian distribution.

The core challenge lies in this tension: To achieve good predictions, the encoder wants to "group" similar physical states together. However, SIGReg constantly tries to "spread" the points out to satisfy the statistical constraint. This "struggle" is exactly what defines the geometry of our model's internal world.

3. Critical Analysis

Advantages

Efficiency: With only 15M parameters, LeWM outperforms giant models in planning speed (up to 48x faster).
Intuitive Physics: The model naturally learns concepts like gravity and object permanence.

Disadvantages

Black Box: It is impossible to "see" the prediction without adding a third-party decoder.
Short Horizon: Error accumulation in autoregressive mode limits long-term planning.

4. The GFN > JEPA Intuition: Marrying System 1 & 2

The major hurdle for JEPA in robotics is reaching distant goals. My intuition is to use GFlowNets (GFN) to decompose the task:

JEPA (System 1 - Instinct): Handles local physics and immediate execution within the latent space.
GFN (System 2 - Reason): Explores the diversity of possible trajectories and defines intermediate sub-goals that JEPA can then easily reach.

My Roadmap: From Signal to Sense

To test this intuition, my work is divided into two major experimental phases:

Phase 1: World Model Stress-Test

I will begin by training a JEPA (based on the LeWM architecture) on specific Low-Cost robotics datasets (LeRobot ecosystem, Koch SO-ARM101 arm). The goal is to study the "scaling laws" of the representation:

Data Density: What is the impact of increasing the number of episodes?
Multi-View: Does the model become more stable by synchronizing multiple angles (e.g., laptop camera + mobile camera) for the same task?
Generalization: How does the model react to non-robotic videos or different robotic arms?

Phase 2: GFN-JEPA Alignment

If the latent space proves robust, I will tackle the "bridge" to deliberate planning. We will seek to define the limits of GFlowNets and align their workspace with JEPA’s. The end goal: a user provides a textual command, and the GFN translates it into a sequence of latent sub-goals for the JEPA to execute.

Follow the Progress

This project is an Open-Science exploration. I will publish results, failures, and breakthroughs in this serie.

A Note from the Author

This log is as much about the "how" as the "why." If you're interested in the intersection of latent dynamics and physical embodiment, stay tuned.

[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs

1. The Current Deadlock: The Limits of VLA (Vision-Language-Action)