JEPA Post-Mortem: Why My First World Model is Blind

The pipeline is ready, the data is converted, and the GPU has completed its first "stress test." Now comes the most critical phase for any research engineer: the autopsy of the latent space.

In this third installment of our series, we dive into the results of our first training run on the Koch SO-ARM101 using a JEPA (Joint-Embedding Predictive Architecture). With a limited dataset of only 12 episodes, we weren't expecting a miracle, but rather a clear diagnostic.

Is the model learning the laws of physics or just memorizing pixels? Does our latent space show signs of "collapse," or is it ready to support a high-level planner like a GFlowNet? By analyzing Latent MSE, PCA projections, t-SNE clustering, and Linear Probing, we will map the boundaries of our model's "internal world" and define the roadmap for our next scaling phase.

Setup detail

Data : 12 épisodes / 9000 rows
Dataset used: https://huggingface.co/datasets/Tpauwels/lerobot-hdf5-koch_pick_place_1_lego
Checkpoint used: https://huggingface.co/Tpauwels/lewm-koch
Hardware:
- GPU: RTX 4090
- vCPU: 16 (AMD EPYC 75F3 32-Core Processor)
- Memory: 62 Gb
- Container Disk: 50 Gb
Training parameters
- Batch SIze: 128
- Num Worker: 6
- Frame Skip: 5
- Sigreg Weight: 0.09
- history_size: 3
- num_preds: 1
- max_epoch: 100 (réel 45)
- precision: bf16
- train_split: 0.9
- seed: 3072

The training was performed on RunPod using the following template:
https://console.runpod.io/deploy?template=f83357qr5r&ref=7x06vrca
(Note: This is not a commercial collaboration or a paid partnership).

Data analysis

Fig 1: Predictive Drift — Stability or Inertia?

Line chart - Latent prediction Error by Step

This graph measures the error between what the model "imagines" (prediction) and what it "sees" (real encoding) as the prediction horizon extends further into the future.

X-Axis (step_idx): Represents the number of "steps" into the future. At Step 0, the model starts from a ground-truth image. By Step 100, it has generated 100 successive states autoregressively, relying solely on its own previous latent predictions and the provided action vectors.
Theory vs. Reality: In a standard predictive architecture, the MSE (Mean Squared Error) should increase linearly or even exponentially. This is due to the inevitable accumulation of temporal errors—a phenomenon to which JEPA models are particularly sensitive.

Analysis of the Two Phases:

The Stagnation Phase (0 to 400 steps): The error remains abnormally low and flat. This is the sign of a "lazy" model. With a limited dataset of only 12 episodes, the JEPA has learned that the static background and noise are the easiest signals to predict. By predicting almost no change, it minimizes its Loss without actually learning the arm's physics. We are witnessing background overfitting.
The Rupture Phase (After 400 steps): We see a sudden explosion in MSE. This occurs when the model exceeds the average sequence length of its training data. Having no more known structure to cling to, its "imaginary world" completely collapses.

The Diagnostic: The apparent stability in the first phase is a statistical illusion. The predictor and encoder are "consistent" only because they have both agreed to ignore the dynamics of the robot in favor of a static world.

Fig 2: Consistency Analysis — Latent Neighbors Topology

Scatter Chart - Nearest Neighbors in PCA Space

Visualizing a PCA (Principal Component Analysis) on a model using the SIGReg regularizer may seem paradoxical. Where PCA seeks to isolate the axes of maximum variance, SIGReg aims to uniformize it into a Gaussian distribution (isotropy).

However, it is precisely this "conflict" that makes this visualization valuable: it allows us to observe the struggle between the model's dynamics and its statistical constraints.

The Visual Grammar of Latent Space

To interpret this graph, we use three "health" indicators:

The Perfect Circle: The regularization has won. The model has "killed" the physical signal in favor of Gaussian statistics.
Filaments or Loops: The physics is "resisting." Episode trajectories are forcing the space to structure itself despite the constraint.
Clustering (Point or Line): A sign of total collapse. The encoder is no longer producing useful information.

Observations from Run 1

On this graph, we observe:

Local Consistency: The neighbors (blue points) are tightly grouped around the anchors (red crosses). This indicates that the model respects a certain mathematical continuity.
"Spider Web" Structure: The distribution is messy but filamentous. This demonstrates that a physical structure is attempting to emerge from the constraints imposed by SIGReg, even if it hasn't yet formed clear semantic clusters (e.g., "closed gripper" vs. "open gripper").

Limitations of the Current Analysis

Technical honesty is required here: my current evaluation script does not yet visually isolate the frames selected by the Nearest Neighbor (NN). While mathematical proximity is evident, I cannot yet state with absolute certainty that two neighbors share identical visual characteristics. This is a key improvement planned for the next diagnostic pipeline.

Fig 3: Global Latent Mapping (PCA Projection)

Scatter chart - PCA Encore and Predicted Latents colored by episode

In this visualization, we use the same PCA projection but color the points according to their source episode. This allows us to track how the model separates individual demonstrations.

1. Sequential Victory: Filaments over Clouds

We observe that each episode forms a semi-independent "filament" rather than an exploded cloud of points. This is a significant victory for the encoder: it demonstrates a clear understanding that state t+1 is closely related to state t. The model has successfully captured the sequential structure of the video.

2. Chaotic Entanglement = Absence of Semantics

However, this is where the "Post-Mortem" gets gritty. In a model that truly "understands" the task (a generalist model), we shouldn't see isolated filaments by episode. Instead, we should see these filaments break apart to form thematic nodes:

All "rest positions" (starting points) should cluster in the same area.
All "grasping phases" should form a compact cluster, regardless of which episode they come from.

Currently, we have chaos because:

The model identifies each episode by its unique noise (slight lighting differences, imperceptibly different starting positions).
It fails to extract the invariant (the act of grasping) to merge these trajectories into a shared latent meaning.

3. SIGReg and the "Tangled Cables" Effect

Without SIGReg, these filaments would likely be drifting far apart in an immense, unorganized space. SIGReg forces them all into the same "small box" (the Gaussian distribution).

The result? They all end up in the same mathematical coordinate space, but because they lack semantic ties, they cross and intertwine without any physical logic. It is the latent equivalent of a drawer full of tangled cables.

Fig 4: Linear Probing R2 Scores – The Latent-to-Physics Bridge

If the PCA was our "X-ray," the Linear Probing is our "lie detector." It measures exactly how much physical information is actually accessible within the latent space for a controller or a planner.

We perform a simple linear regression from the latent state to predict real-world values. A high score indicates high linear availability; a low score means the information is either missing or "twisted" in a way that a simple system cannot use.

1. Component Breakdown:

Action: This is the highest score. The model captures "intent"—the correlation between the visual context and the command sent. It confirms that the link between images and actions is the loudest signal, yet it remains significantly underdeveloped.

$$R^2 \approx 0.35$$
Proprioception: This is the weak link. The model struggles to locate the arm’s joints in space. A score of 0.30 means it is effectively "blind" to 70% of the actual joint positions. For a robot, this is the equivalent of driving with your eyes closed, trying to guess the steering wheel's position by touch alone.

$$R^2 \approx 0.30$$

State: The "State" (a combination of image and proprioception) stagnates. This confirms the encoder’s failure to extract a stable world representation from a dataset of only 12 repetitive episodes.

$$R^2 \approx 0.30$$

2. The "Glass Ceiling" Effect

Seeing all scores clustered around the 0.3 mark reveals a fundamental semantic resolution problem:

Non-Linearity: Since the probing is linear, a low score suggests the information might be present but is stored non-linearly. It is "trapped" in a format the model cannot easily translate into action.
The Predictor Paradox: As noted in our Epoch 10 analysis, the predictor often outperforms the encoder. This proves the architecture is trying to learn the physics, but the encoder isn't receiving enough visual diversity to "calibrate" its vision.

The Verdict: A global score of 0.31 is an admission of powerlessness for any high-level planner (like a GFlowNet). A system cannot make rational decisions if its perception of reality is 70% wrong. This graph mathematically validates our intuition: the model has learned the "noise" of the trajectory, but it remains blind to the "precision" of the gesture.

Fig 6: Non-Linear Clustering Analysis (t-SNE)

Scatter Chart - t-SNE Latents colored by encode/pred

While PCA looks for global variance, t-SNE focuses on local structures and manifold geometry. This visualization is where the model's struggle with generalization becomes most apparent.

What is t-SNE?

Before diving into the visualization, let’s define the tool. t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique specifically designed to visualize high-dimensional data.

Unlike PCA, which is linear and focuses on preserving the global spread (variance) of the data, t-SNE is "obsessed" with local structure. It works by converting the distances between points in the high-dimensional latent space into probabilities of being neighbors. It then tries to recreate those same neighborhood relationships in a 2D map.

In short: if two points are close on a t-SNE plot, they are very likely to be "neighbors" in the model's mind. In robotics, we use it to see if the model naturally groups similar physical states (like "closed grip") into distinct clusters.

1. The "Spaghetti" Syndrome: Episode Overfitting

The most striking feature of this graph is the presence of distinct, colorful "curves" or filaments. Each curve represents a single episode from the training set.

The Diagnostic: Instead of grouping similar physical states together (e.g., all "gripper closing" moments in one cluster), the model has isolated each episode into its own trajectory. This is a clear sign of overfitting: the JEPA has memorized the specific "flavor" of each recording rather than extracting the universal laws of the robot's physics.

2. Latent Stability: Encoder vs. Predictor

There is, however, a silver lining. Notice how closely the Encoder points (blue) and Predictor points (orange) overlap within each filament.

The Good News: The internal "dialogue" between the encoder and the predictor is highly stable. The predictor has successfully learned to mimic the encoder's latent transitions. Within its own "hallucinated" world, the model is consistent; it’s just that this world is currently a collection of 12 separate stories rather than one unified physical reality.

3. Implications for GFlowNet Planning

For a GFlowNet to function as a System 2 (reasoning) layer, it needs an environment where it can jump between states based on logic, not just follow a pre-recorded path. The current "blurred" boundaries and dispersed clusters confirm that the latent space is not yet "navigable."

Conclusion: The model is an excellent "historian" of its 12 episodes, but a poor "physicist." To break these filaments and force the model to merge these trajectories into meaningful semantic clusters, we need to flood the system with the Massive Diversity planned for Run 3.

Nota Bene: The Multi-View Experiment

A second training run was conducted by doubling the dataset size through a multi-view approach (using the same episodes but captured from two different camera angles).

After a thorough analysis, I decided not to dedicate a full article to this second run. The results were strictly identical, exhibiting the same patterns and limitations as the first run. Adding another camera angle on the same 12 episodes did not provide the "physical diversity" the model craved.

Key Takeaway: The 10-Epoch Convergence Interestingly, I compared the metrics at Epoch 10 versus Epoch 40 during this second run. The results were slightly better at Epoch 10. This confirms the findings of the original LeWorldModel (LeWM) paper: this architecture converges extremely fast, typically within 10 epochs.

The Pivot for Run 3 Moving forward, we will stop "wasting" compute on long training cycles. For all future experiments, we will limit training to 10 epochs. Instead of depth, we will focus on breadth: drastically increasing the number of episodes by building a massive, merged dataset from various heterogeneous sources.

[3] Post-Mortem Analysis: Why My First World Model (JEPA) Is "Blind"

Data analysis