<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Technical Blueprints]]></title><description><![CDATA[Engineering insights on world models, robotics, and complex system diagnostics. Documenting the logic and mechanics behind modern technology.]]></description><link>https://blog.tanguy-pauwels.fr</link><image><url>https://cdn.hashnode.com/uploads/logos/69df4f9b74b22138755e755f/958a4242-7462-479c-b424-504c74b2e30a.png</url><title>Technical Blueprints</title><link>https://blog.tanguy-pauwels.fr</link></image><generator>RSS for Node</generator><lastBuildDate>Tue, 21 Apr 2026 08:19:22 GMT</lastBuildDate><atom:link href="https://blog.tanguy-pauwels.fr/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[[3] Post-Mortem Analysis: Why My First World Model (JEPA) Is "Blind"]]></title><description><![CDATA[The pipeline is ready, the data is converted, and the GPU has completed its first "stress test." Now comes the most critical phase for any research engineer: the autopsy of the latent space.
In this t]]></description><link>https://blog.tanguy-pauwels.fr/why-my-jepa-is-blind</link><guid isPermaLink="true">https://blog.tanguy-pauwels.fr/why-my-jepa-is-blind</guid><category><![CDATA[robotics]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[AI Research]]></category><category><![CDATA[Computer Vision]]></category><category><![CDATA[Jepa]]></category><category><![CDATA[openscience]]></category><category><![CDATA[Data Science]]></category><dc:creator><![CDATA[Tanguy Pauwels]]></dc:creator><pubDate>Wed, 15 Apr 2026 12:30:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/a4b4eeac-f7b5-4d61-8f87-fa232ca80754.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The pipeline is ready, the data is converted, and the GPU has completed its first "stress test." Now comes the most critical phase for any research engineer: <strong>the autopsy of the latent space.</strong></p>
<p>In this third installment of our series, we dive into the results of our first training run on the <strong>Koch SO-ARM101</strong> using a JEPA (Joint-Embedding Predictive Architecture). With a limited dataset of only 12 episodes, we weren't expecting a miracle, but rather a clear diagnostic.</p>
<p>Is the model learning the laws of physics or just memorizing pixels? Does our latent space show signs of "collapse," or is it ready to support a high-level planner like a GFlowNet? By analyzing <strong>Latent MSE</strong>, <strong>PCA projections</strong>, <strong>t-SNE clustering</strong>, and <strong>Linear Probing</strong>, we will map the boundaries of our model's "internal world" and define the roadmap for our next scaling phase.</p>
<details>
<summary>Setup detail</summary>
<ul><li><p><strong>Data :</strong> 12 épisodes / 9000 rows</p></li><li><p>Dataset used: <a target="_blank" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" href="https://huggingface.co/datasets/Tpauwels/lerobot-hdf5-koch_pick_place_1_lego" style="pointer-events:none">https://huggingface.co/datasets/Tpauwels/lerobot-hdf5-koch_pick_place_1_lego</a></p></li><li><p>Checkpoint used: <a target="_blank" rel="noopener noreferrer nofollow" class="text-primary underline underline-offset-2 hover:text-primary/80 cursor-pointer" style="pointer-events:none">https://huggingface.co/Tpauwels/lewm-koch</a></p><details class="editor-details"><summary class="details-summary">Hardware:</summary><div data-type="details-content" class="details-content"><ul><li><p>GPU: RTX 4090</p></li><li><p>vCPU: 16 (AMD EPYC 75F3 32-Core Processor)</p></li><li><p>Memory: 62 Gb</p></li><li><p>Container Disk: 50 Gb</p></li></ul></div></details><details class="editor-details"><summary class="details-summary"><strong>Training parameters</strong></summary><div data-type="details-content" class="details-content"><p></p><ul><li><p>Batch SIze: 128</p></li><li><p>Num Worker: 6</p></li><li><p>Frame Skip: 5</p></li><li><p>Sigreg Weight: 0.09</p></li><li><p>history_size: 3</p></li><li><p>num_preds: 1</p></li><li><p>max_epoch: 100 (réel 45)</p></li><li><p>precision: bf16</p></li><li><p>train_split: 0.9</p></li><li><p>seed: 3072</p></li></ul></div></details></li></ul>
</details>

<p>The training was performed on <strong>RunPod</strong> using the following template:<br /><a href="https://console.runpod.io/deploy?template=f83357qr5r&amp;ref=7x06vrca">https://console.runpod.io/deploy?template=f83357qr5r&amp;ref=7x06vrca</a><br />(Note: This is not a commercial collaboration or a paid partnership).</p>
<h2>Data analysis</h2>
<h3><strong>Fig 1: Predictive Drift — Stability or Inertia?</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/d05713cd-95a7-4da3-9310-d0cc9e17028b.png" alt="Line chart - Latent prediction Error by Step" style="display:block;margin:0 auto" />

<p>This graph measures the error between what the model "imagines" (prediction) and what it "sees" (real encoding) as the prediction horizon extends further into the future.</p>
<ul>
<li><p><strong>X-Axis</strong> <code>(step_idx)</code><strong>:</strong> Represents the number of "steps" into the future. At <strong>Step 0</strong>, the model starts from a ground-truth image. By <strong>Step 100</strong>, it has generated 100 successive states autoregressively, relying solely on its own previous latent predictions and the provided action vectors.</p>
</li>
<li><p><strong>Theory vs. Reality:</strong> In a standard predictive architecture, the <code>MSE</code> (Mean Squared Error) should increase linearly or even exponentially. This is due to the inevitable accumulation of temporal errors—a phenomenon to which JEPA models are particularly sensitive.</p>
</li>
</ul>
<h4><strong>Analysis of the Two Phases:</strong></h4>
<ol>
<li><p><strong>The Stagnation Phase (0 to 400 steps):</strong> The error remains abnormally low and flat. This is the sign of a "lazy" model. With a limited dataset of only 12 episodes, the JEPA has learned that the static background and noise are the easiest signals to predict. By predicting almost no change, it minimizes its Loss without actually learning the arm's physics. We are witnessing <strong>background overfitting</strong>.</p>
</li>
<li><p><strong>The Rupture Phase (After 400 steps):</strong> We see a sudden explosion in <code>MSE</code>. This occurs when the model exceeds the average sequence length of its training data. Having no more known structure to cling to, its "imaginary world" completely collapses.</p>
</li>
</ol>
<blockquote>
<p><strong>The Diagnostic:</strong> The apparent stability in the first phase is a statistical illusion. The predictor and encoder are "consistent" only because they have both agreed to ignore the dynamics of the robot in favor of a static world.</p>
</blockquote>
<hr />
<h3><strong>Fig 2: Consistency Analysis — Latent Neighbors Topology</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/edd19a48-5ddc-471c-8609-4cd2624da734.png" alt="Scatter Chart - Nearest Neighbors in PCA Space" style="display:block;margin:0 auto" />

<p>Visualizing a <strong>PCA</strong> (Principal Component Analysis) on a model using the <strong>SIGReg</strong> regularizer may seem paradoxical. Where PCA seeks to isolate the axes of maximum variance, SIGReg aims to uniformize it into a Gaussian distribution (isotropy).</p>
<p>However, it is precisely this "conflict" that makes this visualization valuable: it allows us to observe the <strong>struggle between the model's dynamics and its statistical constraints</strong>.</p>
<h4><strong>The Visual Grammar of Latent Space</strong></h4>
<p>To interpret this graph, we use three "health" indicators:</p>
<ul>
<li><p><strong>The Perfect Circle:</strong> The regularization has won. The model has "killed" the physical signal in favor of Gaussian statistics.</p>
</li>
<li><p><strong>Filaments or Loops:</strong> The physics is "resisting." Episode trajectories are forcing the space to structure itself despite the constraint.</p>
</li>
<li><p><strong>Clustering (Point or Line):</strong> A sign of total <strong>collapse</strong>. The encoder is no longer producing useful information.</p>
</li>
</ul>
<h4><strong>Observations from Run 1</strong></h4>
<p>On this graph, we observe:</p>
<ol>
<li><p><strong>Local Consistency:</strong> The neighbors (blue points) are tightly grouped around the anchors (red crosses). This indicates that the model respects a certain mathematical continuity.</p>
</li>
<li><p><strong>"Spider Web" Structure:</strong> The distribution is messy but filamentous. This demonstrates that a physical structure is attempting to emerge from the constraints imposed by SIGReg, even if it hasn't yet formed clear semantic clusters (e.g., "closed gripper" vs. "open gripper").</p>
</li>
</ol>
<h4><strong>Limitations of the Current Analysis</strong></h4>
<p>Technical honesty is required here: my current evaluation script does not yet visually isolate the frames selected by the <strong>Nearest Neighbor (NN)</strong>. While mathematical proximity is evident, I cannot yet state with absolute certainty that two neighbors share identical visual characteristics. This is a key improvement planned for the next diagnostic pipeline.</p>
<hr />
<h3><strong>Fig 3: Global Latent Mapping (PCA Projection)</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/9b6ccc06-7d4a-4fa3-98cf-7692fcb1481a.png" alt="Scatter chart - PCA Encore and Predicted Latents colored by episode" style="display:block;margin:0 auto" />

<p>In this visualization, we use the same PCA projection but color the points according to their <strong>source episode</strong>. This allows us to track how the model separates individual demonstrations.</p>
<h4><strong>1. Sequential Victory: Filaments over Clouds</strong></h4>
<p>We observe that each episode forms a semi-independent "filament" rather than an exploded cloud of points. This is a significant victory for the encoder: it demonstrates a clear understanding that state <code>t+1</code> is closely related to state <code>t</code>. The model has successfully captured the <strong>sequential structure</strong> of the video.</p>
<h4><strong>2. Chaotic Entanglement = Absence of Semantics</strong></h4>
<p>However, this is where the "Post-Mortem" gets gritty. In a model that truly "understands" the task (a generalist model), we shouldn't see isolated filaments by episode. Instead, we should see these filaments break apart to form <strong>thematic nodes</strong>:</p>
<ul>
<li><p>All "rest positions" (starting points) should cluster in the same area.</p>
</li>
<li><p>All "grasping phases" should form a compact cluster, regardless of which episode they come from.</p>
</li>
</ul>
<p>Currently, we have chaos because:</p>
<ul>
<li><p>The model identifies each episode by its <strong>unique noise</strong> (slight lighting differences, imperceptibly different starting positions).</p>
</li>
<li><p>It fails to extract the <strong>invariant</strong> (the act of grasping) to merge these trajectories into a shared latent meaning.</p>
</li>
</ul>
<h4><strong>3. SIGReg and the "Tangled Cables" Effect</strong></h4>
<p>Without <strong>SIGReg</strong>, these filaments would likely be drifting far apart in an immense, unorganized space. SIGReg forces them all into the same "small box" (the Gaussian distribution).</p>
<p>The result? They all end up in the same mathematical coordinate space, but because they lack semantic ties, they cross and intertwine without any physical logic. It is the latent equivalent of a <strong>drawer full of tangled cables</strong>.</p>
<hr />
<h3><strong>Fig 4: Linear Probing</strong> R2 <strong>Scores – The Latent-to-Physics Bridge</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/e347d0d1-ea63-4ee4-9344-2296381e71eb.png" alt="Bar Chart - Linear Probe R2 by Signal" style="display:block;margin:0 auto" />

<p>If the PCA was our "X-ray," the <strong>Linear Probing</strong> is our "lie detector." It measures exactly how much physical information is actually accessible within the latent space for a controller or a planner.</p>
<p>We perform a simple linear regression from the latent state to predict real-world values. A high score indicates high <strong>linear availability</strong>; a low score means the information is either missing or "twisted" in a way that a simple system cannot use.</p>
<h4><strong>1. Component Breakdown:</strong></h4>
<ul>
<li><p><strong>Action:</strong> This is the highest score. The model captures "intent"—the correlation between the visual context and the command sent. It confirms that the link between images and actions is the loudest signal, yet it remains significantly underdeveloped.</p>
<p>$$R^2 \approx 0.35$$</p>
</li>
<li><p><strong>Proprioception:</strong> This is the weak link. The model struggles to locate the arm’s joints in space. A score of 0.30 means it is effectively "blind" to 70% of the actual joint positions. For a robot, this is the equivalent of driving with your eyes closed, trying to guess the steering wheel's position by touch alone.</p>
</li>
</ul>
<p>$$R^2 \approx 0.30$$</p>
<ul>
<li><p><strong>State:</strong> The "State" (a combination of image and proprioception) stagnates. This confirms the encoder’s failure to extract a stable world representation from a dataset of only 12 repetitive episodes.</p>
<p>$$R^2 \approx 0.30$$</p>
</li>
</ul>
<h4><strong>2. The "Glass Ceiling" Effect</strong></h4>
<p>Seeing all scores clustered around the 0.3 mark reveals a fundamental <strong>semantic resolution problem</strong>:</p>
<ol>
<li><p><strong>Non-Linearity:</strong> Since the probing is linear, a low score suggests the information might be present but is stored non-linearly. It is "trapped" in a format the model cannot easily translate into action.</p>
</li>
<li><p><strong>The Predictor Paradox:</strong> As noted in our Epoch 10 analysis, the predictor often outperforms the encoder. This proves the architecture is <em>trying</em> to learn the physics, but the encoder isn't receiving enough visual diversity to "calibrate" its vision.</p>
</li>
</ol>
<blockquote>
<p><strong>The Verdict:</strong> A global score of <strong>0.31</strong> is an admission of powerlessness for any high-level planner (like a GFlowNet). A system cannot make rational decisions if its perception of reality is 70% wrong. This graph mathematically validates our intuition: the model has learned the "noise" of the trajectory, but it remains blind to the "precision" of the gesture.</p>
</blockquote>
<hr />
<h3><strong>Fig 6: Non-Linear Clustering Analysis (t-SNE)</strong></h3>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/0083baba-40f0-4a4c-8c69-c8b8d3de8d3d.png" alt="Scatter Chart - t-SNE Latents colored by encode/pred" style="display:block;margin:0 auto" />

<p>While PCA looks for global variance, <strong>t-SNE</strong> focuses on local structures and manifold geometry. This visualization is where the model's struggle with generalization becomes most apparent.</p>
<h3><strong>What is t-SNE?</strong></h3>
<p>Before diving into the visualization, let’s define the tool. <strong>t-Distributed Stochastic Neighbor Embedding (t-SNE)</strong> is a non-linear dimensionality reduction technique specifically designed to visualize high-dimensional data.</p>
<p>Unlike PCA, which is linear and focuses on preserving the global spread (variance) of the data, t-SNE is "obsessed" with <strong>local structure</strong>. It works by converting the distances between points in the high-dimensional latent space into probabilities of being neighbors. It then tries to recreate those same neighborhood relationships in a 2D map.</p>
<p>In short: if two points are close on a t-SNE plot, they are very likely to be "neighbors" in the model's mind. In robotics, we use it to see if the model naturally groups similar physical states (like "closed grip") into distinct clusters.</p>
<h4><strong>1. The "Spaghetti" Syndrome: Episode Overfitting</strong></h4>
<p>The most striking feature of this graph is the presence of distinct, colorful "curves" or filaments. Each curve represents a single episode from the training set.</p>
<ul>
<li><strong>The Diagnostic:</strong> Instead of grouping similar physical states together (e.g., all "gripper closing" moments in one cluster), the model has isolated each episode into its own trajectory. This is a clear sign of <strong>overfitting</strong>: the JEPA has memorized the specific "flavor" of each recording rather than extracting the universal laws of the robot's physics.</li>
</ul>
<h4><strong>2. Latent Stability: Encoder vs. Predictor</strong></h4>
<p>There is, however, a silver lining. Notice how closely the <strong>Encoder</strong> points (blue) and <strong>Predictor</strong> points (orange) overlap within each filament.</p>
<ul>
<li><strong>The Good News:</strong> The internal "dialogue" between the encoder and the predictor is highly stable. The predictor has successfully learned to mimic the encoder's latent transitions. Within its own "hallucinated" world, the model is consistent; it’s just that this world is currently a collection of 12 separate stories rather than one unified physical reality.</li>
</ul>
<h4><strong>3. Implications for GFlowNet Planning</strong></h4>
<p>For a <strong>GFlowNet</strong> to function as a System 2 (reasoning) layer, it needs an environment where it can jump between states based on logic, not just follow a pre-recorded path. The current "blurred" boundaries and dispersed clusters confirm that the latent space is not yet "navigable."</p>
<blockquote>
<p><strong>Conclusion:</strong> The model is an excellent "historian" of its 12 episodes, but a poor "physicist." To break these filaments and force the model to merge these trajectories into meaningful semantic clusters, we need to flood the system with the <strong>Massive Diversity</strong> planned for Run 3.</p>
</blockquote>
<hr />
<h3><strong>Nota Bene: The Multi-View Experiment</strong></h3>
<p>A second training run was conducted by doubling the dataset size through a multi-view approach (using the same episodes but captured from two different camera angles).</p>
<p>After a thorough analysis, I decided not to dedicate a full article to this second run. The results were strictly identical, exhibiting the same patterns and limitations as the first run. Adding another camera angle on the same 12 episodes did not provide the "physical diversity" the model craved.</p>
<p><strong>Key Takeaway: The 10-Epoch Convergence</strong> Interestingly, I compared the metrics at <strong>Epoch 10</strong> versus <strong>Epoch 40</strong> during this second run. The results were slightly better at Epoch 10. This confirms the findings of the original <em>LeWorldModel</em> (LeWM) paper: this architecture converges extremely fast, typically within 10 epochs.</p>
<p><strong>The Pivot for Run 3</strong> Moving forward, we will stop "wasting" compute on long training cycles. For all future experiments, we will limit training to <strong>10 epochs</strong>. Instead of depth, we will focus on <strong>breadth</strong>: drastically increasing the number of episodes by building a massive, merged dataset from various heterogeneous sources.</p>
]]></content:encoded></item><item><title><![CDATA[[2] Data Engineering: Why and How to Convert LeRobot (Parquet/MP4) to HDF5]]></title><description><![CDATA[In my previous post, I explained why the JEPA architecture is such a promising lead for robotics. But between Yann LeCun’s theory and the first \(loss.backward()\), there is a massive wall: the data.
]]></description><link>https://blog.tanguy-pauwels.fr/from-mp4-to-hdf5</link><guid isPermaLink="true">https://blog.tanguy-pauwels.fr/from-mp4-to-hdf5</guid><category><![CDATA[data-engineering]]></category><category><![CDATA[Python]]></category><category><![CDATA[HDF5]]></category><category><![CDATA[robotics]]></category><category><![CDATA[lerobot]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Tanguy Pauwels]]></dc:creator><pubDate>Wed, 15 Apr 2026 09:38:59 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/10d091b6-683c-4293-93d1-2eac404d553d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In my previous post, I explained why the <strong>JEPA</strong> architecture is such a promising lead for robotics. But between Yann LeCun’s theory and the first <code>\(loss.backward()\)</code>, there is a massive wall: <strong>the data.</strong></p>
<p>For my POC on the <strong>Koch arm (SO-ARM101)</strong>, I’m using the <strong>LeRobot</strong> ecosystem. It’s a goldmine of data, but its default storage format isn't built for the intensive training cycles required by World Models. Here is why I had to build a technical "bridge" to the <strong>HDF5</strong> format.</p>
<p>🔗 <a href="https://github.com/tanguy-pauwels/lerobot-dataset-to-HDF5">Bridge repository</a></p>
<h2>1. The Format War: Storage vs. Training</h2>
<table style="min-width:75px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><td><p><strong>Feature</strong></p></td><td><p><strong>LeRobot Format (Parquet + MP4)</strong></p></td><td><p><strong>HDF5 Format (LeWM Optimized)</strong></p></td></tr><tr><td><p><strong>Ideal Use</strong></p></td><td><p>Lightweight distribution and archiving.</p></td><td><p>Intensive Training (GPU Datasets).</p></td></tr><tr><td><p><strong>Pros</strong></p></td><td><p>Highly compressed, easy to visualize, HF standard.</p></td><td><p>Ultra-fast <strong>Random Access</strong> to frames and actions.</p></td></tr><tr><td><p><strong>Cons</strong></p></td><td><p>Heavy CPU video decoding for every batch, potential desync.</p></td><td><p>Large file size, less "standard" for sharing.</p></td></tr></tbody></table>

<p><strong>The Problem</strong>: Training a World Model (JEPA) requires sampling random time windows across thousands of episodes. Attempting a <code>seek</code> in an MP4 file for every single frame in a batch of 256 is performance suicide. HDF5 allows us to treat the dataset as one massive tensor living on the disk.</p>
<hr />
<h2>2. The "Little Mac" Challenge: Optimize or Die</h2>
<p>Fetching datasets via the <code>lerobot</code> library is a breeze. The real challenge began during conversion. My Mac, with its limited resources, suffered several <strong>Kernel Panics</strong> before I got it right.</p>
<p>To succeed without saturating the RAM, I had to implement a <strong>"Lean &amp; Mean"</strong> pipeline:</p>
<h3>The "Low-Memory" Conversion Strategy</h3>
<ul>
<li><p><strong>Linear Pipeline</strong>: I abandoned aggressive parallelism. We process one episode at a time, one camera at a time. It’s slower, but it’s <strong>predictable</strong>.</p>
</li>
<li><p><strong>Micro-Batching (64 frames)</strong>: Instead of loading an entire episode, we decode and write to the HDF5 in small chunks.</p>
</li>
<li><p><strong>Video Streaming</strong>: Using iterators (PyAV/OpenCV) to ensure a full video never materializes in the RAM.</p>
</li>
<li><p><strong>LZF Compression</strong>: The perfect compromise. It's ultra-fast, CPU-light, and significantly reduces the final file weight.</p>
</li>
</ul>
<h3>Safety and Reliability (Diagnostic Mode)</h3>
<p>Because a 2-hour conversion crashing at 99% is unacceptable, I integrated several safeguards:</p>
<ul>
<li><p><strong>Pre-validation</strong>: We check the integrity of metadata (episodes, frames, flags) <em>before</em> touching the video files.</p>
</li>
<li><p><strong>Watchdog &amp; Heartbeat</strong>: If the script shows no progress for 120 seconds, it "fails-fast" rather than wasting heat.</p>
</li>
<li><p><strong>RAM Estimation</strong>: The script calculates and displays the estimated memory footprint of the batch before starting.</p>
</li>
</ul>
<h2>3. Alignment for LeWM (World Model)</h2>
<p>The <strong>LeWM</strong> model is demanding. Conversion isn't enough; we need adaptation:</p>
<ul>
<li><p><strong>Resize 224x224</strong>: The standard for modern vision backbones. Resizing is done on-the-fly during conversion.</p>
</li>
<li><p><strong>Key Normalization</strong>: LeRobot names columns one way, while LeWM expects another (e.g., <code>pixels</code>, <code>action</code>, <code>state</code>, <code>done</code>). My bridge handles the translation automatically.</p>
</li>
</ul>
<hr />
<h2>4. The Golden Rule: No "Dirty" Data In low-cost robotics, datasets are often imperfect—truncated episodes or missing done flags are common.</h2>
<blockquote>
<p>My Policy: dirty_episode_policy=fail</p>
</blockquote>
<p>On small datasets, the model is extremely sensitive to overfitting. Introducing inconsistent trajectories or ill-defined episode endings condemns the model to learn nonsense. I would rather have a converter that refuses to work than one that produces toxic data.</p>
<hr />
<h3>Tanguy's Advice</h3>
<blockquote>
<p>If you attempt this: <strong>watch your HDF5 chunks.</strong> A mismatch between your chunk size and your conversion micro-batches can turn your hard drive into a massive bottleneck. Yes, I learned this the hard way—remember, I only have a little Mac!</p>
</blockquote>
<p><strong>Next Step</strong>: We launch the training on an <strong>RTX 4090</strong> runpod instance and see if our latent space survives physical reality.</p>
]]></content:encoded></item><item><title><![CDATA[[1] Rethinking Robotics: Why I’m Betting on JEPA over VLAs]]></title><description><![CDATA[To test my research intuitions, I’m documenting my work on a JEPA (Joint-Embedding Predictive Architecture) world model, starting with low-cost robotics datasets like LeRobot and the Koch SO-ARM101. T]]></description><link>https://blog.tanguy-pauwels.fr/jepa-vs-vla-rethinking-robotics</link><guid isPermaLink="true">https://blog.tanguy-pauwels.fr/jepa-vs-vla-rethinking-robotics</guid><category><![CDATA[robotics]]></category><category><![CDATA[Machine Learning]]></category><category><![CDATA[#jepa #worldmodel]]></category><category><![CDATA[openscience]]></category><category><![CDATA[Computer Vision]]></category><category><![CDATA[Jepa]]></category><dc:creator><![CDATA[Tanguy Pauwels]]></dc:creator><pubDate>Wed, 15 Apr 2026 09:19:51 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/2cad8c2b-ca91-4b64-be98-421b6c3c64bf.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>To test my research intuitions, I’m documenting my work on a <strong>JEPA</strong> (Joint-Embedding Predictive Architecture) world model, starting with low-cost robotics datasets like <strong>LeRobot</strong> and the <strong>Koch SO-ARM101</strong>. This series follows the journey from raw pixels to physical understanding.</p>
<h2>1. The Current Deadlock: The Limits of VLA (Vision-Language-Action)</h2>
<p>Today, state-of-the-art robotics relies heavily on <strong>VLA models</strong>. The concept is straightforward: give the robot a text instruction and an image, and it generates an action. However, this approach hits two major walls:</p>
<ul>
<li><p><strong>The "Weight" Problem</strong>: Integrating a Large Language Model (LLM) into the control loop makes the embedding extremely heavy. We end up with models boasting billions of parameters just to decide whether to squeeze a 2cm gripper.</p>
</li>
<li><p><strong>Physical Hallucination</strong>: Language is discrete and symbolic; physics is continuous and unforgiving. By relying on LLM-style architectures, VLAs are prone to hallucinations: the robot "thinks" it has successfully grabbed an object because the statistical probability of the next word is high, even as the object slips away.</p>
</li>
</ul>
<p><em>"Even when not 'complete,' VLAs rely on massive language model backbones. They inherit the bloat and probabilistic nature of LLMs, whereas JEPA offers a compact architecture dedicated solely to world dynamics."</em></p>
<h2>2. What is the JEPA Architecture?</h2>
<p>The <strong>JEPA</strong> (Joint-Embedding Predictive Architecture), championed by Yann LeCun, proposes a shift away from the "reconstruction" paradigm and toward "understanding."</p>
<h3>The Fundamental Intuition</h3>
<p>Instead of predicting pixels—like a generative model wasting energy trying to reconstruct the exact reflection of a light bulb on a table—JEPA predicts <strong>abstract representations</strong>.</p>
<blockquote>
<p>The model isn't trying to draw the future; it’s trying to comprehend it.<em>.</em></p>
</blockquote>
<img src="https://cdn.hashnode.com/uploads/covers/69df4f9b74b22138755e755f/9e8c9e04-cb4e-4ac6-ba11-4aad81e7541e.png" alt="Technical schema of JEPA architecture" style="display:block;margin:0 auto" />

<blockquote>
<p><em>Architecture inspired by "LeWorldModel: Stable End-to-End JEPA from Pixels" (Maes et al., 2024).</em></p>
<p>This schema is licensed under <strong>CC-BY 4.0</strong>. You are free to share and adapt it for any purpose, even commercially, as long as you keep the watermark or credit the original article.</p>
</blockquote>
<h3>1. <strong>Encoding Images into Latent Vectors</strong></h3>
<p>The process starts with a shared <strong>Encoder</strong> (e.g., a ViT-Tiny). Its job is to condense a high-dimensional <code>224 x224</code> image into a compact <code>[B, 128]</code> <strong>latent vector</strong>. We encode both the initial observation at <code>Step t</code> and the "target" observation at <code>Step t+k</code>. This compressed representation is what we call the <strong>Latent Space</strong>.</p>
<h3><strong>2. Preventing Model Collapse via Statistical Regularization</strong></h3>
<p>Older JEPA architectures often suffered from "model collapse," where the encoder would output the same constant value for every image to artificially minimize the loss. To prevent this, <strong>LeWorldModel (LeWM)</strong> introduces <strong>SIGReg</strong> (Statistical Isotropic Gradient Regularization). This forces the Latent Space to follow an <strong>Isotropic Gaussian distribution</strong>:</p>
<ul>
<li><p><strong>Mean = 0</strong></p>
</li>
<li><p><strong>Variance = 1</strong> in every dimension/direction.</p>
</li>
</ul>
<h3>3. Predicting the Future</h3>
<p>We feed the regularized Latent Space <code>z_t</code> and the <strong>Action Tensor</strong> (shape <code>[B, k, n]</code>) into a <strong>Transformer-based Predictor</strong>. The goal is to predict the future state in the latent space:</p>
<p>$$\hat{z}_{t+k}$$</p>
<h4><strong>4. The Optimization Challenge</strong></h4>
<p>The training involves minimizing two competing losses:</p>
<ol>
<li><p><strong>Consistency Loss (PredLoss):</strong> Ensures the predicted latent matches the real encoded future.</p>
</li>
<li><p><strong>SIGReg Loss:</strong> Prevents collapse by enforcing the Gaussian distribution.</p>
</li>
</ol>
<p><strong>The core challenge lies in this tension:</strong> To achieve good predictions, the encoder wants to "group" similar physical states together. However, SIGReg constantly tries to "spread" the points out to satisfy the statistical constraint. This "struggle" is exactly what defines the geometry of our model's internal world.</p>
<h2>3. Critical Analysis</h2>
<h3>Advantages</h3>
<ul>
<li><p><strong>Efficiency</strong>: With only 15M parameters, LeWM outperforms giant models in planning speed (up to 48x faster).</p>
</li>
<li><p><strong>Intuitive Physics</strong>: The model naturally learns concepts like gravity and object permanence.</p>
</li>
</ul>
<h3>Disadvantages</h3>
<ul>
<li><p><strong>Black Box</strong>: It is impossible to "see" the prediction without adding a third-party decoder.</p>
</li>
<li><p><strong>Short Horizon</strong>: Error accumulation in autoregressive mode limits long-term planning.</p>
</li>
</ul>
<h2>4. The GFN &gt; JEPA Intuition: Marrying System 1 &amp; 2</h2>
<p>The major hurdle for JEPA in robotics is reaching distant goals. My intuition is to use <strong>GFlowNets (GFN)</strong> to decompose the task:</p>
<ul>
<li><p><strong>JEPA (System 1 - Instinct)</strong>: Handles local physics and immediate execution within the latent space.</p>
</li>
<li><p><strong>GFN (System 2 - Reason)</strong>: Explores the diversity of possible trajectories and defines intermediate <strong>sub-goals</strong> that JEPA can then easily reach.</p>
</li>
</ul>
<h2>My Roadmap: From Signal to Sense</h2>
<p>To test this intuition, my work is divided into two major experimental phases:</p>
<h3>Phase 1: World Model Stress-Test</h3>
<p>I will begin by training a JEPA (based on the LeWM architecture) on specific Low-Cost robotics datasets (<strong>LeRobot</strong> ecosystem, <strong>Koch SO-ARM101</strong> arm). The goal is to study the "scaling laws" of the representation:</p>
<ul>
<li><p><strong>Data Density</strong>: What is the impact of increasing the number of episodes?</p>
</li>
<li><p><strong>Multi-View</strong>: Does the model become more stable by synchronizing multiple angles (e.g., laptop camera + mobile camera) for the same task?</p>
</li>
<li><p><strong>Generalization</strong>: How does the model react to non-robotic videos or different robotic arms?</p>
</li>
</ul>
<h3>Phase 2: GFN-JEPA Alignment</h3>
<p>If the latent space proves robust, I will tackle the "bridge" to deliberate planning. We will seek to define the limits of GFlowNets and align their workspace with JEPA’s. The end goal: a user provides a textual command, and the GFN translates it into a sequence of latent sub-goals for the JEPA to execute.</p>
<hr />
<h2>Follow the Progress</h2>
<p>This project is an <strong>Open-Science</strong> exploration. I will publish results, failures, and breakthroughs in this serie.</p>
<blockquote>
<p>A Note from the Author</p>
<p>This log is as much about the "how" as the "why." If you're interested in the intersection of latent dynamics and physical embodiment, stay tuned.</p>
</blockquote>
]]></content:encoded></item></channel></rss>