taykolasinski's comments

taykolasinski · 2026-01-16T15:50:25 1768578625

OP here. This is Part 2 of my reproduction series. I scaled the experiment from 10M params (MacBook) to 1.7B params (8x H100s) to test DeepSeek's instability claims.

The paper reported 3,000x signal amplification. I found 10,924x.

The "Instability Bomb" findings:

- The Scaling Law: It's strictly worse at scale. 10M → 9x, 1.7B → 10k x.

- The Culprit: It's Layer 0. The first mixing matrix eats raw embeddings without LayerNorm and immediately amplifies them.

- The Twist: Despite 10,000x amplification, the model didn't diverge. It kept learning, likely saved by gradient clipping.

I’ve posted the full logs and Amax graphs in the post. Happy to answer questions about the H100 cluster setup or the Sinkhorn projection math.

taykolasinski · 2026-01-12T16:28:19 1768235299

That's interesting.

I suspect your intuition about scale is correct. The theoretical benefit of mHC is that it acts as a sort of relief valve/router for information flow in very deep/wide networks where the standard residual bottleneck becomes an issue. At 8M params, the standard residual stream is likely already perfectly adequate, so mHC might just be adding parameter overhead without solving a real signal propagation problem yet.

Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

Scene_Cast2 · 2026-01-12T18:47:51 1768243671

My baseline was non-HC "vanilla" residuals; I didn't do a meaningful HC run to compare.

My application has some particularities (important and easy to identify per-token signals) that result in values growing (about 3x to 10x) through layers even in the baseline.

astrange · 2026-01-12T23:55:42 1768262142

> Quick question on your run: did you see the signal amplification/instability I saw (values growing during the forward pass)? or was it stable for you, just neutral on loss?

I think your brain may have been taken over by ChatGPT.

taykolasinski · 2026-01-12T16:16:33 1768234593

This is a fantastic catch. I hadn't realized Gemma 3n was already shipping with a variant of this in production.

It feels like we are entering the era of residual stream engineering. For a long time, the standard x + F(x) additive backbone was treated as untouchable. Now, between mHC (weighted scaling) and LAuReL (low-rank projections), labs are finally finding stable ways to make that signal path more dynamic.

I'm curious if the Low-Rank constraint in LAuReL acts as a natural stabilizer against the gradient explosion I saw with unconstrained hyper-connections.

Thanks for the paper link, definitely reading that tonight.

cpldcpu · 2026-01-12T17:46:46 1768240006

Thanks! Would be quite interesting to see how this fares compared to mHC.

I noted that LAuReL is cited in the mHC paper, but they refer to it as "expanding the width of the residual stream", which is rather odd.

taykolasinski · 2026-01-12T15:54:52 1768233292

Thanks! Manually created Astro components with inline SVG and CSS animations.

taykolasinski · 2026-01-12T15:08:14 1768230494

That initialization strategy (effectively starting as identity to match the standard residual stream) is clever. It would let you surgery an existing model like Llama-3 and fine-tune it into an mHC architecture.

The main risk I see is that the 7x signal amplification happens very aggressively. Even with a gentle initialization, you’d likely need very strict gradient clipping or a tiny learning rate on those new routing matrices to prevent them from blowing up the pre-trained features in the first few steps.

Also, I think there's a mix-up here between mHC (this paper, expressivity) and MLA (latent attention, which provides the massive context efficiency). mHC doesn't save memory, but it might make the model 'smarter' per parameter.

solarkraft · 2026-01-12T19:01:12 1768244472

You’re right, I totally mixed this up with MLA.

taykolasinski · 2026-01-12T14:09:12 1768226952

OP here. I spent the last few days reproducing the mHC architecture from the recent DeepSeek paper (2512.24880).

Two key takeaways from the reproduction:

Unconstrained Hyper-Connections really do explode (7x amplification even at 10M scale).

I hit a nasty "stream persistence" bug where my tensors were the right shape, but the architecture was functionally broken.

This is Part 1 (10M scale). Part 2 (scaling to 1B on A100s) is coming later this week. Happy to answer questions about the implementation.

WiSaGaN · 2026-01-12T14:48:08 1768229288

How do you know "GPT-5, Claude, Llama, Gemini. Under the hood, they all do the same thing: x+F(x)."?

taykolasinski · 2026-01-12T14:51:39 1768229499

I’m referring specifically to the fundamental residual connection backbone that defines the transformer architecture (x_{l+1} = x_l + F(x_l)).

While the sub-modules differ (MHA vs GQA, SwiGLU vs GeLU, Mixture-of-Depths, etc.), the core signal propagation in Llama, Gemini, and Claude relies on that additive residual stream.

My point here is that DeepSeek's mHC challenges that fundamental additive assumption by introducing learnable weighted scaling factors to the residual path itself.

WiSaGaN · 2026-01-12T14:54:04 1768229644

I guess I am asking how we know Gemini and Claude relies on the additive residual stream. We don't know the architecture details for these closed models?

taykolasinski · 2026-01-12T14:58:44 1768229924

That's a fair point. We don't have the weights or code for the closed models, so we can't be 100% certain.

However, transformer-based (which their technical reports confirm they are) implies the standard pre-norm/post-nnorm residual block structure. Without those additive residual connections, training networks of that depth (100+ layers) becomes difficult due to the vanishing gradient problem.

If they had solved deep signal propagation without residual streams, that would likely be a bigger architectural breakthrough than the model itself (akin to Mamba/SSMs). It’s a very high-confidence assumption, but you are right that it is still an assumption.