fix: use exp2 for norm_qk_scale to correctly exponentiate log-space parameter

Dhakshin2007 · web-flow · commit 683fc91604ec · 2026-03-15T19:57:46.000+05:30
Bug: In MultiHeadAttention, when normalize_qk=True, the learned scale parameter 'norm_qk_scale' was initialized using nn.initializers.constant(jnp.log2(seq_len_kv**2 - ...)) — storing a log2 value — but then used directly as a linear scale factor passed to _dot_product_attention.

This means:
- The initial scale ≈ 2*log2(seq_len_kv) instead of seq_len_kv^2 as intended
- The parameter semantics are broken: gradient updates act on the wrong manifold
- For seq_len_kv=64, the actual initial scale is ~12, not 4096

Fix:
1. Change the initializer to nn.initializers.zeros_init() so the parameter represents a log2-space scale (exp2(0) = 1 is a sensible default)
2. Add scale = jnp.exp2(scale) after the self.param() call to correctly convert to linear space before use

This matches the intent of storing the scale in log space for unconstrained optimization while ensuring the linear scale is always positive.
diff --git a/hackable_diffusion/lib/architecture/attention.py b/hackable_diffusion/lib/architecture/attention.py
@@ -300,11 +300,10 @@ def __call__(
     if self.normalize_qk:
       scale = self.param(
           "norm_qk_scale",
-          nn.initializers.constant(
-              jnp.log2(seq_len_kv**2 - seq_len_kv + SAFETY_EPSILON)
-          ),
-          (1, 1, 1, 1),
+          nn.initializers.zeros_init(),
+(1, 1, 1, 1),
       )
+                scale = jnp.exp2(scale)
 
       norm_q = jnp.linalg.norm(q, ord=2, axis=-1, keepdims=True)
       norm_k = jnp.linalg.norm(k, ord=2, axis=-1, keepdims=True)