aimlmechanistic-interpretability

Abliterating Gemma 4 E4B: Bigger Model, Easier Surgery

The E4B needed 40% of its layers modified at scale 1.0. The E2B needed 69% at scale 1.75. Bigger models have cleaner refusal signals.

by Ritesh Khanna|@treadon

After abliterating Gemma 4 E2B, the natural next question: does the same technique work on the bigger E4B model? And if so, how does it compare?

Short answer: it works better. Much better. The E2B took a full day of failed experiments to crack. The E4B worked on the first config tested.

E2B (2.3B effective)
24/35 layers
scale 1.75 — the only config that worked
E4B (4.5B effective)
17/42 layers
scale 1.0 — every config tested scored perfect

The Numbers

Same pipeline, same prompts, same technique. The only difference is the model. Zero refusals on everything.

MetricE2BE4B
Base params5.1B (2.3B eff.)7.9B (4.5B eff.)
Decoder layers3542
Hidden size15362560
Layers modified24 (69%)17 (40%)
Scale factor1.751.0
Weight matrices edited4834
Peak refusal signal5274
Harmful refused0/1000/100
Harmless damaged0/1000/100
JBB harmful0/1000/100
JBB over-refusal0/1000/100

Why the Bigger Model Is Easier

The refusal signal in E4B is 43% stronger (peak 74 vs 52) but concentrated in fewer layers. In E2B, the signal is spread diffusely across layers 9–33. In E4B, it peaks sharply at layers 18–28 with a clear dip in the early layers.

A stronger, more concentrated signal is easier to separate from the generation signal. Biprojection works better when the refusal direction is cleanly separable from the harmless direction—and in E4B it is.

Refusal Signal: E2B vs E4B

E2B (red) has a flatter, more diffuse signal peaking around 52. E4B (green) has a sharper, taller peak at 74 with near-zero signal in early layers. The taller peak means a cleaner direction to remove.

Which Layers Were Modified

E2B — 24 of 35 layers (69%)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
E4B — 17 of 42 layers (40%)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
Modified (E2B) Modified (E4B) Untouched

Each block is one decoder layer. E2B needed nearly the entire middle-to-late section modified. E4B leaves the first 17 layers and the last 6 completely untouched.

What This Means

Larger models may be systematically easier to abliterate. With more parameters, the model has more room to separate refusal from generation into distinct directions. The refusal behavior gets pushed into a cleaner subspace rather than being entangled with everything else.

This is one data point, not a law. But it’s consistent with what others have found—Llama 70B is easier to abliterate than 8B, and Gemma 27B is easier than 4B. The pattern holds.

The practical implication: if you’re abliterating a new model family and the small variant is giving you trouble, try the larger one first. You might find it works on the first attempt.