April 14, 2026aimlmechanistic-interpretability

Abliterating Gemma 4 E4B: Bigger Model, Easier Surgery

The E4B needed 40% of its layers modified at scale 1.0. The E2B needed 69% at scale 1.75. Bigger models have cleaner refusal signals.

by Ritesh Khanna|@treadon

After abliterating Gemma 4 E2B, the natural next question: does the same technique work on the bigger E4B model? And if so, how does it compare?

Short answer: it works better. Much better. The E2B took a full day of failed experiments to crack. The E4B worked on the first config tested.

E2B (2.3B effective)

24/35 layers

scale 1.75 — the only config that worked

E4B (4.5B effective)

17/42 layers

scale 1.0 — every config tested scored perfect

The Numbers

Same pipeline, same prompts, same technique. The only difference is the model. Zero refusals on everything.

Metric	E2B	E4B
Base params	5.1B (2.3B eff.)	7.9B (4.5B eff.)
Decoder layers	35	42
Hidden size	1536	2560
Layers modified	24 (69%)	17 (40%)
Scale factor	1.75	1.0
Weight matrices edited	48	34
Peak refusal signal	52	74
Harmful refused	0/100	0/100
Harmless damaged	0/100	0/100
JBB harmful	0/100	0/100
JBB over-refusal	0/100	0/100

Why the Bigger Model Is Easier

The refusal signal in E4B is 43% stronger (peak 74 vs 52) but concentrated in fewer layers. In E2B, the signal is spread diffusely across layers 9–33. In E4B, it peaks sharply at layers 18–28 with a clear dip in the early layers.

A stronger, more concentrated signal is easier to separate from the generation signal. Biprojection works better when the refusal direction is cleanly separable from the harmless direction—and in E4B it is.

Refusal Signal: E2B vs E4B

E2B (red) has a flatter, more diffuse signal peaking around 52. E4B (green) has a sharper, taller peak at 74 with near-zero signal in early layers. The taller peak means a cleaner direction to remove.

Which Layers Were Modified

E2B — 24 of 35 layers (69%)

E4B — 17 of 42 layers (40%)

Modified (E2B) Modified (E4B) Untouched

Each block is one decoder layer. E2B needed nearly the entire middle-to-late section modified. E4B leaves the first 17 layers and the last 6 completely untouched.

What This Means

Larger models may be systematically easier to abliterate. With more parameters, the model has more room to separate refusal from generation into distinct directions. The refusal behavior gets pushed into a cleaner subspace rather than being entangled with everything else.

This is one data point, not a law. But it’s consistent with what others have found—Llama 70B is easier to abliterate than 8B, and Gemma 27B is easier than 4B. The pattern holds.

The practical implication: if you’re abliterating a new model family and the small variant is giving you trouble, try the larger one first. You might find it works on the first attempt.

E4B model: treadon/gemma4-E4B-it-abliterated

E2B model: treadon/gemma4-E2B-it-abliterated

Full walkthrough: I Abliterated Gemma 4 on a MacBook

Eval dataset: treadon/abliteration-eval