Porting the First MoE Image Model (17B) to Apple Silicon
Nucleus-Image packs 17B parameters into a model that only activates 2B per token. Porting it to MLX meant solving expert-choice routing, CausalConv3d, and 13 bugs between black output and photorealism.
Nucleus-Image is a 17 billion parameter text-to-image model, and the first open MoE image model I’m aware of. It uses Mixture-of-Experts to pack 17B parameters into a model that only activates ~2B per token. The idea: get the quality of a large model at the cost of a small one.
There’s no MLX port for it. Most MLX image work targets dense models like FLUX or Stable Diffusion. MoE image models are new, and porting one means solving problems nobody has solved before in MLX: expert-choice routing, packed expert weights, capacity-based token dispatch, and a VAE that uses CausalConv3d.
This is the story of porting it. It took 13 bugs to get from black output to photorealistic images.
Why MoE for Images?
Dense models use every parameter for every token. MoE models route each token to a small subset of “expert” subnetworks. This means you can scale parameters without scaling compute:
Total vs Active Parameters
Nucleus-Image has 17B parameters but only activates 2B per token. That’s less than ERNIE’s 8B dense.
The trade-off: MoE models need more memory (all 17B parameters must be loaded) but less compute per forward pass. This is perfect for Apple Silicon, where unified memory is plentiful (64-128GB) but GPU compute is limited. MLX’s 4-bit quantization makes the memory manageable.
The Pipeline
Click a component for details
The text encoder stays in PyTorch. Qwen3-VL-8B uses a complex architecture with vision-language features that would take weeks to reimplement in MLX, and it runs in ~2 seconds. Not worth porting. Everything else runs in MLX.
The DiT is the interesting part. Each of the 29 MoE layers has 64 routed experts and 1 shared expert. The routing is “expert-choice”: instead of each token picking its top-2 experts (like GPT-4), each expert picks its top-C tokens. C is the “capacity”: how many tokens each expert can handle, determined by a capacity factor from the config.
The Port: Weight Loading
Step one is always the same: define the MLX architecture so that every layer name matches the PyTorch weight names exactly. If names match, weights load with zero mapping code.
Weights loaded perfectly. The output was black.
13 Bugs to Photorealism
Getting weights to load is the easy part. Getting the model to produce correct output took finding and fixing 13 separate bugs across three debugging sessions. Each fix moved the output from black → gray → noisy color → over-saturated → photorealistic.
The hardest bug was the VAE. The original model uses CausalConv3d for video support. For single-frame images, you’d think only the center temporal slice of the 3D kernel matters. Wrong. Causal convolutions pad before the frame, not symmetrically. The padding is (2p, 0), which means the input is [0, 0, x] and only the last kernel slice fires. The center slice had 20× less weight energy. The VAE was producing near-zero output for any input.
The most surprising bug was the negative embeddings. Classifier-free guidance (CFG) works by computing neg + scale × (pos - neg). The reference encodes an empty string as the negative, which produces a non-zero embedding with L2 norm of 15,824. We used zero vectors. That’s not “no guidance.” It’s guidance in a completely wrong direction. The result was over-saturated images with cyan fringing at every edge.
Debugging Method: Drop-In Testing
The breakthrough debugging technique: start with the fully working reference (diffusers) pipeline and swap in one MLX component at a time until quality degrades.
Pixel-perfect match. VAE is correct.
Pixel-perfect match. Post-processing is correct.
Still over-saturated! Bug is in scheduler or CFG, not the DiT.
This immediately ruled out the DiT as the source of the quality issues and pointed directly to the sigma schedule (no shift needed) and negative embeddings (encode empty string, not zeros).
Precision and MoE
Even with all bugs fixed, there’s a subtle quality difference from the reference. The reason: precision compounding across 29 MoE layers.
Output Correlation vs Reference (by block)
Dense blocks (0-2) match at 0.9999. Each MoE block drops ~1-2%, compounding to ~0.75 by block 31.
Each MoE block introduces ~1% error from bfloat16 precision differences in the routing decisions. 96% of token-to-expert assignments match the reference exactly. The 4% that differ come from slightly different softmax scores causing different top-C selections. Over 29 blocks, this compounds to ~0.75 overall correlation.
In practice, the output is visually identical to the reference. The precision difference shows up as a slightly different “interpretation” of the prompt, not as artifacts or degradation.
Performance
512×512, 20 steps, M4 Pro 64GB
bf16 is ~12% faster than 4-bit. Memory savings of 4-bit matter more for smaller machines.
Unlike LLM inference, where quantization usually speeds things up, 4-bit is slightly slower than bf16 here. The dequantization overhead on the attention and modulation projections outweighs the memory bandwidth savings (the expert weights, which are the bulk of the model, stay in bf16 regardless).
So why use 4-bit at all? Memory. The full bf16 model takes ~34GB of RAM just for weights. Add the text encoder (~16GB) and VAE (~1GB) and you’re at 50GB. On a 64GB machine this fits but barely. 4-bit quantization brings the DiT down to ~8GB, total footprint ~25GB. That’s the difference between “runs on my 32GB laptop” and “needs a 64GB machine.” MLX’s built-in nn.quantize() makes this a one-line change.
The Results
512×512, 30 steps, CFG 4.0, 4-bit quantized on M4 Pro:



What I Learned
- MoE image models are different from MoE language models. Expert-choice routing (experts pick tokens) vs token-choice (tokens pick experts) changes the entire dispatch implementation. And the SwiGLU split convention differs between dense FFN and routed experts in the same model.
- CausalConv3d is not Conv3d with center slice. Causal padding is one-sided. For T=1, the last kernel slice fires, not the center. This was a 20× magnitude error that made the VAE look broken.
- Drop-in debugging is powerful. Swapping one component at a time between reference and port immediately isolates where quality degrades. It found two bugs in 10 minutes that I’d spent hours on.
- Negative embeddings matter enormously for CFG. Zero vectors are not “unconditional.” The encoded empty string has L2 norm of 15,824. Using zeros makes CFG amplify the raw signal instead of the prompt-specific signal.
- Precision compounds in MoE. Each MoE block introduces ~1% error from routing differences. Over 29 blocks, this compounds. Dense blocks are nearly exact (0.9999 correlation). The MoE routing’s sensitivity to softmax precision is the bottleneck.
Code & weights: huggingface.co/treadon/mlx-nucleus-image
Source: github.com/treadon/mlx-nucleus-image
Base model: NucleusAI/Nucleus-Image (Apache 2.0)