March 26, 2026aimlmusic

AI Music Is Good But Not Great. Can We Fix That?

I generated 230 AI songs across 10 genres, built a model to rate them, and used the data to 4x the hit rate of producing good tracks. All on a MacBook.

by Ritesh Khanna|@treadon

AI music generators can now produce full songs—vocals, instruments, structure—in about two minutes on a laptop. I’ve been playing with ACE-Step 1.5 and the output is genuinely impressive. But there’s a problem: most of it lands in this zone of competent mediocrity. It sounds like music. It has all the right parts. But it doesn’t move you.

I generated 230 songs across 10 genres and 6 languages. Maybe 10% were songs I’d actually want to listen to again. The rest were technically fine but emotionally flat—like elevator music that went to music school.

So I built a system to fix the hit rate.

Songs generated

Genres tested

Training time

Hit rate improvement

The Idea: Score Everything, Keep the Best

What if you had a model that could rate songs automatically? Generate a batch, score them all in seconds, keep the top scorers. A learned quality filter for music.

How It Works

The architecture is deliberately simple. Two components:

Audio

MP3 / WAV

→

MERT

330M params

→

1024-dim

embedding

→

MLP

30s to train

→

Score

0 - 10

MERT is a 330M parameter model pretrained on 160,000 hours of music. It compresses any song into 1024 numbers capturing rhythm, harmony, timbre, and structure. I freeze it and use it as a feature extractor.

The MLP scorer is four linear layers trained in 30 seconds on 8,000 songs from the Free Music Archive.

AI vs Human Music: The Quality Gap

Here’s what happens when you score AI-generated music on a scale trained on real human music:

AI songs cluster at the FMA median of 3.2. Real music has a long tail of truly great songs reaching 8–10. The best AI track (5.29, melodic techno) sits at the 67th percentile. Above average, not exceptional.

Can you tell which is the banger?

Listen to both songs on HuggingFace, then pick which one scored higher.

Song A

EDM · 130 BPM · Eb minor

Play →B

Song B

Pop · 135 BPM · F major

Play →

Which one scored higher?

200 Songs, 10 Genres, 6 Languages

I tested across Hip Hop, Pop, R&B, Latin, Bollywood, Punjabi, C-Pop, Rock, EDM, and Folk. Twenty songs each, varied BPMs and keys.

EDMmean 3.71 · best 5.29

Punjabi / Bhangramean 3.77 · best 4.26

Bollywoodmean 3.53 · best 4.38

C-Popmean 3.20 · best 4.47

Latin / Reggaetonmean 3.19 · best 3.90

Pop / Dancemean 3.05 · best 4.31

Rock / Altmean 3.03 · best 3.66

Hip Hopmean 2.92 · best 3.38

Acoustic / Folkmean 2.63 · best 3.31

R&B / Soulmean 2.62 · best 3.21

Pop quiz

Which genre had the highest mean score across 20 songs? (Not the highest single song — the most consistently good.)

EDM dominated with the highest single score (5.29). Punjabi/Bhangra had the highest mean (3.77). R&B and Folk scored lowest—both require emotional subtlety AI can’t deliver yet.

Minor keys beat major keys (3.26 vs 3.03). All top-5 songs were in minor keys.

The Optimization: 4x Hit Rate

I analyzed which parameters scored highest, then generated 30 more songs using only the winning combos. Toggle to see the difference:

Random vs Optimized

Random parameters

3.17

Mean score

20%

Above 3.5

Above 4.0

20% of songs score 3.5+

Listen for Yourself

All 230 songs are playable on HuggingFace:

Play the best songs →Play the worst songs →

Try It Yourself

Upload any song and get a banger score. Runs in your browser via HuggingFace Spaces:

Loading demo...

What This Means Beyond Music

The pattern generalizes to any generative AI pipeline:

Build a cheap scorer using transfer learning.
Generate a batch and score everything.
Feed scores back into generation parameters.

Surprises

The scorer trained in 30 seconds. Transfer learning made the hard part trivial.

The genre bias was huge. Same generator, wildly different quality by genre.

Local hardware is enough. Everything ran on a MacBook Pro M4 Pro. No cloud GPUs.

What’s Next

AI music needs to get better. The scorer is a bridge—filtering helps now, but long term it can serve as a reward model to RL-tune the generator itself.

If you can score it, you can optimize it. And you can build a useful scorer with surprisingly little.

Try the demo

Score any song on HuggingFace

View the code

GitHub repo with all scripts + data

Listen to 230 songs

Full dataset with audio player

Get the embeddings

Pre-computed MERT vectors for 8K songs