ImageBench

The only generative image benchmark that shows the images

12 models, 192 prompts, 6 categories — every output published. Judge with your own eyes which model is best for your use case, your budget, your quality bar.

The word 'CHAPTER ONE' typed on aged paper with a vintage typewriter font, complete with slightly uneven ink
Text RenderingTypography StyleEasyopenai/gpt-image-2

Prompt: The word 'CHAPTER ONE' typed on aged paper with a vintage typewriter font, complete with slightly uneven ink

V1 Leaderboard

192 prompts, 6 categories, graded pass/fail by VLM judges.

Full benchmark explorer
#ModelPass RatePass / FailAvg Latency
1openai/gpt-image-2
96.4%
185/745.3s
2fal/google/nano-banana-2
95.3%
183/928.1s
3bfl/flux-2-max
91.7%
176/1626.7s
4fal/google/nano-banana-pro
91.1%
175/1723.4s
5bfl/flux-2-pro
83.3%
160/3211.8s
6bfl/flux-2-klein-9b
78.6%
151/414.1s
7gx10/bonsai-image-4b
76.0%
146/464.1s
8z-image-local/z-image-turbo
75.5%
145/4718.1s
9bfl/flux-2-klein-4b
74.0%
142/503.8s
10qwen-image-local/qwen-image-gen
70.8%
136/5680.2s
11nucleus-local/nucleus-image
67.2%
129/6339.1s
12sana-local/sana-1.5-1.6b
53.1%
102/9011.1s

What we evaluate

Each model is tested across 6 categories with 192 prompts spanning easy to extreme difficulty.

Text Rendering
Typography accuracy, writing correctness across difficulty levels
Spatial Reasoning
Compositionality, counting, relative position, scale & proportions
Human Realism
Faces, expressions, hands, full body, multi-subject coherence
Truthfulness
Physics, reflections, photorealism, world knowledge
Professional Studio
Camera & lighting, color precision, photorealistic quality
Graphical Design
Layout, data visualisation, style diversity

Frequently asked questions

See how every model performs

Compare models side-by-side with our interactive benchmark explorer.

Explore ImageBench V1