Learn

Comprehensive guides on evaluating AI image generation — from automated metrics to human judgment.

Introduction to Image Evaluation

Why evaluating image generation matters, and the two sides: automated metrics vs human judgment.

Automated Metrics

FID, CLIP Score, LPIPS, VQAScore — what they measure, when to use them, and common pitfalls.

Human Evaluation

ELO rankings, pairwise preference, rater calibration, and LLM-as-a-Judge approaches.

Comparing Image Models

Quality, speed, cost, consistency — how to compare fairly and find the Pareto frontier.

Prompt Fidelity & Compositionality

Does the image match the text? Measuring attribute binding, spatial reasoning, and counting.

Consistency & Reproducibility

Same prompt, different outputs — measuring variance and why it matters for production.

Common Failures

Bad hands, text rendering, artifacts — why they happen and failure rates by model.

Cost, Speed & Deployment

API pricing, latency, throughput — the cost × quality × speed tradeoff.

NSFW content, demographic bias, IP concerns, red teaming, and the EU AI Act.

A–Z reference of every term used across the guides.