Glossary — ImageBench

Adversarial prompts — Carefully designed prompts intended to expose model weaknesses, such as attribute binding failures, spatial reasoning errors, or counting mistakes.

Arena ranking — A ranking system derived from pairwise comparisons where models compete in head-to-head matchups, commonly using ELO or Bradley-Terry models to aggregate preferences.

CFG (Classifier-Free Guidance) — A technique that steers diffusion models toward text alignment by interpolating between conditional and unconditional predictions using a guidance scale parameter.

CLIP — Contrastive Language-Image Pre-training, a model trained to align text and image embeddings in a shared latent space, widely used for zero-shot classification and text-image similarity.

CLIP Score — A metric measuring text-image alignment by computing cosine similarity between CLIP embeddings of a prompt and generated image; higher scores indicate stronger prompt adherence.

Compositionality — The ability of a model to correctly combine multiple concepts, attributes, and relationships in a single image, such as binding colors to specific objects or understanding spatial arrangements.

ControlNet — An architecture that adds spatial conditioning controls (e.g., edge maps, depth maps, pose skeletons) to diffusion models, enabling precise structural guidance during generation.

DDPM (Denoising Diffusion Probabilistic Models) — A foundational class of generative models that learn to reverse a gradual noising process, generating images by iteratively denoising pure noise.

Deterministic — A process that produces the same output given identical inputs; in image generation, using the same seed and parameters yields identical results.

Diffusion — A class of generative models that produce images by gradually removing noise from random samples, guided by learned denoising steps conditioned on text or other inputs.

DrawBench — A benchmark containing 200 prompts designed to test compositionality, spatial reasoning, attribute binding, and other challenging aspects of text-to-image generation.

ELO — A rating system originally developed for chess that ranks models or images based on pairwise comparison outcomes, commonly used in preference-based evaluation.

EU AI Act — European Union legislation regulating high-risk AI systems, including requirements for transparency, safety, and human oversight that affect deployment of generative models.

FID (Fréchet Inception Distance) — A distributional metric comparing generated and real image sets by measuring the distance between Gaussian fits of Inception v3 features; lower is better.

GenEval — A benchmark evaluating compositional generation across multiple dimensions including object count, attribute binding, spatial relationships, and complex prompts.

Guardrails — Safety mechanisms that filter or block inappropriate outputs, enforce content policies, or prevent misuse of generative models.

HPS (Human Preference Score) — A learned metric trained on 798K human preferences that predicts which images humans would prefer given a text prompt.

Human evaluation — Assessment of generated images by human raters, considered the gold standard but expensive and requiring careful design to minimize bias.

ImageReward — A reward model trained on 137K expert comparisons to predict human preference for text-image pairs, built on BLIP architecture.

Inference — The process of generating outputs from a trained model; in diffusion models, this involves iterative denoising from random noise to a final image.

Inter-annotator agreement — The degree to which multiple human raters assign consistent scores or preferences to the same images, measured by metrics like Krippendorff's alpha or Fleiss' kappa.

Latent diffusion — Diffusion models that operate in a compressed latent space rather than pixel space, enabling faster generation and lower memory usage while maintaining quality.

Latent space — A compressed, learned representation space where diffusion or other generative models perform computations before decoding to pixel space.

Likert scale — A rating system where annotators assign discrete scores (e.g., 1-5) to measure quality dimensions; requires careful anchor definitions and suffers from subjective variation.

LoRA (Low-Rank Adaptation) — A parameter-efficient fine-tuning method that adapts pretrained models by training small low-rank weight updates, commonly used for style or subject customization.

LPIPS (Learned Perceptual Image Patch Similarity) — A perceptual distance metric between two images computed using deep network features; lower values indicate greater perceptual similarity.

Mode collapse — A failure mode where a generative model produces limited diversity, repeatedly generating similar outputs instead of covering the full distribution.

NSFW (Not Safe For Work) — Content containing nudity, violence, or other material inappropriate for general audiences; typically filtered by safety guardrails in production systems.

Pairwise preference — An evaluation method where annotators choose the better of two images rather than assigning absolute scores, reducing bias and improving consistency.

Pareto frontier — The set of models or configurations where no alternative is strictly better across all dimensions; represents optimal trade-offs between competing objectives like quality, speed, and cost.

PartiPrompts — A benchmark of 1,600+ prompts organized by difficulty and testing various aspects of text-to-image generation including complex scenes and abstract concepts.

Prompt adherence — The degree to which a generated image accurately reflects the content, attributes, and relationships specified in the text prompt.

PSNR (Peak Signal-to-Noise Ratio) — A pixel-level metric measuring reconstruction quality in decibels; rarely useful for text-to-image evaluation due to poor correlation with perceptual quality.

Red teaming — Systematic adversarial testing to identify safety failures, bias amplification, harmful outputs, or policy violations in generative models before deployment.

Seed — A random number initializing the noise distribution in stochastic generation; fixing the seed enables reproducible outputs.

SSIM (Structural Similarity Index) — A metric comparing luminance, contrast, and structure between two images; like PSNR, it's pixel-focused and poorly suited for generative model evaluation.

Stochastic — A process involving randomness that produces different outputs across runs; in image generation, different seeds yield different results even with identical prompts.

T2I-CompBench — Text-to-Image Compositional Benchmark, a suite evaluating attribute binding, object relationships, and complex compositions through automated and human metrics.

VLM (Vision-Language Model) — A model jointly trained on images and text that can perform tasks requiring understanding of both modalities, such as image captioning or visual question answering.

VQAScore — A metric reframing image evaluation as visual question answering, measuring whether an image correctly answers questions derived from the prompt; strong for compositional evaluation.