Common Failures — ImageBench

Every image generation model fails. The interesting questions are how they fail, how often, and why. Understanding failure modes is essential for setting realistic expectations, designing evaluation benchmarks, and choosing the right model for production use.

Anatomy Errors

Anatomy errors — particularly hands and fingers — are the most widely recognized failure mode in AI image generation. They're also among the most persistent, surviving multiple architecture generations.

Hands and Fingers

Hand generation remains the single most common spatial artifact, accounting for an estimated 42% of anatomy-related failures across diffusion models. Typical errors include:

Wrong finger count: 4 or 6 fingers instead of 5
Fused fingers: adjacent fingers merged into a single mass
Impossible articulation: joints bending in anatomically impossible directions
Missing or extra joints: knuckles that don't align with finger structure
Asymmetric hands: left and right hands with different proportions in the same image

Why hands are so hard: human hands have 27 bones and 20+ degrees of freedom, creating enormous pose variability. Training data contains hands at every angle, scale, and occlusion level — but the frequency of any specific hand configuration is low. The model must generalize from sparse examples of each pose to the full continuous space of valid configurations.

State of the art (2025): Flux Pro and GPT Image 1 have reduced hand failure rates to roughly 5–10% on standard portrait prompts, down from 30–40% for SDXL. The improvement comes primarily from higher-resolution training data and improved attention mechanisms, not hand-specific fixes.

Other Anatomy Issues

Teeth and mouths: overcrowded, misaligned, or melting teeth in close-up portraits
Eyes: asymmetric pupil sizes, misaligned gaze direction, floating irises
Limbs: arms emerging from wrong positions on the torso, legs misaligned with hips
Proportions: heads too large for bodies, child-like proportions on adult faces

Text Rendering Failures

Generating readable text within images is one of the hardest tasks for diffusion models — and one of the most commercially important for product mockups, signs, UI screenshots, and marketing materials.

Common Text Failures

Character substitution: wrong letters, especially in words longer than 4–5 characters
Omission and duplication: missing letters or repeated sequences ("COFFEEE", "RESTRANT")
Spatial distortion: letters that warp, overlap, or float off baselines
Style inconsistency: mixed fonts within a single word
Illegibility: characters that resemble letterforms but aren't readable

Why Text Is Hard

Diffusion models learn in pixel/latent space, not character space. They don't have a concept of "the letter A" as a discrete symbol — they learn visual patterns that correlate with text-containing images in training data. This means:

Long text sequences are exponentially harder (each character adds combinatorial complexity)
Uncommon words have fewer training examples to learn from
The model has no spell-checking mechanism — it generates shapes, not characters

Current State

As of 2025, Ideogram 2 and GPT Image 1 lead text rendering capability. Ideogram specifically optimized for text placement with glyph-aware training. DALL·E 3 and Midjourney v6 handle short text (1–3 words) reasonably but degrade on longer strings. Stable Diffusion models without text-specific fine-tuning fail on nearly all text rendering tasks.

Spatial Reasoning Breakdowns

Spatial reasoning requires the model to correctly place objects relative to each other based on the text prompt. This is fundamentally a compositionality challenge.

Typical Failures

Attribute binding: "A red sphere and a blue cube" produces a blue sphere and red cube — attributes swap between objects
Counting: "Three cats on a sofa" produces 2 or 4 cats. Exact counts above 3 are unreliable for most models
Spatial relations: "A cat sitting on top of a table" may place the cat beside or under the table
Relative size: "A large elephant and a small mouse" may produce similar-sized animals
Negation: "A room with no people" frequently includes people. Models are notoriously poor at interpreting "not" and "without"

Benchmark Data

T2I-CompBench (2023) evaluated spatial reasoning across leading models with these findings:

Attribute binding accuracy: 35–65% across models (DALL·E 3 highest at ~65%)
Spatial relation accuracy: 15–40% (all models struggle significantly)
Counting accuracy (1–4 objects): 40–70%, dropping sharply above 3

These numbers mean that even the best models get spatial relationships wrong more often than they get them right for complex prompts.

Artifact Types

Visual artifacts are rendering imperfections that break immersion or reduce image quality.

Blur and Softness

Global blur: entire image lacks sharpness, common at lower inference step counts
Local blur: specific regions (typically backgrounds or fine detail) are soft while the foreground is sharp
Motion blur artifacts: streak-like patterns in static scenes

Color Bleeding

Colors from one object leak into adjacent regions. Common in scenes with strong color contrasts — red flowers on green stems, bright clothing against skin. This stems from imperfect attention, where feature maps for adjacent regions interfere with each other.

Tiling and Repetition

Visible repeating patterns, especially in textures like grass, fabric, or walls. The model generates a small patch and tiles it, creating unnatural periodicity. More common in models with lower-resolution latent spaces.

Edge Artifacts

Halo effects: bright or dark outlines around objects where they meet the background
Aliasing: jagged edges on curves and diagonals
Seam lines: visible boundaries where inpainted or tiled regions meet

Uncanny Valley Effects

Generated faces that are almost right but trigger discomfort: overly smooth skin, glassy eyes, pixel-perfect facial symmetry, or subtly wrong lighting on skin. These artifacts are hard to quantify but significantly impact human preference ratings.

Object Hallucination and Duplication

Hallucination

The model generates objects that were not requested and don't belong in the scene:

Extra animals or people in the background
Phantom objects partially visible at frame edges
Watermarks or logos from training data bleeding through
UI elements or text overlays that weren't requested

Duplication

The model generates the same object multiple times when only one was requested. Particularly common with prominent subjects (people, animals) where the model "hedges" by including multiple instances.

Why These Failures Happen

The root causes map to fundamental properties of how diffusion models work.

Training Data Limitations

Models learn the distribution of their training data. If hands appear in 10,000 different poses across 100 million images, each specific pose is represented by only 0.01% of the data. Rare configurations — unusual finger positions, complex interlocking hands — have even sparser representation.

Latent Space Compression

Most modern diffusion models operate in a compressed latent space via a VAE. The image is compressed from, say, 1024×1024×3 to 128×128×4 before the diffusion process operates on it. Fine spatial details — individual fingers, text characters, thin lines — may not survive this compression cleanly.

Loss Function Limitations

The standard training objective (predicting noise or velocity) treats all pixels equally. A one-pixel error in a finger joint has the same loss magnitude as a one-pixel error in a sky gradient — but the perceptual impact is vastly different. Perceptual losses partially address this but aren't standard practice across all architectures.

Attention Mechanism Bottlenecks

Cross-attention between text tokens and spatial positions is how the model associates words with image regions. With complex prompts, attention must correctly route multiple attributes to multiple objects — a many-to-many mapping that current architectures often get wrong.

State of Fixes

ControlNet and Conditioning

ControlNet (Zhang et al., 2023) allows conditioning on pose skeletons, depth maps, edge maps, and other structural guides. For anatomy errors, conditioning on a correct hand skeleton dramatically reduces malformations. The tradeoff: you need the conditioning input, which limits spontaneous generation.

Reference Image Approaches

IP-Adapter and similar methods inject reference image features into the generation process. This helps with consistency and style but can also ground anatomy to a correct reference pose.

Architecture Improvements

Transformer-based architectures (DiT, used in Flux and SD3) show improved coherence over U-Net architectures, likely due to better global attention. Higher-resolution latent spaces also help — less compression means fewer fine-detail losses.

Post-Processing and Refinement

Inpainting pipelines detect and re-generate failed regions, especially hands and faces. Two-pass approaches — generate at low resolution, then upscale with detail correction — are increasingly standard in production workflows.

None of these are complete solutions. Failure rates have decreased significantly from 2022 to 2025, but no model has eliminated them. The most honest evaluation acknowledges failure rates as a first-class metric alongside quality scores.