Guide to Multimodal Benchmarks

LLMs

Author

Branden Collingsworth

Published

May 10, 2025

A compact field manual for evaluating vision-language models.

Why benchmarks matter

Modern multimodal LLMs are judged by how well they see, read, reason, and chat. Public benchmarks give you:

The landscape below is grouped by task family, followed by the metrics and leaderboards you’ll meet most often.

Task	Representative benchmarks	One-liner
Visual QA	VQAv2 (Visual Question Answering) · GQA (Computer Science) · OK‑VQA (okvqa.allenai.org)	Open-ended Q&A; GQA stresses compositionality, OK-VQA injects external knowledge.
Image Captioning	MS‑COCO (Papers with Code) · NoCaps (nocaps)	Describe images; NoCaps probes novel object generalisation.
Diagnostic Reasoning	CLEVR (Computer Science)	Synthetic 3-D scenes that isolate counting, comparison, logic.
Real-image Reasoning	NLVR2 (lil.nlp.cornell.edu) · VCR (visualcommonsense.com)	NLVR2 uses image-pairs + T/F statements; VCR adds commonsense inference & rationales.
Grounding / Localisation	RefCOCO family (see MS-COCO), Flickr30k-Entities	Map phrases to boxes; evaluate phrase-level grounding.
Document & OCR QA	DocVQA (docvqa.org)	Read forms, IDs, invoices; combines OCR and layout understanding.
Holistic Diagnostics	MME (arXiv) · MMBench (GitHub)	14-subtask perception + cognition suite (MME); skill-tagged test bank (MMBench).
Exam-style Multidiscipline	MMMU (mmmu-benchmark.github.io)	11.5 k college-level problems spanning six domains.
Multimodal Chat	LLaVA‑Bench (Papers with Code)	Free-form conversations grounded in images, memes, sketches.
Math-in-Vision	MathVista (MathVista)	6 141 diagram-based problems drawn from 28 sources + new IQ sets.

Metric	What it scores	Use cases
BLEU (Wikipedia)	n-gram precision vs. refs	classic MT & captions
CIDEr / CIDEr-D (arXiv)	TF-IDF–weighted consensus	COCO caption leaderboard metric
SPICE (arXiv)	scene-graph semantic overlap	counts/attributes in captions
CLIPScore (arXiv)	Img–text cosine (no refs)	caption quality without references
Exact Match & F1	string or token overlap	VQA / span QA
Accuracy / Soft-VQA	correct answers (soft ≥3/10 agree)	classification, VQA
mAP (Medium)	mean AP over IoU slices	detection / grounding
IoU (V7 Labs)	box overlap ratio	localisation threshold (≥0.5)
Recall@k, MRR, nDCG, Hits@k	ranking relevance	retrieval & KG tasks

Leaderboard	Scope	Why check it
HF Open VLM Leaderboard (Hugging Face)	Open & closed-source VLMs on ≈20 public tests	Quick scorecard for GPT-4o, Gemini Ultra, Claude 3, etc.
MME & MMEB leaderboards (arXiv, Hugging Face)	Fine-grained sub-skills	Spot perception vs. cognition gaps.

Match task to metric. Use CIDEr/SPICE for captions, IoU/AP for grounding, and BLEU/F1 for free-text QA.
Cross-check robustness. Pass VQAv2? Try OK-VQA or MMU to probe knowledge depth.
Follow the boards. New models leap-frog weekly; the Open VLM board is the fastest pulse-check.
Look beyond averages. Skill-tagged suites (MME, MMBench) reveal hidden weaknesses even when headline scores look strong.
Re-run locally. Most datasets are on Hugging Face; integrate into CI to catch drift after each fine-tune.