HardSubBench

The Problem

While extracting subtitles is an extremely easy task for humans, it presents significant challenges for AI. Models often struggle with exact word recognition, handling spelling errors visible in the images, strict task following, and preserving formatting.

Large Language Models are increasingly used for these multimodal tasks, but even state-of-the-art models can fail to capture the precise details that humans take for granted.

When working with hardcoded subtitles in videos, maintaining precise formatting is critical. Line breaks aren't arbitrary—they're carefully placed for readability and timing. Special characters carry meaning. Even small deviations can render the extracted text unusable for downstream applications.

What We Test

Formatting Preservation

Can the model maintain original text structure, including capitalization, spacing, and punctuation exactly as shown?

Line Break Accuracy

Does the model preserve intentional line breaks? This is crucial for subtitle timing and readability.

Special Characters

Can the model correctly extract emojis, symbols, and non-standard characters without modification or loss?

Real-World Impact

Accurate text extraction isn't just about correctness—it's about usability. Content creators need reliable tools for subtitle extraction, accessibility features depend on precise formatting, and automated workflows break when text formatting changes unexpectedly.

This benchmark provides a standardized way to evaluate and compare models on this critical capability, helping developers choose the right model for their text extraction needs.

Model Capabilities

This benchmark demonstrates just how strong modern models have become at multimodal tasks. It also highlights which models are particularly efficient, delivering great performance relative to their size.

Methodology

Each model is tested on the same set of hardcoded subtitle images across multiple categories. Scores represent exact match accuracy—a character-by-character comparison where even a single deviation results in a failed test case.

This strict evaluation ensures that only models capable of truly preserving the original text formatting score well on the benchmark.

Back to Leaderboard

Hardcoded Subtitle Benchmark

Why This Benchmark Matters

Formatting Preservation

Line Break Accuracy

Special Characters