Why This Benchmark Matters
Understanding the importance of accurate text extraction in AI systems
While extracting subtitles is an extremely easy task for humans, it presents significant challenges for AI. Models often struggle with exact word recognition, handling spelling errors visible in the images, strict task following, and preserving formatting.
Large Language Models are increasingly used for these multimodal tasks, but even state-of-the-art models can fail to capture the precise details that humans take for granted.
When working with hardcoded subtitles in videos, maintaining precise formatting is critical. Line breaks aren't arbitrary—they're carefully placed for readability and timing. Special characters carry meaning. Even small deviations can render the extracted text unusable for downstream applications.
Formatting Preservation
Can the model maintain original text structure, including capitalization, spacing, and punctuation exactly as shown?
Line Break Accuracy
Does the model preserve intentional line breaks? This is crucial for subtitle timing and readability.
Special Characters
Can the model correctly extract emojis, symbols, and non-standard characters without modification or loss?
Accurate text extraction isn't just about correctness—it's about usability. Content creators need reliable tools for subtitle extraction, accessibility features depend on precise formatting, and automated workflows break when text formatting changes unexpectedly.
This benchmark provides a standardized way to evaluate and compare models on this critical capability, helping developers choose the right model for their text extraction needs.
This benchmark demonstrates just how strong modern models have become at multimodal tasks. It also highlights which models are particularly efficient, delivering great performance relative to their size.
Each model is tested on the same set of hardcoded subtitle images across multiple categories. Scores represent exact match accuracy—a character-by-character comparison where even a single deviation results in a failed test case.
This strict evaluation ensures that only models capable of truly preserving the original text formatting score well on the benchmark.