Hardcoded Subtitle Benchmark

Benchmarking Large Language Models on their ability to extract hardcoded subtitles with word and format accuracy. We evaluate how well models preserve line breaks, special characters, and original formatting structure.

Leaderboard

Performance Chart
Overall accuracy across 64 models

Filter by Model Family

Model Rankings
Detailed breakdown by category
RankModelOverall
1
Qwen 3 VL 8B Instruct4 variants
100%
2
google/gemini-3-pro-preview
100%
3
anthropic/claude-opus-4.5
100%
4
openai/gpt-4.1-mini
98.46%
5
anthropic/claude-3.7-sonnet
98.46%
6
qwen/qwen-vl-max
98.44%
7
anthropic/claude-sonnet-4.5
98.44%
8
openai/gpt-5-mini
98.44%
9
openai/gpt-5.1-codex-mini
96.88%
10
perplexity/sonar
96.88%
11
anthropic/claude-haiku-4.5
95.31%
12
openai/gpt-5.1
95.31%
13
deepcogito/cogito-v2-preview-llama-109b-moe
95.31%
14
qwen3-vl-32b-instruct
95.31%
15
openai/o4-mini-high
93.85%
16
google/gemini-3-flash-preview
93.85%
17
openai/gpt-5-chat
93.85%
18
qwen/qwen3-vl-235b-a22b-instruct
93.85%
19
openai/gpt-5.2
93.75%
20
z-ai/glm-4.5v
93.75%
21
z-ai/glm-4.6v
93.75%
22
qwen3-vl-4b-instruct
93.75%
23
google/gemini-2.5-flash
92.31%
24
openai/gpt-4.1
92.31%
25
google/gemini-2.0-flash-001
92.19%
26
meta-llama/llama-4-maverick
92.19%
27
qwen/qwen3-vl-8b-thinking
92.19%
28
openai/gpt-5.2-chat
92.19%
29
openai/o4-mini
90.77%
30
mistralai/ministral-14b-2512
90.62%
31
Noah/qwen3-vl-30b-a3b-v3
90.62%
32
openai/gpt-5.1-codex-max
90.62%
33
Noah/qwen3-vl-30b-a3b-v2
90.62%
34
openai/gpt-4o
89.23%
35
google/gemini-2.5-flash-lite
89.06%
36
Qwen 3 VL 30B Instruct2 variants
89.06%
37
qwen/qwen2.5-vl-32b-instruct
89.06%
38
meta-llama/llama-4-scout
87.69%
39
qwen3-vl-2b-instruct
85.94%
40
openai/gpt-4o-mini
84.62%
41
openai/gpt-5-nano
84.38%
42
google/gemma-3-27b-it
84.38%
43
Ministral 3B2 variants
82.81%
44
mistralai/pixtral-large-2411
82.81%
45
mistralai/mistral-large-2512
82.81%
46
qwen/qwen3-vl-30b
80%
47
allenai/olmocr-2-7b
79.69%
48
Noah/qwen3-vl-30b-a3b-v1
79.69%
49
nvidia/nemotron-nano-12b-v2-vl:free
78.12%
50
camel-doc-ocr-080125
78.12%
51
gliese-ocr-7b-post2.0-final
75%
52
chandra-ocr
75%
53
openai/gpt-4.1-nano
73.85%
54
qwen3-visioncaption-2b
73.44%
55
bytedance/ui-tars-1.5-7b
70.77%
56
baidu/ernie-4.5-vl-28b-a3b
70.31%
57
nanonets-ocr2-3b-aio
70.31%
58
anthropic/claude-3.5-haiku
67.69%
59
google/gemma-3-4b-it
67.19%
60
x-ai/grok-4.1-fast
46.88%
61
ln
34.38%
62
tencent/HunyuanOCR
33.85%
63
meta-llama/llama-3.2-11b-vision-instruct
32.31%
64
x-ai/grok-4-fast
27.69%

The Hardcoded Subtitle Benchmark tests LLMs on their ability to extract text exactly as presented, including formatting, line breaks, and special characters. Learn more about why this benchmark matters

Samples

View example outputs for each category

Formatting

Line Breaks

Sample 1
Model Output:
Wie wäre es dann, wenn ich dir eine Woche
das Essen für die Pause mitbringe?
Sample 2
Model Output:
Ich sehe deine Welt durch Glas...
Sample 3
Model Output:
Licht blitzt auf. Ein Signal anzufangen?

Special Characters