Extraction Tasks

4 benchmark tasks with side-by-side model comparisons

Invoice Data Extraction

MEDIUM
11 runs · Last: Mar 31
o4 Mini: 5500ms · $0.000000o3 Mini: 2568ms · $0.000000o3: 2037ms · $0.000000GPT-4.1 Nano: 728ms · $0.000000GPT-4.1 Mini: 1822ms · $0.000000GPT-4.1: 1003ms · $0.000000Claude Opus 4.6: 2929ms · $0.000000GPT-4o Mini: 2113ms · $0.000055GPT-4o: 1960ms · $0.000920Claude Sonnet 4.6: 1282ms · $0.001482Claude Haiku 4.5: 769ms · $0.000499

Ambiguous Date Parsing

HARD
11 runs · Last: Mar 31
o4 Mini: 11003ms · $0.000000o3 Mini: 9086ms · $0.000000o3: 8128ms · $0.000000GPT-4.1 Nano: 435ms · $0.000000GPT-4.1 Mini: 989ms · $0.000000GPT-4.1: 898ms · $0.000000Claude Opus 4.6: 5647ms · $0.000000Claude Sonnet 4.6: 8469ms · $0.007005GPT-4o: 3634ms · $0.002720GPT-4o Mini: 1756ms · $0.000032Claude Haiku 4.5: 1564ms · $0.000730

Extract Invoice Line Items

MEDIUM
11 runs · Last: Mar 31
o4 Mini: 9697ms · $0.000000o3 Mini: 10927ms · $0.000000o3: 13285ms · $0.000000GPT-4.1 Nano: 2783ms · $0.000000GPT-4.1 Mini: 7814ms · $0.000000GPT-4.1: 3796ms · $0.000000Claude Opus 4.6: 6750ms · $0.000000Claude Sonnet 4.6: 7933ms · $0.010521GPT-4o Mini: 7736ms · $0.000272GPT-4o: 6276ms · $0.004543Claude Haiku 4.5: 1950ms · $0.002607

Extract Invoice Line Items

MEDIUM
11 runs4 votes · Last: Mar 31
o4 Mini: 11850ms · $0.000000o3 Mini: 21058ms · $0.000000o3: 15289ms · $0.000000GPT-4.1 Nano: 16091ms · $0.000000GPT-4.1 Mini: 6560ms · $0.000000GPT-4.1: 3667ms · $0.000000Claude Opus 4.6: 8110ms · $0.000000GPT-4o Mini: 14067ms · $0.000281Claude Sonnet 4.6: 9032ms · $0.010554Claude Haiku 4.5: 4933ms · $0.0026531/1GPT-4o: 4676ms · $0.0045951/1