Question & Answer Tasks

4 benchmark tasks with side-by-side model comparisons

Factual Recall with Distractor

MEDIUM
11 runs · Last: Mar 31
o4 Mini: 1333ms · $0.000000o3 Mini: 1259ms · $0.000000o3: 1290ms · $0.000000GPT-4.1 Nano: 332ms · $0.000000GPT-4.1 Mini: 840ms · $0.000000GPT-4.1: 664ms · $0.000000Claude Opus 4.6: 1842ms · $0.000000Claude Sonnet 4.6: 1967ms · $0.000000GPT-4o: 1372ms · $0.0000001/1GPT-4o Mini: 1256ms · $0.0000001/1Claude Haiku 4.5: 850ms · $0.000000

Multi-hop Reasoning

HARD
11 runs · Last: Mar 31
o4 Mini: 1918ms · $0.000000o3 Mini: 1845ms · $0.000000o3: 1341ms · $0.000000GPT-4.1 Nano: 696ms · $0.000000GPT-4.1 Mini: 727ms · $0.000000GPT-4.1: 906ms · $0.000000Claude Opus 4.6: 3465ms · $0.000000Claude Sonnet 4.6: 2105ms · $0.001500Claude Haiku 4.5: 1369ms · $0.000415GPT-4o: 1286ms · $0.000470GPT-4o Mini: 1241ms · $0.000026

Contract Clause Liability Interpretation

HARD
11 runs · Last: Mar 31
o4 Mini: 11735ms · $0.000000o3 Mini: 13213ms · $0.000000o3: 10932ms · $0.000000GPT-4.1 Nano: 6651ms · $0.000000GPT-4.1 Mini: 7689ms · $0.000000GPT-4.1: 7583ms · $0.000000Claude Opus 4.6: 17005ms · $0.000000Claude Sonnet 4.6: 23873ms · $0.016293GPT-4o Mini: 7793ms · $0.000242GPT-4o: 7420ms · $0.004155Claude Haiku 4.5: 3821ms · $0.002151

Contract Clause Liability Reasoning

HARD
11 runs · Last: Mar 31
o4 Mini: 7827ms · $0.000000o3 Mini: 6993ms · $0.000000o3: 9216ms · $0.000000GPT-4.1 Nano: 3342ms · $0.000000GPT-4.1 Mini: 7523ms · $0.000000GPT-4.1: 8916ms · $0.000000Claude Opus 4.6: 13674ms · $0.000000Claude Sonnet 4.6: 14393ms · $0.011586GPT-4o Mini: 7462ms · $0.000246GPT-4o: 4797ms · $0.004180Claude Haiku 4.5: 4044ms · $0.002482