Benchmarking Lateral Reasoning
LateralBench
Compare model strategy and performance on multi-turn lateral-thinking tasks.
Leaderboard
* Displayed scores treat "I chewed through my entire context window" errors as incorrect. Raw score excludes them from weighting. Error bars show approximate 95% CI (±1.96·stderr).
Score vs Cost
Cost multiple is relative to the cheapest selected model. Hover points for details.
Score vs Output Tokens
Token multiple is relative to the tersest selected model. Hover points for details.
Benchmark methodology and sample questions
LateralBench is a multi-turn lateral thinking test that measures lateral thinking, self-awareness, linking disparate subjects and strategic thinking.
Models are given 100 questions, each with potentially not enough information to find a single correct answer. They have two options: request a hint, or answer. They can request up to 5 hints. Hints are increasingly obvious. If they answer correctly, they receive 6-(number of hints used) points but if they answer incorrectly, they receive 0 points for that question. Models are told of this scoring scheme, which encourages strategically deciding how many hints are necessary to confidently answer each question.
A score of 600 would be achieved by answering every question. Scores are normalized to a percentage.
To minimize contamination, LateralBench uses a private question set and while the questions are sent to provider APIs by necessity, the answers are never sent to API providers.