⚠️ Spoiler Warning: This post contains spoilers for the game show OnlyConnect and the podcast Lateral.
TLDR; I test LLMs on lateral thinking and connections style puzzles, connections-2k
and lateral-1k
.
Here, try one yourself from connections-2k
:
An example puzzle from connections-2k
.
And try one from lateral-1k
:
An example puzzle from lateral-1k
.
One of my favorite game shows is OnlyConnect and one of my favorite podcasts is Lateral by Tom Scott (and what a wonderful crossover when Tom was on OnlyConnect!). I always wondered how well an LLM would do if it were a contestant on any of these shows, since it seemed that these types of puzzles combine trivia-like knowledge with reasoning and system 2 style thinking.
And so I curated about 1000 lateral puzzles from the podcast and 2000 connections puzzles from the game show, which I call lateral-1k
and connections-2k
respectively. Back when gpt-3.5-turbo
was all the rage, these benchmarks served as a nice private way for me to quickly test the state of the models, and back then the models were not very good on these types of puzzles.
However, newer reasoning-focused models have significantly improved at solving these puzzles. Given this progress, I decided to document their performance. I am unsure of the legal implications of releasing these datasets publicly, so for now they remain on my disk. If you're interested in benchmarking LLMs on the full sets and have the necessary resources or connections, I'd love to collaborate!
Benchmarking LLMs on these types of puzzles is a bit tricky, since accepting an answer has a decent bit of subjectivity. I used a smaller model (specifically, claude-3.5-haiku
) as a grader, and while it largely did a great job at grading, there were definitely some places where I had to manually regrade an answer because the grader model got confused, or injected its own bias.
The prompt I used for connections was:
What's the connection between the following four things?
<clue1>
---
<clue2>
---
<clue3>
---
<clue4>
---
First provide your reasoning, then provide your answer as the last line.
Of course, for reasoning models, I ommitted the "first provide your reasoning" part.
For lateral, I used a similar prompt, but with a bit more context:
What's the answer to this lateral thinking puzzle?
<question>
First provide your reasoning, then provide your answer as the last line.
Evaluation results of various models on connections-2k
. Out of approximately 2000 puzzles, the models were given a uniform random subset of 100 puzzles to solve.
Evaluation results of various models on lateral-1k
. Out of approximately 1000 puzzles, the models were given a uniform random subset of 100 puzzles to solve. Human performance is not comparable to models, since on the show the contestants are allowed to ask follow-up questions until they get it right.
Here are some examples from lateral-1k
:
Examples from lateral-1k
.
And here are some examples from connections-2k
:
Examples from connections-2k
.
o1-preview
is very impressive indeed. I speculate the OpenAI probably hasn't optimized their model for lateral thinking puzzles like these, and it's really cool to see this style of cross-domain transfer happening in the wild. Of course, these puzzles are actually very in-line with the kind of thing that o1-preview
should ostensibly be good at: combining trivia-like knowledge with reasoning and system 2 style thinking. I think this is a big win for test-time compute.
I would love to do a future benchmark where the models are allowed to ask follow-up questions for a couple of rounds to an oracle model with the correct answer, and measuring the performance of the models in that more angentic setting.
Unfortunately, I couldn’t evaluate Gemini models due to aggressive rate limits on my account. If you're interested in collaborating on a more comprehensive benchmarking effort, please reach out!
Thanks to Lisa Dunlap for technical discussions. Thanks to Erik Jenner and Kathy Jang for giving feedback on drafts of this post.
@misc{kapur2025lateral,
url={https://shreyaskapur.com/blogs/lateral},
journal={Lateral Thinking Benchmarks},
author={Kapur, Shreyas},
year={2025},
month={Feb}
}