Can LLMs do Lateral Thinking Puzzles?

5 February 2025 · 3 mins

⚠️ Spoiler Warning: This post contains spoilers for the game show OnlyConnect and the podcast Lateral.


TLDR; I test LLMs on lateral thinking and connections style puzzles, connections-2k and lateral-1k.

Here, try one yourself from connections-2k:

What is the connection between these?
Beechcraft airliner (post & pre 1996)
Cigarette (84mm & 100mm)
IPL team (Punjab & Chennai)
Bed sizes (5ft & 6ft)
Click To Reveal Answer
Answer: Kings and Super Kings
Gemini 2.0 Flash
The connection is the use of the words "King" and "Queen".
Claude 3.5 Sonnet
KING
o1-preview
They are linked by “King” and “Super King”.
gpt-4o
King and Queen
DeepSeek-R1
Each pair consists of a "King" and "Super King" variant.
Figure 0:

An example puzzle from connections-2k.

And try one from lateral-1k:

Answer the following lateral thinking question:
In 2007, how did one moment of bad behaviour from 1-year-old Charlie Davies-Carr cause his family to move to a new house?
Click To Reveal Answer
Answer: Charlie bit his brother's finger, and the video went viral, generating over $100,000 in ad revenue. This allowed the family to purchase a new home.
Gemini 2.0 Flash
Charlie Davies-Carr bit his brother Harry's finger, and the resulting "Charlie Bit My Finger" video went viral on YouTube, earning the family significant income and fame, but also unwanted attention, leading them to move to a new house for privacy.
Claude 3.5 Sonnet
Charlie biting his brother's finger became a viral video sensation, earning the family enough money to buy a new house.
o1-preview
Charlie bit his brother's finger in a home video that went viral ("Charlie bit my finger"), and the resulting fame and income allowed his family to move to a new house.
gpt-4o
Charlie's "bad behavior" was biting his brother's finger in a viral video, which led to financial gain and allowed the family to move.
DeepSeek-R1
Charlie’s viral "Charlie Bit My Finger" video, stemming from his playful bite, earned the family enough money to move to a new house.
Figure 0:

An example puzzle from lateral-1k.

Background

One of my favorite game shows is OnlyConnect and one of my favorite podcasts is Lateral by Tom Scott (and what a wonderful crossover when Tom was on OnlyConnect!). I always wondered how well an LLM would do if it were a contestant on any of these shows, since it seemed that these types of puzzles combine trivia-like knowledge with reasoning and system 2 style thinking.

And so I curated about 1000 lateral puzzles from the podcast and 2000 connections puzzles from the game show, which I call lateral-1k and connections-2k respectively. Back when gpt-3.5-turbo was all the rage, these benchmarks served as a nice private way for me to quickly test the state of the models, and back then the models were not very good on these types of puzzles.

However, newer reasoning-focused models have significantly improved at solving these puzzles. Given this progress, I decided to document their performance. I am unsure of the legal implications of releasing these datasets publicly, so for now they remain on my disk. If you're interested in benchmarking LLMs on the full sets and have the necessary resources or connections, I'd love to collaborate!

Method

Benchmarking LLMs on these types of puzzles is a bit tricky, since accepting an answer has a decent bit of subjectivity. I used a smaller model (specifically, claude-3.5-haiku) as a grader, and while it largely did a great job at grading, there were definitely some places where I had to manually regrade an answer because the grader model got confused, or injected its own bias.

The prompt I used for connections was:

What's the connection between the following four things?

<clue1>
---
<clue2>
---
<clue3>
---
<clue4>
---

First provide your reasoning, then provide your answer as the last line.

Of course, for reasoning models, I ommitted the "first provide your reasoning" part.

For lateral, I used a similar prompt, but with a bit more context:

What's the answer to this lateral thinking puzzle?

<question>

First provide your reasoning, then provide your answer as the last line.

Results

Connections-2k
Figure 0:

Evaluation results of various models on connections-2k. Out of approximately 2000 puzzles, the models were given a uniform random subset of 100 puzzles to solve.

Lateral-1k
Figure 0:

Evaluation results of various models on lateral-1k. Out of approximately 1000 puzzles, the models were given a uniform random subset of 100 puzzles to solve. Human performance is not comparable to models, since on the show the contestants are allowed to ask follow-up questions until they get it right.

Examples

Here are some examples from lateral-1k:

1 / 7
Answer the following lateral thinking question:
In a 2010 study, two sets of biology facts were given to a randomized student study group. One set was significantly easier to remember than the other, even though it ought to have been more difficult. Why?
Click To Reveal Answer
Answer: Facts presented in a harder-to-read font, like Comic Sans, were retained better than those in a standard, easier-to-read font like Arial.
Gemini 2.0 Flash
The "more difficult" facts were presented in a way that made them more relatable or interesting to the students.
Claude 3.5 Sonnet
The easier-to-remember set included evolutionary explanations for why the biological facts were true, while the other set just presented the facts alone. The "why" made the information more memorable, even though it added complexity.
o1-preview
Because the easier-to-remember facts were presented in a difficult-to-read font, which led to better memory retention due to increased cognitive processing.
gpt-4o
The easier-to-remember set was presented in a more engaging or memorable way.
DeepSeek-R1
The easier-to-remember set was written in a harder-to-read font, which promoted deeper processing and better retention.
Figure 0:

Examples from lateral-1k.

And here are some examples from connections-2k:

1 / 8
What is the connection between these?
Laurence (O) and Frank (I)
Laurence (O) and Kenneth (I)
Adrian (O) and Rory (I)
Chiwetel (O) and Ewan (I)
Click To Reveal Answer
Answer: Othello and Iago
Gemini 2.0 Flash
The first name on the left contains an "O" sound, and the first name on the right contains an "I" sound.
Claude 3.5 Sonnet
Oscar Winners (O) vs Non-Oscar Winners (I)
o1-preview
They are pairs of actors who played Othello (O) and Iago (I) together in productions of Shakespeare's "Othello."
gpt-4o
Each pair of actors has played the same character in different adaptations or productions.
DeepSeek-R1
Each pair consists of actors who played Othello (O) and Iago (I) in different productions of Shakespeare's *Othello*.
Figure 0:

Examples from connections-2k.

Discussion

o1-preview is very impressive indeed. I speculate the OpenAI probably hasn't optimized their model for lateral thinking puzzles like these, and it's really cool to see this style of cross-domain transfer happening in the wild. Of course, these puzzles are actually very in-line with the kind of thing that o1-preview should ostensibly be good at: combining trivia-like knowledge with reasoning and system 2 style thinking. I think this is a big win for test-time compute.

I would love to do a future benchmark where the models are allowed to ask follow-up questions for a couple of rounds to an oracle model with the correct answer, and measuring the performance of the models in that more angentic setting.

Unfortunately, I couldn’t evaluate Gemini models due to aggressive rate limits on my account. If you're interested in collaborating on a more comprehensive benchmarking effort, please reach out!

Acknowledgements

Thanks to Lisa Dunlap for technical discussions. Thanks to Erik Jenner and Kathy Jang for giving feedback on drafts of this post.

Citation

@misc{kapur2025lateral,
  url={https://shreyaskapur.com/blogs/lateral},
  journal={Lateral Thinking Benchmarks},
  author={Kapur, Shreyas},
  year={2025},
  month={Feb}
} 
Shreyas Kapur's Blog