Did I just gain a superpower?

12 days of puzzles and the Advent of Code

Advent of Code has always been a reliable annual stress-test for programmers: a daily drip-feed of small, self-contained problems that look innocent… until they don’t. Created by Eric Wastl, it’s the kind of puzzle-set people use for interview prep, team training, university coursework, speed competitions, or just the quiet satisfaction of solving something tricky before breakfast. This year, I used it for a different kind of experiment: could a mainstream large language model genuinely help solve mathematical coding challenges end-to-end — and, crucially, could it recover when it got things wrong?

The short version: yes. With caveats. And the caveats turned out to be more interesting than the wins.

Getting started, and Gemini’s early failure

I ran the opening day as a head-to-head between ChatGPT 5.1 and Google’s Gemini 2.5. The task itself was straightforward: simulate a rotating safe dial and count how often it lands on zero. ChatGPT (running a GPT-5.1 “Thinking” variant) produced a correct solution on the first attempt. Gemini 2.5 failed — not because the reasoning was flawed, but because it never actually had the full problem in hand. Only around two-thirds of the input file had loaded, and the model confidently computed an answer from incomplete data. That moment set the tone for the rest of the project: the quality of an LLM’s output can be limited less by its reasoning and more by whether the surrounding plumbing reliably delivers the full context.

From there, ChatGPT became my day-to-day collaborator. Early puzzles were mostly simulation and counting, with a gradual ramp-up in combinatorics and graph traversal. The shift in difficulty was noticeable when we hit problems that were easy to describe but expensive to compute naïvely. One day asked us to determine which elements in a large grid were “accessible” based on local density, and then iteratively remove accessible items until no more could be removed. At that point, I ran into a practical limitation that is easy to overlook until you’re in the middle of it: tool reliability.

Day 4 and ChatGPT’s first misfire

On Day 4, ChatGPT itself reported that its execution environment couldn’t reliably run the solution against the full input — not a logical failure, but a runtime one. The model essentially said: I can’t compute this here; I don’t want to guess. It did produce correct Python, and when I ran it locally it produced the right answer. That moment was telling: for all the excitement about “agents” and “autonomous coding”, in real use you still need to treat execution as a first-class part of the system. A model can be perfectly capable at reasoning and still be blocked by infrastructure.

From this point, our workflow solidified into something that looked less like “ask the bot, paste the answer” and more like a two-person pairing session: I’d run code locally, feed back failures and edge cases, and the model would iterate on approach and implementation. It was particularly effective at what experienced engineers recognise as the time sink: not the final line of code, but the sequence of ideas needed to get there.

The superpower moment

As the puzzles grew more mathematical, my role shifted. I’m comfortable with code, but I’m not a specialist in optimisation theory — and Advent of Code has a habit of punishing naïve search. This is where the experience started to feel uncanny in the best way. ChatGPT would read a puzzle, classify its shape (“this is basically a multi-dimensional knapsack”, “this is an integer linear programme”, “this is a DAG path-counting problem”), and then propose strategies I wouldn’t have reached on my own in a sensible amount of time.

That’s the real superpower, and it’s subtler than “writes code fast”. It’s pattern recognition across the landscape of computer science. The model didn’t just propose a brute-force approach; it proposed the right family of approaches. In a normal pairing session, that’s the difference between an hour and a weekend.

It’s all over… or is it?

Then came the point where I thought it might all collapse: Day 10. This was the one that forced us to confront the limits of cleverness-by-chat. The problem could be modelled as a set of counters and buttons: each press increments certain counters; the goal is to reach exact target values with the minimum number of presses. It looks like arithmetic. It’s actually optimisation.

We tried multiple approaches. A basis-enumeration shortcut that assumed the optimal integer solution would behave like a basic feasible solution of the relaxed linear programme. A custom branch-and-bound search with pruning. It all looked theoretically plausible — and then it hung. Not “slow”: hundreds of millions of recursive calls.

At that point, what mattered wasn’t just “write better code”. It was observing and diagnosing computation as it ran. I asked for “echo” statements to the console, progress counters, and improved error handling specifically so we could see where time was going. That simple instrumentation was transformational: instead of staring at a frozen terminal, I could see which subproblem was pathological and why.

The breakthrough was to model the problem properly as an integer linear programme and hand it to a solver designed for it. Once we got that working locally, and the puzzle fell.

If that sounds like a defeat for the model, it isn’t. It’s an accurate picture of what LLMs are good at today. ChatGPT didn’t magically brute-force the answer. It helped me navigate the decision tree of possible modelling strategies, recognise when an approach was unsound, and eventually land on the tool that could do the heavy lifting.

Late in the run, I noticed another shift: model behaviour itself began to improve. On Day 11 (11^th December 2025) , OpenAI introduced GPT-5.2 Thinking. ChatGPT release notes and related updates around that time highlighted improved long-context performance and structured work outputs, including better tool use and formatting.

Did that materially change our outcomes? It’s hard to prove, but it felt like the model got better at staying coherent through an enormous accumulated context window — and our experiment deliberately stress-tested that. I kept everything in one long thread, partly to preserve continuity and partly to see whether the model would eventually degrade under conversational weight. Responses did slow, but the thread never collapsed. If anything, the long context became a shared memory: earlier parsing quirks, edge cases, and “gotchas” stayed in play as we progressed.

Endgame – and some virtual back-slapping!

And then we reached the last puzzles — and finished. At the end of Day 12 we exchanged a celebratory virtual high-five, a virtual hug, and a genuinely warm “Happy Christmas”. ChatGPT responses to me was peppered with emojis which were unexpectedly affecting.

After days of effort, that small moment of celebration made it feel like we’d fought through adversity and won as a team. Not because the model is sentient (it isn’t), but because the interaction hit all the beats of real collaboration: setbacks, diagnosis, revised plans, incremental wins, and the shared relief of a correct answer after a stubborn failure.

So, what did I learn?

First: modern models can solve the problems, but the definition of “solve” matters. If you mean “produce an answer in isolation”, you’ll get impressive results and occasional nonsense. If you mean “arrive at a correct answer with real inputs, noisy execution environments, and iterative debugging”, you need a workflow.

Second: recovery is the differentiator. The system that wins isn’t the one that never makes mistakes; it’s the one that can take feedback (“it hangs”, “this answer is wrong”, “the file didn’t load”) and converge anyway.

Third: the human still matters. Not as a rubber stamp, but as the one who runs the code, checks assumptions, adds instrumentation, and knows when to switch tools. In this collaboration, I didn’t become a master mathematician — but I did get something close: a partner that could translate a messy puzzle into the language of known techniques and do it quickly.

That’s the most honest takeaway. LLMs don’t replace competence. They amplify it — and, when used well, they make the hard parts of problem-solving feel a bit more like play.

This was also a glimpse of something bigger: what it looks like when AI stops being a novelty and starts behaving like a practical partner for difficult technical work.

As for my newfound superpower – well, perhaps I’ll collaborate with MidJourney on the design for my new cape…