May 2026 feels like a reset month for AI models. Not because one lab ended the race, but because the shape of the race changed again. OpenAI pushed GPT-5.5 into the center of the conversation. Anthropic answered with Claude Opus 4.7 and a smaller experimental thread called Claude Mythos. Google kept Gemini 3.1 Pro in the mix while making Flash-Lite harder to ignore on price and speed. DeepSeek V4 Pro reminded everyone that open weight models are still compressing the gap.
The honest summary is simple: GPT-5.5 looks like the strongest general model on the public benchmark mix I could verify by May 18, 2026. Claude Opus 4.7 looks especially attractive for coding, agents, and long work sessions. Gemini 3.1 Pro remains very competitive in reasoning and browsing style tasks. Gemini 3.1 Flash-Lite is the model I would reach for when budget and latency matter. DeepSeek V4 Pro is not at the frontier, but it is important because it raises the floor for open systems.
The chart uses OpenAI's reported ARC AGI 2 verified scores for the left panel and NIST CAISI aggregate Elo for the DeepSeek V4 Pro evaluation on the right panel.
The new center of gravity
OpenAI's GPT-5.5 announcement is the cleanest public signal from this batch. The model improves on GPT-5.4 across the benchmark table OpenAI published, and it does especially well on hard reasoning and tool use style tasks.
The numbers worth paying attention to:
| Benchmark | GPT-5.5 | GPT-5.4 | Claude Opus 4.7 | Gemini 3.1 Pro |
|---|---|---|---|---|
| ARC AGI 2 verified | 85.0 | 73.3 | 75.8 | 77.1 |
| GPQA Diamond | 93.6 | 92.8 | 94.2 | 94.3 |
| BrowseComp | 84.4 | 82.7 | 79.3 | 85.9 |
| SWE-Bench Pro public | 58.6 | 57.7 | 64.3 | 54.2 |
| Terminal-Bench 2.0 | 82.7 | 75.1 | 69.4 | 68.5 |
That table tells a more interesting story than "one model wins." GPT-5.5 has the broadest profile. Claude Opus 4.7 still edges it on SWE-Bench Pro public in OpenAI's own table. Gemini 3.1 Pro is slightly ahead on BrowseComp and GPQA Diamond. The practical reading is that frontier models are now close enough that product fit matters more than leaderboard placement.
If I were choosing a model for a new product today, I would not start with a single benchmark. I would start with the workflow:
- Long chain reasoning with tools: GPT-5.5 is the first model I would test.
- Coding agents and refactors: Claude Opus 4.7 deserves a serious run.
- Research, browsing, and multimodal workflows: Gemini 3.1 Pro is still in the room.
- High volume extraction, summarization, and routing: Gemini 3.1 Flash-Lite changes the cost math.
- Private or self hosted experiments: DeepSeek V4 Pro is the model to watch.
Claude Opus 4.7 looks built for work, not demos
Anthropic's Claude Opus 4.7 announcement reads less like a magic trick and more like a model built for long, annoying, real jobs. The claim is better sustained performance on coding, agent workflows, research, and writing.
The most concrete public signals are partner and benchmark numbers Anthropic published:
| Signal | Claude Opus 4.7 result | Why it matters |
|---|---|---|
| CursorBench | 70 | Coding agent work, above Opus 4.6 at 58 |
| SWE-Bench Pro public | 64.3 | Strong public coding benchmark result |
| Finance Agent benchmark | 0.813 | Better than Opus 4.6 at 0.767 |
| BigLaw Bench | 90.9 | High score on legal workflow tasks |
I would read those numbers as a pattern: Claude is trying to be the dependable model for messy professional work. That is different from winning every synthetic leaderboard. If you are building an AI coding assistant, internal knowledge worker, or agent that has to keep state across many steps, Opus 4.7 is one of the models you benchmark first.
Claude Mythos is more experimental. Anthropic describes it as a restricted preview focused on long horizon reasoning, but it is not the model most teams can actually deploy today. I would keep it out of production comparisons until availability and evaluation details are less fuzzy.
Gemini's story is two models, not one
Google's Gemini line has a split personality in the best way. Gemini 3.1 Pro is the frontier competitor. Gemini 3.1 Flash-Lite is the practical machine you use when a product has real traffic and a real bill.
Gemini 3.1 Pro remains competitive in the OpenAI comparison table:
- 77.1 on ARC AGI 2 verified
- 94.3 on GPQA Diamond
- 85.9 on BrowseComp
- 54.2 on SWE-Bench Pro public in OpenAI's comparison table
Those are not sleepy numbers. The weaker spot is coding compared with Claude Opus 4.7 and GPT-5.5. The stronger spot is browsing and general reasoning.
Flash-Lite is the more interesting product move. Google priced it at $0.25 per million input tokens and $1.50 per million output tokens, with a focus on low latency and high throughput. Google also reported Flash-Lite at 1432 Arena Score, 86.9 on GPQA Diamond, and 76.8 on MMMU. If those numbers hold in your own evals, it becomes a strong default for large scale features that do not need the most expensive model on every request.
This is the direction I expect more apps to take: one frontier model for hard cases, one cheaper model for routing, drafting, extraction, and background work.
DeepSeek V4 Pro is not the winner, but it matters
NIST CAISI's DeepSeek V4 Pro evaluation is useful because it is less promotional than a launch post. Their report compared DeepSeek V4 Pro against GPT-5.5, Claude Opus 4.6, and GPT-5.4 mini across their aggregate evaluation suite.
The headline numbers:
| Model | CAISI aggregate Elo |
|---|---|
| GPT-5.5 | 1260 |
| Claude Opus 4.6 | 999 |
| DeepSeek V4 Pro | 800 |
| GPT-5.4 mini | 749 |
That puts DeepSeek V4 Pro behind the most capable closed models, but above GPT-5.4 mini in CAISI's aggregate. CAISI also described the model as roughly eight months behind the proprietary frontier on performance, while still being much cheaper for certain workloads.
That is the part to watch. Open weight models do not need to beat GPT-5.5 tomorrow to matter. They need to be good enough for private deployment, fine tuning, local control, and cost sensitive inference. DeepSeek V4 Pro looks like another step in that direction.
What I would actually compare before choosing
Public benchmarks are a starting point. They are not procurement. If I were choosing a model stack for a product in May 2026, I would run a small internal benchmark with four columns:
| Test area | What to measure | Models I would start with |
|---|---|---|
| Hard reasoning | Correctness on domain problems | GPT-5.5, Gemini 3.1 Pro |
| Coding | Multi-file change quality and test pass rate | Claude Opus 4.7, GPT-5.5 |
| Agent work | Tool calls, recovery, state tracking | GPT-5.5, Claude Opus 4.7 |
| Volume tasks | Cost, latency, acceptable accuracy | Gemini 3.1 Flash-Lite, DeepSeek V4 Pro |
| Private workflows | Control, data posture, hosting fit | DeepSeek V4 Pro, local variants |
The mistake is treating the leaderboard as the architecture. The better pattern is a model router: cheap model first, frontier model when needed, human review where the cost of being wrong is high.
My read on the race
The model race is less about a single king and more about specialization. GPT-5.5 is the best overall signal. Claude Opus 4.7 feels like the professional workhorse. Gemini 3.1 Pro is still a serious frontier model, and Flash-Lite is a quiet threat because cost wins more products than people admit. DeepSeek V4 Pro is the reminder that open models keep making the floor higher.
The next wave will not be decided by who can write the flashiest launch post. It will be decided inside actual products: fewer retries, better tool use, lower latency, cleaner failure modes, and bills that do not make teams afraid to ship.
That is the benchmark I care about most.
