Introducing GPT-5.5

2 min read
gpt-5-5openaiagentic-aicoding-agentscomputer-usellm-benchmarkscodextoken-efficiency
View as Markdown
Originally from openai.com
View source

My notes

Summary

OpenAI released GPT-5.5 on 23 April 2026, positioning it as their strongest agentic model to date, better at sustained multi-step tasks, coding, computer use, and knowledge work than GPT-5.4, while matching its latency and using fewer tokens for the same work. API pricing lands at $5/1M input and $30/1M output tokens, with GPT-5.5 Pro at $30/$180. The model is already in use by 85%+ of OpenAI’s internal teams weekly across finance, comms, and engineering.

Key Insight

Token efficiency matters more than raw benchmark scores. GPT-5.5 uses fewer tokens than GPT-5.4 to complete the same Codex tasks while delivering better results. For API-based workflows this directly cuts cost, the “more capable = more expensive” assumption no longer holds for this generation.

The shift is from Q&A to sustained execution. GPT-5.5 is engineered to persist across multi-step loops: plan, use tools, check output, iterate without human babysitting. The real unlock is not smarter single responses but reliable long-horizon task completion. The NVIDIA engineer quote (“losing access feels like a limb amputation”) signals how fast dependency forms once a model can actually finish work end-to-end.

Concrete benchmark numbers to know:

  • SWE-Bench Pro (real GitHub issue resolution): 58.6%
  • Terminal-Bench 2.0 (complex CLI workflows): 82.7%
  • OSWorld-Verified (autonomous computer use): 78.7%
  • GDPval (44-occupation knowledge work): 84.9%
  • Tau2-bench Telecom (customer service workflows): 98.0%
  • FrontierMath Tier 4 (hardest math): 35.4% (vs 27.1% for 5.4)
  • Long-context 1M token retrieval (BFS): 45.4% vs Claude Opus 4.6 at 41.2%

Computer use is now practically viable. 78.7% on OSWorld-Verified is the number that matters, this is real GUI automation (click, type, navigate) without special APIs. This opens browser automation options that don’t need Playwright scripting.

Load balancing self-optimisation is a noteworthy precedent. GPT-5.5’s inference team used Codex to analyse production traffic and write custom heuristic algorithms for GPU partitioning, yielding >20% token generation speedup. AI-written infra optimisations shipped to production, the loop is closing.

OpenAI internal usage is the real signal. 85%+ weekly active use across non-engineering teams (Finance reviewed 71,637 pages of K-1 tax forms; Comms automated Slack triage; GTM saved 5-10 hrs/week on reports) suggests the productivity gains are real and broad, not just for developers.