On March 5, 2026, OpenAI announced “Introducing GPT‑5.4,” positioning it as a major step forward for people who use AI to do real work, writing, analysis, coding, and tool-driven tasks that span multiple apps and files. The release rolls out across ChatGPT, the API, and Codex, and OpenAI describes it as the first “mainline reasoning model” that also incorporates GPT‑5.3‑Codex coding capabilities.
Beyond the line capability claims, OpenAI and third-party coverage point to a practical theme: fewer mistakes, better long-horizon deliverables, and stronger performance in computer-use and knowledge-work benchmarks. In other words, GPT‑5.4 is framed less as a novelty and more as an efficiency upgrade for professional workflows.
1) What OpenAI actually launched on March 5, 2026
OpenAI’s launch announcement makes GPT‑5.4 notable for how it consolidates strengths that were often split across “reasoning” and “coding” variants. OpenAI calls it the first mainline reasoning model that incorporates GPT‑5.3‑Codex coding capabilities, aiming to reduce the trade-off between deep thinking and high-quality code generation.
Availability is broad: OpenAI says the rollout spans ChatGPT, the API, and Codex. This matters because teams often prototype in ChatGPT, then move into product via the API, and finally rely on Codex-style workflows for repository-scale coding, GPT‑5.4 is meant to feel consistent across those surfaces.
OpenAI Academy materials also clarify the lineup. Alongside GPT‑5.3 Instant (framed as fast, everyday), GPT‑5.4 “Thinking” targets difficult professional workflows and is available in ChatGPT, the API, and Codex; GPT‑5.4 Pro is positioned as the highest-capability option for Pro & Enterprise users as well as the API and Codex.
2) Factuality improvements: fewer false claims, fewer error-containing responses
OpenAI makes concrete factuality claims versus GPT‑5.2, using a dataset of de-identified prompts where users flagged factual errors. In that evaluation, OpenAI reports that individual claims are 33% less likely to be false with GPT‑5.4.
The company also reports a response-level metric: full responses are 18% less likely to contain any errors compared to GPT‑5.2. This distinction, claim-level vs response-level, suggests GPT‑5.4 is improving both the “micro” correctness of statements and the “macro” reliability of an answer end-to-end.
For professional use, these numbers are meaningful because many workflows fail not due to a single hallucinated fact but because one wrong assumption contaminates a spreadsheet, a policy draft, or a technical plan. OpenAI’s framing implies GPT‑5.4 is targeted at reducing those cascading failures rather than only improving style or fluency.
3) Knowledge-work performance: GDPval and deliverables that look like real jobs
OpenAI highlights a knowledge-work evaluation called GDPval, covering 44 occupations and tasks such as sales presentations, spreadsheets, schedules, diagrams, and short videos. On this benchmark, GPT‑5.4 “wins or ties” in 83.0% of comparisons, up from 70.9% for GPT‑5.2.
What’s notable is the breadth of outputs: not just text answers, but multi-format artifacts that people actually submit to colleagues or clients. This matches OpenAI’s product positioning for GPT‑5.4 as a frontier model optimized for professional work across documents, spreadsheets, and presentations.
External voices echo the “deliverables” angle. In the launch materials, Mercor CEO Brendan Foody is quoted saying: “GPT‑5.4 is the best model we’ve ever tried… It excels at creating long-horizon deliverables such as slide decks, financial models, and legal analysis…” TechCrunch also reports this statement, adding that it’s described as faster and lower cost than some competing frontier models.
4) Spreadsheets and presentations: measurable gains in business artifacts
OpenAI reports a substantial uplift on an internal spreadsheet modeling benchmark designed around junior investment banking analyst-style tasks. The mean score is listed as 87.3% for GPT‑5.4 versus 68.4% for GPT‑5.2, suggesting fewer formula mistakes, more coherent assumptions, or better end-to-end modeling accuracy.
Presentation quality is also directly evaluated. OpenAI says human raters preferred GPT‑5.4 presentations 68.0% of the time over GPT‑5.2, based on criteria that include aesthetics, variety, and use of image generation.
Taken together, these metrics point to a specific competitive battleground: not just “write a paragraph,” but “produce an artifact someone would actually send.” If your workflow includes turning messy notes into slides or translating business logic into a spreadsheet, OpenAI is signaling GPT‑5.4 as a more dependable first draft, and, increasingly, a near-final draft.
5) Agents and computer use: OSWorld, WebArena, and beyond
One of the most striking benchmark callouts is OSWorld-Verified, a computer-use evaluation. OpenAI reports GPT‑5.4 at a 75.0% success rate versus 47.3% for GPT‑5.2, and notes this surpasses human performance reported as 72.4% (citing the OSWorld paper).
For browser-use, WebArena-Verified shows a smaller lift: 67.3% for GPT‑5.4 versus 65.4% for GPT‑5.2. OpenAI also reports an “Online-Mind2Web screenshot-only” success rate of 92.8%, contrasted with “ChatGPT Atlas’s Agent Mode” at 70.9%, emphasizing progress in screenshot-grounded browsing tasks.
These results map to OpenAI’s broader positioning of GPT‑5.4 as an “agentic workflows” model, able to plan and execute across tools and software environments rather than simply answer questions. In practice, that could mean more reliable multi-step actions: finding information, updating a document, filling a form, or executing a repeatable process with tool calls and UI interactions.
6) Coding and tool benchmarks: steady gains, plus stronger browsing and tooling
On SWE-Bench Pro (Public), OpenAI reports GPT‑5.4 at 57.7%, compared with 55.6% for GPT‑5.2. That’s a modest improvement, but it aligns with the launch message that GPT‑5.4 merges mainline reasoning with Codex-grade coding capability rather than chasing a single coding metric at the expense of everything else.
Tool and retrieval-heavy evaluations show larger deltas. BrowseComp is reported at 82.7% for GPT‑5.4 versus 65.8% for GPT‑5.2, indicating stronger performance in tasks where browsing, selecting sources, and integrating found information matters.
OpenAI also reports Toolathlon at 54.6% for GPT‑5.4 versus 46.3% for GPT‑5.2. Interpreted practically, this suggests better orchestration: choosing the right tool, calling it correctly, and incorporating outputs coherently, core requirements for “agentic” professional workflows.
7) API details: context window, modalities, endpoints, and model IDs
According to OpenAI API docs (snapshot dated March 5, 2026), GPT‑5.4 supports a 1,050,000-token context window with a maximum output of 128,000 tokens. The same documentation lists a knowledge cutoff of August 31, 2025, which is important for teams that require awareness of post-cutoff events (often addressed via browsing or retrieval tools rather than pretraining alone).
In terms of modalities, GPT‑5.4 accepts text and image as input and produces text output. Availability spans multiple API surfaces: Responses, Chat Completions, Realtime, Assistants, and Batch, with tool support listed for capabilities such as web search, file search, code interpreter, computer use, MCP, and more.
For integration, OpenAI lists both an alias and a pinned snapshot: `gpt-5.4` and `gpt-5.4-2026-03-05`. This gives developers a typical choice between “latest” behavior via the alias and reproducibility via the dated snapshot.
8) Pricing and operational considerations: what GPT‑5.4 costs to run
OpenAI’s pricing (as crawled March 5, 2026) lists GPT‑5.4 standard API rates at $2.50 per 1M input tokens, $0.25 per 1M cached input tokens, and $15.00 per 1M output tokens. For many production workloads, output cost dominates, so controlling verbosity and using structured outputs can have material budget impact.
OpenAI’s model documentation also notes batch pricing and cached-input pricing, encouraging patterns like reusing stable system prompts, retrieval scaffolds, or shared policy text. That matters when you operate at scale and want predictable cost curves across repeated tasks.
There are additional pricing nuances for very large contexts: prompts with more than 272K input tokens for 1.05M-context models are priced at 2× input and 1.5× output for the full session, and regional processing carries a +10% uplift. In practice, teams using near-megacontext sessions (for large codebases or multi-quarter project archives) will want guardrails, chunking, retrieval, and caching, to avoid surprise bills.
GPT‑5.4 arrives as a consolidation release: a model OpenAI says combines mainline reasoning with Codex-grade coding and is designed for agentic work across tools. The announcement is backed by claims of lower factual error rates (33% fewer false individual claims and 18% fewer error-containing responses versus GPT‑5.2) and by benchmark improvements that target professional outputs, spreadsheets, presentations, and knowledge-work tasks.
At the same time, the practical story for users and builders is about scaling workflows: a 1,050,000-token context window, text+image inputs, broad endpoint coverage, and clearer production economics via caching and batch options. If GPT‑5.4’s promise holds in real deployments, it won’t just answer better, it will complete more of the work you would otherwise do across documents, browsers, and software interfaces.