GPT-5 has been live in enterprise for sixty days. The picture is sharper than the launch narrative.
OpenAI shipped GPT-5 to enterprise customers on 6 February 2026 with three headline claims: more reliable tool use, a 2 million token context window, and a price curve that, on paper, undercut GPT-4o for high-volume reasoning workloads. Two months in, the teams that migrated have data. The teams still on GPT-4o have questions.
This is the debrief.
What actually shipped
GPT-5 enterprise is not a single product. It is three things bundled.
First, a new tool-calling architecture. Function calls now run inside a structured planner that resolves multi-step tasks before executing. OpenAI's published benchmarks show tool-call accuracy rising from 87 per cent on GPT-4o to 96 per cent on GPT-5 on the Berkeley Function Calling Leaderboard (source). Migrated teams report numbers in line with that. One mid-market financial services customer told us their internal pipeline saw spurious tool calls fall from roughly 1 in 12 to 1 in 60 over March.
Second, the 2 million token window. This is technically real. It is not always practically useful. We will return to this.
Third, a new pricing tier. GPT-5 Standard runs at USD 8 per million input tokens and USD 24 per million output tokens. GPT-5 Reasoning, which exposes extended thinking, runs at USD 12 input and USD 48 output. GPT-4o for comparison sits at USD 5 and USD 15 (OpenAI pricing page, accessed April 2026).
What teams are reporting
We spoke with five enterprise customers across financial services, professional services and government. The pattern is consistent enough to call.
Tool use is the win. Every team that migrated their agent or RAG pipeline saw measurable lift on tool-call success rates. For workflows that chain three or more tool calls, the difference between GPT-4o and GPT-5 is the difference between something you can ship and something you have to babysit.
The 2M context is situational. Teams running large document review use it. Teams running ordinary chat assistants do not. One legal services customer reported genuine value loading entire matter files into a single prompt. A consulting firm's internal tooling team reported the longer context introduced more attention dilution than they expected on tasks that did not need it. The lesson: long context is a tool, not a default.
Pricing maths is harder than it looks. On the headline rate, GPT-5 Reasoning is more expensive than GPT-4o. The argument for migration is that fewer retries and fewer prompt-engineering work-arounds compensate. For some workloads this holds. For others it does not. One ops team modelled the same pipeline on both and found GPT-5 was 18 per cent cheaper on end-to-end task cost despite higher per-token rates because the retry rate dropped from 9 per cent to 2 per cent.
What this actually means
Enterprise AI in April 2026 is no longer about one model winning. It is about which workloads sit on which model. Three patterns separate teams getting value from teams burning credits.
First, the teams getting value have a clear taxonomy of their own workloads. Tool-heavy agentic workflows on GPT-5. Long-context document tasks on GPT-5. Routine drafting and summarisation still on GPT-4o or Claude Sonnet for cost reasons. Teams that swept everything to GPT-5 are now reverting selectively.
Second, the teams getting value are measuring task-level cost, not token cost. Per-token pricing is a misleading anchor. Cost per completed task, including retries and human review, is the real number. Most internal dashboards we have seen are still tracking the wrong metric.
Third, the teams getting value have logged tool-use traces from day one. When a tool call fails or returns a wrong answer, they have the evidence to debug it. Teams that did not instrument early are now flying blind on the very capability they migrated for.
Who should care
If you sponsor an enterprise AI program, you have a budget reckoning coming. The headline pricing is not the operating cost. Make sure your finance team understands the difference between rate card and run cost.
If you build internal tools, your job for the next quarter is workload triage. Stop arguing about which model is best. Start mapping which model fits which task.
If you sit in GRC, this is the moment to update your vendor risk file. GPT-5 sits behind more agentic workflows than its predecessor. That changes your control surface. Tool-use logs become material evidence under a CPS 230 lens. Now is the time to specify retention and review obligations in your service contract.
Hype check
The 2 million token context is the most oversold part of GPT-5 enterprise. The marketing positions it as the answer to retrieval. It is not. For most production workloads, well-built RAG with a 200K context still outperforms loading everything into a 2M prompt. Long context is a complement to retrieval, not a replacement. Vendors who sold you "we don't need RAG anymore" should be politely asked to revise.
The tool-use story, conversely, is undersold in OpenAI's own messaging. The teams getting the most value from GPT-5 are not the ones using the long context. They are the ones whose agents finally stopped hallucinating tool calls.
What to do this week
If you have not yet migrated, do not migrate everything. Pick the one workload most painful on GPT-4o and run it on GPT-5 in shadow mode for a fortnight.
If you have migrated, instrument tool-call success rates today. If you cannot answer the question "what is our tool-call success rate by workflow", you are not yet getting GPT-5's main benefit.
If you sit in GRC or risk, ask your AI program owner two questions. What is our task-level cost on GPT-5 versus GPT-4o by workload? Where do tool-use logs sit and how long are they retained? If the answers are vague, that is the gap you need to close before your next operational resilience review.
GPT-5 is not a magic upgrade. It is a better tool for some jobs and a worse one for others. The teams winning are the ones who treat it that way.
TheAICommand. Intelligence, At Your Command.
