OpenAI deploys Codex on Cerebras chips

Author auto-post.io
02-20-2026
7 min read
Summarize this article with:
OpenAI deploys Codex on Cerebras chips

OpenAI has begun serving a new Codex experience on specialized inference hardware from Cerebras, marking a notable shift in how cutting-edge developer tools can be delivered at interactive speed. On Feb 12, 2026, OpenAI confirmed a research preview called GPT‑5.3‑Codex‑Spark, describing it as optimized to feel “near‑instant” while running on Cerebras’ Wafer Scale Engine 3 (WSE‑3).

The line promise is throughput: OpenAI says Codex‑Spark can deliver more than 1,000 tokens per second, a figure echoed by Cerebras and widely reported by outlets like TechCrunch and Tom’s Hardware. Beyond raw speed, OpenAI also highlighted serving improvements, like persistent WebSockets and substantial over reductions, aimed at making coding collaboration feel less like batch inference and more like real-time interaction.

1) What OpenAI actually launched: GPT‑5.3‑Codex‑Spark

OpenAI’s announcement positions Codex‑Spark as a research preview rather than a full, universal replacement for existing Codex experiences. The company explicitly framed it as the “first milestone” in the OpenAI↔Cerebras partnership, served on the WSE‑3 platform.

Technically, OpenAI states that Spark ships with a 128k context window and is text-only at launch. That combination is aimed squarely at large codebases and long conversational threads, while keeping the inference path streamlined for speed and responsiveness.

The naming, GPT‑5.3‑Codex‑Spark, also signals intent: this is a Codex-specialized model variant in the GPT‑5.3 family, tuned for a particular user experience. OpenAI describes it as optimized to feel “near‑instant,” with “more than 1000 tokens per second,” strongly implying that the product goal is interactive iteration rather than maximum deliberation time.

2) Why Cerebras WSE‑3 matters for Codex

Cerebras’ Wafer Scale Engine approach differs from conventional GPU deployments by using a single, wafer-scale chip system designed for large neural network workloads. In the Codex‑Spark rollout, OpenAI confirmed Spark “runs on Cerebras’ Wafer Scale Engine 3,” and Cerebras separately emphasized that it is “powered by Cerebras” and “runs at over 1,000 tokens/s.”

For developers, the practical meaning of this architecture is less about the novelty of the silicon and more about what it enables at the product layer: latency that feels immediate, and throughput that can keep up with rapid prompts, autocompletions, refactors, and iterative debugging sessions.

Multiple reports framed this as a meaningful supply-chain and deployment milestone. Tom’s Hardware characterized it as OpenAI’s first production deployment away from Nvidia hardware, while Swedish outlet Omni (citing Bloomberg) also described it as OpenAI’s first model running on Cerebras chips, underscoring the broader narrative that OpenAI is expanding beyond a single hardware ecosystem.

3) The speed story: >1,000 tokens per second and “near-instant” feel

The most repeated metric in coverage is throughput. OpenAI’s own post claims “delivering more than 1000 tokens per second,” and Cerebras’ announcement similarly says Codex‑Spark runs at “over 1,000 tokens/s.” Forbes likewise spotlighted 1,000 tokens per second as a marquee number for the launch.

Throughput alone doesn’t guarantee a great developer experience, but it can fundamentally change how tools feel, especially when coding workflows involve short, frequent exchanges. In those settings, being able to generate and stream tokens extremely quickly can make the difference between a “waiting for inference” loop and a “conversing with a collaborator” loop.

TechCrunch described Spark as a lighter, faster inference version designed for “rapid iteration,” and reported it will be powered by WSE‑3. That description aligns with OpenAI’s own positioning: Spark is intended to emphasize responsiveness, turning Codex into something that behaves more like a real-time system than a queued request.

4) Serving and networking upgrades: where the latency reductions come from

OpenAI didn’t attribute the improved experience solely to hardware. The company also detailed changes to the serving stack, including the use of a persistent WebSocket connection, an architectural choice that reduces the repeated setup cost that can accumulate in interactive sessions.

In the same Feb 12, 2026 post, OpenAI claimed it “reduced over per client/server roundtrip by 80%,” “per-token over by 30%,” and “time-to-first-token by 50%.” These are product-facing metrics: they reflect the parts of the system users actually feel, such as the delay before the first streamed words appear.

Together, these networking and serving optimizations help explain how Codex‑Spark targets that “near‑instant” sensation. Even with extremely fast inference hardware, poorly optimized roundtrips and token streaming can blunt perceived performance, so the rollout pairs WSE‑3 throughput with software-layer latency cuts.

5) Where Codex‑Spark is available (and who gets it first)

OpenAI’s rollout is intentionally scoped. Codex‑Spark is being introduced as a research preview for ChatGPT Pro users, accessible via the Codex app, the CLI, and a VS Code extension. TechCrunch echoed this initial availability, emphasizing that the Pro tier is the first audience.

OpenAI also noted that Spark uses separate rate limits from other experiences, implying capacity planning and traffic shaping specific to this new serving path. That separation is important in early deployments, where a product team may be actively tuning performance, reliability, and cost.

API access is more restricted: OpenAI described limited API availability to design partners. This kind of phased distribution suggests OpenAI is gathering feedback on real-world coding workloads before broadening access, especially given the model’s speed-focused behavior and new hardware footprint.

6) A mixed hardware strategy: GPUs remain foundational, Cerebras complements

OpenAI explicitly framed Codex‑Spark as part of a broader hardware strategy rather than a clean break. In its launch post, OpenAI wrote that “GPUs remain foundational,” while “Cerebras complements that foundation,” and added that “GPUs and Cerebras can be combined for single workloads.”

This matters because the industry often treats hardware choices as zero-sum. OpenAI’s language suggests a portfolio approach: GPUs for generality and broad ecosystem support, and wafer-scale systems where they deliver compelling advantages, such as ultra-high token throughput and interactive developer tooling.

Several reports interpreted this moment as infrastructure diversification. The Financial Times reported a multiyear deal, described as $10B for 750 MW through 2028, as part of expanding supply beyond Nvidia. Even if product deployments remain heterogeneous, Codex‑Spark makes that strategy visible to end users for the first time.

7) The partnership scale: 750 MW and a staged rollout starting in 2026

Cerebras has described the partnership in unusually large infrastructure terms. In a Jan 14, 2026 post, the company said OpenAI and Cerebras signed a multi-year agreement to deploy “750 megawatts” of wafer-scale systems for OpenAI customers, rolling out in stages beginning in 2026.

That figure is significant because it suggests the Codex‑Spark milestone is not a one-off experiment but part of a capacity roadmap. Large-scale inference capacity can translate to broader availability, better latency at peak times, and the ability to support more interactive sessions concurrently, assuming the serving software and product packaging keep pace.

External reporting adds more context. The Financial Times pegged the arrangement at $10B and extending through 2028, characterizing it as part of OpenAI’s efforts to diversify infrastructure supply. Regardless of exact financial terms, the public messaging from both companies indicates a long-horizon commitment rather than a short pilot.

8) Product direction: two complementary Codex modes

Speed is not the only goal OpenAI outlined. The company described Codex‑Spark as “the first step toward a Codex that works in two complementary modes,” spanning “real-time collaboration” and “long-running tasks.” That framing implies a future Codex experience that can switch between immediacy and depth depending on the job.

Codex‑Spark clearly targets the “real-time” end of that spectrum: fast streaming, reduced over, and an experience designed for rapid iteration. If you think of pair programming, quick code review, or tight feedback loops during refactoring, the “near-instant” emphasis directly supports those behaviors.

At the same time, OpenAI’s mention of long-running tasks suggests a complementary workflow where the model can take more time, perhaps to run extended reasoning, multi-file changes, or multi-step plans. The two-mode idea hints that different serving stacks or hardware profiles (including combined GPU+Cerebras workloads) could be orchestrated behind the scenes to match user intent.

Codex‑Spark’s debut shows how model capability, serving architecture, and hardware choice can converge into a noticeably different product feel. With OpenAI claiming major improvements to time-to-first-token and roundtrip over, plus throughput above 1,000 tokens per second on Cerebras WSE‑3, the launch is as much about interaction design as it is about silicon.

Whether this becomes the new default for code-focused AI will depend on reliability, cost, and how well OpenAI executes the broader vision of two complementary Codex modes. For now, the research preview for ChatGPT Pro users is a concrete first step: OpenAI has not only talked about diversifying inference infrastructure, it has shipped a developer-facing experience that puts Cerebras chips directly in the loop.

Ready to get started?

Start automating your content today

Join content creators who trust our AI to generate quality blog posts and automate their publishing workflow.

No credit card required
Cancel anytime
Instant access
Summarize this article with:
Share this article: