Inception Labs just released Mercury 2, a diffusion-based language model that breaks traditional AI speed limits while still handling real reasoning tasks.
Instead of generating text one token at a time, Mercury 2 refines entire responses in parallel, allowing it to break the latency wall and push past one thousand tokens per second in real-world use.
This architectural shift changes how inference behaves at scale, collapsing the usual tradeoff between speed, cost, and reasoning quality. With OpenAI-compatible APIs, tool calling, structured outputs, and a one hundred twenty eight thousand token context window, Mercury 2 is built for production systems where latency and reliability matter.
This launch positions diffusion as a serious alternative to autoregressive language models and signals a broader shift in how future LLMs may be designed.