Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

🤖 Yapay Zekâ 📰 VentureBeat 🕐 3 saat önce

GenAI image generators like Stable Diffusion do not draw a picture pixel by pixel from left to right. They start with noise and iteratively refine the entire image in parallel until it converges, in a process known as diffusion. For years, applying that same principle to text generation had remained out of reach at scale. Standard language models work like a typewriter: one token at a time, left to right, with no ability to revise a committed output. That pattern works in the cloud, where batch sizes keep GPUs saturated. For local inference or low-concurrency deployments, the GPU is idle most of the time. Google's DiffusionGemma, released this week, is an open source experimental model that applies diffusion to text generation at production scale. Built on the Gemma 4 backbone and released under the Apache 2.0 license, it is the first diffusion language model natively supported in the open source vLLM inference platform. It generates a 256-token block in parallel rather than sequentially, with every token position attending to every other. Google says DiffusionGemma generates text up to 4x faster than standard models on GPUs. At batch size 1 on a single Nvidia H100, the FP8 version reaches 1,008 tokens per second. On H200, it hits 1,288 — roughly six times a standard autoregressive baseline, according to vLLM benchmark results published today. Despite the speed gains, Google did not oversell the release. The company's launch post acknowledged directly that DiffusionGemma's overall output quality is lower than standard Gemma 4, adding "For applications that demand maximum quality, we recommend deploying standard Gemma 4." What DiffusionGemma does DiffusionGemma does not generate tokens in order. It starts with a block of 256 random placeholder tokens, effectively a blank canvas, and runs multiple refinement passes over the entire block at once. On each pass, it evaluates every position and locks in the ones it is most confident about. Uncertain positions get randomized and reconsidered on the next pass, with the model using what it resolved in the previous round to inform the next attempt. The block converges progressively until enough positions stabilize to anchor the rest. Two things follow from that architecture. Self-correction. An autoregressive model that commits to a wrong token is stuck with it, because subsequent tokens are already conditioned on the mistake. DiffusionGemma can identify low-confidence positions and re-evaluate them on the next pass. Bidirectional context. Every position attends to every other position in the block simultaneously, including tokens that appear later in the sequence. That makes the model structurally better suited to constrained generation tasks where left-to-right generation fails. Google demonstrated both properties with a fine-tuned Sudoku solver. The base model solved zero puzzles. After fine-tuning on a Sudoku dataset, it reached an 80% success rate and converged in 12 denoising steps rather than 48. The efficiency gain came directly from the model's ability to self-correct and stop early. How it was built DiffusionGemma runs as a 26B Mixture of Experts model that activates only 3.8B parameters during inference. Quantized, it fits within 18GB VRAM on consumer hardware including the Nvidia RTX 4090 and 5090. Google and NVIDIA also optimized for enterprise Hopper and Blackwell servers using NVFP4 kernels. The vLLM integration required new work because DiffusionGemma does not fit the standard serving model. A typical vLLM batch applies the same attention type to every request. DiffusionGemma requests alternate between causal and bidirectional attention as they cycle through prompt reading, canvas refinement and block commit. The team built per-request attention switching into both the Triton and FlashAttention 4 backends and reused the existing speculative decoding path for the refinement loop. The new ModelState interface the team built for this integration is designed to support additional diffusion models in vLLM as they emerge. Where the speed wins and where it does not DiffusionGemma's speed advantage is real but conditional. Where it applies depends entirely on deployment context. The numbers. At batch size 1 on a single H100, vLLM's published benchmarks put the FP8 model at roughly five times a standard autoregressive baseline. On H200, roughly six times. Those peak figures reflect optimal conditions: single user, dedicated hardware, FP8 quantization. Where it wins. Local inference, single-user applications and low-concurrency serving. In those conditions the GPU has spare compute and memory bandwidth is the bottleneck. DiffusionGemma's parallel block generation fills that gap. Where it does not. High-throughput cloud serving. When a server is batching hundreds of concurrent requests, autoregressive models already saturate available compute and DiffusionGemma's

#llm#openai#research#experiment#hardware

📌 Kaynak

Bu özet VentureBeat kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.

Orijinal haberi oku →

📱

News AI World — Mobil uygulama

Bu haberleri 45 dilde, anlık çeviriyle cebinde. Erken erişim için Gmail adresini bırak.

← Tüm haberlere dön

Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes

📌 Kaynak

📰 Önerilen haberler