Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

🤖 Yapay Zekâ 📰 United States 🕐 2 saat önce

On Sunday, a team of nine researchers at Sina Weibo — the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence — quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research community. Their claim: a language model with just 3 billion parameters can match or exceed the reasoning performance of flagship systems from Google DeepMind , OpenAI , Anthropic , and DeepSeek that are hundreds of times larger. The model, called VibeThinker-3B , scored 94.3 on AIME 2026 — the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2 , a model with 671 billion parameters, and ahead of Gemini 3 Pro , Google's high-performance flagship reasoning system, which scored 91.7. With a test-time scaling technique the team calls Claim-Level Reliability Assessment, the score climbs to 97.1, edging past virtually every system in the public record. Within hours of publication, the paper had drawn 62 upvotes on Hugging Face's daily papers feed, the model repository had accumulated 130 likes, and the GitHub repository had reached 685 stars. But the reaction on social media was not uniformly celebratory. It was, in many cases, deeply skeptical. "WHAT THE HELL is happening in AI?" wrote the user @orcus108 on X, in a post that accumulated over 161,000 views. "A 3B parameter model just put up coding benchmark scores in the same league as Claude Opus 4.5… I genuinely don't know if this is a breakthrough or if the benchmarks are broken." That tension — between genuine scientific advancement and the growing suspicion that AI benchmarks have become gameable to the point of meaninglessness — sits at the heart of the VibeThinker-3B story. And the answer matters enormously, not just for academic bragging rights, but for the multibillion-dollar question of whether the AI industry's relentless push toward ever-larger models is the only path to intelligence. Benchmark scores that defy the scaling laws of modern AI The results reported in the technical report are, by any conventional standard, extraordinary. On the mathematics side, VibeThinker-3B achieved 91.4 on AIME 2025 , 94.3 on AIME 2026 , 89.3 on HMMT 2025 (the Harvard-MIT Mathematics Tournament), 93.8 on BruMO 2025 (the Brown University Math Olympiad), and 76.4 on IMO-AnswerBench , a benchmark comprising 400 problems at the level of the International Mathematical Olympiad. In coding, it posted an 80.2 Pass@1 on LiveCodeBench v6 , a benchmark designed to test executable code generation, and achieved a 96.1 percent acceptance rate on unseen LeetCode weekly and biweekly contests from late April through late May 2026. On instruction following, it scored 93.4 on IFEval . To put the parameter disparity in perspective: DeepSeek V3.2 has 671 billion parameters — roughly 224 times the size of VibeThinker-3B . GLM-5 , from Zhipu AI, has 744 billion parameters. Kimi K2.5 , from Moonshot AI, exceeds 1 trillion. VibeThinker-3B's 3 billion parameters could run on a consumer laptop. The researchers frame this result not as an anomaly but as evidence for a broader theoretical claim. They introduce what they call the " Parametric Compression-Coverage Hypothesis ," which argues that different types of AI capability have fundamentally different relationships to model size. Verifiable reasoning — the kind tested by math competitions and coding challenges, where answers can be definitively checked — is what the paper calls a "parameter-dense" capability: one that can be compressed into a compact core. Open-domain knowledge, by contrast, is "parameter-expansive," requiring broad coverage across facts, concepts, and edge cases that inherently demands more parameters. The paper acknowledges this distinction directly. On GPQA-Diamond , a graduate-level science knowledge benchmark, VibeThinker-3B scored just 70.2 — well behind the 91.9 achieved by Gemini 3 Pro and the 87.0 scored by Claude Opus 4.5. The authors write that this gap "is consistent with our claim rather than a contradiction to it: the main finding is not that a 3B model has fully replaced leading general-purpose models, but that a small model can reach first-tier performance on many verifiable reasoning tasks." Inside the four-stage training pipeline that powers a tiny reasoning engine VibeThinker-3B is not built from scratch. It is post-trained on top of Qwen2.5-Coder-3B , a compact foundation model from Alibaba's Qwen team, through what the Weibo AI researchers call the "Spectrum-to-Signal Principle" — a multi-stage pipeline first introduced in the team's earlier VibeThinker-1.5B work in November 2025. The training unfolds in four major phases. The first is a two-stage supervised fine-tuning process that uses curriculum learning: the model first train

#artificial intelligence#llm#gpt-#openai#anthropic

📌 Kaynak

Bu haber XML kaynağından derlenmiştir. Tamamı için orijinal habere gidin.

Orijinal haberi oku →

📱

News AI World — Mobil uygulama

Bu haberleri 45 dilde, anlık çeviriyle cebinde. Erken erişim için Gmail adresini bırak.

← Tüm haberlere dön

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

📌 Kaynak

📰 Önerilen haberler