Why Weiboโs tiny VibeThinker-3B has the AI world arguing over benchmarks again
On Sunday, a team of nine researchers at Sina Weibo โ the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence โ quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research communi
On Sunday, a team of nine researchers at Sina Weibo โ the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence โ quietly posted a 14-page technical report to arXiv that sent shockwaves through the AI research community. Their claim: a language model with just 3 billion parameters can match or exceed the reasoning performance of flagship systems from Google DeepMind , OpenAI , Anthropic , and DeepSeek that are hundreds of times larger. The model, called VibeThinker-3B , scored 94.3 on AIME 2026 โ the American Invitational Mathematics Examination, one of the most demanding standardized math competitions in the world. That figure places it alongside DeepSeek V3.2 , a model with 671 billion parameters, and ahead of Gemini 3 Pro , Google's high-performance flagship reasoning system, which scored 91.7. With a test-time scaling technique the team calls Claim-Level Reliability Assessment, the score climbs to 97.1, edging past virt
This report comes from VentureBeat. The story centres on Why Weiboโs tiny VibeThinker-3B has the AI world arguing over benchmarks again. Full coverage and background context is available at the original source. Readers seeking more detail on this developing topic are encouraged to follow updates from VentureBeat and related outlets covering this beat.

