Zhejiang University Team Creates Visual Reasoning System That Lets Robots 'Think With Their Eyes' — 22x Faster Than Text
Zhejiang University Team Creates Visual Reasoning System That Lets Robots 'Think With Their Eyes' — 22x Faster Than Text Researchers at Zhejiang University, in collaboration with Cornell University, the National University of Singapore, and Xidian University, have developed a breakthrough visual reasoning system that enables robots to "think with their eyes" rather than processing language-based internal monologues. The system, called VisualThink-VLA, achieves a 22.8x speed i
Zhejiang University Team Creates Visual Reasoning System That Lets Robots 'Think With Their Eyes' — 22x Faster Than Text Researchers at Zhejiang University, in collaboration with Cornell University, the National University of Singapore, and Xidian University, have developed a breakthrough visual reasoning system that enables robots to "think with their eyes" rather than processing language-based internal monologues. The system, called VisualThink-VLA, achieves a 22.8x speed improvement over text-based reasoning approaches while also delivering higher accuracy. The fundamental insight behind VisualThink-VLA is that traditional Vision-Language-Action (VLA) models rely on text-based chain-of-thought reasoning, where the robot essentially writes an internal essay describing each step before acting. This process takes an average of 8.377 seconds per step — painfully slow for real-time manipulation tasks. VisualThink-VLA replaces text tokens with visual reasoning tokens, reducing processing time to just 0.367 seconds per step. The system employs a four-channel visual evidence architecture comprising Bounding Box, Edge, Motion, and Relation channels. Rather than using all four channels indiscriminately, VisualThink-VLA features an adaptive routing mechanism that selects only 2.22 channels per step on average, optimizing the balance between computational efficiency and reasoning quality. Testing across eight benchmarks yielded a 92.63 percent average success rate, outperforming the text-based ECoT approach which achieved 85.09 percent. The speed advantage is even more pronounced: 22.8x faster while being more accurate — a rare combination in AI systems where speed and quality are typically traded off against each other. The researchers validated the system on a PIPER NERO 7-degree-of-freedom robotic arm, demonstrating success in multi-object pick-and-place operations, relation-sensitive placement where object spatial relationships matter, contact-sensitive reorientation, and two-stage compound tasks that require sequential reasoning. The training data, dubbed "VisualEvidence-Set," contains 754,700 instructions covering diverse manipulation scenarios. A key design advantage is that VisualThink-VLA operates as a plug-and-play module for existing VLA systems. This means robots currently using text-based reasoning can be upgraded without entirely replacing their underlying architecture. The paper is available on arXiv under identifier 2605.30011. The work represents a paradigm shift from "write an essay then act" to "see-think-act," moving robot reasoning closer to how humans naturally operate — processing visual information directly rather than translating it through language. As robots are deployed in increasingly dynamic environments, the ability to reason visually at near-instant speeds could be a critical enabler for widespread adoption.
📌 Kaynak
Bu özet Pandaily kaynağından otomatik derlenmiştir. Tamamı için orijinal habere gidin.
Orijinal haberi oku →