
Why is a model that can write coherent essays and debug complex React applications failing at football betting?
Ross Taylor, the CEO of General Reasoning and a former Meta AI researcher, points out the flaw in today's testing methodologies:
"There is so much hype about AI automation, but there’s not a lot of measurement of putting AI into a longtime horizon setting."
The benchmarks we use to rate AI (like coding competitions or trivia quizzes) are static environments. They don't change once the model starts. But the real world is dynamic. Factors like weather, player morale, injuries, referee bias, and luck shift every minute.
If you try AI on a static benchmark, it performs brilliantly. But as the KellyBench report suggests, when you expose these models to the chaos and complexity of real-world economics, they systematically struggle to make long-term decisions, ignoring the risks of over-aggression.
This study doesn't prove AI is useless—far from it. It excels at software engineering and short-term pattern recognition. However, it highlights a specific gap in AI capabilities: Long-time horizon management.
We are currently building "dots," attaching them to LLMs, and assuming they can run whole companies. Until researchers find a way to test these models in truly dynamic, chaotic scenarios rather than static datasets, AI agents might be better at writing the betting strategy on a napkin than actually placing the bets.
For now, it seems human intuition is still a better predictor of a 3-1 scoreline than a $100 trillion parameter model. ⚽️🔚