GPT-5 Flops in Business CEO Simulation: Reality Check for AGI Hype
Everyone's talking about artificial general intelligence as if it's right around the corner. But a new benchmark just dropped a reality check: GPT-5, OpenAI's latest powerhouse, isn't close to running a business like a human. The results aren't subtle-humans outperformed GPT-5 by a factor of 9.8x in a RollerCoaster Tycoon-style simulation. That's a huge gap, and the details are telling.
Inside the MAPs Benchmark: Theme Parks, Not Theory
The MAPs benchmark is about more than just answering trivia or chatting. It's a real test of business acumen, challenging AI agents to operate a virtual theme park. Think maintenance schedules, inventory management, planning for slow seasons, and making sure the park doesn't go bankrupt. In other words, all the messy stuff that comes with running a business in the real world.
The results? GPT-5 failed at almost every practical skill:
- Maintenance: Rides broke down, repairs lagged, and guest satisfaction tanked.
- Inventory: Food stands ran out, shops overstocked, and waste piled up.
- Planning: No coherent long-term strategy-just reactive, short-sighted moves.
- Causal Reasoning: The model struggled to link actions and outcomes, leading to random decisions.
Human participants, by contrast, juggled these variables with ease. The average human score was nearly ten times higher than GPT-5's. That's not a rounding error-that's a wall.
What's Going Wrong for LLMs?
Large language models like GPT-5 excel at tasks with clear, structured goals: writing code, summarizing articles, or answering questions. But business simulations are messy. They demand continuous planning, adapting to uncertainty, and connecting cause and effect over time. These are the exact areas where GPT-5 broke down in the MAPs paper.
Instead of acting like a savvy CEO, GPT-5 got lost in the weeds-fixating on short-term problems, missing the big picture, and failing to keep the business afloat. It couldn't maintain an effective feedback loop or adapt strategy as conditions changed. The model's "intelligence" just isn't robust enough for the dynamic, high-stakes world of business management.
Takeaways: AGI Isn't Here (Yet)
The MAPs benchmark is a wake-up call for anyone betting on AGI in 2024. There's no magic leap happening in language models. Running real businesses - even virtual ones - is still a uniquely human skill. For AI founders, this means plenty of headroom for building products that combine human expertise with narrow AI tools. For researchers, it's a signal to double down on benchmarks that reflect real-world messiness-not just test scores.
Curious how you'd stack up? Try the simulation yourself at MAPs. For a deep dive into the research, check the project's paper.