AI Agents in the Workplace: A Reality Check on Their Readiness
In 2024, Microsoft CEO Satya Nadella forecasted a future where artificial intelligence (AI) would revolutionize knowledge work, potentially automating roles traditionally held by professionals such as lawyers, investment bankers, librarians, accountants, and IT specialists. Despite significant advancements in AI, the anticipated transformation in these white-collar sectors has been slower than expected. While AI models have demonstrated proficiency in tasks like in-depth research and strategic planning, their integration into everyday professional workflows remains limited.
A recent study by Mercor, a leading training-data company, sheds light on this phenomenon. The research evaluates the performance of top AI models in executing real-world tasks from fields like consulting, investment banking, and law. The outcome is the APEX-Agents benchmark, which reveals that current AI systems are not yet up to par. When presented with queries from actual professionals, even the most advanced models managed to answer correctly only about 25% of the time. More often than not, the responses were incorrect or absent altogether.
Brendan Foody, CEO of Mercor and a contributor to the study, identified a significant challenge for these AI models: the ability to gather and synthesize information from multiple sources—a fundamental aspect of many professional roles. Foody explained, One of the big changes in this benchmark is that we built out the entire environment, modeled after real professional services. The way we do our jobs isn’t with one individual giving us all the context in one place. In real life, you’re operating across Slack and Google Drive and all these other tools. This multi-domain reasoning remains a hurdle for many AI agents.
The benchmark scenarios were crafted based on inputs from professionals within Mercor’s expert network, who not only formulated the queries but also defined the criteria for successful responses. Reviewing these publicly available questions offers insight into the complexity of the tasks involved.
For instance, one scenario involves a legal professional seeking guidance on whether a client’s data-sharing practices comply with the European Union’s General Data Protection Regulation (GDPR). Providing an accurate answer necessitates a thorough understanding of the client’s internal policies and the relevant EU privacy laws. Such intricate tasks can challenge even seasoned human professionals, underscoring the high standards set for AI performance in these contexts.
While OpenAI’s GDPval benchmark assesses general knowledge across various professions, the APEX-Agents benchmark focuses on the AI’s capacity to perform sustained tasks within specific high-value professions. This approach presents a more rigorous test for AI models and is more directly related to their potential to automate certain jobs.
The initial results indicate that AI models are not yet ready to replace professionals in fields like investment banking. However, some models are making notable progress. Google’s Gemini 3 Flash led the group with a 24% accuracy rate on first attempts, closely followed by OpenAI’s GPT-5.2 at 23%. Other models, including Opus 4.5, Gemini 3 Pro, and GPT-5, scored around 18%.
Despite these modest scores, the AI industry has a history of rapidly overcoming challenging benchmarks. With the APEX-Agents test now publicly available, it serves as an open challenge for AI developers aiming to enhance their models’ performance. Foody remains optimistic about the pace of improvement, stating, It’s improving really quickly. Right now it’s fair to say it’s like an intern that gets it right a quarter of the time, but last year it was the intern that gets it right five or 10% of the time. That kind of improvement year after year can have an impact so quickly.
In summary, while AI agents have made significant strides, they are not yet fully prepared to take over complex professional tasks. The APEX-Agents benchmark highlights the current limitations and provides a roadmap for future advancements. As AI continues to evolve, it holds the promise of transforming the workplace, but for now, human expertise remains indispensable in many professional domains.