The Arc Prize Foundation, co-founded by AI researcher François Chollet, has introduced a new evaluation called ARC-AGI-2 to assess the general intelligence of advanced AI models. This test presents complex, puzzle-like problems requiring AI systems to identify visual patterns and generate correct solutions, aiming to measure their ability to adapt to novel challenges beyond their training data.
Initial results indicate that leading AI models are struggling with ARC-AGI-2. For instance, OpenAI’s o1-pro and DeepSeek’s R1 achieved scores between 1% and 1.3%, while models like GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash scored around 1%. In contrast, human participants averaged a 60% success rate on the same test, underscoring a significant gap between current AI capabilities and human-level performance.
ARC-AGI-2 builds upon its predecessor, ARC-AGI-1, by addressing previous shortcomings. Notably, it emphasizes efficiency, requiring models to interpret patterns dynamically rather than relying on memorization or brute-force computation. This approach aims to provide a more accurate measure of an AI system’s true intelligence and adaptability.
The development of ARC-AGI-2 reflects a broader industry effort to create more rigorous benchmarks for evaluating AI progress. As AI systems continue to advance, establishing reliable and challenging tests like ARC-AGI-2 is crucial for understanding their true capabilities and limitations.