OpenAI and Paradigm Launch EVMbench to Test AI’s Smart Contract Security Skills

OpenAI Unveils EVMbench: A Benchmark for AI in Smart Contract Security

OpenAI, in partnership with crypto investment firm Paradigm, has introduced EVMbench, a comprehensive benchmark designed to assess the proficiency of AI agents in identifying, rectifying, and exploiting critical vulnerabilities within smart contracts. This initiative represents a significant advancement in evaluating AI capabilities in environments where smart contracts manage assets exceeding $100 billion.

Overview of EVMbench

EVMbench is constructed from a curated selection of 120 vulnerabilities sourced from 40 security audits, many of which originate from open code audit competitions on platforms like Code4rena. Additionally, the benchmark includes scenarios from the security auditing process of the Tempo blockchain—a Layer 1 network optimized for high-throughput stablecoin transactions—thereby extending its applicability to payment-focused smart contract code.

Evaluation Modes

EVMbench assesses AI agents across three distinct modes, each corresponding to a critical phase in the smart contract security lifecycle:

1. Detect Mode: Agents perform audits on smart contract repositories, earning scores based on their ability to recall known vulnerabilities and associated audit rewards.

2. Patch Mode: Agents modify vulnerable contracts to eliminate flaws while maintaining intended functionality, with effectiveness verified through automated tests and exploit checks.

3. Exploit Mode: Agents execute comprehensive fund-draining attacks against deployed contracts within a controlled blockchain environment, with performance graded via transaction replay and on-chain verification.

To ensure reproducibility, OpenAI developed a Rust-based framework that deploys contracts deterministically and restricts unsafe Remote Procedure Call (RPC) methods. All exploit tasks are conducted in an isolated local Anvil environment, avoiding live network interactions.

Performance Insights

Initial evaluations using EVMbench have revealed notable behavioral differences among AI models across various tasks. In Exploit Mode, GPT‑5.3‑Codex achieved a score of 72.2%, a significant improvement over GPT‑5’s 31.9% score recorded six months earlier. Agents consistently performed best in Exploit Mode, where objectives are clear-cut—drain funds and iterate until successful. However, Detect and Patch Modes presented greater challenges; agents often ceased after identifying a single vulnerability rather than completing a comprehensive audit and struggled to eliminate subtle flaws without disrupting existing contract functionality.

Acknowledging Limitations

OpenAI recognizes that EVMbench does not fully encapsulate the complexities of real-world smart contract security. The current grading system may not effectively distinguish between actual vulnerabilities and false positives when agents identify issues beyond the baseline established by human auditors.

Commitment to Cybersecurity Research

In conjunction with the release of EVMbench, OpenAI has committed $10 million in API credits through its Cybersecurity Grant Program to accelerate defensive security research, particularly focusing on open-source software and critical infrastructure. The company also announced the expansion of Aardvark, its security research agent, through a private beta program. EVMbench’s tasks, tools, and evaluation framework have been made publicly available to support ongoing research into AI-driven cybersecurity capabilities.