Cloudflare’s March 2025 Service Disruption: A Detailed Analysis

On March 21, 2025, Cloudflare experienced a significant service disruption that affected multiple products and services globally for over an hour. The incident, lasting from 21:38 UTC to 22:45 UTC, resulted in a complete failure of write operations and approximately 35% failure of read operations to their R2 object storage service. This disruption also impacted other Cloudflare services, including Cache Reserve, Images, Log Delivery, Stream, and Vectorize.

Root Cause of the Incident

The outage was traced back to a credential rotation error during a routine security procedure. Cloudflare’s engineering team inadvertently deployed new authentication credentials to a development environment instead of the production environment. Subsequently, when the old credentials were deleted from the storage infrastructure, the production R2 Gateway service lost authentication access to backend systems, leading to widespread service failures.

Cloudflare acknowledged the human error and the lack of proper visibility into the credentials used by the Gateway Worker to authenticate with their storage infrastructure. This oversight prolonged the duration of the incident.

Impact on Services

The credential mismanagement had a cascading effect on various Cloudflare services:

– R2 Object Storage: All write operations failed, and approximately 35% of read operations were unsuccessful during the incident window.

– Cache Reserve: Customers experienced increased origin traffic as cached objects became unavailable.

– Images and Stream Services: Both services saw 100% failure rates for uploads. Image delivery rates dropped to approximately 25%, while Stream delivery rates fell to about 94%.

– Vectorize: Cloudflare’s vector database experienced a 75% query failure rate and complete failure for insert operations.

– Log Delivery: The service faced delays of up to 70 minutes in processing logs.

Other services, including Email Security, Billing, and Key Transparency Auditor, also experienced disruptions due to the incident.

Technical Details

The technical error stemmed from the omission of a critical command-line parameter during the credential rotation process. Engineers executing the `wrangler secret put` and `wrangler deploy` commands failed to include the `–env production` flag. As a result, the new credentials were deployed to a non-production Worker instead of the intended production environment. When the previous credentials were removed, authentication failures cascaded across services, leading to the widespread outage.

Response and Remediation

Upon identifying the root cause at 22:36 UTC, nearly an hour after the impact began, Cloudflare’s engineering team promptly deployed the correct credentials to the production environment. Service was restored by 22:45 UTC.

To prevent future occurrences, Cloudflare has implemented several technical and procedural changes:

– Enhanced Logging: Explicit identification of credential IDs used for authentication to improve visibility.

– Verification Procedures: Mandatory checks to confirm credential usage before decommissioning old credentials.

– Automated Deployment Tools: Requirement to use automated hotfix release tooling instead of manual command entry to reduce human error.

– Dual-Human Validation: Explicit requirement for two-person validation during credential rotation processes to ensure accuracy.

– Health Checks: Development of closed-loop health checks to validate credential propagation before releases.

Context and Industry Implications

This incident follows a similar outage in February 2025, where an employee mistakenly disabled the entire R2 Gateway service while attempting to block a phishing URL. These recurring configuration-related outages highlight the challenges of managing complex cloud infrastructure while maintaining rigorous security practices like credential rotation.

Cloudflare’s proactive approach to addressing these issues demonstrates their commitment to service reliability and security. However, these incidents serve as a reminder of the importance of robust operational procedures and the need for continuous improvement in managing cloud services.

Conclusion

Cloudflare’s March 2025 service disruption underscores the critical importance of meticulous operational procedures in cloud service management. While the company has taken significant steps to address the root causes and prevent future incidents, this event serves as a valuable lesson for the broader industry on the complexities and challenges inherent in maintaining large-scale cloud infrastructures.