On May 29, 2025, SentinelOne, a prominent cybersecurity firm, experienced a significant global service disruption that lasted approximately 20 hours. This incident prevented customers from accessing the SentinelOne management console and related services. Despite the disruption, endpoint protection remained operational, ensuring that client systems continued to be safeguarded. Importantly, the company confirmed that this was not a security-related event, and no customer data was compromised.
Incident Overview
The outage commenced at 13:37 UTC on May 29, triggered by a software flaw in an infrastructure control system slated for deprecation. This flaw led to the automatic deletion of critical network routes and DNS resolver rules. The issue arose during the creation of a new account as part of SentinelOne’s transition to a new Infrastructure-as-Code (IaC) architecture. The control system’s configuration comparison function misidentified discrepancies, resulting in the overwriting of established network settings. Consequently, an empty route table was restored, causing widespread network connectivity loss across all regions.
Impact on Services
The primary impact of the outage was the inability of security teams to access and manage operations through the SentinelOne management console. However, the core endpoint protection services continued to function without interruption, ensuring that client devices remained protected throughout the incident.
Response and Communication
SentinelOne’s engineering teams promptly responded to the incident. By 14:27 UTC, they had identified the missing routes on Transit Gateways and initiated restoration efforts. The company maintained transparent communication with its customers through multiple channels, including announcements on their Customer Portal, email notifications, social media updates, and blog posts. Console access was restored by 20:05 UTC, with full service restoration achieved approximately 14 hours later.
Preventative Measures and Future Steps
In the aftermath of the incident, SentinelOne has implemented several corrective actions to prevent similar occurrences:
1. Auditing Automated Functions: The company is reviewing EventBridge and other automatically triggered functions to prevent deprecated control code from being activated during architectural transitions.
2. Accelerating Infrastructure Migration: SentinelOne is expediting its migration to the new IaC infrastructure to eliminate risks associated with running split architectures.
3. Enhancing Recovery Automation: The company has backed up all Transit Gateway configurations and is improving recovery automation to prevent manual restoration delays in future incidents.
4. Developing a Public Status Page: An independently operated public status page is being developed to provide real-time updates during incidents.
5. Updating Incident Playbooks: High-severity incident playbooks have been updated to ensure better customer communication during critical events.
Conclusion
This incident underscores the complexities technology companies face when modernizing critical infrastructure while maintaining service continuity. It also highlights the importance of robust incident response procedures in cybersecurity operations. Notably, Federal customers using GovCloud environments were unaffected by this incident, demonstrating the effectiveness of segregated infrastructure designs for different customer segments.