Intermittent failures across multiple API endpoints

Incident Report for PolygonIO

Resolved

What Went Wrong
On January 29th at approximately 5:50 PM EST, customers experienced intermittent outages and delays across multiple API services. The disruptions were caused by an infrastructure upgrade that unexpectedly impacted critical systems.

Detection
The issue was detected through internal monitoring alerts and engineering reports. Services would go down briefly and then recover, leading to inconsistent availability.

Mitigation
Once the upgrade was identified as the cause, it was halted, and a rollback was initiated. Services were gradually restored, and system stability was confirmed.

Root Cause
The outage occurred because a scheduled infrastructure upgrade required rebooting key systems, which inadvertently disrupted API services.

Next Steps
Improve planning around infrastructure upgrades.
Enhance monitoring and alerting to detect issues faster.
Implement processes to minimize service disruptions during future updates.

We apologize for the inconvenience and are taking steps to prevent similar issues in the future. Thank you for your patience and for being a valued customer.

Posted Jan 29, 2025 - 19:40 EST