v2 Snapshot Endpoint Errors

Incident Report for PolygonIO

Postmortem

Postmortem for v2 Snapshots timeouts

What went wrong

Detection

At approximately 10:01 AM EST on September 11th, we noticed increased timeouts for our Snapshots endpoints resulting in a negative experience for our Stocks Business customers. Status Page was updated to reflect degraded performance and Engineering began investigating cause/mitigation.

Mitigation

It was determined that several pods were having scaling issues due to an influx of api requests. We increased memory requests on the affected pods and the timeouts were mitigated.

Root cause

After service functionality was restored, we began investigating the timeline and observability artifacts to gain a better understanding of what went wrong and when it began. It was determined that the timeouts began at approximately 9:37 AM EST. A sharp spike in requests resulted in scaling issues for several pods which caused a cascading affect of timeouts across the Snapshots API affecting all Stocks Business customers. The out of memory alerts were not triggered until 10am, which meant we lost 23 minutes of awareness. These out of memory warnings are not uncommon in most modern stacks…and most of of our services have the ability to scale horizontally, Snapshots are an exception.

Calls to Action

Immediate

Fix the alerting delay we observed in this incident. A solution has not been identified as of yet, but several ideas have been discussed. The most likely candidate is to add another set of external synthetics on the Snapshots API so that we alert when the service appears to be affected from an external perspective (not just internal alerts on out of memory conditions).

Future

Revamp Snapshots. This has been a discussion for a few months now. We are working on a new datastore that will make these point in time captures easier to perform, less memory intensive, and more performant.

We deeply apologize for any disruptions and inconvenience caused by this incident. We do not take this event lightly. Our team worked diligently to address all problems and restore normal functionality to the affected services as quickly as possible. By implementing these mitigation measures and refining our incident response strategy, we aim to improve the reliability and availability of our services and prevent future outages.

Please don’t hesitate to reach out with any additional questions about this matter.

Thank you for being a loyal customer of Polygon.io,

Polygon Engineering Team

Posted Sep 16, 2024 - 15:59 EDT

Resolved

v2 Snapshot Endpoints have remained stable after mitigations were put in place. We appreciate your patience while this issue was resolved.

Posted Sep 11, 2024 - 13:26 EDT

Monitoring

The services have been stabilized and we are currently looking into the root cause.

Posted Sep 11, 2024 - 10:30 EDT

Investigating

We are aware of an issue impacting v2 Snapshot Endpoints, causing increased latency and failed requests.

v3 Snapshots and other REST end points do not appear to be impacted at this time.

Posted Sep 11, 2024 - 10:18 EDT

This incident affected: Stocks (Market Data REST Endpoints), Options (Market Data REST Endpoints), Indices (Market Data REST Endpoints), Forex (Market Data REST Endpoints), and Crypto (Market Data REST Endpoints).