Postmortem for v2 Snapshots timeouts
Detection
At approximately 10:01 AM EST on September 11th, we noticed increased timeouts for our Snapshots endpoints resulting in a negative experience for our Stocks Business customers. Status Page was updated to reflect degraded performance and Engineering began investigating cause/mitigation.
Mitigation
It was determined that several pods were having scaling issues due to an influx of api requests. We increased memory requests on the affected pods and the timeouts were mitigated.
Root cause
After service functionality was restored, we began investigating the timeline and observability artifacts to gain a better understanding of what went wrong and when it began. It was determined that the timeouts began at approximately 9:37 AM EST. A sharp spike in requests resulted in scaling issues for several pods which caused a cascading affect of timeouts across the Snapshots API affecting all Stocks Business customers. The out of memory alerts were not triggered until 10am, which meant we lost 23 minutes of awareness. These out of memory warnings are not uncommon in most modern stacks…and most of of our services have the ability to scale horizontally, Snapshots are an exception.
Immediate
Future
We deeply apologize for any disruptions and inconvenience caused by this incident. We do not take this event lightly. Our team worked diligently to address all problems and restore normal functionality to the affected services as quickly as possible. By implementing these mitigation measures and refining our incident response strategy, we aim to improve the reliability and availability of our services and prevent future outages.
Please don’t hesitate to reach out with any additional questions about this matter.
Thank you for being a loyal customer of Polygon.io,
Polygon Engineering Team