Increased latency leading to delayed pricing for v2/snapshot/
Incident Report for PolygonIO
Postmortem

What went wrong

Detection

At approximately 10:34 am EST, September 16th we started receiving reports of FMV data inaccuracy on several tickers in the v2/snapshot endpoint. We began triage and observed wide delta’s between FMV and actual price. In the process of triaging we noticed a large amount of lag when reading messages from kafka in the snapshot service.

Mitigation

We traced the origin of the consumer lag back to a change we made to the service the prior week. This is a difficult service to rollback during market hours, we made the decision to apply updated settings to the service after market close. The goal was to bring kafka consumer settings parity with other services. We deployed those settings after hours and observed the changes at market open on Wednesday morning. When we did not see improvements in consumer lag we made the decision to roll the service back to a prior version.

Root cause / Resolution

We have been slowly migrating services to use an updated kafka library. This new library was allowed to soak in our staging environment for over a month without any negative observations. We applied this new library to the v2/snapshot service over the weekend of Sep. 14th/15th. We have performed this upgrade several times and did not expect adverse affects. When we could not determine the reason for consumer lag with this new library we rolled the service back to use a prior version with an older Kafka library.

Calls to Action

Immediate

  • Deploy the service with the newer Kafka library to a secondary region and observe production data flowing through. Determine why the service and Kafka settings are causing consumer lag when reading messages. Gain confidence with this service in the secondary environment and then promote it to the production environment.

We deeply apologize for any disruptions and inconvenience caused by this incident. We do not take this event lightly. Our team worked diligently to address all problems and restore normal functionality to the affected services as quickly as possible. By implementing these mitigation measures and refining our incident response strategy, we aim to improve the reliability and availability of our services and prevent future outages.

Please don’t hesitate to reach out with any additional questions about this matter.

Thank you for being a loyal customer of Polygon.io,

Polygon Engineering Team

Posted Sep 25, 2024 - 10:53 EDT

Resolved
During this time frame we were seeing increased latency with Fair Market Value (FMV) which led to delayed pricing data when making a calculation off FMV via the v2/snapshot/ endpoint.
Posted Sep 18, 2024 - 07:45 EDT