Detection
At approximately 10:34 am EST, September 16th we started receiving reports of FMV data inaccuracy on several tickers in the v2/snapshot endpoint. We began triage and observed wide delta’s between FMV and actual price. In the process of triaging we noticed a large amount of lag when reading messages from kafka in the snapshot service.
Mitigation
We traced the origin of the consumer lag back to a change we made to the service the prior week. This is a difficult service to rollback during market hours, we made the decision to apply updated settings to the service after market close. The goal was to bring kafka consumer settings parity with other services. We deployed those settings after hours and observed the changes at market open on Wednesday morning. When we did not see improvements in consumer lag we made the decision to roll the service back to a prior version.
Root cause / Resolution
We have been slowly migrating services to use an updated kafka library. This new library was allowed to soak in our staging environment for over a month without any negative observations. We applied this new library to the v2/snapshot service over the weekend of Sep. 14th/15th. We have performed this upgrade several times and did not expect adverse affects. When we could not determine the reason for consumer lag with this new library we rolled the service back to use a prior version with an older Kafka library.
Immediate
We deeply apologize for any disruptions and inconvenience caused by this incident. We do not take this event lightly. Our team worked diligently to address all problems and restore normal functionality to the affected services as quickly as possible. By implementing these mitigation measures and refining our incident response strategy, we aim to improve the reliability and availability of our services and prevent future outages.
Please don’t hesitate to reach out with any additional questions about this matter.
Thank you for being a loyal customer of Polygon.io,
Polygon Engineering Team