Live pricing data degraded performance

Incident Report for PolygonIO

Postmortem

Postmortem on API increased latency and errors (Cassandra outage)

What went wrong

Polygon uses an open source database called Cassandra to store our intraday trading data. It plays a crucial part in our data storage solution.

On November 14th at 3:48 PM EST we began receiving alerts from our database writers that it was having read/write errors to our production Cassandra cluster (”Blue”). These errors started propagating to our serving stack that resulted in 5xx HTTP errors to our customers. Customers began reporting increased latencies and these errors to our support team. These errors began to dissipate in coordination with market close but not completely go away. Errors, alerts and system instability continued throughout the night. Around 7am on the morning of November 15th we made the decision to hydrate a backup Cassandra cluster with data and route traffic to it before market open.

Impact

Customers were adversely affected from approximately 3:48 PM November 14th until 9:00 AM on November 15th. Impact ranged from elevated latencies to 5xx errors.

Mitigation

Before market open on November 15th we moved traffic to a backup Cassandra (“Green”) cluster. This temporarily fixed the problem and allowed us to handle traffic after market open. Several engineers continued to look at the affected cluster to try and determine why it started experiencing errors the day before.

Resolution

After more analysis during the course of the day on 11/15 we determined that the compaction strategy on the Blue cluster did not match historical runtime configuration. This was adversely affecting performance on the cluster at indeterminate times. The compaction strategy on the Blue cluster was updated and on November 17th traffic was diverted back to the Blue cluster.

Root cause

On the weekend of November 10th - November 12th we had a planned exercise of standing up a new Cassandra cluster that had more storage provisioned for it (”Blue cluster”). We thought this cluster was a verbatim match of the old production cluster. However, there was some runtime configuration drift concerning the compaction strategy for the tables in the cluster. Our production services began having read/write issues on November 14th at 3:48 PM EST as Cassandra was compacting data behind the scenes.

Remediation and Calls to Action

  • We will be taking a closer look at runtime configuration drift in our infrastructure repos to make sure those configs are maintained and applied properly during each new infra deployment.
  • We are changing our procedures around updating StatusPage to better use the “monitoring” state so that customers have more awareness that the incident is still open and not resolved.
  • We are evaluating implementing an alert that is more fine tuned to Cassandra schema issues. 

We deeply apologize for any disruptions and inconvenience caused by this incident. We do not take this event lightly. Our team worked diligently to address all problems and restore normal functionality to the affected services as quickly as possible. By implementing these mitigation measures and refining our incident response strategy, we aim to improve the reliability and availability of our services and prevent future outages.

Please don’t hesitate to reach out with any additional questions about this matter.

Thank you for being a loyal customer of Polygon.io,

Polygon Engineering Team

Posted Nov 22, 2023 - 09:59 EST

Resolved

This incident has been resolved.
Posted Nov 21, 2023 - 17:03 EST

Monitoring

A fix has been implemented and we are monitoring the state
Posted Nov 15, 2023 - 09:35 EST

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 15, 2023 - 09:03 EST

Investigating

We are currently investigating this issue
Posted Nov 15, 2023 - 02:33 EST
This incident affected: Stocks (Market Data REST Endpoints), Options (Market Data REST Endpoints), Indices (Market Data REST Endpoints), Forex (Market Data REST Endpoints), and Crypto (Market Data REST Endpoints).