Write-up
All API Endpoints Delayed
Timeline
  • At ~8:16PM (PDT-7) the error rate started to go up on our main cloudfront distribution that hosts the central graph.codex.io api.  Requests were starting to time out from our origin server.

  • The on-call engineer was notified and started to investigate.

  • At 8:31 the error rate had reached ~22%

  • The rest of the engineering team was called in.

  • The majority of resolvers that were failing were filterTokens from our tokens subgraph, other resolvers for things like prices / bars and other non-token data were unaffected.

  • A specific cache cluster, used to cache identical requests was identified as having a higher latency (graph-resolvers-cache). This cache was used in front of the search cluster to de-duplicate identical requests and reduce load on the search cluster. However, after over a year of growing the filterTokens feature-set, response sizes grew, and combined with our increased scale this opened the door for a potential DOS where if enough requests came in concurrently with different inputs it was possible to overload the cache. This vulnerability coupled with an over 300% increase in load to that resolver in less than 10 minutes caused the latency to spike, which caused a death spiral for the requests trying to reach that cache.

Mitigation

  • At 9:01pm we quickly deployed a fix which removed the usage of the cache for de-duplication and immediately error rate dropped back to sub 0.001%

  • Due to the fact that we had since scaled our search infrastructure greatly, this cache was actually not really needed anymore. Query de-duplication happens in another area of the stack now.

Future-proofing

  • This specific issue was caused by an over-utilized cache with extremely volatile load. We are auditing every cache in use and ensuring that they are not vulnerable to this type of problem.

  • In addition, we have found some metrics that will help us get notified of the severity of the issue faster, and should help reduce mitigation time for any future issues that arise.