At ~8:16PM (PDT-7) the error rate started to go up on our main cloudfront distribution that hosts the central graph.codex.io api. Requests were starting to time out from our origin server.
The on-call engineer was notified and started to investigate.
At 8:31 the error rate had reached ~22%
The rest of the engineering team was called in.
The majority of resolvers that were failing were filterTokens from our tokens subgraph, other resolvers for things like prices / bars and other non-token data were unaffected.
A specific cache cluster, used to cache identical requests was identified as having a higher latency (graph-resolvers-cache). This cache was used in front of the search cluster to de-duplicate identical requests and reduce load on the search cluster. However, after over a year of growing the filterTokens feature-set, response sizes grew, and combined with our increased scale this opened the door for a potential DOS where if enough requests came in concurrently with different inputs it was possible to overload the cache. This vulnerability coupled with an over 300% increase in load to that resolver in less than 10 minutes caused the latency to spike, which caused a death spiral for the requests trying to reach that cache.
Mitigation
At 9:01pm we quickly deployed a fix which removed the usage of the cache for de-duplication and immediately error rate dropped back to sub 0.001%
Due to the fact that we had since scaled our search infrastructure greatly, this cache was actually not really needed anymore. Query de-duplication happens in another area of the stack now.
Future-proofing
This specific issue was caused by an over-utilized cache with extremely volatile load. We are auditing every cache in use and ensuring that they are not vulnerable to this type of problem.
In addition, we have found some metrics that will help us get notified of the severity of the issue faster, and should help reduce mitigation time for any future issues that arise.