Symptoms
Starting at 17:28 UTC most token data (enhanced events, aggregates, both pair and token etc) were degraded. They would delay for ~1-2 minutes and then catch back up.
Causes
We found quickly that this was an issue with our kafka producers timing out when trying to produce messages. When they time out, it causes services to backup. Specifically, our services that generate lots of kafka messages (price, walletBalances, enhancer) would get behind.
A rolling security patch on the AWS MSK Kafka cluster caused broker reboots, which triggered a connection leak in the shared Kafka producer service. This resulted in an exhaustion of broker connections, leading to failures when publishing price data.
Mitigation
We performed a rolling restart of the producer services to clear leaked connections and restore timely price data updates.
Fix
An upgrade to @platformatic/kafka v2.2.3 is planned to permanently fix the connection leak. We will also perform better error handling and async processing here to lessen the impact.
No components marked as affected
Resolved
Symptoms
Starting at 17:28 UTC most token data (enhanced events, aggregates, both pair and token etc) were degraded. They would delay for ~1-2 minutes and then catch back up.
Causes
We found quickly that this was an issue with our kafka producers timing out when trying to produce messages. When they time out, it causes services to backup. Specifically, our services that generate lots of kafka messages (price, walletBalances, enhancer) would get behind.
A rolling security patch on the AWS MSK Kafka cluster caused broker reboots, which triggered a connection leak in the shared Kafka producer service. This resulted in an exhaustion of broker connections, leading to failures when publishing price data.
Mitigation
We performed a rolling restart of the producer services to clear leaked connections and restore timely price data updates.
Fix
An upgrade to @platformatic/kafka v2.2.3 is planned to permanently fix the connection leak. We will also perform better error handling and async processing here to lessen the impact.