The Snowflake service experienced a 2 1/2 hour period of service disruption on the evening of January 18, 2017. An unexpected overload to the Snowflake metadata store, which is a core component of the Snowflake database, initiated the service availability incident.
The initial root cause of the Snowflake metadata store overload was a failure in one of the metadata processes which resulted in that process using excessive CPU. Concurrent with this event, an increase of query job activity created a high load of read requests into the metadata service. Snowflake is designed to handle occasional spikes in query activity by reducing the write load on the metadata store. In this case, the combination of an errant metadata process together with a spike in read requests significantly reduced the performance of the metadata store.
During this overload window, excessive read requests to the metadata store created a set of work queues that our internal operations team was focused on remediating. Snowflake internal operation teams detected external availability issues to the Snowflake service at 10:14 PM PST, which resulted in query execution and access failures. Gradual remediation of these issues was attempted but continued service activity caused the queues to stay at unacceptably high levels. Ultimately, it was decided to pause the services clusters and incoming queries. At this point, the metadata store quickly cleared its queues. The system recovered within a few minutes and by 11:46 PM PST internal operations began the recovery process by restarting new services clusters. By 12:49 AM PST, full external availability was back online.
The Snowflake engineering team has implemented and installed into production throttling for heavy metadata read activity to prevent future incidents. Our operations playbook has been updated with new procedures that allow us to isolate affected components and recover them, enabling us to respond more effectively to future situations. We are also enhancing our telemetry to provide an early indication of not only this particular event but also a general class of events that can cause system overload.
Most importantly, Snowflake has been developing an architectural enhancement that improves the compile-time performance for some queries and will dramatically reduce the load on the metadata store. This enhancement is in the final stages of testing and will be rolled into production in the coming weeks.
We are very sorry for the problems this service disruption may have caused our customers. The Snowflake architecture is designed to prevent these situations but sometimes a combination of simultaneous events create an unanticipated situation. When this happens, we learn and improve.
If there are any questions or issues related to this summary report, please submit a support request ticket to firstname.lastname@example.org.