The Snowflake service experienced a 97 minute period of service disruption for customers with non-dedicated instances and a 163 minute period for customers with dedicated instances on the morning of January 23, 2017. The combination of a configuration change during a service upgrade together with a bug in virtual warehouse health check code caused the virtual warehouse servers to enter repair state and get de-provisioned, which initiated the service disruption incident.
As a part of the mitigation efforts during the 1/18/17 service outage, Snowflake operations decided to roll back the release (1.79) that was installed earlier that day. After detailed post-mortem analysis, it was determined that the 1.79 release was not the cause of the metadata store overload so this release was scheduled to be installed on Monday morning.
Because this was the first major release following last week’s outage, the load on the metadata store was more closely monitored during the service upgrade. When the new release was provisioned and background services were initiated on the new version, spikes in metadata usage were observed. Because the customer workload continued to run on the older version, the background services running on the new version were paused to avoid impacting customer workloads. As an additional precaution, servers running the new version were also halted.
These operational steps were unusual but we were taking extra precautions with this release. Because all customer workloads were connected to the older version of services, all of the steps taken by Snowflake operations were considered safe.
Unfortunately, while the virtual warehouses were connected to the active (older version) service cluster, a bug in the health check code of the virtual warehouses established and maintained a health check connection only with the new version service cluster, which was now halted.
At 9:26 AM PST, within a minute of halting the new service version, the failed health check caused an avalanche of virtual warehouse servers to appear as needing repair.
At that point, the system immediately began re-provisioning the servers. However, because these servers were deemed “dead”, the recovery code required them to be re-provisioned from Amazon. Because of the thousands of servers that were affected and required replacements from Amazon, this process was very time consuming. Because of the number of EC2 instance requests, Amazon (very reasonably), throttled the speed at which new servers were made available. As servers became available, the virtual warehouses were restored. At 10:54 AM PST, the Snowflake internal operations team confirmed that customer virtual warehouses were restored, undergoing normal server provisioning and operating normally.
Following Monday’s service disruption, we analyzed the observed spike in metadata usage during the upgrade and compared this to previous upgrades. It was determined that this spike was expected and not an issue. The extra caution of our operations team following last week’s service outage caused us to take steps which upon further analysis were found to be unnecessary and ultimately resulted in an additional service disruption.
However, the underlying issue of the health check not propagating to the correct cluster was a latent problem in the Snowflake service which needed to be rectified.
We have subsequently fixed that issue and are taking multiple steps to prevent further disruptions. The purpose of our health check code is to detect failed servers and instantiate replacements for them without customer disruption. That code is being changed to prevent all instances from getting de-provisioned.
Additionally, our server provisioning process is being improved to provide a way for Snowflake operations to re-provision a large number of servers without rate throttling from Amazon.
We are also changing our release process to avoid upgrades on Monday morning, which we know is a critical time for many of our customers.
Snowflake is completely focused on delivering a highly available service to our customers and this is something we take extremely seriously. In this case, additional caution following last week’s issues initiated a series of events which resulted in a subsequent service disruption.
While Monday’s event was very unfortunate, it did reveal an underlying problem that we have subsequently corrected. We are deeply sorry for the issues this caused our customers and are taking all possible steps to prevent a similar recurrence.
If there are any questions or issues related to this summary report, please submit a support request ticket to firstname.lastname@example.org.