The Snowflake service experienced two incidents of service degradation: first incident on April 24, 2017 from 2:02 PM PDT to 3:50 PM PDT, and second incident on April 25, 2017 from 1:58 AM PDT to 2:44 AM PDT. A subset of new query requests during these incidents experienced intermittent failures. Both incidents were caused due to high CPU and memory exhaustion in a couple of processes in the Snowflake metadata store; the metadata store is designed to favor data consistency over service performance under high load.
From our investigation, the high CPU and memory exhaustion in a couple of processes in the Snowflake metadata store triggered the throttling of workload across the production environment in order to maintain data consistency which resulted in service performance degradation.
On the April 24th incident, as we were mitigating the impacted production processes, the resource issues subsided by 3:50 PM PDT. At this point, workload throttling was disabled and system stability was fully restored. We also performed a full health check review and ensured that there were no other potential issues as we transitioned to the root cause investigation for this incident. Unfortunately, on April 25th at 1:58 AM PDT, another incident of the same issue resurfaced which eventually recovered by 2:44 AM PDT.
The root cause investigation identified that the background processes in our metadata store, responsible for data cleaning to improve workload efficiency, were the ones that were running at high CPU. Their workload increases as the data in the metadata store gets fragmented over time. We rebuilt the data storage to drastically reduce the fragmentation, using an approach that has zero customer impact. This action not only resolved the root cause but is also expected to make the background process run more efficiently in near future.
We are creating a new periodic operational procedure where we would proactively eliminate data fragmentation, using that same approach that has zero customer impact, and makes the background processes more efficient. Previously, we only needed to do this twice a year. As the workload continues to grow significantly, executing this maintenance task on a higher frequency would ensure the Snowflake service to keep running at peak performance.
If there are any questions or issues related to this summary report, please submit a support request ticket to email@example.com.