Snowflake Dynamic Tables are wonderful! But how do you test them?

In this article we cover our experience helping an Analytics Engineering team create a next-level data infrastructure based on Snowflake Dynamic Tables with a robust and responsive data quality testing framework.

DATA QUALITYDATA PIPELINES

Lucian

10/16/20244 min read

Context

The analytics engineering team for one of my consulting clients is exploring a move from DBT data pipelines (DBT core orchestrated by Prefect and using EKS runners) towards Snowflake Dynamic Tables.

What are Snowflake Dynamic Tables? They are declarative SQL transformations where Snowflake is handling all the management of pipelines behind the scenes for you. With some restrictions.

Why Snowflake Dynamic Tables for us? Several reasons:

the team is slowly becoming a bottleneck in the organization (thus automated pipelines are expected to release the operational burden and reduce thoughts behind materialization strategies, debugging incremental refreshes and rerunning pipelines affected by source data issues which sneaked out of sight)
for cost optimization - the organization is considering a move off Prefect, thus removing the need of an orchestrator for the data pipelines would work hand-in-hand with this; the team is already using serverless ingestion using Snowpipes, thus using Dynamic Tables for downstream transformations would remove the need for imperative data pipelines

The testing gap

So fine so good, and our Dynamic Tables functional prototypes proved a success after overcoming an initial learning curve. Yet we soon realized that there would be a significant gap left behind our existing DBT + Prefect setup - and that is, testing infrastructure.

With DBT + Prefect, 80% of our testing relied on build-time tests orchestrated by DBT:

not_null and accepted_values, for column assertions
for testing primary keys and foreign keys we adopted the use of the Snowflake-labs/dbt_constraints package - as it not only runs the tests, but it also handles pushing constraints to Snowflake automatically

The remaining 20% of our testing covered:

source_freshness tests for key data providers
some isolated singular DBT tests (covering complex expectations in terms of the relationship between business entities in our models)
soft data quality tests — a custom-built, non-blocking, row-level, propagated violations flagging mechanism
data distribution profiles, for detecting unusual patterns in data and data distribution without the usage of explicit expectations or assertions

Why is testing a gap? Because Dynamic Tables are built automatically by Snowflake, without a hook mechanism for adding tests. Additionally, given that our tests are defined in DBT yml files, even if Snowflake provided a proper hook in SQL we would still have to build functionality for binding to such hooks, as the out-of-the-box materialization dynamic_table doesn’t cover it (yet).

Trying out DMFs to close the testing gap

Well, we were aware that the switch to Dynamic Tables would leave testing off the table, yet we were also expecting that the mechanism that Snowflake recently introduced — Data Metric Functions — would be closer to a drop-in replacement to fill this gap. The latter is what really surprised us — and not in a good way.

Data Metric Functions were recently introduced in Snowflake as part of the data governance/data quality pillar:

These insights enhance your data governance posture by enabling the following:
* Data stewards to know the current state of their data based on a particular metric.
* Data engineers to take immediate action on important tables and views.
* Platform administrators to ensure data quality monitoring is done with cost, consistency, and performance.

The concept was to for Snowflake to introduce a centralized, vertically-integrated mechanism to define and collect data quality metrics, upon which notifications can then be added. Some typical needs can be covered out-of-the-box with system DMFs: freshness, null count, unique/duplicate count etc.

Designing the future testing framework

Architecturally, this data quality testing concept introduced by Snowflake separates the metric collection from the metric assertion, which is not a bad thing! Except that the implementation lacks significant functionality, because:

DMFs don't allow the customizations we need, despite built as a special database objects similar to UDFs, (for example, test parametrization or running custom DML).
The test assertion mechanism seems completely left out, as Snowflake only provides a mechanism to create tasks but no standard mechanism for assertions and alerting. Additionally, the tasks annot even be triggered by test executions, but rather just on CRON.

On the latter point, I would argue that even the usage of the word “metric” is borderline, given that DMFs rather implement something closer to a “measurement”. I personally see assertions and thresholds as integral part of defining a “metric”.

The lack of DMF parametrization (mentioned in the former point) was an issue for us because our soft-quality tests sometimes flagged several potential violations for each data record, for example: “referenced customer not matching existing customers” (a foreign-key relationship issue), or “deviceType and deviceId expected to be populated both or none” (a more refined data consistency issue).

Yet, when evaluating how many violations there are by issue type, we couldn’t simply raise a test failure mentioning that “we have 100 soft quality issues” — it was important to understand that “we have 72 issues of type A and 28 issues of type B”, yet A & B weren’t known in advance when the DMF was called. In other words, we needed one metric for each soft quality issue, yet test violations were “dynamic” — embedded in the SQL logic — and not declared in the DMF definition/signature. The list of violations only came out at runtime (when a table was updated) and not at compile-time (when the table, and the corresponding DMF, would be declared).

Given that DMFs expect a static signature, Snowflake expects you to have one DMF per metric. Period.

Additionally, the mechanism of triggering DMFs envisioned a future in which, given that Dynamic Tables would be refreshed not once per day but rather once every 10 minutes (we needed to reap some benefits from moving to close-to-real-time data with Dynamic Tables, right?). This would mean that the testing suite would be triggered once every 10 minutes, yet DMFs don’t come with an out-of-the-box mechanism to test incremental data only. Imagine a table with 50 billion records, refreshed with a bunch of records every 10 minutes, for which you need to test the uniqueness of primary keys and the referential integrity of foreign keys for all 50 billion records every 10 minutes.

Long story, short — we couldn’t use DMF to properly cover our testing needs.

Coming next

Luckily, this was an “challenge accepted!” situation for a team of seasoned data engineers.

We had the opportunity to build what Snowflake didn’t — a testing mechanism that covered all our use cases, that was incremental in nature, dynamic (parametrizable) and that allowed custom “hooks” for adding conditional logic.

Article also available on Medium.com.