Take data snapshots with DBT (1)

In this case study, we present a custom solution for reshaping historical data into a snapshot format. In a modeling paradigm relying on 3NF and SCD-Type4, analyst are often finding it hard to report on entity dynamics. Snapshots - to the rescue - provide an alternative perspective on the historical data that is easy to consume.

Lucian

5/15/20235 min read

One of the clients working on an Analytics Engineering (AE) project with, as a data consultancy, was constantly receiving challenges internally from the end-users of the datasets regarding data being hard to query for further analytics.

For context, the digital marketing company is a DBT shop using AWS, a Snowflake backend, Prefect for orchestration and frontend using PowerBI (for citizen analysts) or Snowsight (for power users). AE department is gradually taking over the ownership of several datasets across the organization and driving the standards regarding data modeling.

Context (data modeling)

The key modeling paradigm is 3NF, a decision which I advocated and drove throughout implementation, given that one of the core pillars when founding a dedicated AE department was to improve data quality and trust. [add link to why normalization is helping].

Despite having achieved significant milestones in terms of the AE mandate on data quality and trust, the decision to model and expose 3NF data was putting the burden of denormalization and building of “point-in-time snapshots” on downstream organizations (e.g. analytics) or forcing several consumers of the data to apply the same logic over- and over-again in order to reshape the data.

This is where AE stepped in and we decided that we want to proactively expose some of the key datasets in an alternative “point-in-time” (snapshot) form.

The entire organization is relying on the models built by AE, and some typical query patterns do not go well with the baseline 3NF modeling paradigm.

The design of “point-in-time” snapshots

We envision “point-in-time snapshots” (further on, simply called “snapshots”) as periodic samples taken to represent the state of a specific business entity at a specific point-in-time. Imagine looking at a dataset and running a simple task such as:

insert into <entity_snapshot>
select date(...) as effective_on, * from <entity>

In practice, the task is rarely as simple as selecting from a table — this would impose constraints on the exact timing when the snapshot is run and there would be no possibility to reproduce history. To overcome this, I usually build a parametrizable structure (a parametrized view, e.g. a Tabular SQL UDF in Snowflake) that takes as input an `effective_on` date and returns the snapshot as-of that specific point in time. The snapshot logic is then only responsible for the materialization of the snapshots — see more details below.

In order to keep complexity from exploding, we discussed and agreed on a set of guiding principles for the implementation:

snapshots MUST NOT become a parallel modeling paradigm maintained by AE for its’ datasets, and they would only be deployed on a case-by-case basis
snapshots MUST NOT incorporate transformation business logic; if such logic is required/requested by consumers, it should be first incorporated into one of the generic datasets managed by AE, and only afterwards a snapshot should be built on top of it in a minimalistic, technical way like indicated above

Solution prototype in DBT

We focused on obtaining a simple and elegant solution, based solely on configuration. DBT, with the Jinja macro support, was our friend here in building reusable logic and keeping it separate from the transformation business logic (thumbs-up for maintainability):

[/dbt/macros/execute_snapshot.sql]

{# Manage the materialization of a periodic snapshot for an existing model. #}
{% macro execute_snapshot(this, defer_days = 1) -%}
{% if execute %}
{# Obtain the list of columns of the underlying entity automatically #}
{% set query %}
select trim(replace(replace(DATA_TYPE, 'TABLE (', ''), ')', ''))
from {{this.database}}.information_schema.functions
where function_schema = upper('{{this.schema}}')
and function_name = upper('{{this.identifier}}')
{% endset %}
{# print(query) #}
{% set results = run_query(query) %}
{% set funcsig = results.columns[0].values()[0] %}

{# Materialization infrastructure (idempotent) #}
{% set query %}
create table if not exists {{this}} ({{funcsig}})
{% endset %}
{# print(query) #}
{% set results = run_query(query) %}

{# Snapshot materialization logic #}
{% set query %}
insert into {{this}}
select * from table({{this}}(CURRENT_DATE - {{defer_days}}))
{% endset %}
{{ print(query) }}
{% set results = run_query(query) %}
{% endif %}
{%- endmacro %}

At a minimum, the macro is expected to handle the materialization infrastructure (idempotent creation of target table) and logic (actual insert into for each snapshot run). This would have required to pass to the macro the list of columns of the target entity (in order to create the table on the first run), which would have been noisy, verbose and error-prone, thus we maintained a clean interface by inferring the columns and data types of the snapshot source from the Snowflake `INFORMATION_SCHEMA` metadata.

The parameter `defer_days` provides the possibility to defer the building of a snapshot (for example, if data is expected to be incomplete or to suffer updates within a defined time window after the initial ingestion). We use this because we want our snapshots to be stable — thus if we know that data is incomplete or unreliable for the last business day, we don’t want to build a snapshot which to change later. Immutability is aways beneficial in data processing.

Source types

While implementing the prototype, we realized that there are two types of sources that we want to support:

sources for which there is no possibility to reconstruct historical point-in-time data (because there is no data collected in the DWH supporting that use case; for example, data is refreshed in-place); we will call these “un-historized sources”, and
sources for which we can reconstruct any arbitrary point-in-time; we will call these “historized sources”.

For un-historized sources, the orchestration for taking the snapshot is paramount, as there is no possibility to travel back in time — a snapshot which is not taken at the appropriate time is a snapshot that is lost, with a direct impact on downstream consumers. In a properly designed DWH, one should not encounter these often.

For historized sources, building snapshots is simply a a problem of reshaping the data. Consumers would already have the possibility to reconstruct a point-in-time of an entity (a query which sometimes can be simple, but other times can get prohibitively complex for unsophisticated consumers). Still, we found out that it’s beneficial to simplify the complexity (code duplication and inconsistent outcomes) by standardizing the process and centralizing some of the key entity snapshots from across the organization at the level of the AE department.

Implementation

Our frontrunner case was based on a “historized source”, and the implementation we chose was a Snowflake Tabular SQL UDF.

Why a UDF? Because, in line with the principles stated above, we wanted to keep business logic separate from the technical implementation. “Business logic” for us was the process of preparing, filling and reshaping data coming from different sources and providing the ability to reconstruct the point-in-time for any arbitrary date (provided as parameter). The implementation of parametrized views in Snowflake comes in the form of Tabular SQL UDFs.

As a benefit of having access to historized sources, we were able to not only build snapshots starting today, but also construct the snapshots retroactively in order to provide immediate value to downstream consumers. We backfilled data manually for the frontrunner, but we put the robust backfilling of data at the top of our backlog for v2.

Conclusion

Taking a snapshot of a model in DBT is now configured in a few lines of code, as a post-hook for an existing model that incorporates the business logic:

[<model_name>.sql]

{{
config(
post_hook = "{{ execute_snapshot(this, 2) }}"
)
}}
select
date(...) as effective_on,
current_timestamp as snapshot_at,
<other_payload_fields>
from <underlying_model>

Simple? Yes.
Elegant? Pretty much so.
Production-ready? Not yet.

In part 2, I will cover the challenges that we encountered with the initial design and how we overcame them, as well as some improvements like robust data backfilling. Stay tuned!

Article also available on Medium.com.

Contact us

Whether you have a request, a query, or want to work with us, use the form below to get in touch with our team.