A new object arrives in Snowflake... the Data Metric Function!
You can use standard functions or create your own to measure the quality of your data.
What better way to discover this new feature than with an example?
Let’s start with the basic but recurring need for automatic duplicate detection.
With Snowflake, we often start by assigning roles:
use role accountadmin;
grant database role SNOWFLAKE.DATA_METRIC_USER to role sysadmin;
grant execute data metric function on account to role sysadmin;
grant application role snowflake.data_quality_monitoring_viewer to role sysadmin;
Let's go to our working context:
USE ROLE SYSADMIN ;
USE DATABASE demo_db;
USE SCHEMA public;
Then, let's create a table with 1 duplicate:
CREATE OR REPLACE TABLE region
AS
SELECT * FROM snowflake_sample_data.tpch_sf1.region
UNION ALL
(SELECT * FROM snowflake_sample_data.tpch_sf1.region limit 1)
;
Manually invoke the system function for detecting duplicates:
select snowflake.core.duplicate_count(select r_name from region);
Here is the result:
SNOWFLAKE.CORE.DUPLICATE_COUNT(SELECT R_NAME FROM PUBLIC.REGION)
1
We indeed find 1 duplicate.
Let's add a trigger on our table so that quality checks are executed when the data changes:
ALTER TABLE region SET
DATA_METRIC_SCHEDULE = 'TRIGGER_ON_CHANGES';
Finally, let’s add the duplicate detection function to our table:
ALTER TABLE region
ADD DATA METRIC FUNCTION snowflake.core.duplicate_count
ON (r_name);
All that remains is to inject new duplicates:
INSERT OVERWRITE INTO region
SELECT * FROM snowflake_sample_data.tpch_sf1.region
UNION ALL
(SELECT * FROM snowflake_sample_data.tpch_sf1.region limit 3);
Tips 🎁: The keyword OVERWRITE allows doing a truncate before doing an insert.
Wait a few minutes before you can invoke the result.
Another tip 🎁: it's the keyword TABLE which avoids writing SELECT * FROM ...
So handy ♥️
Let’s now retrieve the result of our check:
TABLE snowflake.local.data_quality_monitoring_results;
This table contains many columns.
Let's keep the ones that interest us:
select measurement_time,
metric_database,
table_schema,
table_name,
metric_schema,
metric_name,
value
from snowflake.local.data_quality_monitoring_results;
And we indeed find our 3 duplicates:
MEASUREMENT_TIME METRIC_DATABASE TABLE_SCHEMA TABLE_NAME METRIC_SCHEMA METRIC_NAME VALUE
2024–04–09 10:35:22.290 -0400 SNOWFLAKE PUBLIC REGION CORE DUPLICATE_COUNT 3
Of course, you can create alerts or dashboards from this data.
We could already do all this programmatically. It’s just become easier! 🎉
Don't forget to clean up afterwards:
DROP TABLE region;
I hope you will use and take full advantage of these handy functions to monitor data quality.
To learn more about DMFs... RTFM 😁
https://docs.snowflake.com/en/user-guide/data-quality-working