The Context
Imagine you're the new CIO of an international group.
One of the first projects you want to undertake is to harmonize the data ecosystem.
You understand that the Data Mesh approach must provide strong autonomy to teams, but you don’t forget that one of the most important aspects of Data Mesh is Federated Governance.
This article discusses the implementation of Federated Governance over Snowflake.
A Bit of Vocabulary
In Snowflake, you have several containers at your disposal to “organize” your data.
Here is their hierarchy:
That’s all for the technical part. Let’s now talk about what makes data architectures exciting.
Snowflake Governance
To ensure proper data usage and management within Snowflake, you may want to monitor the following:
- Data classification.
- Application of mandatory tags (e.g., PRIVACY_CATEGORY and SEMANTIC_CATEGORY) to identify sensitive data.
- Adherence to naming conventions.
- Implementation of masking rules (masking policy).
- Usage of data access restrictions (row access policy, aggregation policy, projection policy).
- Access history to certain sensitive objects.
- Financial tracking of resource usage.
- Network security rules or management of computing units (warehouses) with specific rules (resource monitor).
- Secure access management (SSO, MFA, key/pair).
All governance elements are now under a new umbrella called Snowflake Horizon, which continues to expand.
3 Options for Managing Your Governance
As I mentioned to grab your attention, we have three options for managing accounts:
1️⃣ Use a single account and separate departments into different databases.
2️⃣ Use multiple accounts and deploy your governance from your CI/CD.
3️⃣ Use multiple accounts, including a Zero Data Account that carries the governance.
Option 1: Single Account
It's possible, and indeed it has been done for years, to use a single Snowflake account and isolate departments into separate databases.
Example in a diagram:
Advantages:
- Joins are possible directly.
- Governance is very simple: direct access to all metadata.
Disadvantages:
- Data isolation will only be through role management.
- Roles will proliferate, quickly leading to several dozen roles.
Option 2: Multiple Accounts with CI/CD Deployment
One can use a CIO account and one account per department capable of operating its own Snowflake. If this isn’t the case, it might be preferable to keep objects at the CIO account level and provide a complete service to subsidiaries.
To deploy governance elements (tags, masking policies, etc.), we won't manually execute scripts. No, not here.
We will use CI/CD (e.g., Github Actions + schemachange) to automatically execute scripts across the different Snowflake accounts.
We can also use direct git integration in Snowflake and code a procedure that manages deployments. I'll talk about this in a future article.
Note: You will not be able to perform joins from one account to another directly in your queries.
We will publish the shared object (table, view, etc.) on a Private Listing accessible to one or more accounts.
I see this constraint as an opportunity to manage the publication of data products more controlledly. Because when you know you are a data producer, you must provide your consumers with a quality experience (documentation, quality data, no changes to the interface contract, etc.).
Advantages:
- Clearer separation of responsibilities and data.
- Fewer roles.
Disadvantages:
- Cannot perform joins but must go through publication operations.
Option 3: Multiple Accounts Including a Zero Data Account
The approach of a Zero Data Account is now possible with Snowflake thanks to the introduction of Replication Groups. But the principle is simple and widespread in DevOps approaches (e.g., AWS Control Tower).
Instead of using CI/CD, we will deploy governance elements directly from Snowflake through replication groups that will be deployed on other accounts in the same organization.
We can centralize all previously mentioned governance information at this Zero Data Account level, which, as the name suggests, does not intend to host data.
The Zero Data Account must be at a Business Critical subscription level to use the governance object replication.
I haven’t found in the documentation whether the target accounts also need to be Business Critical, but it seems they do not.
To verify that accounts have properly used the governance elements, we could ask them to share tables like SNOWFLAKE.ACCOUNT_USAGE.TAG_REFERENCE with us.
But we can also talk to each other during federated governance meetings. I prefer that.
Advantages:
- Deployment is managed from Snowflake.
- Supervision and auditing are centralized.
Disadvantages:
- A certain number of accounts are needed to justify creating the Zero
Conclusion
Using Snowflake is simple.
Administering Snowflake within a large company while ensuring governance consistency across various accounts requires a bit more thought.
Without that, where would the fun be?!
Sources
Webinar : Gouvernance des données dans un data mesh — DCWT 23 — Jade Le Van
, Nicolas Lerose
Masterclass: Deliver a Domain-Driven Data Mesh Architecture Successfully