Data flow isn’t one-way – it’s circular. While the analytics team may have the deepest understanding of information and the most experience translating raw data to insights, data is both generated by and used outside of the analytics team. This circular nature presents complexity when moving data around through both ELT and reverse ETL.

When it concerns data pipelines, in both the case of ELT and reverse ETL, data moves across team boundaries. What happens, though, when there’s unexpected behavior? All of a sudden an email that’s supposed to have 1000 recipients has five – or worse, it doesn’t go out at all.

Data hygiene: What it is and why it matters

Tools like Census and Fivetran have expanded the impact of the analytics teams from writing shell scripts nonstop back in the day to easily transforming data and impacting downstream automations happening in other teams.

The goal of data hygiene is to make sure expectations across all of these teams align. Without this alignment, the value of data (and insight from it) suffers from one key faux pas: Miscommunication. Making decisions with a partial or incorrect understanding of existing behavior is worse than making decisions without any information at all.

Good data hygiene means data is correct and easily used to draw insight. This definition then begs the question: How do you achieve it?

Let’s break down data hygiene into three distinct principles:

Quality. Having quality data means the information accurately represents reality. If a user received an email once, there should be one and only one record of the user receiving the email.
Architecture. Having good data architecture means having clear relationships defined between objects and data sources, linked through unique identifiers.
Efficiency. Good architecture and quality data together lead to efficient querying. Relationships based on specific foreign keys on data that doesn’t have duplicates means fewer aggregations are needed to gain insight.

Remember: Data is both generated and used outside of the analytics team. Unfortunately, this leads to unclear ownership of all the aspects of data hygiene. To combat this complexity, the analytics team and data stakeholders need a heavy amount of collaboration to bridge the gap from raw data to insight. Data hygiene is not limited to raw data or data in a warehouse; it applies to data living in all operational tools (owned by various teams) as well.

Data hygiene in action

Consider your typical marketing team using HubSpot. Data is generated by the marketing team from within HubSpot through the creation of email campaigns and lifecycle campaign triggers. Using an ELT tool like Fivetran, the analytics team pipes HubSpot data into a data warehouse, where it can be modeled and enriched. 🌱

All of this happens alongside other data sources, centralizing everything. Logic for flagging users that are most likely to make a purchase from an email, for example, is developed in the same place where the raw data lives, shortening the development lifecycle.

Using an operational analytics tool like Census, the analytics team can then pipe the newly generated flag back into HubSpot (yes, where the data originated!) Finally, all of this effort positively impacts the performance of the email campaigns themselves: The marketing team can use the additional information to engage users more personally.

The process of operationalizing data also creates centralization, but for the marketing team. After all, each team should have access to data where they already work. For data teams, this means it should be in the warehouse, but for marketing teams, it should be actionable in their CRM.

And, when a process works as expected, everything is grand: Skies are blue, birds chirp and the marketing team has better conversions from their emails. 🐦

The stakeholder’s role in designing data flow

While the analytics team may own data transformations, the stakeholder generates the raw data from operational tools like in the previous example with HubSpot. The process of achieving good data hygiene creates alignment between the analytics team and the stakeholder (the marketing team in our example).

A lack of clear ownership is the equivalent of no ownership. Moving data from an operational tool into the analytics team’s world to transform it before moving it back is a lot of steps.

Now, complexity actually comes from more defined ownership in recent years (which is a good thing). Analytics teams own data transformation based in the warehouse, while business teams own the processes of how they actually use the data.

Considering the three principles of data hygiene outlined above, let’s break them down into who owns what:

Quality:

Stakeholder: Have test accounts to run through scenarios generating new data, and QA it within the team before making the change. For instance, send a new email to yourself through HubSpot and partner with the analytics team to make sure the data generated meets expectations. Document these expectations to ensure no seemingly unrelated changes impact the data generated by accident.
Analytics team: Run tests and set up alerts on business rules defined in the data warehouse, communicating back to the stakeholder when tests fail. The more areas raw data is tested, the more likely poor data quality is caught and resolved quickly. These tests and alerts should exist on both raw and transformed data to better identify exactly where failures happen.

Architecture:

Stakeholder: Document all properties used in business logic, especially ones tied to mission-critical KPIs. Share this with the data team to inform tests and increase data quality. Document these properties in a central location that is accessed by both analytics and business teams.
Analytics team: When it comes to activating data through reverse ETL, outline the structure with the stakeholder and make sure it fits the use case. Operational tools can be tricky and require data to be formatted in a specific way.

Efficiency:

Analytics team: This one’s all you. With proper architecture and quality data in place, it all comes down to writing efficient queries before sending data back into the tool used by the stakeholder. Work with the engineering team and data pipeline tools to make sure ERDs align with how the data is going to be used.

Ultimately, data hygiene relies on communication and clearly defined processes. Then, everyone can execute on business and technical requirements. But the stakes are high: Failure in process or execution means the business suffers.

With activated data, hygiene is business critical

While data hygiene sounds nice, prioritizing it is a matter of understanding what happens if data hygiene falls by the wayside.

ELT tools pipe raw data from operational tools to a warehouse. They don't guarantee, however, the quality of data coming from the source — that’s where the users of the operational tool lead the charge.

In reverse ETL, automations are built on top of data sent from the analytics team. If this data is incomplete or poorly tested, the performance of automations on top of this information isn’t reaching its maximum potential.

Data hygiene needs to be considered at each step in the circular data flow between operational tools and the warehouse.

[CTA_MODULE]

Data insights

Data hygiene in ETL and reverse ETL

January 16, 2023

Sean Lynch

Chief Product Officer

Census

Anchor Link