This post is adapted from Chartio’s Data Governance web book. The author, Matthew David, is Product Lead at Chartio, and formerly developed data science courses for Udacity.
As more people depend on data in their daily workflow, organizations need to think critically about the quality of the data being provided. Having a small team field all data questions will not scale, so companies must move from a centralized data organization to a decentralized one.
Why You Need Data Governance
According to Gartner, data governance consists of assigning decision rights and accountability in order to properly manage and control the creation and consumption of data. It ensures data quality and adherence to compliance standards. This procedure is best fulfilled by designating a data governor who owns the data governance process.
If your organization has built a data warehouse and experiences any of the following conditions, we strongly recommend appointing at least one data governor:
- Use self-service dashboards
- Operate in an industry with regulations and compliance procedures
- Have large data sources spanning different departments
- Strive for operational intelligence
Without oversight from a data governor, sensitive data may be shared inappropriately, employees will lack access to or misinterpret data, and analyses will often be incorrect. A data governor will maintain and improve the quality of data and ensure your company is compliant with any regulations. It is vital to keeping your company data-literate.
Data Governors for Data Governance
As the volume of data at a company explodes, it becomes extremely difficult for a small technical team to simultaneously govern an organization’s data and field a report for every question. Data scientists and analysts should shift from traditional reporting responsibilities to data governance.
In a traditional reporting role, analysts answer questions as needed for other members of the organization. The shift to data governance instead calls for analysts to create clean, documented data products for the end-users of the data to explore themselves. This is called democratized data governance, in which good data governance empowers people across an entire organization to conduct analytics.
The Roles of the Data Governor
As a data governor, your role changes at each stage of sophistication. You bravely lead your company from struggling to get value out of their data to producing accurate insights consistently. Let’s step through each of the roles you will play.
1. Data Cleanup and Maintenance
The majority of the technical work of data governance is around collecting, cleaning and maintaining various data sets. This is a multi-part activity; here are the subtypes:
Data Piping (ETL/ELT) and Warehousing
Data is going to exist in many different silos across your organization. A big part of your job may consist of moving these disparate sets of data into an accessible central environment. These central environments include data warehouses such as Google BigQuery or Amazon Redshift, and there are various Extract, Transform and Load (ETL) and Extract, Load and Transform (ELT) tools out there such as Stitch and Fivetran.
For most companies, the team collecting the data reports on the data. Its analysts understand idiosyncrasies such as where certain fields are stored, how to exclude deleted or expired accounts, how to join certain records together, and more.
As your organization grows, the importance of data accessibility grows, and the people exploring the data aren’t always the ones who collect and store it. This means that you must clean, structure, summarize and clearly label the data for the benefit of non-technical users. These preparatory steps are collectively called transformation.
Some BI products feature tools to handle transformations, but it’s generally more sustainable to transform at the database level. Create new schemas in your database containing views for particular consumers. This is both a usability best practice and a security best practice.
Process and Auditing
Manually entered data, such as from your CRM, can be messy and confusing. Sales reps may mistype their data or enter the same information differently in two separate places. There might not be a consistent way to track cancellations. Manual data entry means discrepancies.
The way to handle this is to audit the data, ensure that it’s recorded properly for the needed reports, and identify and develop missing processes with the managers of the relevant teams.
Again, under democratized data governance, the people exploring the data are no longer the people who put it there in the first place. By now, you’ve created really clean, curated, simple models for specific teams, but you’ll still benefit from documenting each table and column.
This can be done with a wiki or even by leaving comments inside the database schema.
2. Permissions and Organization
Data security is obviously incredibly important. Beyond that, permissions can be leveraged for proper organization. Not everyone needs access to absolutely everything especially if there is a clear process for requesting additional data. Without proper permissions, data projects can get messy fast.
Radical transparency is a good thing, of course, but it must be balanced with concise and effective communication. You must curate your team’s data experience.
3. Integrity Handling
It happens all the time — two people exploring the data end up with two different values for the same metric. For anyone working with data, few experiences are more disheartening. It encourages mistrust in the integrity of the data.
You can only minimize, not stop, this issue. If the data is kept clean and well-documented, these events should be rare and simple to fix.
The best approach is to make everyone aware of the possibility and to accept and embrace it. Just as every product has bugs, every data set will, too. Shift your team’s mental frame so that any inconsistency you discover is an exciting challenge to fix, solve, or clarify as soon as possible.
Ensure there’s a clear process for people to resolve these integrity issues. Be responsive and helpful when it’s reported. Be sure to thank them for reporting and be kind even when it was their error.
Maintaining a data set is like maintaining a garden. There will always be weeds growing and more to do. It will never be perfect, but it can be beautiful.
4. Tool Selection
The data governor must decide what your organization needs. Be mindful of tools that have high learning curves or vendor lock-in from proprietary languages. Consider all the pieces of your data analytics stack and make sure the tools you select work well together.
No matter how well you’ve done your data cleaning, documentation and tool selection, you’ll still have to teach your organization how to use the data to get accurate and actionable insights.
Here are the things you must teach your organization:
- What’s in the models
- How to use the tool
- Your process for prioritizing data requests and for data sharing and access
- Data basics in databases, tables, data structures, and SQL
- Quality versus vanity metrics
- Chart and dashboard best practices
Remember to organize and communicate how people should come to you with integrity issues, data needs, access requests, training needs, etc.