A data catalog is an inventory of all of an enterprise's data assets, bundled with tools to maintain the catalog. The data catalog (or data catalogue) not only enumerates the data but also describes it. Everything you’d find in a database schema — database and table names, column names, column descriptions, and access rights — are all part of a data catalog. But there are a couple of major differences between a schema and a data catalog.
A database schema applies to data from a single source — say your Salesforce data. When you replicate a data source using an ELT tool like Fivetran into a data warehouse like Amazon Redshift, Google BigQuery, Snowflake or Microsoft Synapse Analytics, you need a schema to describe the data in the new repository.
A data catalog, by contrast, describes all the data in your organization, across all data sources and repositories. We’re most interested in what's in the analytics repository — everything that comes from every SaaS platform and internal database you’ve replicated.
You use a database schema to define the data; you use a data catalog to maintain and access it.
Just as a library has a catalog to help readers discover books they’re interested in, an enterprise can develop a data catalog that gives an overarching view of its data assets. A data catalog can perform automated discovery of data sets, both for building the initial catalog and for ongoing discovery of new data sets. It uses an automated discovery process to build data dictionaries.
A data dictionary contains definitions of data sets and elements along with metadata that describes them, such as field data types and constraints.
But for the data catalog, the data dictionary is just a base to build on. The catalog encompasses relationships between data sets, data lineage (where each data element comes from and where it’s used), and a glossary entry for each field that explains how it’s used in the organization. It provides a visual interface to the data for analysts, with search capabilities so they can find exactly what data they’re interested in.
One of the biggest benefits of a data catalog is that it can make analytics easier for folks who aren’t experts. You can certainly do analytics without one, but a data catalog is a useful reference tool for people who write reports and dashboards. It helps self-service analysts see what data is available for analytics, where it comes from, and what other data it’s related to.
Finding the right data catalog
You could build your own data catalog, just like you could build your own data pipeline, but there’s no need. Dozens of software vendors offer data catalog tools that run on-premise or as a service. (We recommend the latter.) It makes sense to use these products and let your developers spend their valuable time working on software that improves your own products instead of building infrastructure tools.
However, a data catalog isn’t one of those products you install and it just works. You need to designate a database administrator or data engineer to keep the data catalog up to date and available.
Different data catalog services are suited for different use cases, so it’s a good idea to narrow your search by seeing how well each one fits your use case. Some data catalogs handle data in data lakes, which makes them suitable for data science use cases. Others are more business-oriented, and therefore probably what anyone reading this post is looking for.
You want a data catalog that can read metadata from transactional databases, NoSQL databases, and data warehouses — in other words, anywhere that your organization stores data.
Best practices for data catalog development
If you decide to build a data catalog, the usual advice for developing software products applies. Start small, with a catalog oriented for use by a team that’s already happy with you.
Once you’ve worked out any difficulties, move on to more widely used data sets that hold core customer transaction data.
Finally, expand availability to other groups and departments in the organization, starting with people in roles similar to the use cases you’ve already deployed the software to.
Your data catalog should generate a data dictionary by autodiscovery. Then the people who know the data best should profile the data to let people creating reports see how each data element is used in the organization.
Between the built-in capabilities of the software and the skills of a database administrator, you should define the relationships between tables, fields, and other elements in the catalog. You can then build a representation of data lineage that tracks data from its origin to its destination, which enables the organization to trace data issues back to their root cause.
Over all of that functionality, security is paramount. After all, an organization’s value is locked up in its data. Your data catalog should offer data encryption, role-based security, and audit logs that show who accessed what data when.
Once you have the data catalog rolled out, be sure to offer training to everyone in the organization who might use it, and make the training materials available as documentation.
Make your analysts happy
How do you know that you’ve done a good job implementing a data catalog? If your team is using it regularly and they express satisfaction (or at least you don’t hear complaints) then you’ve succeeded in creating a useful enterprise data catalog.