The Rise and Fall of Data Governance (Again)
Data governance has had more than its fair share of twists and turns throughout the hype cycle. It burst onto the scene in the late 90s with metadata management as a seemingly silver bullet to making data actionable and trustworthy.
A decade and a half later, the industry was littered with failed C-suite led initiatives to try to manually catalog every data asset. So many data teams drowned that it was unfathomable that anyone would dare to embark on such a hubristic odyssey again.
And yet, many data teams are convinced that today the tide has turned!
Data governance remains vital, perhaps even more so as data volume levels increase and disruptive data regulatory tidal waves such as GDPR sweep through the industry.
Driven by these outside forces, data teams began to convince themselves that maybe, just maybemachine learning automation can tame the storm and make cataloging data assets possible this time around.
Unfortunately, many of these new data governance initiatives are doomed to founder by focusing on technology at the expense of culture and process.
The reality is that to improve their data governance posture, teams must not only have visibility into their data, but also treat it as a product, be domain-focused, and establish data quality as a prerequisite.
Treat data governance like a product — Don’t treat a product like its data governance
Data governance is a big challenge, so it’s tempting to try to tackle it with a big solution.
Typically, data governance initiatives begin with a data steward decreeing the seemingly acceptable goal: “We’re going to catalog all things and assign owners to all of our end-to-end data assets so they’re accessible. , meaningful, compliant and reliable. ”
The first problem with this initiative is how it came about. Just as successful companies are customer-centric, data teams must also focus on their data consumers and internal customers.
I guarantee you that no one in the marketing department has asked you for a data catalog. They asked for useful reports and more reliable dashboards.
Nor did anyone in the compliance department request a data catalog. They demanded visibility into the location of regulated and personally identifiable information and who has access to it.
But rather than setting a course for those achievable destinations, some data teams are looking beyond the horizon with no business requirements in sight. There is no minimum viable product. There is no customer feedback or iteration. There are only big ideas and broken promises.
And make no mistake: catalogs still have an important role to play. But even the best technologies are no substitute for good processes.
Too much emphasis on tactics (cataloging data assets) and not enough on goals (accessible, meaningful, compliant, and trusted data). It’s no wonder the sails are starting to unfurl once teams realize they need more precise coordinates.
Let’s review the previous executive order: “We will catalog all the things and assign owners for all our data assets end to end it is therefore accessible, meaningful, compliant and reliable.
- What is meant by “catalogue”? How will the data be organized? Who will it be built for? What level of detail will it include? Will it have a real-time lineage? At which level ?
- What exactly are “all things?” What is a “data asset”? Is it only tables or does it also mean downstream SQL queries and reports?
- What do we mean by “owners”? Who owns the catalog? How will they be affected and what are they responsible for? Are we talking about the centralized data managers of yesteryear?
- What is “end to end”? What is the scope of the catalog? “Does it include both structured and unstructured data? If so, how can unstructured data be cataloged before it is processed into a form that has intent, meaning, and purpose?
Cataloging data without having answers to these questions can be like cataloging water, its constantly moving and changing states, which makes it nearly impossible to document.
Be domain first
The reason these points are so difficult to plot is that teams navigate without a compass: the needs of the business. Specifically, the needs of the different business areas that will actually use the data.
Without business context, there is no right answer, let alone prioritization. Mitigating governance gaps is a monumental undertaking, and prioritizing them is impossible without a full understanding of what data assets your business is actually accessing and for what purpose.
Just as we have moved towards cloud-first and mobile-first approaches, data teams are beginning to adopt a domain-first approach, often referred to as data mesh. This decentralized approach distributes data ownership to data teams within different departments that develop and maintain data products. And in the process, it brings business data teams closer together.
A modern approach to data governance must federate the meaning of data across these domains. It is important to understand how these data domains relate to each other and which aspects of the aggregated view are important.
This type of data discovery can provide a dynamic, domain-specific understanding of your data based on how it is ingested, stored, aggregated, and used by a specific set of consumers.
Data governance must also go beyond describing the data to agreement its goal. How a data producer might describe an asset would be very different from how a consumer of that data understands its function, and even between one data consumer and another there could be a big difference in terms of understanding the meaning attributed to the data.
A domain-driven approach can give shared meaning and requirements to data in the business operational workflow.
Data quality is a Prerequisites data governance
No technology can fix sloppy data processes or organizational culture. Even though more data assets are automatically documented and cataloged, more problems are generated below the surface. If you take in more water than you will refloat, you will sink.
Software engineering and the discipline of site reliability engineering have moved to an uptime standard of 5.9s (as in 99.999%) for their SLAs. Unfortunately, most data teams do not have internal SLAs detailing the expected performance of their data products and may struggle to define and document data quality metrics such as data downtime.
It’s hard to blame data teams for having sloppy habits when data was too fast, the consequences of disorganized data were too small, and the data engineers were too few. However, data reliability engineering must be a priority for any data governance initiative to have a reasonable chance of success.
He must also be the first for the governance initiative to be successful. To put it simply, if you catalog, document, and organize a broken system, you’ll just have to redo it once it’s fixed.
Instilling good data quality practices can also give teams a head start in achieving data governance goals by moving visibility from the ideal state to the current (real-time) state.
For example, without real-time lineage, it is impossible to know how PII or other regulated data is spreading. Think about it for a second: even if you use the most sophisticated data catalog on the market, your governance is only as good as your knowledge of where this data is going. If your pipelines aren’t reliable, your data catalog isn’t either.
Data governance with a purpose
My recommendation to data teams is to flip the data governance mission statement. Launch multiple smaller initiatives, each focused on a specific goal of making data more accessible, meaningful, compliant, or trustworthy.
Treat your data governance initiatives like a product and listen to your consumers to understand priorities, workflows and goals. Ship and iterate.
About the Author: Barr Moses is the CEO and co-founder of Monte Carlo, the data reliability company, creator of the industry’s first end-to-end system data observability Platform.
Monte Carlo hits the circuit breaker on bad data
Finding the sweet spot for data access governance
Security, privacy and governance at the crossroads of data in 22