Crossing the Data Divide: Data Catalogs and the Generative AI Wave



The “Tide” of LLMs Has Reached Our Shores

As a career data professional and having spent most of the last 10 years with Collibra, Alation, and now as an independent, I’ve come to a somewhat painful, yet exciting realization that large language models (LLM) will completely disrupt and displace data cataloging as we know it!

That may seem like a wild assertion, but our corner of the world — the world of metadata, governance, stewardship, data quality, etc. — will not be “different” and remain untouched.

In fact, it will be radically changed.

Revisiting the Data Catalog Value Proposition

The central purpose of a data catalog has been to act as an authoritative system of reference and knowledge base for data and data-related assets. The core value proposition has been to increase the productivity of a broad set of actors by sharing contextual knowledge (metadata) about assets, how assets are related, and how the assets are permitted to be used. These actors include everyone including business leadership, analysts, data engineers, data scientists, and risk professionals.

In order to achieve this core value proposition, catalogs must be continuously populated, curated, and most importantly adopted.

How Should We Grade the Success of Data Catalogs?

We have seen several generations of catalogs over the past 20 years. First, there was a generation of very technical metadata-oriented platforms, then a generation of data governance-centric platforms. More recently, we have seen analyst and data-consumer-oriented catalog platforms.

From a big picture perspective, great strides have been made to fulfill the vision of a catalog serving as a core knowledge base that supports leaders trying to embed the use of data and analytics into their culture. But technological evolution has not equaled overwhelming success or the emergence of a formulaic approach.

Achieving broad adoption across an enterprise is still quite rare and the means of achieving it is a black art of human physiology, world class communication, a heavy training orientation, and a healthy dose of senior leadership inspiring and driving a clear vision.

I think it’s important that we ask ourselves why success relies so heavily on threading this very difficult needle of leadership and soft skills. In part, it’s because enterprises are still moving up the data and analytics maturity curve. That’s fine and normal, but it’s also because catalogs, as good as they are, still don’t make it easy enough to find, understand, explore, trust, and govern data.

My simple litmus test for “easy” is being able to open a file, spreadsheet, or report and be offered insight related to the origin of its data, business definitions of terms and metrics, an assessment of its trustworthiness, instructions for how it can and cannot share it, etc. Catalogs get a D+ based on this test because they require people, who are trying to do their day jobs, to switch context into the catalog to search/navigate to the data assets of interest. For most people, it’s still far easier to pop a quick question in a chat to a colleague and let the human “network” produce an answer.

I will add that data catalogs do grade out much higher, probably B-, for more technical and data-oriented roles such as data engineers, data analysts, and data scientists. That’s because they don’t see catalogs as friction, but as an accelerant for gaining deeper knowledge.

Catalog Front-Ends and Back-Ends

I am going to take a moment to decompose elementary capabilities of modern catalogs. This is important to understanding how generative AI will eat its lunch.

The modern catalog platform is an interesting beast and has what I think of as a multiple personality disorder. Its first personality is one of a collector and ‘stitcher’ of assets. They continuously connect to and collect metadata from a vast array of data assets that exist in data stores, reporting systems, and applications across cloud, on-premise, and hybrid environments. In addition, they sport some metadata augmentation capabilities such as classifiers, attempting to identify lineage, sensitive asset identification, data quality measurement, etc. All this is maintained in their internal data structure.

Its second personality is that of an application that provides “views” of the assets in ways that drive productivity for different roles, as described above. This commonly includes capability such as searching, tagging, chatting, reviewing, approving, etc. Basically, things that are close to what we expect from consumer grade social media apps but for data.

The third personality, which is less well developed, is as an active participant in the enforcement of governance policies. This involves maintaining policy definitions and then acting as a system of record for granting and restricting access to data across a broad scope of systems. This is less well developed because, frankly, it’s hard to unify technologies owned by vendors who all want to be the center of the universe.

Metadata as Language

All the metadata a catalog collects and stores — which includes data characteristics, profiles, classifications, usage, popularity and its relationship to all kinds of data-related assets like reports, metrics, terms, policies, etc. — can be easily expressed as language. That might seem like a very strange assertion, so I offer a simple scenario:

Imagine we ingested Tableau reports into a catalog and they include a report called “Payment Forecast Report” for finance. Pretend it was tagged by the supply chain steward as also being important to its domain. Also assume that we have ingested tables from the data lake (the source of the report data), NetSuite (the originating source), and Azure Data Factory (the pipeline that move the data). Finally, let’s assume that some of the catalogs intelligent augmentation capabilities are being used and all the assets have been classified, scanned for the potential of being sensitive and governance sharing and usage policies assigned. After being notified, the finance business steward associates key metric descriptions for supplier payment thresholds and financial terms to the report and tags it as being authoritative.

All of this is now tucked into the catalog data store waiting for someone to consume it using the catalog’s conventional and traditional user interface’s search and navigation capabilities.

Now, consider how the same metadata might be expressed as language: 

“Our finance department’s accounts payable analysts and supply chain organization use a Tableau report called ‘Payment Forecast Report’ to understand accounts payable requirements for cash on hand to pay suppliers. The report is created using both straight-line and moving average statistic methods. The source of the data is the payment history fact table, time, and supplier tables in the accounts payable schema in the lake house. Those tables are populated from the NetSuite application using Azure Data Factory Pipelines. The report is used monthly and has been certified to comply with our data quality standards.”

As you can clearly see, all the necessary context is present to create this narrative version. But why would we want to do that? Why would we want to express everything we know about our data assets as narrative? The obvious answer is to increase accessibility and reduce the friction I described above. And how that happens is by unleashing that knowledge to the enterprise using a large language model.

Intelligent Data Assistants

Microsoft announced Copilot for its virtual agents in November. OpenAI launched its private GPT models in November and Google announced its AI Studio for its Gemini LLM and Bard chat in early December. All of these are lowering the barrier for the creation of specialized chat driven assistants/agents. They are also promising to maintain a secure and private barrier between an enterprise’s intellectual property and what is consumed by their public facing LLMs.

The clear opportunity is for the metadata being collected in the catalog to be converted to narrative text and then consumed by the private LLMs that sit behind an intelligent data assistant chat interface.

The experience for users will be revolutionary. Imagine having a chat with a digital expert about a report’s origin, meaning, provenance, trustworthiness, and data quality.  Extend that thought to popular topics of glossaries and terms. With digital assistance, users can simply ask what terms mean, ask for suggested terms, and discuss term conflicts and overlaps.

That may sound like imaginary, rocket science, but it’s already starting to happen. For instance, a user can already drop a report PDF or spreadsheet into a chat and ask for help analyzing it. If the contextual metadata exists in the LLM, then analysis can be broader and deeper than what simply appears in the document.

We Need the Catalog Back-End

I suppose this is a bit philosophical, but if you change the front-end of a catalog to be a LLM driven virtual assistant, is it still a catalog? That is a bit like “if a tree falls in the woods and no one is there to hear it.” From a pragmatic perspective, I am of the opinion that the idea of data assistants is much more attractive to an enterprise and it is the terminology that should be embraced, letting the word catalog mean part of the infrastructure.

Regardless of the semantics used to describe the solution, what’s clear is that the rich array of back-end metadata collection, classification, data quality, and governance capabilities that catalogs offer is going to remain incredibly valuable and necessary.

What Should a Data Leader Do Right Now?

My advice to data leaders is to do the following things:

  • Continue to invest in catalogs with an emphasis on the collection of rich metadata, lineage, data quality measurements, usage, and popularity. Building strength around this will not be a wasted investment.
  • Make sure the data store of their catalog is open and can be accessed in order to systematically convert the rich metadata into narrative text to feed an LLM. In fact, it’s likely that leaders will want to use an LLM to create the text (if the catalog vendors are not quick enough to offer it on their own), but the data store has to be accessible and not hidden behind a proprietary barrier.
  • Consider a pilot rollout of a data assistant in parallel with rolling out the traditional catalog UI. Use that pilot to decide the roles that are most appropriate for each in the immediate and distant future.
  • Partner with InfoSec and the broader enterprise risk management team to evaluate and vet the security of Microsoft, Google, OpenAI, and others that claim they will keep data private and secure. It will take some time to get everyone comfortable, so start now.
  • If the leap to a general consumption data assistant is too big, consider something on a smaller scale, such as piloting a virtual stewardship assistant. That helps stewards work efficiently.

The Indispensable Future

I considered mourning what I see as the inevitable loss of the cataloging world as I’ve known it. But if I’m being honest with myself, that would just be trying to hang on to the familiar. Instead, I’ve decided to be more excited about the opportunity to help increase the role of data and analytics as an indispensable part of the corporate culture in ways that have never been possible before. I hope you make that same decision and join me.

Crossing the Data Divide: Data Catalogs and the Generative AI Wave