robot s hand on a blue background

LLM integration takes Cloudera data lakehouse from Big Data to Big AI



 

 

Cloudera is not building its own LLMs; rather, it is making it easier for enterprises to use LLMs to gain insights from data that organizations already have in a data lakehouse.

Cloudera already has a catalog of reference architectures that it provides to its users; existing use cases have included AI models for customer churn and fraud analytics. Now the company is expanding with architectures for conversational AI and LLMs. Venkatesh explained that CDP users can select the new LLM reference architecture from the catalog and have it installed in their environment in a few minutes.

The training approach that Cloudera is embracing is what is known as a zero-shot learning model, where an existing LLM can quickly benefit from an existing data source. The initial set of LLMs that Cloudera is integrating with are open-source models that can run entirely inside the Cloudera platform. Venkatesh noted that by running the LLM in the same platform as the data, organizations can ensure that no data ever leaves the enterprise’s purview and no external API calls are being made. He emphasized that keeping data under tight control is critical for some enterprises.

The intersection between vector databases and Cloudera’s data lakehouse platform

Part of the Cloudera LLM reference architecture is the integration of open-source vector databases into the stack. 

Venkatesh said that Cloudera is enabling its users to choose which open-source vector database to use. Among the options are Milvus, Weaviate and qdrant.

Data lakehouse technology relies on data object storage, which Venkatesh said is often a great way for organizations to store unstructured and semi-structured data. To work with AI, there is a need to organize the data with a vector database.

“You really need a database engine that can take a semantic search query, run it in vector space, and return the most relevant results back to you,” he said.

Venkatesh emphasized that creating a vector database for an LLM deployment with Cloudera does not mean enterprises are duplicating data, with one set in the lakehouse and another in the vector database. Rather than duplicating data, what a vector database does is provide a functional index of the data as vectors.

How LLMs are the logical path forward from Big Data

When Cloudera got started in 2008, Big Data, in the form of the open-source Hadoop project, was the company’s foundation.

The Big Data market has shifted over the years into the data lakehouse space, where organizations use query engines, typically SQL-based, for data analytics on data stored in cloud object storage repositories. Venkatesh now sees LLMs as the next logical step on the path forward from Big Data.

“A bunch of us came to work in Big Data, not because we were all excited about SQL, but to look at fundamentally different ways to analyze data,” Venkatesh said.

He explained that Big Data created a pyramid-like approach for data analytics, where the Big Data resides at the bottom and only a small amount of data could be analyzed at the top. With LLMs, that pyramid structure has flattened out, with significantly more data available for analysis, and easier methods.

“What I see with LLMs and the new wave of AI is an era where you can now analyze all the data at the topmost layer and instead of querying with just SQL or Spark, it’s English or natural language queries,” Venkatesh said. “You only need to ingest the data once and you can get the benefits of that ingestion from a vectorized embedding multiple times, so all of your queries can take advantage of the semantic store.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.

 

 

 

 



LLM integration takes Cloudera data lakehouse from Big Data to Big AI