Data Stewards

August 13, 2024

Large Language Models (LLMs): The Layman’s Guide

A glowing digital representation of a brain is interconnected with circuitry and data lines, symbolizing the fusion of artificial intelligence and technology. The brain is illuminated with bright nodes, representing data points and neural connections, against a backdrop of intricate electronic circuits, highlighting the concept of AI-driven cognitive processes.

Let’s start with the basics. A Large Language Model (LLM) is like the brain behind modern AI chatbots and virtual assistants. Imagine it as an extremely well-read person who has consumed billions of pages of text from books, websites, and more. This vast knowledge allows it to generate human-like responses, but it's also why it sometimes "hallucinates" or goes off-script, especially when the data it’s fed isn’t up to par.

LLMs are machines that predict the next word of a given text based on the text they have been trained into.

We train our LLMs on a vast amount of Internet & books information (carefully curated by humans), and then ask questions about the information it has ingested. It’s like compressing all the Internet information in a search engine that feels like a conversation.

The problem is that since it “predicts” words, many times it makes them up, and this is what we call hallucinations: the prediction felt human, but was factually wrong.

‍

Fine-Tuning LLMs

To address these challenges and improve accuracy, fine-tuning becomes essential. Fine-tuning Large Language Models (LLMs) is a critical process that tailors these powerful AI systems to specific tasks or domains, enhancing their accuracy and relevance. By refining how LLMs generate responses, fine-tuning allows organizations to align the AI’s outputs with their unique needs, making the models more effective and reliable.

When it comes to fine-tuning LLMs, there are several approaches, to name the most common ones:

A diagram illustrating the relationship between the effort to implement and the customization result of various AI techniques. The y-axis represents “Effort to Implement,” with higher effort required for approaches like Fine-Tuning, Reinforcement Learning, and using a bad dataset or data source. The x-axis represents “Customization Result,” with Retrieval-Augmented Generation (RAG) shown as a large green oval indicating a high customization result with relatively lower effort compared to Fine-Tuning and Reinforcement Learning. Prompt Engineering is depicted as requiring less effort but offering less customization. — *Graph showing the trade-off between implementation effort and customized output from LLM.*

Fine-Tuning: involves taking a pre-trained LLM and training it further on a specific dataset or domain. This helps the model specialize in a particular area, making its predictions more accurate and relevant to that context. So in this case, we would be actually modifying the parameters of the LLM.
‍
Reinforcement Learning: In this method, an LLM learns by receiving feedback in the form of rewards or penalties based on the actions it takes. The model iteratively improves its performance by maximizing rewards, which makes it better suited for tasks like dialogue generation and decision-making. We don’t modify the LLM parameters for this, instead we add a new layer of neurons to the model and we modify those parameters: this way we always know what the foundational model thought, and what our modification was..
‍
Human-Labeled Reinforcement Feedback (HLRF): HLRF involves using human-generated feedback to guide the fine-tuning process of an LLM. By incorporating human judgment, the model learns to produce more accurate and contextually appropriate outputs, reducing errors and improving its overall reliability.
‍
Retrieval-Augmented Generation (RAG): Instead of relying solely on the model's internal knowledge, RAG is a technique that enhances LLMs by combining them with an information retrieval system, retrieving information from documents or databases and incorporating it into the prompt. It is basically adding extra information from other data sources to the user prompt before sending it to the LLM (i.e. all the user data in the database, or the list of internal documents that have information about the user question).
‍

A flowchart illustrating the process of using Retrieval-Augmented Generation (RAG) in a Large Language Model (LLM). The process starts with a “User Prompt” that initiates a “Search” in a “Data Source.” The result from the data source is combined with the user prompt to create an “Augmented Prompt [User Prompt + Result].” This augmented prompt is then sent as an “LLM request” to the LLM, which processes it and produces an “LLM Response.” The diagram visually represents how external data is integrated into the prompt to enhance the accuracy and relevance of the LLM’s response. — *Diagram depicting the process of augmenting a user prompt with external data using RAG.*‍‍

Welcome to the RAG Party

Of all the options above, RAG gives us a lot of superpowers while being very easy to implement from a technology point of view. RAG is a game-changer because it allows for easy customization of LLMs by pulling in relevant, context-specific data from your existing documents. It’s like having a well-organized library at your AI’s disposal. This approach not only enhances accuracy but also helps avoid those infamous hallucinations by grounding the AI in reliable data.

In simple terms, if you’re interested in RAG, using vector embeddings of your data enables Semantic Search (or vector similarity search). This approach allows the LLM to access highly relevant information from your data sources, leading to more accurate and tailored responses.

A flowchart illustrating the process of enhancing a user prompt with document-based data using an embedding model and vector search in a Retrieval-Augmented Generation (RAG) system. The process begins with a “User Prompt” which triggers a “Vector Search” through an “Embedding Model” that processes “Docs” by dividing them into “Chunks” and converting them into “Vectors.” These vectors are stored in a “Vectors Data Source,” and the relevant results are retrieved to create an “Augmented Prompt [User Prompt + Result].” This augmented prompt is then sent as an “LLM Request” to the LLM, which processes it and generates an “Assistant Answer” that includes a response and a link to the documents or tables used. The diagram illustrates how document data is integrated into the AI’s response process to provide a more informed and accurate answer. — *Diagram illustrating a RAG embedding model that retrieves information from documents.*

Now, here’s where it gets interesting. While RAG from documents like PDFs, Word files, and Excel sheets is relatively straightforward, doing the same from databases, data warehouses, and streams is a whole different ball game. These data assets require top-notch data quality and governance. Without the right metadata and governance in place, you’re not just risking hallucinations—you’re flirting with legal non-compliance and security risks. And let's face it, most organizations aren’t exactly known for having their metadata ducks in a row. This brings us to the critical topic of Data Stewardship, specifically how to effectively implement it from the ground up to ensure accuracy, compliance, and security in your data ecosystem.

How can we leverage generative AI for Data Stewardship?‍

A digital artwork featuring multiple rubber ducks floating on a sea of data waves, with the ducks symbolizing metadata or data elements. The waves are composed of a grid-like pattern of digital information, representing the flow and management of data. The image visually conveys the concept of organizing and maintaining metadata within a digital ecosystem, highlighting the idea of keeping “metadata ducks in a row.”

At the heart of every well-executed data strategy is data stewardship. It’s the cornerstone of effective data management, quality, and governance. But here’s the kicker: traditional data stewardship is resource-intensive, a job found boring by many, that can quickly become a bottleneck. That’s where AI comes into play. Imagine one steward being able to resolve 30 times more issues automating most of the considered boring stuff, thanks to AI augmentation. Not only does this make the human Data Steward’s job more enjoyable, but it also enhances the entire data ecosystem by providing continuous, 24/7 support.

In a world of rapidly evolving regulations and growing data complexity, governance and metadata can’t be afterthoughts. They must be integrated into the design process from the very beginning. In data engineering terms, we need to “shift left” on governance, quality, and metadata efforts. This means treating metadata, compliance, and quality as fundamental components of your data strategy, not just as items to address after the fact. Building a solid foundation from the start is far easier—and far more effective—than trying to retrofit one later.

The Transformative Role of Generative AI in Data Governance

Generative AI is already making a significant impact on data governance, with the potential to completely transform how organizations manage and oversee their data. Implementing a metadata solution, such as a data catalog or a comprehensive data governance framework, is a complex but essential task. It involves organizing and managing all the data within an organization to ensure it is easily accessible, reliable, and compliant with regulations.

A futuristic scene depicting collaboration between a human data steward and AI-powered robots. The human, seated at a desk, is working alongside AI robots that are actively involved in flagging data issues, analyzing information, and managing data stewardship tasks. The background is filled with holographic displays showing data-related terms like “Data Steward,” “Flagging Issues,” and “Human-AI Collaboration.” The image highlights the integration of AI into data governance, with AI assisting human stewards in managing complex data ecosystems efficiently.

A key part of this process is defining and implementing roles like data stewards or data governors. These roles are critical, serving as the guardians of data quality, integrity, and compliance. They ensure that the data is well-organized, properly classified, and that it adheres to the necessary standards and regulations. However, creating these roles and integrating them into an organization can be challenging. It often requires individuals to take on new responsibilities, sometimes as a part-time role, which can be resource-intensive and demanding.

Rather than diminishing the importance of these roles, AI offers a way to enhance and support the work of data stewards and governors. By automating some of the more routine and time-consuming tasks, AI can free up these professionals to focus on more strategic, high-value activities. For example, an AI system can continuously monitor the metadata ecosystem, flagging areas that need attention, and providing suggestions for data classification, tagging, or quality assessments. This not only reduces the burden on human stewards but also helps maintain a high standard of data governance across the organization.

In this way, AI becomes an ally to Data Stewards and Governors, augmenting their capabilities and allowing them to work more efficiently and effectively. The goal is not to replace these vital roles but to empower them, enabling organizations to achieve better data management and Governance outcomes. As AI continues to evolve, its role in supporting data governance will only grow, offering new opportunities for collaboration between technology and the skilled professionals who guide it.

This is just the beginning of a new era of AI-assisted data ecosystems, where technology and human expertise come together to drive unprecedented productivity and significantly reduce the risks associated with data regulation compliance. The future is bright, and by embracing these advancements, we can turn challenges into powerful opportunities for growth and innovation. Let’s seize this moment and lead the way into a more efficient and compliant data-driven world!

‍