MAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent…
Photo by julien Tromeur on UnsplashMAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent SystemHow to build a Multi-Agent RAG with AG2 and ChromaDBRetrieval-Augmented Generation (RAG) systems have improved rapidly in recent years. Ideally, we can distinguish their evolution into three phases: in the pre-LLM era, information retrieval systems primarily relied on traditional search algorithms and indexing techniques. These systems were limited in their ability to understand context and generate human-like responses. Then, LLMs entered the scene, resulting in a drastic paradigm shift. Now, there are agents and another paradigm shift is happening.But let’s take a step back, what is a RAG?How RAG WorksTo understand how a RAG system works, it can be helpful to compare its processes to those of a library.Basic components of a RAGIngestion. This phase is similar to stocking a library. Just as a librarian organizes books and creates an index, a RAG system prepares data by converting it into numerical representations called embeddings. These embeddings are stored in a vector database, making it easy to find relevant information laterRetrieval. when a user asks a question, it’s like asking a librarian for information. The RAG system uses the query to search the indexed data and retrieve the most relevant documents or pieces of information from the database. This process ensures that the system pulls in accurate and up-to-date content.Generation. With the retrieved information, the system generates a response by combining this information with its internal knowledge. This is similar to how a librarian synthesizes information from multiple sources to provide an answer to a question.Photo by Radu Marcusu on UnsplashIt is important to clarify that, although the ingestion phase is not strictly a component of RAG, which stands for Retrieval-Augmented Generation, I always prefer to include ingestion as a crucial part of the process. Without proper organization of knowledge, the subsequent phases are unlikely to function effectively.RAG systems traditionally operate through sequential workflows, where distinct pipelines handle the ingestion of data, retrieval of relevant information based on user queries, and generation of responses using the retrieved data. While this architecture is straightforward and effective for many applications, it poses significant limitations in scenarios that demand complex and non-linear interactions.For a comprehensive understanding of how to implement a traditional LLM-based Retrieval-Augmented Generation (RAG) system, I encourage you to read one of my previous articles.Build your own RAG and run it locally on your laptop: ColBERT + DSPy + StreamlitUnfortunately, progress in the field of Generative AI is rapid, and many aspects of that article are already outdated. However, it still serves as a valuable resource for understanding the fundamentals of the topic we are discussing. In this tutorial, we aim to combine Retrieval-Augmented Generation (RAG) systems with Multi-Agent Systems (MAS).… Multi-Agent System… bla bla bla… I know, today, everyone is buzzing about Multi-Agent Systems (MAS) just like they once did about Generative AI, Reinforcement Learning, Machine Learning and Big Data (can you relate?). However, I will try to make this tutorial valuable for those who are approaching this field for the first time. By the end of the article, I will also share some of my thoughts regarding the limitations of multi-agent systems.MAS = ?In the context of artificial intelligence, an agent is defined as a system or program that perceives its environment, makes decisions, and takes actions autonomously to achieve specific goals. For example, a librarian can be considered an agent; it organizes books, researches information, and formulates responses to inquiries. Much like an AI agent, a librarian navigates through vast amounts of information, curating and providing access to knowledge while adapting to the needs of users. The agents we will develop primarily delegate the decision-making component to Large Language Models (LLMs), leveraging their advanced capabilities for processing and generating human-like text.Photo by Xu Haiwei on UnsplashA (LLM-based) Multi-Agent System (MAS) consists of a collection of such agents that collaborate to achieve common objectives or solve complex problems. In a MAS, each agent operates independently but can communicate, debate and coordinate with other agents to share information, delegate tasks, and enhance overall system performance.Don’t worry, we are not going to write a Multi-Agent System (MAS) from scratch in Python. There are several frameworks available that simplify the development process. It is important to emphasize that the goal of this tutorial is not to build the ultimate Multi-Agent Retrieval-Augmented Generation system, but rather to demonstrate how easily we can construct a relatively complex system using the to
MAS Is All You Need: Supercharge Your Retrieval-Augmented Generation (RAG) with a Multi-Agent System
How to build a Multi-Agent RAG with AG2 and ChromaDB
Retrieval-Augmented Generation (RAG) systems have improved rapidly in recent years. Ideally, we can distinguish their evolution into three phases: in the pre-LLM era, information retrieval systems primarily relied on traditional search algorithms and indexing techniques. These systems were limited in their ability to understand context and generate human-like responses. Then, LLMs entered the scene, resulting in a drastic paradigm shift. Now, there are agents and another paradigm shift is happening.
But let’s take a step back, what is a RAG?
How RAG Works
To understand how a RAG system works, it can be helpful to compare its processes to those of a library.
Basic components of a RAG
- Ingestion. This phase is similar to stocking a library. Just as a librarian organizes books and creates an index, a RAG system prepares data by converting it into numerical representations called embeddings. These embeddings are stored in a vector database, making it easy to find relevant information later
- Retrieval. when a user asks a question, it’s like asking a librarian for information. The RAG system uses the query to search the indexed data and retrieve the most relevant documents or pieces of information from the database. This process ensures that the system pulls in accurate and up-to-date content.
- Generation. With the retrieved information, the system generates a response by combining this information with its internal knowledge. This is similar to how a librarian synthesizes information from multiple sources to provide an answer to a question.
It is important to clarify that, although the ingestion phase is not strictly a component of RAG, which stands for Retrieval-Augmented Generation, I always prefer to include ingestion as a crucial part of the process. Without proper organization of knowledge, the subsequent phases are unlikely to function effectively.
RAG systems traditionally operate through sequential workflows, where distinct pipelines handle the ingestion of data, retrieval of relevant information based on user queries, and generation of responses using the retrieved data. While this architecture is straightforward and effective for many applications, it poses significant limitations in scenarios that demand complex and non-linear interactions.
For a comprehensive understanding of how to implement a traditional LLM-based Retrieval-Augmented Generation (RAG) system, I encourage you to read one of my previous articles.
Build your own RAG and run it locally on your laptop: ColBERT + DSPy + Streamlit
Unfortunately, progress in the field of Generative AI is rapid, and many aspects of that article are already outdated. However, it still serves as a valuable resource for understanding the fundamentals of the topic we are discussing. In this tutorial, we aim to combine Retrieval-Augmented Generation (RAG) systems with Multi-Agent Systems (MAS).
… Multi-Agent System… bla bla bla… I know, today, everyone is buzzing about Multi-Agent Systems (MAS) just like they once did about Generative AI, Reinforcement Learning, Machine Learning and Big Data (can you relate?). However, I will try to make this tutorial valuable for those who are approaching this field for the first time. By the end of the article, I will also share some of my thoughts regarding the limitations of multi-agent systems.
MAS = ?
In the context of artificial intelligence, an agent is defined as a system or program that perceives its environment, makes decisions, and takes actions autonomously to achieve specific goals. For example, a librarian can be considered an agent; it organizes books, researches information, and formulates responses to inquiries. Much like an AI agent, a librarian navigates through vast amounts of information, curating and providing access to knowledge while adapting to the needs of users. The agents we will develop primarily delegate the decision-making component to Large Language Models (LLMs), leveraging their advanced capabilities for processing and generating human-like text.
A (LLM-based) Multi-Agent System (MAS) consists of a collection of such agents that collaborate to achieve common objectives or solve complex problems. In a MAS, each agent operates independently but can communicate, debate and coordinate with other agents to share information, delegate tasks, and enhance overall system performance.
Don’t worry, we are not going to write a Multi-Agent System (MAS) from scratch in Python. There are several frameworks available that simplify the development process. It is important to emphasize that the goal of this tutorial is not to build the ultimate Multi-Agent Retrieval-Augmented Generation system, but rather to demonstrate how easily we can construct a relatively complex system using the tools available to us.
Every piece of code shown here is also reported in the GitHub repository of this article.
Ready? Let’s go!
Environment setup
We use Anaconda in this tutorial. If you do not have it on your machine, please, download it from the official website and install it (just follow the installation script instructions).
Then, within a terminal session, we can start creating the environment with some packages we will use during the process
conda create -n "mas" python=3.12.8
conda activate mas
git clone https://github.com/ngshya/mas-is-all-you-need.git
cd mas-is-all-you-need
pip install -r requirements.txt
We need a .env file inside the project folder where we put the OpenAI API key and the ChromaDB configuration. For instance, mine looks like:
OPENAI_API_KEY="sk-proj-abcdefg..."
CHROMA_DB_HOST="localhost"
CHROMA_DB_PORT=8001
Data Ingestion
In the repository, we have already prepared some sample data located in the kb folder. The contents of these text files come from Wikipedia. To facilitate the ingestion of the text files within this folder, we have implemented some functions in the tools_ingestion.py file:
- get_txt_file_content() read the content of a text file.
- process_text() transforms a long text into chunks through an LLM call.
- text_to_list() reduces the output of the previous function into an effective Python list.
- save_chunks_to_db() saves the output of text_to_list() to a persistent DB (ChromaDB in our case).
- path_to_db() calls in sequence get_txt_file_content() → process_text() → text_to_list() → save_chunks_to_db()
- text_to_db() calls in sequence process_text() → text_to_list() → save_chunks_to_db()
I won’t comment them because they are already documented in the script. We can now start ChromaDB
chmod +x start_chroma
./start_chroma
and in a separate terminal, in the same project folder, run the ingestion pipeline (remember to use the same Python environment you have created before):
chmod +x ingest
./ingest
If we look at the content of the file ingest, we can notice that the last line about Turin is commented. There is a reason for this and we will find out soon.
Retrieve
Before starting to build our MAS, we need to define some functions to retrieve information from ChromaDB. You can find the implementations of these functions in the tools_retrieve.py file. Basically, the function retrieve() connects to the DB, compute the embedding of the input query, looks at similar chunks and returns the results. We can test this script by searching for “Universities in Amsterdam”:
python tools_retrieve.py --query "Universities in Amsterdam" --n_results 2
the output should be something similar to:
[
{
"uuid": "9a921695-9310-53b4-9f52-c42d7c6432ef",
"distance": 0.5576044321060181,
"source": "kb/cities/europe/amsterdam.txt",
"last_update": "12 January 2025 07:11:18 UTC +0000",
"chunk": "\nEducational Institutions \nThe University of Amsterdam (abbreviated as UvA, Dutch: Universiteit van Amsterdam) is a public research university located in Amsterdam, Netherlands. Established in 1632 by municipal authorities, it is the fourth-oldest academic institution in the Netherlands still in operation. The UvA is one of two large, publicly funded research universities in the city, the other being the Vrije Universiteit Amsterdam (VU). It is also part of the largest research universities in Europe with 31,186 students, 4,794 staff, 1,340 PhD students and an annual budget of \u20ac600 million. It is the largest university in the Netherlands by enrollment. \n"
},
{
"uuid": "5ce692ab-b762-53f7-84bc-f95fc6585015",
"distance": 0.561765730381012,
"source": "kb/cities/europe/amsterdam.txt",
"last_update": "12 January 2025 07:11:18 UTC +0000",
"chunk": "\nUniversity Structure and Achievements \nThe main campus is located in central Amsterdam, with a few faculties located in adjacent boroughs. The university is organised into seven faculties: Humanities, Social and Behavioural Sciences, Economics and Business, Science, Law, Medicine, Dentistry. Close ties are harbored with other institutions internationally through its membership in the League of European Research Universities (LERU), the Institutional Network of the Universities from the Capitals of Europe (UNICA), European University Association (EUA) and Universitas 21. The University of Amsterdam has produced six Nobel Laureates and five prime ministers of the Netherlands. \n"
}
]
Building MAS with AG2
AG2 (formerly known as AutoGen) is an innovative open-source programming framework designed to facilitate the development of AI agents and enhance collaboration among multiple agents to tackle complex tasks. Its primary goal is to simplify the creation and research of agentic AI. While the official AG2 website claims that the framework is ready to “build production-ready multi-agent systems in minutes,” I personally believe that there is still some work needed before it can be considered fully production-ready. However, it is undeniable that AG2 provides a very user-friendly environment for creating experiments aimed at research. It is important to emphasize that there are many other frameworks available for creating multi-agent systems. For example: Letta, LangGraph, CrewAI, etc.
In this tutorial we are going to implement a MAS with:
- Human → a proxy for human input.
- Agent Ingestion → responsible for ingesting information from text files or directly from text inputs.
- Agent Retrieve → responsible for extracting relevant information from the internal database to assist other agents in answering user questions.
- Agent Answer → responsible for providing answers to user queries using information retrieved by the Agent Ingestion.
- Agent Router → responsible for facilitating communication between the human user and other agents.
Human will interact only with Agent Router which will be responsible of an internal chat group that includes Agent Retrieve, Agent Answer and Agent Ingestion. Agents inside the chat group collaborate with their knowledge and tools to provide the best answer possible.
# Agents' Topology
Human <-> Agent Router <-> [Agent Ingestion, Agent Retrieve, Agent Answer]
The complete code for the MA-RAG (Multi-Agent Retrieval-Augmented Generation) system can be found in the mas.py file. In this section, we will discuss some key components and features of the code that are particularly noteworthy.
Agents Definition
To define an agent in AG2, we use the ConversableAgent() class. For instance, to define the Agent Ingestion:
agent_ingestion = ConversableAgent(
name = "agent_ingestion",
system_message = SYSTEM_PROMPT_AGENT_INGESTION,
description = DESCRIPTION_AGENT_INGESTION,
llm_config = llm_config,
human_input_mode = "NEVER",
silent=False
)
ee specify:
- a name (agent_ingestion);
- the system prompt that defines the agent (SYSTEM_PROMPT_AGENT_INGESTION is a variable defined in prompts.py);
SYSTEM_PROMPT_AGENT_INGESTION = '''
You are the **Ingestion Agent** tasked with acquiring new knowledge from various sources. Your primary responsibility is to ingest information from text files or directly from text inputs.
### Key Guidelines:
- **No New Information**: You do not contribute new information to conversations; your role is strictly to ingest and store knowledge.
- **Evaluation of Information**: Before ingesting any new knowledge, carefully assess whether the information provided is genuinely novel and relevant.
- **Step-by-Step Approach**: Take a moment to reflect and approach each task methodically. Breathe deeply and focus on the process.
### Tools Available:
1. **`path_to_db()`**: Use this tool to ingest knowledge from a specified text file.
2. **`text_to_db()`**: Utilize this tool to ingest knowledge directly from provided text.
Your mission is to enhance the database with accurate and relevant information while ensuring that you adhere to the guidelines above.
'''
- the description that will help during the routing of messages (DESCRIPTION_AGENT_INGESTION is a variable defined in prompts.py);
DESCRIPTION_AGENT_INGESTION = '''
I am the **Ingestion Agent** responsible for acquiring new knowledge from text files or directly from user-provided text.
'''
- the configuration for LLM;
llm_config = {
"config_list": [
{
"model": "gpt-4o-mini",
"api_key": os.environ["OPENAI_API_KEY"],
"temperature": 0.7,
}
]
}
- whether to ask for human inputs every time a message is received (by setting human_input_mode = “NEVER” the agent will never prompt for human input);
- whether to not print the message sent.
Similarly, we can define all other agents (human, agent_retrieve, agent_answer, agent_router).
Adding Tools
So far, we have defined various agents; however, as they are currently configured, these agents can only receive text inputs and respond with text outputs. They are not equipped to perform more complex tasks that require specific tools. For instance, an agent in its current state cannot access the database we created in the first part of this tutorial to conduct searches.
To enable this functionality, we need to “tell” the agent that it has access to a tool capable of performing certain tasks. Our preference for implementing a tool deterministically, rather than asking the agent to figure it out on its own, is based on efficiency and reliability. A deterministic approach reduces the likelihood of errors, as the process can be clearly defined and coded. Nevertheless, we will still give the agent the responsibility and autonomy to select which tool to use, determine the parameters for its use, and decide how to combine multiple tools to address complex requests. This balance between guidance and autonomy will enhance the agent’s capabilities while maintaining a structured approach.
I hope it is clear by now that, contrary to the claims made by many non-experts who suggest that agents are “so intelligent” that they can effortlessly handle complex tasks, there is actually a significant amount of work happening behind the scenes. The foundational tools that agents rely on require careful study, implementation, and testing. Nothing occurs “automagically,” even in the realm of generative AI. Understanding this distinction is crucial for appreciating the complexity and effort involved in developing effective AI systems. While these agents can perform impressive tasks, their capabilities are the result of meticulous engineering and thoughtful design rather than innate intelligence.
Remember the functions text_to_db() and path_to_db() we created before for the ingestion? We can “register” them to Agent Ingestion in this way:
register_function(
path_to_db,
caller=agent_ingestion,
executor=agent_ingestion,
name="path_to_db",
description="Ingest new knowledge from a text file given its path.",
)
register_function(
text_to_db,
caller=agent_ingestion,
executor=agent_ingestion,
name="text_to_db",
description="Ingest new knowledge from a piece of conversation.",
)
Similarly, we can add the retrieve tool to Agent Retrieve:
register_function(
retrieve_str,
caller=agent_retrieve,
executor=agent_retrieve,
name="retrieve_str",
description="Retrieve useful information from internal DB.",
)
MAS Topology
So far, we have defined each agent, their roles, and the tools they can utilize. What remains is how these agents are organized and how they communicate with one another. We aim to create a topology in which the Human interacts with the Agent Router, which then participates in a nested chat group with other agents. This group collaborates to address the human query, autonomously determining the order of operations, selecting the appropriate tools, and formulating responses. In this setup, the Agent Router acts as a central coordinator that directs the flow of information among the agents (Agent Ingestion, Agent Retrieve, and Agent Answer). Each agent has a specific function: Agent Ingestion processes incoming data, Agent Retrieve accesses relevant information from the database, and Agent Answer proposes the final response based on the gathered insights.
To create a group chat, we can use the GroupChat() class.
group_chat = GroupChat(
agents = [
agent_router,
agent_ingestion,
agent_retrieve,
agent_answer
],
messages=[],
send_introductions=False,
max_round=10,
speaker_selection_method="auto",
speaker_transitions_type="allowed",
allowed_or_disallowed_speaker_transitions={
agent_router: [agent_ingestion, agent_retrieve, agent_answer],
agent_ingestion: [agent_router],
agent_retrieve: [agent_answer],
agent_answer: [agent_router],
},
)
In this instantiation, we list the agents that will be part of the group (agents), decide that they don’t need to introduce themselves at the beginning of the chat (send_introductions), set the max rounds of conversation to 10 (max_round), delegate the selection of the speaker at each round to the chat manager (speaker_selection_method), and constrain the conversation transitions to a particular scheme (allowed_or_disallowed_speaker_transitions).
Created the group, we need a group manager that manage the order of conversation:
group_chat_manager = GroupChatManager(
groupchat=group_chat,
llm_config=llm_config,
silent=False,
is_termination_msg=lambda msg: "(to human)" in msg["content"].lower()
)
It is important to note the lambda function used for the is_termination_msg parameter. This function determines when the chat should terminate by checking if the last message contains the substring "(to human)." This mechanism is crucial because, in the system prompt for the Agent Router, it specifies: "Clearly indicate your message's intended recipient. For example, use (to human) when addressing the user." This approach provides a clear signal for when to exit the nested chat and return a response to the human user.
Now, we need to make group chat we have just created a nested chat that stats from Agent Router.
nested_chats = [
{
"recipient": group_chat_manager,
"summary_method": "last_msg",
}
]
agent_router.register_nested_chats(
nested_chats,
trigger=lambda sender: sender in [human],
)
By leveraging a structured communication framework and predefined transitions between agents, we ensure efficient collaboration between agents and in the same time allow flexibility in decision-making.
Let’s start chatting
We are really ready now. To start chatting with Agent Router:
chat_results = human.initiate_chat(
agent_router,
message=input("Ciao! How can I assist you today?
What's Your Reaction?