Ricky Loynd

Teachable Agent Architecture


  • We introduce Teachable Agents so that users can teach their LLM-based assistants new facts, preferences, and skills.
  • We showcase examples of teachable agents learning and later recalling facts, preferences, and skills in subsequent chats.


Conversational assistants based on LLMs can remember the current chat with the user, and can also demonstrate in-context learning of user teachings during the conversation. But the assistant's memories and learnings are lost once the chat is over, or when a single chat grows too long for the LLM to handle effectively. Then in subsequent chats the user is forced to repeat any necessary instructions over and over.

Teachability addresses these limitations by persisting user teachings across chat boundaries in long-term memory implemented as a vector database. Instead of copying all of memory into the context window, which would eat up valuable space, individual memories (called memos) are retrieved into context as needed. This allows the user to teach frequently used facts and skills to the teachable agent just once, and have it recall them in later chats.

Any instantiated agent that inherits from ConversableAgent can be made teachable by instantiating a Teachability object and calling its add_to_agent(agent) method. In order to make effective decisions about memo storage and retrieval, the Teachability object calls an instance of TextAnalyzerAgent (another AutoGen agent) to identify and reformulate text as needed for remembering facts, preferences, and skills. Note that this adds extra LLM calls involving a relatively small number of tokens, which can add a few seconds to the time a user waits for each response.

Run It Yourself

AutoGen contains four code examples that use Teachability.

  1. Run to converse with a teachable agent.

  2. Run for quick unit testing of a teachable agent.

  3. Use the Jupyter notebook agentchat_teachability.ipynb to step through examples discussed below.

  4. Use the Jupyter notebook agentchat_teachable_oai_assistants.ipynb to make arbitrary OpenAI Assistants teachable through GPTAssistantAgent.

Basic Usage of Teachability

  1. Install dependencies

Please install autogen with the [teachable] option before using Teachability.

pip install "autogen[teachable]"
  1. Import agents
from autogen import UserProxyAgent, config_list_from_json
from autogen.agentchat.contrib.capabilities.teachability import Teachability
from autogen import ConversableAgent # As an example
  1. Create llm_config
# Load LLM inference endpoints from an env variable or a file
# See
# and OAI_CONFIG_LIST_sample
filter_dict = {"model": ["gpt-4"]} # GPT-3.5 is less reliable than GPT-4 at learning from user feedback.
config_list = config_list_from_json(env_or_file="OAI_CONFIG_LIST", filter_dict=filter_dict)
llm_config={"config_list": config_list, "timeout": 120}
  1. Create the agents

# Start by instantiating any agent that inherits from ConversableAgent, which we use directly here for simplicity.
teachable_agent = ConversableAgent(
name="teachable_agent", # The name can be anything.

# Instantiate a Teachability object. Its parameters are all optional.
teachability = Teachability(
reset_db=False, # Use True to force-reset the memo DB, and False to use an existing DB.
path_to_db_dir="./tmp/interactive/teachability_db" # Can be any path, but teachable agents in a group chat require unique paths.

# Now add teachability to the agent.

# For this test, create a user proxy agent as usual.
user = UserProxyAgent("user", human_input_mode="ALWAYS")
  1. Chat with the teachable agent
# This function will return once the user types 'exit'.
teachable_agent.initiate_chat(user, message="Hi, I'm a teachable user assistant! What's on your mind?")

Example 1 - Learning user info

A user can teach the agent facts about themselves. (Note that due to their finetuning, LLMs can be reluctant to admit that they know personal information.)

Loading previous memory (if any) from disk.
teachable_agent (to user):

Greetings, I'm a teachable user assistant! What's on your mind today?

Provide feedback to teachable_agent. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: My name is Ricky
user (to teachable_agent):

My name is Ricky


>>>>>>>> USING AUTO REPLY...
teachable_agent (to user):

Hello Ricky! How can I assist you today?

Provide feedback to teachable_agent. Press enter to skip and use auto-reply, or type 'exit' to end the conversation: exit

In a later conversation, the user can check whether the teachable agent remembers their name. (For readability, the user prompts and some logged notices are not repeated below.)

user (to teachable_agent):

what's my name?

teachable_agent (to user):

Hi Ricky! How can I help you today?


Example 2 - Learning new facts

A user can teach the agent more complex, related facts.

user (to teachable_agent):

What is the Vicuna model?

teachable_agent (to user):

I'm sorry, but I don't have information about the "Vicuna model" in my current database. Could you please provide some context, description, or details about this model so I can better assist you?

user (to teachable_agent):

Vicuna is a 13B-parameter language model released by Meta.

teachable_agent (to user):

Thank you for providing the information. The Vicuna model is a large-scale language model developed by Meta. It consists of 13 billion parameters, which is what makes it able to understand and generate human-like text across various topics. Parameters in a neural network are the weights and biases assigned to each connection between the neurons, and having more parameters enables the model to learn and capture more complex patterns and relationships in data. As an AI language model, Vicuna can be utilized in various language-related tasks, such as text completion, translation, summarization, and more.

user (to teachable_agent):

What is the Orca model?

teachable_agent (to user):

The Orca model is a yet-to-be-released language model by Meta, referring to the Research Preview API release of Meta's advanced language model. While I don't have specific details about the Orca model at this time, it is expected that the model would have an even larger number of parameters to further enhance its capabilities compared to its predecessor, Vicuna.

As AI language models evolve, they generally become more efficient at understanding and generating human-like text. This makes them more effective in completing a wide range of tasks, including text completion, translation, summarization, and Q&A, as well as providing more accurate and coherent results in various applications.

user (to teachable_agent):

Orca is a 13B-parameter language model released by Microsoft. It outperforms Vicuna on most tasks.

teachable_agent (to user):

Thank you for the clarification. Orca is a 13-billion-parameter language model developed by Microsoft. It is designed to outperform Vicuna, the language model released by Meta. With a large number of parameters, Orca is capable of understanding and generating human-like text across diverse subjects, making it highly useful in various applications.

Orca's advanced capabilities allow it to excel in tasks such as text completion, translation, summarization, Q&A, and more. Its performance improvement over Vicuna highlights the rapid advancements in natural language processing (NLP) research and the development of increasingly capable AI language models.

Then in a later chat the teachable agent can answer questions about the facts it has been taught. (Remember to first close the previous chat by typing 'exit'.)

user (to teachable_agent):

How does the Vicuna model compare to the Orca model?

teachable_agent (to user):

The Vicuna and Orca models are both 13B-parameter language models released by Meta and Microsoft, respectively. In terms of performance, Orca has been reported to outperform Vicuna on most tasks. However, without detailed information about specific tasks and benchmarks, it is difficult to provide a more comprehensive comparison. Generally speaking, both models are advanced language models that aim to provide high-quality natural language understanding and generation, but Orca appears to have an edge in terms of overall performance.


Example 3 - Learning user preferences

A user can teach the agent how they prefer to have things done.

Be aware that a message like the next one cannot be entered as a single message through a command line because it contains a newline character. Such messages can be entered in a Jupyter notebook, or through some UI layer like that of ChatGPT.

user (to teachable_agent):

Please summarize this abstract.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

teachable_agent (to user):

AutoGen is an open-source framework designed to enable developers to create LLM applications with multiple customizable agents that can converse with each other to complete tasks. These agents can operate using different combinations of LLMs, human inputs, and tools, allowing developers to define agent interaction behaviors flexibly. AutoGen supports programming flexible conversation patterns using both natural language and code, making it suitable for building diverse applications with varying complexities and LLM capacities. Its effectiveness has been demonstrated through empirical studies across various domains including mathematics, coding, operations research, decision-making, and entertainment.

Please summarize this abstract. When I'm summarizing an abstract, I try to make the summary contain just three short bullet points: the title, the innovation, and the key empirical results.

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, Chi Wang
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.

teachable_agent (to user):

- Title: AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
- Innovation: Open-source framework for creating customizable LLM applications through agent conversations, supporting various modes and interaction behaviors.
- Key Empirical Results: Demonstrated effectiveness across diverse application domains, including mathematics, coding, question answering, and more.


Then in later chats the teacher doesn't need to reiterate their detailed preferences.

user (to teachable_agent):

Please summarize this abstract.

Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, Yi Zhang
Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data. In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI. We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models. We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. We conclude with reflections on societal influences of the recent technological leap and future research directions.

teachable_agent (to user):

- Title: Sparks of Artificial General Intelligence: Early experiments with GPT-4
- Innovation: GPT-4, an LLM with remarkable capabilities, demonstrates human-level performance across various domains, like math, coding, vision, medicine, law, and psychology.
- Key results: GPT-4 significantly surpasses prior models, suggesting it may be an early version of AGI; limitations and challenges toward deeper AGI are also discussed.


Example 4 - Learning new skills

Users can extend the teachable agent's capabilities by teaching it new skills for accomplishing challenging tasks. It usually works best to first describe the task, then (in the same turn) provide a hint or advice for approaching the task.

The Sparks of AGI paper evaluated GPT-4 on math problems like the following, which it could only solve 32% of the time. We first show a failure case, then teach the agent a strategy which lifts GPT-4's success rate above 95%.

user (to teachable_agent):

Consider the identity:
9 * 4 + 6 * 6 = 72
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 99?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

Step 1: Identify the current equation and the target value on the right-hand side.
Current equation: 9 * 4 + 6 * 6 = 72
Target value: 99

Step 2: Determine what numbers can be changed on the left-hand side.
Possible changes: 9, 4, 6 (first occurrence), and 6 (second occurrence)

Step 3: Evaluate whether each change in the numbers on the left-hand side can result in the target value.
9 -> If changed, it would create a larger product.
4 -> If changed, the product would increase or decrease by more than the amount needed to reach 99.
6 (first occurrence) -> If increased to 7, then the equation would be 9 * 4 + 7 * 6 = 94, which is still not 99.
6 (second occurrence) -> If increased to 7, then the equation would be 9 * 4 + 6 * 7 = 99.

Step 4: Write down the solution.
The solution is: 9 * 4 + 6 * 7

user (to teachable_agent):

Consider the identity:
9 * 4 + 6 * 6 = 72
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 99?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

Here's some advice that may help:
1. Let E denote the original number on the right.
2. Let F denote the final number on the right.
3. Calculate the difference between the two, G = F - E.
4. Examine the numbers on the left one by one until finding one that divides evenly into G, where negative integers are allowed.
5. Calculate J = G / H. This is the number of times that H divides into G.
6. Verify that J is an integer, and that H * J = G.
7. Find the number on the left which is multiplied by H, and call it K.
8. Change K to K + J.
9. Recompute the value on the left, and verify that it equals F.
Finally, write down your solution as: "The solution is: A * B + C * D".

teachable_agent (to user):

Step 1: Identify the original number (E) and the final number (F) on the right-hand side.
E = 72
F = 99

Step 2: Calculate the difference between the two, G = F - E.
G = 99 - 72 = 27

Step 3: Examine the numbers on the left one by one until finding one that divides evenly into G, where negative integers are allowed.
9 (divides evenly into 27)
4 (does not)
6 (first occurrence, does not)
6 (second occurrence, does not)

Step 4: Calculate J = G / H. This is the number of times that H divides into G.
H = 9 (since 9 divides evenly into 27)
J = 27 / 9 = 3

Step 5: Verify that J is an integer, and that H * J = G.
H * J = 9 * 3 = 27
J = 3 (which is an integer)

Step 6: Find the number on the left which is multiplied by H, and call it K.
Number on the leftside of the equation that is multiplied by 9: 4
Therefore, K = 4

Step 7: Change K to K + J.
K + J = 4 + 3 = 7

Step 8: Recompute the value on the left, and verify that it equals F.
9 * 7 + 6 * 6 = 63 + 36 = 99 (which is equal to F)

Step 9: Write down the solution.
The solution is: 9 * 7 + 6 * 6


In a later chat the user doesn't need to repeat the detailed advice.

user (to teachable_agent):

Consider the identity:
8 * 3 + 7 * 9 = 87
Can you modify exactly one integer (and not more than that!) on the left hand side of the equation so the right hand side becomes 59?
-Let's think step-by-step, write down a plan, and then write down your solution as: "The solution is: A * B + C * D".

teachable_agent (to user):

Let's follow the advice to modify one integer on the left hand side of the equation to make the right hand side become 59:

1. Let E denote the original number on the right: E = 87.
2. Let F denote the final number on the right: F = 59.
3. Calculate the difference between the two, G = F - E: G = 59 - 87 = -28.
4. Examine the numbers on the left one by one to find one that divides evenly into G: 8, 3, 7, 9. Since 7 divides -28 evenly, we will work with 7 (H = 7).
5. Calculate J = G / H: J = -28 / 7 = -4.
6. Verify that J is an integer, and that H * J = G: J is an integer, and 7 * (-4) = -28.
7. Find the number on the left which is multiplied by H (7), and call it K: K = 9.
8. Change K to K + J: K_new = 9 + (-4) = 5.
9. Recompute the value on the left, and verify that it equals F: (8 * 3) + (7 * 5) = 24 + 35 = 59.

The solution is: 8 * 3 + 7 * 5


Planned improvements

  • Understanding user instructions distributed over multiple turns.
  • Learning from the agent's own experience, to reduce dependence on explicit user teachings.
  • Learning skills built on top of previously learned skills.


Teachability is still under active research and development. For any problems you find or improvements you have in mind, please join our discussions in this repo and on our Discord channel. We look forward to seeing how you and the rest of the community can use and improve teachable agents in AutoGen!

Li Jiang

Last update: August 14, 2024; AutoGen version: v0.2.35

RAG Architecture


  • We introduce RetrieveUserProxyAgent, RAG agents of AutoGen that allows retrieval-augmented generation, and its basic usage.
  • We showcase customizations of RAG agents, such as customizing the embedding function, the text split function and vector database.
  • We also showcase two advanced usage of RAG agents, integrating with group chat and building a Chat application with Gradio.


Retrieval augmentation has emerged as a practical and effective approach for mitigating the intrinsic limitations of LLMs by incorporating external documents. In this blog post, we introduce RAG agents of AutoGen that allows retrieval-augmented generation. The system consists of two agents: a Retrieval-augmented User Proxy agent, called RetrieveUserProxyAgent, and an Assistant agent, called RetrieveAssistantAgent; RetrieveUserProxyAgent is extended from built-in agents from AutoGen, while RetrieveAssistantAgent can be any conversable agent with LLM configured. The overall architecture of the RAG agents is shown in the figure above.

To use Retrieval-augmented Chat, one needs to initialize two agents including Retrieval-augmented User Proxy and Retrieval-augmented Assistant. Initializing the Retrieval-Augmented User Proxy necessitates specifying a path to the document collection. Subsequently, the Retrieval-Augmented User Proxy can download the documents, segment them into chunks of a specific size, compute embeddings, and store them in a vector database. Once a chat is initiated, the agents collaboratively engage in code generation or question-answering adhering to the procedures outlined below:

  1. The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding similarity, and sends them along with the question to the Retrieval-Augmented Assistant.
  2. The Retrieval-Augmented Assistant employs an LLM to generate code or text as answers based on the question and context provided. If the LLM is unable to produce a satisfactory response, it is instructed to reply with “Update Context” to the Retrieval-Augmented User Proxy.
  3. If a response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback. If there are no code blocks or instructions to update the context, it terminates the conversation. Otherwise, it updates the context and forwards the question along with the new context to the Retrieval-Augmented Assistant. Note that if human input solicitation is enabled, individuals can proactively send any feedback, including Update Context”, to the Retrieval-Augmented Assistant.
  4. If the Retrieval-Augmented Assistant receives “Update Context”, it requests the next most similar chunks of documents as new context from the Retrieval-Augmented User Proxy. Otherwise, it generates new code or text based on the feedback and chat history. If the LLM fails to generate an answer, it replies with “Update Context” again. This process can be repeated several times. The conversation terminates if no more documents are available for the context.

Basic Usage of RAG Agents

  1. Install dependencies

Please install autogen with the [retrievechat] option before using RAG agents.

pip install "autogen[retrievechat]"

RetrieveChat can handle various types of documents. By default, it can process plain text and PDF files, including formats such as 'txt', 'json', 'csv', 'tsv', 'md', 'html', 'htm', 'rtf', 'rst', 'jsonl', 'log', 'xml', 'yaml', 'yml' and 'pdf'. If you install unstructured, additional document types such as 'docx', 'doc', 'odt', 'pptx', 'ppt', 'xlsx', 'eml', 'msg', 'epub' will also be supported.

  • Install unstructured in ubuntu
sudo apt-get update
sudo apt-get install -y tesseract-ocr poppler-utils
pip install unstructured[all-docs]

You can find a list of all supported document types by using autogen.retrieve_utils.TEXT_FORMATS.

  1. Import Agents
import autogen
from autogen import AssistantAgent
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
  1. Create an 'AssistantAgent' instance named "assistant" and an 'RetrieveUserProxyAgent' instance named "ragproxyagent"

Refer to the doc for more information on the detailed configurations.

assistant = AssistantAgent(
system_message="You are a helpful assistant.",

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
  1. Initialize Chat and ask a question
ragproxyagent.initiate_chat(assistant, message=ragproxyagent.message_generator, problem="What is autogen?")

Output is like:

assistant (to ragproxyagent):

AutoGen is a framework that enables the development of large language model (LLM) applications using multiple agents that can converse with each other to solve tasks. The agents are customizable, conversable, and allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.

  1. Create a UserProxyAgent and ask the same question
userproxyagent = autogen.UserProxyAgent(name="userproxyagent")
userproxyagent.initiate_chat(assistant, message="What is autogen?")

Output is like:

assistant (to userproxyagent):

In computer software, autogen is a tool that generates program code automatically, without the need for manual coding. It is commonly used in fields such as software engineering, game development, and web development to speed up the development process and reduce errors. Autogen tools typically use pre-programmed rules, templates, and data to create code for repetitive tasks, such as generating user interfaces, database schemas, and data models. Some popular autogen tools include Visual Studio's Code Generator and Unity's Asset Store.


You can see that the output of UserProxyAgent is not related to our autogen since the latest info of autogen is not in ChatGPT's training data. The output of RetrieveUserProxyAgent is correct as it can perform retrieval-augmented generation based on the given documentation file.

Customizing RAG Agents

RetrieveUserProxyAgent is customizable with retrieve_config. There are several parameters to configure based on different use cases. In this section, we'll show how to customize embedding function, text split function and vector database.

Customizing Embedding Function

By default, Sentence Transformers and its pretrained models will be used to compute embeddings. It's possible that you want to use OpenAI, Cohere, HuggingFace or other embedding functions.

  • OpenAI
from chromadb.utils import embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
"embedding_function": openai_ef,
  • HuggingFace
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(

More examples can be found here.

Customizing Text Split Function

Before we can store the documents into a vector database, we need to split the texts into chunks. Although we have implemented a flexible text splitter in autogen, you may still want to use different text splitters. There are also some existing text split tools which are good to reuse.

For example, you can use all the text splitters in langchain.

from langchain.text_splitter import RecursiveCharacterTextSplitter

recur_spliter = RecursiveCharacterTextSplitter(separators=["\n", "\r", "\t"])

ragproxyagent = RetrieveUserProxyAgent(
"task": "qa",
"docs_path": "",
"custom_text_split_function": recur_spliter.split_text,

Customizing Vector Database

We are using chromadb as the default vector database, you can also use mongodb, pgvectordb and qdrantdb by simply set vector_db to mongodb, pgvector and qdrant in retrieve_config, respectively.

To plugin any other dbs, you can also extend class agentchat.contrib.vectordb.base, check out the code here.

Advanced Usage of RAG Agents

Integrate with other agents in a group chat

To use RetrieveUserProxyAgent in a group chat is almost the same as you use it in a two agents chat. The only thing is that you need to initialize the chat with RetrieveUserProxyAgent. The RetrieveAssistantAgent is not necessary in a group chat.

However, you may want to initialize the chat with another agent in some cases. To leverage the best of RetrieveUserProxyAgent, you'll need to call it from a function.

boss = autogen.UserProxyAgent(
system_message="The boss who ask questions and give tasks.",

boss_aid = RetrieveUserProxyAgent(
system_message="Assistant who has extra content retrieval power for solving difficult problems.",
"task": "qa",
code_execution_config=False, # we don't want to execute code in this case.

coder = autogen.AssistantAgent(
system_message="You are a senior python engineer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

pm = autogen.AssistantAgent(
system_message="You are a product manager. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

reviewer = autogen.AssistantAgent(
system_message="You are a code reviewer. Reply `TERMINATE` in the end when everything is done.",
llm_config={"config_list": config_list, "timeout": 60, "temperature": 0},

def retrieve_content(
message: Annotated[
"Refined message which keeps the original meaning and can be used to retrieve content for code generation and question answering.",
n_results: Annotated[int, "number of results"] = 3,
) -> str:
boss_aid.n_results = n_results # Set the number of results to be retrieved.
_context = {"problem": message, "n_results": n_results}
ret_msg = boss_aid.message_generator(boss_aid, None, _context)
return ret_msg or message

for caller in [pm, coder, reviewer]:
d_retrieve_content = caller.register_for_llm(
description="retrieve content for code generation and question answering.", api_style="function"

for executor in [boss, pm]:

groupchat = autogen.GroupChat(
agents=[boss, pm, coder, reviewer],

llm_config = {"config_list": config_list, "timeout": 60, "temperature": 0}
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)

# Start chatting with the boss as this is the user proxy agent.
message="How to use spark for parallel training in FLAML? Give me sample code.",

Build a Chat application with Gradio

Now, let's wrap it up and make a Chat application with AutoGen and Gradio.

RAG ChatBot with AutoGen

# Initialize Agents
def initialize_agents(config_list, docs_path=None):
return assistant, ragproxyagent

# Initialize Chat
def initiate_chat(config_list, problem, queue, n_results=3):
assistant, problem=problem, silent=False, n_results=n_results
messages = ragproxyagent.chat_messages
messages = [messages[k] for k in messages.keys()][0]
messages = [m["content"] for m in messages if m["role"] == "user"]
print("messages: ", messages)
except Exception as e:
messages = [str(e)]

# Wrap AutoGen part into a function
def chatbot_reply(input_text):
"""Chat with the agent through terminal."""
queue = mp.Queue()
process = mp.Process(
args=(config_list, input_text, queue),
messages = queue.get(timeout=TIMEOUT)
except Exception as e:
messages = [str(e) if len(str(e)) > 0 else "Invalid Request to OpenAI, please check your API keys."]
return messages


# Set up UI with Gradio
with gr.Blocks() as demo:
assistant, ragproxyagent = initialize_agents(config_list)

chatbot = gr.Chatbot(
avatar_images=(None, (os.path.join(os.path.dirname(__file__), "autogen.png"))),
# height=600,

txt_input = gr.Textbox(
placeholder="Enter text and press enter",

with gr.Row():
txt_model = gr.Dropdown(
txt_oai_key = gr.Textbox(
label="OpenAI API Key",
placeholder="Enter key and press enter",
value=os.environ.get("OPENAI_API_KEY", ""),

clear = gr.ClearButton([txt_input, chatbot])


if __name__ == "__main__":

The online app and the source code are hosted in HuggingFace. Feel free to give it a try!

Read More

You can check out more example notebooks for RAG use cases:

Jiale Liu

TL;DR: We demonstrate how to use autogen for local LLM application. As an example, we will initiate an endpoint using FastChat and perform inference on ChatGLMv2-6b.


Clone FastChat

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.

git clone
cd FastChat

Download checkpoint

ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion parameters. ChatGLM2-6B is its second-generation version.

Before downloading from HuggingFace Hub, you need to have Git LFS installed.

git clone

Initiate server

First, launch the controller

python -m fastchat.serve.controller

Then, launch the model worker(s)

python -m fastchat.serve.model_worker --model-path chatglm2-6b

Finally, launch the RESTful API server

python -m fastchat.serve.openai_api_server --host localhost --port 8000

Normally this will work. However, if you encounter error like this, commenting out all the lines containing finish_reason in fastchat/protocol/ and fastchat/protocol/ will fix the problem. The modified code looks like:

class CompletionResponseChoice(BaseModel):
index: int
text: str
logprobs: Optional[int] = None
# finish_reason: Optional[Literal["stop", "length"]]

class CompletionResponseStreamChoice(BaseModel):
index: int
text: str
logprobs: Optional[float] = None
# finish_reason: Optional[Literal["stop", "length"]] = None

Interact with model using oai.Completion (requires openai<1)

Now the models can be directly accessed through openai-python library as well as autogen.oai.Completion and autogen.oai.ChatCompletion.

from autogen import oai

# create a text completion request
response = oai.Completion.create(
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL", # just a placeholder

# create a chat completion request
response = oai.ChatCompletion.create(
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
messages=[{"role": "user", "content": "Hi"}]

If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).

interacting with multiple local LLMs

If you would like to interact with multiple LLMs on your local machine, replace the model_worker step above with a multi model variant:

python -m fastchat.serve.multi_model_worker \
--model-path lmsys/vicuna-7b-v1.3 \
--model-names vicuna-7b-v1.3 \
--model-path chatglm2-6b \
--model-names chatglm2-6b

The inference code would be:

from autogen import oai

# create a chat completion request
response = oai.ChatCompletion.create(
"model": "chatglm2-6b",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
"model": "vicuna-7b-v1.3",
"base_url": "http://localhost:8000/v1",
"api_type": "openai",
"api_key": "NULL",
messages=[{"role": "user", "content": "Hi"}]

For Further Reading

Yiran Wu

MathChat WorkFlow TL;DR:

  • We introduce MathChat, a conversational framework leveraging Large Language Models (LLMs), specifically GPT-4, to solve advanced mathematical problems.
  • MathChat improves LLM's performance on challenging math problem-solving, outperforming basic prompting and other strategies by about 6%. The improvement was especially notable in the Algebra category, with a 15% increase in accuracy.
  • Despite the advancement, GPT-4 still struggles to solve very challenging math problems, even with effective prompting strategies. Further improvements are needed, such as the development of more specific assistant models or the integration of new tools and prompts.

Recent Large Language Models (LLMs) like GTP-3.5 and GPT-4 have demonstrated astonishing abilities over previous models on various tasks, such as text generation, question answering, and code generation. Moreover, these models can communicate with humans through conversations and remember previous contexts, making it easier for humans to interact with them. These models play an increasingly important role in our daily lives assisting people with different tasks, such as writing emails, summarizing documents, and writing code.

In this blog post, we probe into the problem-solving capabilities of LLMs. Specifically, we are interested in their capabilities to solve advanced math problems, which could be representative of a broader class of problems that require precise reasoning and also have deterministic solutions.

We introduce MathChat, a conversational framework designed for solving challenging math problems with LLMs. This framework takes advantage of the chat-optimized feature of state-of-the-art LLMs, where a user proxy agent and an LLM assistant work together to tackle math problems. We also test previous prompting techniques for comparison.

The MathChat Framework

MathChat simulates a conversation between the LLM assistant and a user proxy agent. As the name indicates, the user proxy agent acts as a proxy for the user, which is responsible for communicating with the LLM assistant and continuing the conversation in a desired manner.

The proxy agent first presents a math problem to the LLM assistant, framed by an initial prompt that guides the assistant in how it should work collaboratively to solve the problem. We incorporate existing prompting methods such as CoT (chain-of-thought) and tool-using are incorporated in the initial prompt. Key components of the prompt include:

  • Tool-using Prompt: This guides the LLM assistant to use Python code in the correct format to address the problem, allowing the user proxy agent to parse the code and provide the appropriate results.

  • Problem-Solving Strategy Selection Prompt: The assistant is instructed to choose one of three potential problem-solving strategies, including:

    1. Writing a Python program to solve the problem directly.
    2. Solving the problem directly without using Python, which exercises LLMs' inherent reasoning capabilities.
    3. Solving the problem step by step with Python. This should be used when the first two strategies aren't suitable.
  • Final Answer Encapsulation Prompt: This part instructs the assistant to put the final answer in \boxed.

The prompt design in MathChat is flexible and can easily be refined to include other tools, such as Wolfram Alpha.

Let's take a look at an example between the User Proxy Agent and the LLM Assistant (GPT-4). The conversation focuses on how to solve inequality using Python. (The conversation is modified for readability.)

User Proxy Agent(Initial Prompt: Let's solve a math problem with Python...) Problem: Find all x that satisfy the inequality (2x+10)(x+3) < (3x+9)(x+8). Express your answer in interval notation.
LLM assistantThe key idea to solve this problem is to find the critical points, which are the points where the inequality is equal. Then, test the intervals between the critical points to find the interval for which the inequality holds.
LLM assistant (continued)Here's Python code to find the critical points: from sympy import symbols, Eq, solve; x = symbols("x"); lhs = (2*x + 10)*(x + 3); rhs = (3*x + 9)*(x + 8); equation = Eq(lhs, rhs); critical_points = solve(equation, x); critical_points
User Proxy AgentThe critical points are: [-14, -3]
LLM assistantWe found the critical points to be x = -14 and x = -3. Now we'll test the intervals between these critical points as well as points before and after the critical points. The intervals are: x < -14, -14 < x < -3, x > -3.
LLM assistant (continued)Here's Python code to test the inequality for each interval: from sympy import symbols; x = symbols("x"); inequality = (2*x + 10)*(x + 3) < (3*x + 9)*(x + 8); test_points = [-15, -5, 0]; intervals = []; for point in test_points: if inequality.subs(x, point): intervals.append(point); intervals
User Proxy AgentThe intervals where the inequality holds true are for test points: [-15, 0]
LLM assistantWe have tested the intervals and found that the inequality holds for: 1. x < -14 (test point: -15) 2. x > -3 (test point: 0). Therefore, the solution to the inequality (2x+10)(x+3) < (3x+9)(x+8) is x ∈ (-∞, -14) ∪ (-3, ∞)

Experiment Setup

We evaluate the improvement brought by MathChat.

For the experiment, we focus on the level-5 problems from the MATH dataset, which are composed of high school competition problems. These problems include the application of theorems and complex equation derivation and are challenging even for undergraduate students. We evaluate 6 of 7 categories from the dataset (excluding Geometry): Prealgebra, Algebra, Number Theory, Counting and Probability, Intermediate Algebra, and Precalculus.

We evaluate GPT-4 and use the default configuration of the OpenAI API. To access the final performance, we manually compare the final answer with the correct answer. For the vanilla prompt, Program Synthesis, and MathChat, we have GPT-4 enclose the final answer in \boxed, and we take the return of the function in PoT as the final answer.

We also evaluate the following methods for comparison:

  1. Vanilla prompting: Evaluates GPT-4's direct problem-solving capability. The prompt used is: " Solve the problem carefully. Put the final answer in \boxed".

  2. Program of Thoughts (PoT): Uses a zero-shot PoT prompt that requests the model to create a Solver function to solve the problem and return the final answer.

  3. Program Synthesis (PS) prompting: Like PoT, it prompts the model to write a program to solve the problem. The prompt used is: "Write a program that answers the following question: {Problem}".

Experiment Results

The accuracy on all the problems with difficulty level-5 from different categories of the MATH dataset with different methods is shown below:


We found that compared to basic prompting, which demonstrates the innate capabilities of GPT-4, utilizing Python within the context of PoT or PS strategy improved the overall accuracy by about 10%. This increase was mostly seen in categories involving more numerical manipulations, such as Counting & Probability and Number Theory, and in more complex categories like Intermediate Algebra and Precalculus.

For categories like Algebra and Prealgebra, PoT and PS showed little improvement, and in some instances, even led to a decrease in accuracy. However, MathChat was able to enhance total accuracy by around 6% compared to PoT and PS, showing competitive performance across all categories. Remarkably, MathChat improved accuracy in the Algebra category by about 15% over other methods. Note that categories like Intermediate Algebra and Precalculus remained challenging for all methods, with only about 20% of problems solved accurately.

The code for experiments can be found at this repository. We now provide an implementation of MathChat using the interactive agents in AutoGen. See this notebook for example usage.

Future Directions

Despite MathChat's improvements over previous methods, the results show that complex math problem is still challenging for recent powerful LLMs, like GPT-4, even with help from external tools.

Further work can be done to enhance this framework or math problem-solving in general:

  • Although enabling the model to use tools like Python can reduce calculation errors, LLMs are still prone to logic errors. Methods like self-consistency (Sample several solutions and take a major vote on the final answer), or self-verification (use another LLM instance to check whether an answer is correct) might improve the performance.
  • Sometimes, whether the LLM can solve the problem depends on the plan it uses. Some plans require less computation and logical reasoning, leaving less room for mistakes.
  • MathChat has the potential to be adapted into a copilot system, which could assist users with math problems. This system could allow users to be more involved in the problem-solving process, potentially enhancing learning.

For Further Reading

Chi Wang

An adaptive way of using GPT-3.5 and GPT-4 outperforms GPT-4 in both coding success rate and inference cost


  • A case study using the HumanEval benchmark shows that an adaptive way of using multiple GPT models can achieve both much higher accuracy (from 68% to 90%) and lower inference cost (by 18%) than using GPT-4 for coding.

GPT-4 is a big upgrade of foundation model capability, e.g., in code and math, accompanied by a much higher (more than 10x) price per token to use over GPT-3.5-Turbo. On a code completion benchmark, HumanEval, developed by OpenAI, GPT-4 can successfully solve 68% tasks while GPT-3.5-Turbo does 46%. It is possible to increase the success rate of GPT-4 further by generating multiple responses or making multiple calls. However, that will further increase the cost, which is already nearly 20 times of using GPT-3.5-Turbo and with more restricted API call rate limit. Can we achieve more with less?

In this blog post, we will explore a creative, adaptive way of using GPT models which leads to a big leap forward.


  • GPT-3.5-Turbo can already solve 40%-50% tasks. For these tasks if we never use GPT-4, we can save nearly 40-50% cost.
  • If we use the saved cost to generate more responses with GPT-4 for the remaining unsolved tasks, it is possible to solve some more of them while keeping the amortized cost down.

The obstacle of leveraging these observations is that we do not know a priori which tasks can be solved by the cheaper model, which tasks can be solved by the expensive model, and which tasks can be solved by paying even more to the expensive model.

To overcome that obstacle, one may want to predict which task requires what model to solve and how many responses are required for each task. Let's look at one example code completion task:

def vowels_count(s):
"""Write a function vowels_count which takes a string representing
a word as input and returns the number of vowels in the string.
Vowels in this case are 'a', 'e', 'i', 'o', 'u'. Here, 'y' is also a
vowel, but only when it is at the end of the given word.

>>> vowels_count("abcde")
>>> vowels_count("ACEDY")

Can we predict whether GPT-3.5-Turbo can solve this task or do we need to use GPT-4? My first guess is that GPT-3.5-Turbo can get it right because the instruction is fairly straightforward. Yet, it turns out that GPT-3.5-Turbo does not consistently get it right, if we only give it one chance. It's not obvious (but an interesting research question!) how to predict the performance without actually trying.

What else can we do? We notice that: It's "easier" to verify a given solution than finding a correct solution from scratch.

Some simple example test cases are provided in the docstr. If we already have a response generated by a model, we can use those test cases to filter wrong implementations, and either use a more powerful model or generate more responses, until the result passes the example test cases. Moreover, this step can be automated by asking GPT-3.5-Turbo to generate assertion statements from the examples given in the docstr (a simpler task where we can place our bet) and executing the code.


Combining these observations, we can design a solution with two intuitive ideas:

  • Make use of auto-generated feedback, i.e., code execution results, to filter responses.
  • Try inference configurations one by one, until one response can pass the filter.


This solution works adaptively without knowing or predicting which task fits which configuration. It simply tries multiple configurations one by one, starting from the cheapest configuration. Note that one configuration can generate multiple responses (by setting the inference parameter n larger than 1). And different configurations can use the same model and different inference parameters such as n and temperature. Only one response is returned and evaluated per task.

An implementation of this solution is provided in autogen. It uses the following sequence of configurations:

  1. GPT-3.5-Turbo, n=1, temperature=0
  2. GPT-3.5-Turbo, n=7, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]
  3. GPT-4, n=1, temperature=0
  4. GPT-4, n=2, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]
  5. GPT-4, n=1, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]

Experiment Results

The first figure in this blog post shows the success rate and average inference cost of the adaptive solution compared with default GPT-4. The inference cost includes the cost for generating the assertions in our solution. The generated assertions are not always correct, and programs that pass/fail the generated assertions are not always right/wrong. Despite of that, the adaptive solution can increase the success rate (referred to as pass@1 in the literature) from 68% to 90%, while reducing the cost by 18%.

Here are a few examples of function definitions which are solved by different configurations in the portfolio.

  1. Solved by GPT-3.5-Turbo, n=1, temperature=0
def compare(game,guess):
"""I think we all remember that feeling when the result of some long-awaited
event is finally known. The feelings and thoughts you have at that moment are
definitely worth noting down and comparing.
Your task is to determine if a person correctly guessed the results of a number of matches.
You are given two arrays of scores and guesses of equal length, where each index shows a match.
Return an array of the same length denoting how far off each guess was. If they have guessed correctly,
the value is 0, and if not, the value is the absolute difference between the guess and the score.


compare([1,2,3,4,5,1],[1,2,3,4,2,-2]) -> [0,0,0,0,3,3]
compare([0,5,0,0,0,4],[4,1,1,0,0,-2]) -> [4,4,1,0,0,6]
  1. Solved by GPT-3.5-Turbo, n=7, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]: the vowels_count function presented earlier.
  2. Solved by GPT-4, n=1, temperature=0:
def string_xor(a: str, b: str) -> str:
""" Input are two strings a and b consisting only of 1s and 0s.
Perform binary XOR on these inputs and return result also as a string.
>>> string_xor('010', '110')
  1. Solved by GPT-4, n=2, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]:
def is_palindrome(string: str) -> bool:
""" Test if given string is a palindrome """
return string == string[::-1]

def make_palindrome(string: str) -> str:
""" Find the shortest palindrome that begins with a supplied string.
Algorithm idea is simple:
- Find the longest postfix of supplied string that is a palindrome.
- Append to the end of the string reverse of a string prefix that comes before the palindromic suffix.
>>> make_palindrome('')
>>> make_palindrome('cat')
>>> make_palindrome('cata')
  1. Solved by GPT-4, n=1, temperature=1, stop=["\nclass", "\ndef", "\nif", "\nprint"]:
def sort_array(arr):
In this Kata, you have to sort an array of non-negative integers according to
number of ones in their binary representation in ascending order.
For similar number of ones, sort based on decimal value.

It must be implemented like this:
>>> sort_array([1, 5, 2, 3, 4]) == [1, 2, 3, 4, 5]
>>> sort_array([-2, -3, -4, -5, -6]) == [-6, -5, -4, -3, -2]
>>> sort_array([1, 0, 2, 3, 4]) [0, 1, 2, 3, 4]

The last problem is an example with wrong example test cases in the original definition. It misleads the adaptive solution because a correct implementation is regarded as wrong and more trials are made. The last configuration in the sequence returns the right implementation, even though it does not pass the auto-generated assertions. This example demonstrates that:

  • Our adaptive solution has a certain degree of fault tolerance.
  • The success rate and inference cost for the adaptive solution can be further improved if correct example test cases are used.

It is worth noting that the reduced inference cost is the amortized cost over all the tasks. For each individual task, the cost can be either larger or smaller than directly using GPT-4. This is the nature of the adaptive solution: The cost is in general larger for difficult tasks than that for easy tasks.

An example notebook to run this experiment can be found at: The experiment was run when AutoGen was a subpackage in FLAML.


Our solution is quite simple to implement using a generic interface offered in autogen, yet the result is quite encouraging.

While the specific way of generating assertions is application-specific, the main ideas are general in LLM operations:

  • Generate multiple responses to select - especially useful when selecting a good response is relatively easier than generating a good response at one shot.
  • Consider multiple configurations to generate responses - especially useful when:
    • Model and other inference parameter choice affect the utility-cost tradeoff; or
    • Different configurations have complementary effect.

A previous blog post provides evidence that these ideas are relevant in solving math problems too. autogen uses a technique EcoOptiGen to support inference parameter tuning and model selection.

There are many directions of extensions in research and development:

  • Generalize the way to provide feedback.
  • Automate the process of optimizing the configurations.
  • Build adaptive agents for different applications.

For Further Reading

Chi Wang

level 2 algebra


  • Just by tuning the inference parameters like model, number of responses, temperature etc. without changing any model weights or prompt, the baseline accuracy of untuned gpt-4 can be improved by 20% in high school math competition problems.
  • For easy problems, the tuned gpt-3.5-turbo model vastly outperformed untuned gpt-4 in accuracy (e.g., 90% vs. 70%) and cost efficiency. For hard problems, the tuned gpt-4 is much more accurate (e.g., 35% vs. 20%) and less expensive than untuned gpt-4.
  • AutoGen can help with model selection, parameter tuning, and cost-saving in LLM applications.

Large language models (LLMs) are powerful tools that can generate natural language texts for various applications, such as chatbots, summarization, translation, and more. GPT-4 is currently the state of the art LLM in the world. Is model selection irrelevant? What about inference parameters?

In this blog post, we will explore how model and inference parameter matter in LLM applications, using a case study for MATH, a benchmark for evaluating LLMs on advanced mathematical problem solving. MATH consists of 12K math competition problems from AMC-10, AMC-12 and AIME. Each problem is accompanied by a step-by-step solution.

We will use AutoGen to automatically find the best model and inference parameter for LLMs on a given task and dataset given an inference budget, using a novel low-cost search & pruning strategy. AutoGen currently supports all the LLMs from OpenAI, such as GPT-3.5 and GPT-4.

We will use AutoGen to perform model selection and inference parameter tuning. Then we compare the performance and inference cost on solving algebra problems with the untuned gpt-4. We will also analyze how different difficulty levels affect the results.

Experiment Setup

We use AutoGen to select between the following models with a target inference budget $0.02 per instance:

  • gpt-3.5-turbo, a relatively cheap model that powers the popular ChatGPT app
  • gpt-4, the state of the art LLM that costs more than 10 times of gpt-3.5-turbo

We adapt the models using 20 examples in the train set, using the problem statement as the input and generating the solution as the output. We use the following inference parameters:

  • temperature: The parameter that controls the randomness of the output text. A higher temperature means more diversity but less coherence. We search for the optimal temperature in the range of [0, 1].
  • top_p: The parameter that controls the probability mass of the output tokens. Only tokens with a cumulative probability less than or equal to top-p are considered. A lower top-p means more diversity but less coherence. We search for the optimal top-p in the range of [0, 1].
  • max_tokens: The maximum number of tokens that can be generated for each output. We search for the optimal max length in the range of [50, 1000].
  • n: The number of responses to generate. We search for the optimal n in the range of [1, 100].
  • prompt: We use the template: "{problem} Solve the problem carefully. Simplify your answer as much as possible. Put the final answer in \boxed{{}}." where {problem} will be replaced by the math problem instance.

In this experiment, when n > 1, we find the answer with highest votes among all the responses and then select it as the final answer to compare with the ground truth. For example, if n = 5 and 3 of the responses contain a final answer 301 while 2 of the responses contain a final answer 159, we choose 301 as the final answer. This can help with resolving potential errors due to randomness. We use the average accuracy and average inference cost as the metric to evaluate the performance over a dataset. The inference cost of a particular instance is measured by the price per 1K tokens and the number of tokens consumed.

Experiment Results

The first figure in this blog post shows the average accuracy and average inference cost of each configuration on the level 2 Algebra test set.

Surprisingly, the tuned gpt-3.5-turbo model is selected as a better model and it vastly outperforms untuned gpt-4 in accuracy (92% vs. 70%) with equal or 2.5 times higher inference budget. The same observation can be obtained on the level 3 Algebra test set.

level 3 algebra

However, the selected model changes on level 4 Algebra.

level 4 algebra

This time gpt-4 is selected as the best model. The tuned gpt-4 achieves much higher accuracy (56% vs. 44%) and lower cost than the untuned gpt-4. On level 5 the result is similar.

level 5 algebra

We can see that AutoGen has found different optimal model and inference parameters for each subset of a particular level, which shows that these parameters matter in cost-sensitive LLM applications and need to be carefully tuned or adapted.

An example notebook to run these experiments can be found at: The experiments were run when AutoGen was a subpackage in FLAML.

Analysis and Discussion

While gpt-3.5-turbo demonstrates competitive accuracy with voted answers in relatively easy algebra problems under the same inference budget, gpt-4 is a better choice for the most difficult problems. In general, through parameter tuning and model selection, we can identify the opportunity to save the expensive model for more challenging tasks, and improve the overall effectiveness of a budget-constrained system.

There are many other alternative ways of solving math problems, which we have not covered in this blog post. When there are choices beyond the inference parameters, they can be generally tuned via flaml.tune.

The need for model selection, parameter tuning and cost saving is not specific to the math problems. The Auto-GPT project is an example where high cost can easily prevent a generic complex task to be accomplished as it needs many LLM inference calls.

For Further Reading

