Skip to main content

agentchat.contrib.agent_eval.agent_eval

generate_criteria

def generate_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
task: Task = None,
additional_instructions: str = "",
max_round=2,
use_subcritic: bool = False)

Creates a list of criteria for evaluating the utility of a given task.

Arguments:

  • llm_config dict or bool - llm inference configuration.
  • task Task - The task to evaluate.
  • additional_instructions str - Additional instructions for the criteria agent.
  • max_round int - The maximum number of rounds to run the conversation.
  • use_subcritic bool - Whether to use the subcritic agent to generate subcriteria.

Returns:

  • list - A list of Criterion objects for evaluating the utility of the given task.

quantify_criteria

def quantify_criteria(llm_config: Optional[Union[Dict, Literal[False]]] = None,
criteria: List[Criterion] = None,
task: Task = None,
test_case: str = "",
ground_truth: str = "")

Quantifies the performance of a system using the provided criteria.

Arguments:

  • llm_config dict or bool - llm inference configuration.
  • criteria [Criterion] - A list of criteria for evaluating the utility of a given task.
  • task Task - The task to evaluate.
  • test_case str - The test case to evaluate.
  • ground_truth str - The ground truth for the test case.

Returns:

  • dict - A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.