Skip to content

How Chat-Agent-Simulator works

IntellAgent is a cutting-edge multi-agent framework designed to provide fine-grained diagnostics for chatbot systems. It simulates thousands of edge-case scenarios to thoroughly assess chatbot performance. This document provides a detailed overview of the step-by-step simulation process.

Pipeline Overview

1. Input Context Analysis

The framework begins by analyzing the agent's core inputs: - Prompt: Chat-agent system prompt or a document outlining the company policies - Tools: Chat-agent tools - Database Schema: Underlying data structure - Database Validators: (Optional) Custom validation functions - Task Description: (Optional) Automatically inferred from system prompt

This analysis forms the baseline for understanding the agent's capabilities and constraints.

Input Context

2. Flow and Policies Generation

The system automatically: - Decomposes the agent's primary task into several main flows - For each flow, extracts relevant policies, category and challenge score

Flow Policy Example Category Challenge Score
Book Flight "Each reservation can use at most one travel certificate, one credit card, and three gift cards." Payment Handling / Financial 3
Book Flight "Flights booked within 7 days of departure are non-refundable unless a travel insurance policy is purchased." Policy Enforcement / Restriction 4
Modify Flights "Basic economy flights cannot be modified. Other reservations can be modified without changing the origin or trip." Policy Enforcement / Restriction 4
Modify Flights "Modifications made less than 24 hours before departure will incur a $75 fee per passenger." Fee Policy / Upsell Handling 3
Cancel Flight "All reservations can be cancelled within 24 hours of booking or if the airline cancelled the flight." Policy Enforcement / Restriction 3
Cancel Flight "Cancellations made after 24 hours will result in a travel credit valid for one year." Refund Handling / Credit Policy 3
Refund "Refunds will go to original payment methods within 5 to 7 business days." Payment Handling / Financial 2
Refund "Refunds for non-refundable tickets are available only if the flight is cancelled by the airline." Policy Enforcement / Restriction 4

3. Policies Graph Generation

IntellAgent creates a weighted graph structure where: - Nodes represent individual policies - Each node is assigned a weight that reflects the complexity of its associated policy - Edge weights reflect (on a scale of 1–10) the likelihood of the two policies co-occurring in a conversation

This graph facilitates creation of natural user requests that cover various policies at various levels of complexity.

Policies Graph Generation

4. Dataset Event Generation

The system generates comprehensive test scenarios including: - Scenario Content: Specific user interaction cases - Expected Behavior: Desired agent behavior based on policies - Data Ingestion: Required database records and state prerequisites

Each scenario is assigned varying challenge scores to ensure thorough testing across complexity levels.

Generate Dataset Event

5. Dialog Simulation

The dialog simulation process involves a sophisticated interaction between multiple components to thoroughly test the chatbot's capabilities:

5.1 User-Chatbot Interaction

For each event in the database, IntellAgent orchestrates a dynamic conversation between: - A User Agent that simulates user behavior - The Target Chatbot being evaluated - A Dialog Critique component that monitors the interaction

The simulation follows these steps:

  1. The User Agent receives:
    • Event description and context
    • Database state information
    • Expected chatbot behaviors based on policy requirements
  2. The interaction proceeds step-by-step with the User Agent making decisions based on the chatbot's responses
  3. The User Agent can terminate the conversation when either:
    • The task is successfully completed
    • The chatbot violates a policy or deviates from expected behavior

Run Simulation Events

6. Dialog Critique System & Comprehensive Evaluation

The critique component performs real-time monitoring and post-conversation analysis:

  1. Real-Time Dialog Monitoring

    • Tracks conversation flow and policy compliance in real-time
    • Validates each interaction against defined policies
    • Flags potential issues for immediate intervention
  2. Policy Analysis

    • Identifies which event policies were actively tested during the conversation
    • Documents any policy violations or deviations from expected behavior
    • Generates a detailed compliance report
  3. Performance Metrics

    • Calculates quantitative metrics (success rate, completion time, policy coverage)
    • Evaluates qualitative aspects (conversation flow, user satisfaction)
    • Identifies patterns in successful vs. failed interactions
    • Provides actionable recommendations for improvement

The system generates two primary types of analysis:

  • Category-based Analysis: Success rates and common issues across different policy types
  • Complexity-based Analysis: Performance metrics across varying challenge levels
Policy Categories Performance Challenge Levels Performance
Success Rates Across Policy Categories Success Rates Across Challenge Levels