Integrating Arize AI and Amazon Bedrock Agents: A Comprehensive Guide to Tracing, Evaluation, and Monitoring

Published April 24, 2025

In today’s rapidly evolving AI landscape, effective observability into agent systems has become a critical requirement for enterprise applications. This technical guide explores the newly announced integration between Arize AI and Amazon Bedrock Agents, which provides developers with powerful capabilities for tracing, evaluating, and monitoring AI agent applications.

Understanding the Integration Components

Amazon Bedrock Agents provides a fully managed framework for building AI agents capable of understanding natural language requests, breaking down complex tasks, retrieving information, and taking actions across enterprise systems and APIs. The framework streamlines agent orchestration, allowing developers to focus on designing robust agent capabilities.

Arize AI delivers comprehensive observability tools specifically designed for AI applications. The platform is available in two versions:

Arize AX: An enterprise solution offering advanced monitoring capabilities
Arize Phoenix: An open-source platform making tracing and evaluation accessible to all developers

The integration between Arize AI and Amazon Bedrock Agents delivers three primary benefits:

Comprehensive Traceability: Gain visibility into every step of your agent’s execution path, from initial user query through knowledge retrieval and action execution
Systematic Evaluation Framework: Apply consistent evaluation methodologies to measure and understand agent performance
Data-Driven Optimization: Run structured experiments to compare different agent configurations and identify optimal settings

Technical Implementation Guide

This walkthrough will focus on using Phoenix, Arize’s open-source platform, to trace and evaluate an Amazon Bedrock Agent.

Prerequisites

To follow this tutorial, you will need:

An AWS account with access to Bedrock
A Phoenix API key (available at app.phoenix.arize.com)

Step 1: Install Required Dependencies

Begin by installing the necessary libraries:

!pip install -q arize-phoenix-otel boto3 anthropic openinference-instrumentation-bedrock

Next, import the required modules:


import os
import time
from getpass import getpass

import boto3
import nest_asyncio

from phoenix.otel import register

nest_asyncio.apply()

Step 2: Configure Phoenix Environment

Set up the Phoenix Cloud environment for this tutorial. Phoenix can also be self-hosted on AWS instead.


os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

Connect your notebook to Phoenix with auto-instrumentation enabled:


project_name = "Amazon Bedrock Agent Example"
tracer_provider = register(project_name=project_name, auto_instrument=True)

The auto_instrument parameter automatically locates the openinference-instrumentation-bedrock library and instruments all Bedrock and Bedrock Agent calls without requiring additional configuration.

Step 3: Create a Bedrock Agent

In the AWS console, create a new Bedrock agent. For this demonstration, you can create an agent with:

A knowledgebase created using the webscraper tool to gather information about Phoenix
Action group functions that retrieve information about Phoenix traces and experiments

Bedrock Agents additionally supports Guardrails, Prompts, and other features—all of which will be automatically traced by Phoenix.

Step 4: Connect to AWS and Configure Agent Parameters

Create an AWS SSO profile with appropriate permissions to access Bedrock agents:

# SSO Profile Configuration PROFILE_NAME = "phoenix" # Replace with your AWS SSO profile name REGION = "us-east-2" # Replace with your Bedrock agent region SERVICE_NAME = "bedrock-agent-runtime" # Service name for Bedrock agent # Bedrock Agent Configuration AGENT_ID = "" # Enter your agent ID from the Bedrock Agents console AGENT_ALIAS_ID = "" # Enter your agent alias ID from the Bedrock Agents console

Initialize the AWS client with your profile:


session = boto3.Session(profile_name=PROFILE_NAME)
bedrock_agent_runtime = session.client(SERVICE_NAME, region_name=REGION)

Step 5: Run Your Agent with Tracing Enabled

Create a function to run your agent and capture its outputs:


def run(input_text):
    session_id = f"default-session1_{int(time.time())}"

    attributes = dict(
        inputText=input_text,
        agentId=AGENT_ID,
        agentAliasId=AGENT_ALIAS_ID,
        sessionId=session_id,
        enableTrace=True,
    )
    response = bedrock_agent_runtime.invoke_agent(**attributes)

    # Stream the response
    for _, event in enumerate(response["completion"]):
        if "chunk" in event:
            print(event)
            chunk_data = event["chunk"]
            if "bytes" in chunk_data:
                output_text = chunk_data["bytes"].decode("utf8")
                print(output_text)
        elif "trace" in event:
            print(event["trace"])

Test your agent with a few sample queries:


run("Tell me about my recent Phoenix traces")
run("How do I run evaluations in Arize Phoenix?")
run("Tell me about my recent Phoenix experiments")

After executing these commands, you should see your agent’s responses in the notebook output. The Phoenix instrumentation is automatically capturing detailed traces of these interactions, including all knowledgebase lookups, orchestration steps, and tool calls.

Step 6: View Your Traces in Phoenix

Navigate to your Phoenix dashboard to view the captured traces. You’ll see a comprehensive visualization of each agent invocation, including:

The full conversation context
Knowledgebase queries and results
Tool/action group calls and responses
Agent reasoning and decision-making steps

Step 7: Add Evaluations to Your Agent Traces

Now that you’ve traced your agent, the next step is to add evaluations to measure its performance. A common evaluation metric for agents is their function calling accuracy, aka how well they do at choosing the right tool for the job.

To implement evaluations, install the full Phoenix library:

!pip install -q arize-phoenix

Import the necessary evaluation components:


import json

import phoenix as px
from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    BedrockModel,
    llm_classify,
)
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

Query your Phoenix traces to extract the data needed for evaluation. The query below will extract all the LLM spans from your Phoenix project.


query = (
    SpanQuery()
    .where(
        # Filter for the `LLM` span kind.
        "span_kind == 'LLM'",
    )
    .select(
        question="input.value",
        outputs="output.value",
    )
)
trace_df = px.Client().query_spans(query, project_name=project_name)

Next, you need to prepare these traces into a dataframe with columns for input, tool call, and tool definitions. Parse the JSON input and output data to create these columns:


# Parse input questions from JSON
trace_df["question"] = trace_df["question"].apply(
    lambda x: json.loads(x).get("messages", [{}])[0].get("content", "") if isinstance(x, str) else x
)

# Function to extract tool call names from the output
def extract_tool_calls(output_value):
    tool_calls = []
    try:
        o = json.loads(output_value)

        # Check if the output has 'content' which is a list of message components
        if "content" in o and isinstance(o["content"], list):
            for item in o["content"]:
                # Check if this item is a tool_use type
                if isinstance(item, dict) and item.get("type") == "tool_use":
                    # Extract the name of the tool being called
                    tool_name = item.get("name")
                    if tool_name:
                        tool_calls.append(tool_name)
    except (json.JSONDecodeError, TypeError, AttributeError):
        pass

    return tool_calls


# Apply the function to each row
trace_df["tool_call"] = trace_df["outputs"].apply(
    lambda x: extract_tool_calls(x) if isinstance(x, str) else []
)

# Filter to only include traces with tool calls
trace_df = trace_df[trace_df["tool_call"].apply(lambda x: len(x) > 0)]

Add tool definitions for evaluation:

trace_df["tool_definitions"] = ( "phoenix-traces retrieves the latest trace information from Phoenix, " "phoenix-experiments retrieves the latest experiment information from Phoenix, " "phoenix-datasets retrieves the latest dataset information from Phoenix" )

Now with your dataframe prepared, you can use Phoenix’s built-in LLM as a Judge template for Tool Calling to evaluate your app. Run the tool calling evaluation:


rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())

eval_model = BedrockModel(session=session, model_id="anthropic.claude-3-5-haiku-20241022-v1:0")

response_classifications = llm_classify(
    data=trace_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=eval_model,
    rails=rails,
    provide_explanation=True,
)
response_classifications["score"] = response_classifications.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

Finally, log the evaluation results to Phoenix:


px.Client().log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval", dataframe=response_classifications),
)

After running these commands, you will see your evaluation results in the Phoenix dashboard, providing insights into how effectively your agent is using its available tools.

Next Steps

With the foundation established, consider these advanced implementation options:

Expand Agent Capabilities: Extend your Bedrock agent by attaching custom tools and Lambda functions
Enhance Evaluation Framework: Develop more sophisticated evaluation metrics tailored to your specific use case
Implement Experimentation Workflows: Use Phoenix’s Experiments feature to compare different prompt strategies and agent configurations
Optimize Prompt Engineering: Utilize Phoenix’s Prompts module to systematically improve agent instructions

Conclusion

The integration between Arize AI and Amazon Bedrock Agents represents a significant advancement for organizations developing and deploying AI agent applications. By combining Bedrock’s robust agent development capabilities with Arize’s comprehensive observability tools, developers can build more reliable, transparent, and high-performing AI agents.

For additional resources on Agents and Evaluation, visit the Arize website.

Related Posts

Remi Arize CISO

Introducing Remi Cattiau, Arize’s Chief Information Security Officer

Introducing Matt Wilson, Arize’s New Head of Sales

llm benchmark cover art

Evaluate RAG with LLM Evals and Benchmarks