Building an AI observability platform at massive scale and complexity fundamentally changes what support looks like.
Customers rarely arrive with a clear root cause. Instead, they come with symptoms, while the actual issue may span distributed services, customer-specific configuration, third-party LLM providers, ingestion pipelines, or other layers of the platform. Debugging that kind of issue requires more than fast ticket responses. It requires fast access to the right technical context.
At Arize, we have been rebuilding our support operations to match that reality.
Last quarter, Arize supported thousands of support conversations across hundreds of customer accounts, with ticket volume continuing to increase month-over-month as the number of new customers continued to grow.
At the same time, median support resolution time dropped by half month over month:
- February: 22 hours
- March: 11 hours
- April: 5 hours
- Current: roughly 2.5 hours
That improvement did not come from replacing support engineers with bots. It came from building AI-native internal workflows that reduce the manual work required to investigate, escalate, and resolve technical issues.
The real bottleneck was context gathering
For complex production AI issues, the slowest part of support is often getting enough context to know where to look and not the actual fix itself.
A support engineer may need to:
- read through Slack threads
- understand the customer account and environment
- inspect traces
- search application logs
- reproduce the issue
- correlate timestamps across tools
- identify the affected user, workflow, or model call
- summarize the issue for engineering
- create a GitHub issue with enough detail for someone else to act
That work is necessary, but also repetitive. It burns time, increases cognitive load, and makes every escalation depend on how much context a person can manually gather and communicate.
So we started packaging those investigation patterns into internal agent skills.
Internal skills as support infrastructure
At Arize, internal skills are reusable workflows that help teams gather context, run investigations, and move work across systems without rebuilding the process from scratch every time.
“We intentionally made our skills composable so engineers don’t have to reinvent the wheel every time they want to add something new,” says Chris Hendel, head of engineering at Arize. “We even have a ‘create skill’ skill to leverage that composability, bake in security best practices, and help with ongoing maintenance.”
A support engineer can start from a ticket link, customer account, issue ID, affected user, or timeframe. From there, a skill can pull the relevant context into the agentic coding environment where the engineer is already working, such as Cursor, Claude Code, or Codex.
Instead of manually copying details across tools, the workflow can collect the background needed to begin debugging:
- customer account context
- ticket history
- Slack discussion
- relevant traces
- application logs
- reproduction details
- known related issues
- suspected failure points
- links to source artifacts
That does not remove the engineer from the workflow. But it does give the engineer a better starting point.
As Arize’s Head of Global Customer Experience Trevor LaViale put it: “These AI workflows really, really significantly reduce the mental load that application engineers, forward-deployed engineers, and customer support members need to do. We can kick off workflows that help us figure out an issue within minutes rather than within a couple of hours.”
From ticket link to investigation workspace
One of the most useful patterns is simple: start with the support artifact, then build the investigation context automatically.
Previously, a support engineer might open a ticket, search Slack, find the customer account, look up traces, inspect logs, reproduce the issue, and then summarize everything manually for engineering.
Now, a workflow can do much of that setup directly from a ticket link or issue ID.
The workflow pulls the relevant customer context, gathers debugging artifacts, and prepares the investigation inside the tools engineers already use. That means the first human pass starts with the right background instead of a blank page.
For an AI observability platform, the initial bug report is rarely enough on its own. Resolving issues often requires understanding how data moved through the platform — from ingestion and traces to evaluations, model outputs, retrieval behavior, and customer-specific configuration — and identifying where behavior diverged from expectations.
Debugging workflows that move from symptoms to likely causes
Arize also built debugging skills that retrieve logs, traces, and reproduction details using a small set of inputs, such as:
- customer account
- affected user
- approximate timeframe
- relevant workflow or feature
- ticket or issue ID
From there, the workflow can chain together the steps an engineer would otherwise perform manually: search logs, inspect traces, correlate errors, identify relevant code paths, and surface likely root causes.
The goal is to compress the time between “something is wrong” and “the right person is looking at the right evidence.”
This has changed the shape of support investigations at Arize. Workflows that previously took hours can now often be completed in minutes because engineers are no longer starting every investigation by manually collecting context across systems.
Better escalations between support and engineering
The same workflows also improved handoffs.
Once an investigation is complete, the workflow can generate a concise summary, attach relevant traces and logs, and prepare a structured GitHub issue for engineering.
That means escalations are less likely to arrive as vague summaries like “customer is seeing bad behavior.” Instead, they can include:
- what the customer experienced
- when it happened
- which account, user, or workflow was affected
- what evidence has already been gathered
- links to traces and logs
- reproduction steps if available
- suspected root cause
- what still needs engineering review
This makes support faster and engineering time more effective. The receiving engineer does not have to reconstruct the entire story from Slack, tickets, and screenshots before deciding what to do next.
“These workflows have dramatically improved how our customer-facing teams hand issues off to engineering,” Chris says. “We’re getting better context, more focused investigations, and often the root cause is already included in the ticket. That allows our engineers to spend less time diagnosing and more time fixing.”
Humans stay in the loop
A key design choice is that Arize always keeps people in the support loop.
The point of these workflows is not to automate customer relationships or push users into a bot-only support path. Enterprise AI support requires judgment, communication, and technical ownership.
The AI workflows reduce the operational overhead around that work. They help support engineers, application engineers, and product engineers spend less time gathering context and more time resolving the actual issue.
That distinction matters. In production AI, customers need partners who can reason through model behavior, system design, data pipelines, and application logic. Automation helps most when it gives technical teams more leverage without removing accountability.
Dogfooding Arize on our own support agents
We also trace these internal skills and agentic workflows into Arize.
That gives us visibility into how the workflows behave in practice:
- Which skills are used most often?
- Where do agents get stuck?
- Which workflows loop or call the wrong tools?
- Which steps still require too much manual intervention?
- Which skills improve resolution time?
- Which workflows need better evals, prompts, or tool access?
As Trevor explained: “We’re big believers in dogfooding our own product. Every one of these skills and agentic workflows is traced directly into Arize so we can understand which skills are working well, where agents get stuck, whether workflows are looping, and what needs improvement. We’re continuously using Arize to improve the very systems that power our support operations.”
This creates the same feedback loop we believe every production AI team needs: observe the workflow, evaluate where it fails, improve the system, and repeat.
What this means for teams running AI in production
Production AI support has to account for failure modes that traditional support systems were not designed around.
A customer might need help with an error message or help understanding why an agent selected a tool, why retrieval returned the wrong context, why latency spiked, why an eval regressed, or why a workflow behaved differently for one user than another.
That requires a support model with:
- fast access to technical context
- shared visibility across support and engineering
- clear escalation paths
- strong debugging workflows
- observability across agent behavior
- humans who can reason through complex production systems
For Arize, internal skills became a way to operationalize that model.
They helped reduce repetitive work, improve escalation quality, and make support investigations faster without lowering the technical bar.
The result is a support organization that scales more like an engineering system: instrumented, iterative, and built around tight feedback loops.
The future of support is operational leverage
As AI moves deeper into production, support teams will need to work differently.
The best support organizations will not simply respond faster. They will build systems that make every investigation easier to start, easier to escalate, and easier to learn from.
That is the direction Arize is building toward: support workflows that combine human technical judgment with AI-native operational infrastructure.
The headline metric is resolution time. But the lesson is that better tooling, stronger feedback loops, and well-designed internal skills can change how support organizations actually run.