o1 time series evals

o1-preview Time Series Evaluations

Aparna Dhinakaran

Co-founder & Chief Product Officer

Time series anomaly detection is one of the most challenging tasks we tackle at Arize. Using large language models (LLMs) for time series analysis, especially in our AI co-pilot assistant, has proven invaluable for uncovering intricate patterns, correlations, and potential issues in complex datasets. In this post, we’ll dive into the results of our recent evaluation of several LLMs, showing how these models measure up in detecting anomalies across vast time series data.

The Challenge

Our latest evaluation focused on the o1-preview model, which performed well in terms of accuracy but had limitations in processing speed. For context, we analyzed hundreds of time series data points, each representing metrics over time (in JSON format) for various global cities. The models were asked to detect significant deviations in these metrics, identify the time series (city), and specify the date of any detected anomaly.

This is no easy feat. The models had to:

  1. Analyze patterns across large context windows while handling multiple metrics.
  2. Perform mathematical calculations on time series data (something LLMs typically struggle with).
  3. Avoid false positives by correctly identifying anomalies without cross-referencing unrelated cities.

Given these demands, o1-preview significantly outperformed other models in anomaly detection—marking a leap forward for time series analysis in LLMs.

Evaluation Results: How Did Each Model Perform?

We conducted tests across four different context window sizes, measuring how well each model detected anomalies within each setup.

o1-Preview Results:

  • 110k context window: Detected 85% of anomalies
  • 80k context window: Detected 80% of anomalies
  • 56k context window: Detected 95% of anomalies
  • 30k context window: Detected 100% of anomalies

Claude-Sonnet Results:

  • 110k context window: Detected 55% of anomalies
  • 80k context window: Detected 75% of anomalies
  • 56k context window: Detected 85% of anomalies
  • 30k context window: Detected 60% of anomalies

o1-Mini Results:
Struggled with this task, detecting anomalies in only 20-45% of cases depending on the context window size.

Clearly, o1-preview’s accuracy at smaller context windows highlights its potential, though speed remains a challenge. For example, the response time for o1-preview is currently around 2-3 minutes per query—a significant drawback for a real-time application like Arize Co-pilot.

Why These Results Matter for Arize Co-Pilot

Our AI co-pilot assistant has a dedicated skill for time series analysis, crucial for detecting anomalies in complex data. The enhanced detection accuracy from o1-preview could improve our product’s ability to identify subtle patterns and anomalies. However, the long processing time isn’t practical for regular use today. As a solution, we’re exploring offering an opt-in for users who prioritize accuracy over speed or implementing an option to swap in stronger models selectively for tougher analysis tasks.

Future versions of o1-preview with optimized latency could unlock an even more powerful anomaly detection tool for Arize Co-pilot, paving the way for smarter, faster debugging of time series data.

The “Thought Process” of o1-Preview

One of the unique aspects of o1-preview is its “thought process” feature—a behind-the-scenes view of how the model maps out daily changes, spots anomalies, and sets boundaries for data patterns. While this insight doesn’t directly impact anomaly detection accuracy, it adds a layer of transparency and helps users verify if the model’s decision-making aligns with their expectations.

How Other Models Compare: Claude-Sonnet and Claude-Opus

In previous tests, models from Claude, such as Opus and Sonnet, consistently performed well in time series anomaly detection. However, our latest, more aggressive evaluation method (using decimal values in the range of 0 to 1) revealed that while Claude-Sonnet and Claude-Opus still provide reliable anomaly detection, o1-preview’s edge in accuracy at larger context windows is noticeable.

For example:

  • Claude-Opus achieved 50-90% detection rates across context windows, depending on the range.
  • Claude-Sonnet ranged from 55-85% in anomaly detection.

This reinforces the potential for model selection based on task complexity—swapping to o1-preview or similar models for challenging anomaly detection tasks when accuracy is paramount.

What the Data Looks Like

To give a sense of what these models are analyzing, here’s an example of a JSON data snippet:

{“2024-09-17”: 0.17, “2024-09-18”: 0.09, “2024-09-19”: 0.11, “2024-09-20”: 0.66, “2024-09-21”: 0.1}

In this example, a spike of 0.66 on 2024-09-20 is the detected anomaly. The models analyze such numeric patterns, looking for spikes or other irregularities over a given period.

A Promising Future for o1-Preview

This evaluation marks a significant step forward. We’re now seriously considering o1-preview as a potential replacement for more specific tasks within Arize Co-pilot, provided its response time can be improved. The future likely holds opportunities for swapping models in and out based on task difficulty and accuracy requirements.

For now, o1-preview’s performance on high-difficulty time series tasks is encouraging, and it’s almost certain that future versions will play a crucial role in our ongoing quest to bring smarter, more precise AI-powered tools to our users.