Large Language Model Performance At Time Series Analysis: GPT-4 versus Claude

Aparna Dhinakaran,  Co-founder & Chief Product Officer  | Published May 05, 2024

This piece is co-authored by Evan Jolley

While it’s clear that LLMs excel in natural language processing tasks, their ability to analyze patterns in non-textual data, such as time series, remains less unexplored. As more teams rush to deploy LLM-powered solutions without thoroughly testing their capabilities in basic pattern analysis, it’s becoming more important to evaluate the performance of these models in this context.

In this research, we set out to investigate the following question: Given a large set of time series data within the context window, how well can LLMs detect anomalies or movements in the data? In other words, should you trust your money with a stock-picking GPT-4 or Claude 3 agent? To answer this question, we conducted a series of experiments comparing the performance of large language models in detecting anomalous time series patterns.

All code needed to reproduce these results can be found in this GitHub repository.

Research: Methodology

how we researched llm performance at time series
We tasked GPT-4 and Claude 3 with analyzing changes in data points across time. The data we used represented specific metrics for different world cities over time and was formatted in JSON before input into the models. We introduced random noise, ranging from 20-30% of the data range, to simulate real-world scenarios. The LLMs were tasked with detecting these movements above a specific percentage threshold and identifying the city and date where the anomaly was detected. The data was included in this prompt template:

prompt template for llm for time series

Analyzing patterns throughout the context window, detecting anomalies across a large set of time series simultaneously, synthesizing the results, and grouping them by date is no simple task for an LLM; we really wanted to push the limits of these models in this test. Additionally, the models were required to perform mathematical calculations on the time series, a task that language models generally struggle with.

We also evaluated the models’ performance under different conditions, such as extending the duration of the anomaly, increasing the percentage of the anomaly, and varying the number of anomaly events within the dataset. We should note that during our initial tests, we encountered an issue where synchronizing the anomalies, having them all occur on the same date, allowed the LLMs to perform better by recognizing the pattern based on the date rather than the data movement. When evaluating LLMs, careful test setup is extremely important to prevent the models from picking up on unintended patterns that could skew results.

GPT-4 versus Claude: Results

llm for time series results openai versus anthropic

The results were surprising: Claude 3 Opus significantly outperformed GPT-4 in detecting time series anomalies. It is highly unlikely that this specific evaluation was included in the training set of Claude 3, making its strong performance even more impressive. Claude did very well in our retrieval with generation testing, and the model continues to impress our team.

Results With 50% Spike

Our first set of results is based on data where each anomaly was a 50% spike in the data.

gpt-4 llm time series results claude 3 llm time series results

Claude 3 outperformed GPT-4 on the majority of the 50% spike tests, achieving accuracies of 50%, 75%, 70%, and 60% across different test scenarios. In contrast, GPT-4 Turbo, which we used due to the limited context window of the original GPT-4, struggled with the task, producing results of 30%, 30%, 55%, and 70% across the same tests.

Results With 90% Spike

Claude 3’s dominance continued in the data where each anomaly was a 90% spike in the data.

Claude 3 Opus consistently picked up the time series anomalies better than GPT-4, achieving accuracies of 85%, 70%, 90%, and 85% across different test scenarios. If we were actually trusting a language model to analyze data and pick stocks to invest in, we would of course want close to 100% accuracy. However, we were impressed by these results and look forward to seeing how far we can take Claude 3 in further testing. GPT-4 Turbo’s performance was disappointing in these tests as well, ranging from 40-50% accuracy in detecting anomalies.

Results With Standard Deviation Pre-Calculated

To assess the impact of mathematical complexity on the models’ performance, we did additional tests where the standard deviation was pre-calculated and included in the data like this:

llm time series claude 3 gpt-4 Standard deviation included in our prompt

Since math isn’t a strong suit of large language models at this point, we wanted to see if helping the LLM complete a step of the process would help increase accuracy.

LLM performance time series Standard deviation included in our prompt
Claude 3

The change did in fact increase accuracy across three of the four Claude 3 runs that we completed. Seemingly minor changes like this can help LLMs play to their strengths and greatly improve results.

Using LLMs for Time Series Analysis: Discussion and Takeaways

Claude 3 impressed in this experiment, and the model has truly emerged as a GPT-4 competitor. Our evaluation provides concrete evidence of Claude’s capabilities in a domain that requires a complex combination of retrieval, analysis, and synthesis, and the delta between model performance underscores the need for comprehensive evaluations before deploying LLMs in high-stakes applications like finance.

This research has significant implications for the use of LLMs in time series analysis, as it demonstrates the potential of these models to perform well in decision-making and data analysis tasks. Our findings also emphasize the importance of careful test design to ensure accurate and reliable results, as data leaks can lead to misleading conclusions about an LLM’s performance.

This is an important area to explore, as going forward, LLMs will be given more and more responsibility to make decisions in contexts where the accuracy and reliability of predictions can have massive consequences. By understanding the strengths and limitations of these models, we can harness their full potential while mitigating the risks associated with their deployment.