In our latest paper reading, we had the pleasure of hosting Grégoire Mialon — Research Scientist at Meta Superintelligence Labs — to walk us through Meta AI’s groundbreaking paper titled “ARE: scaling up agent environments and evaluations.”
Context on the Paper
Meta’s new ARE (Agents Research Environments) is a platform for building time-driven worlds where agents can act, adapt, and be verified. On top of ARE, the team introduces Gaia2, a large, verifiable benchmark of 1,120 scenarios inside a smartphone-like environment (“Mobile”). Unlike static tests, these scenarios run asynchronously, inject events and delays, and evaluate abilities like adaptability, handling ambiguity, collaboration with other agents, and temporal responsiveness. Early results show no model dominates across capability, cost, and latency; stronger reasoning often slows agents down, especially on time-sensitive tasks.
Watch
Dive In
- Gaia2 Benchmark Leaderboard
- More on ARE
- Get notified for future AI research paper readings.
Why Did Meta AI Build ARE and Gaia2?
Grégoire Mialon: We started to work on agents a while ago… there was no good library for building agents, nothing to create environments. Each time you wanted to work on agents, you needed to create a runtime, create tools, and create content — or try to connect actual tools — and things quickly became messy. Even creating tasks was difficult; ideally you want verifiable tasks, but the more complex the environment, the harder it is to find good verification.
Agent evaluation was getting saturated. You had nice evals like τ-bench that test multi-turn tool use, but many tool-use evals weren’t as holistic or complex as we wanted, and they focused mostly on search and execution.
From tasks to scenarios: time & verification in ARE
Grégoire Mialon: ARE is meant to be a platform to create diverse, complex environments. An environment is a set of apps—email, messages, etc.—with read/write tools. While the agent is working, time flows; the state of the world evolves even if the agent doesn’t act. You can connect users and agents, get notifications, and choose to process them or not. We go beyond single ‘tasks’ toward ‘scenarios’—represented as DAGs with user messages, environment events scheduled while the agent works, and conditional branches. Verification is part of this graph: after the agent acts, you can schedule checks and stop early if it failed, saving compute.
What is Gaia2?
Grégoire Mialon: We created a new benchmark for agents—Gaia2—because the original Gaia (web browsing) was narrow and getting saturated. We made the benchmark harder not by longer questions but with a richer, more difficult action space in a complex environment where agents can modify the world.
Gaia2 checks write actions — the ones that modify the world (like sending an email) — and don’t explicitly verify pure reads. We check the core actions needed to consider the task completed; it’s a design choice that also helps with safety and makes the verification harder to game.
Beyond search (Gaia) and execution (τ-bench), agents must adapt while working; handle ambiguities that can have real consequences; collaborate with other agents; and deal with time—for example, ‘every five minutes check my inbox and alert me when Joe emails.’
Gaia2-Time: inference speed, latency, and inverse scaling
Grégoire Mialon: We compared scores when scenarios execute in real time vs. when generation is considered instantaneous—simulating that inference is ‘free’ and takes one second. A model like GPT-5 gets 0 on Time because it’s extremely slow—good reactions but too late, missing a five-minute window by answering at six minutes. With instantaneous generation, GPT-5 jumps to ~34% on Time.
So with the Time capability you see an inverse scaling law: bigger models can perform worse on Time tasks because they’re slower. In real life you often need fast answers; this nudges what it means to ‘serve’ a model—best replies or fast replies—and how an API provider ensures responses in due time.
From the paper: Gaia2 consists of 800 unique verifiable scenarios (plus augmentations totaling 1,120) across 10 universes in the Mobile environment, with 101 tools; it evaluates dynamic events, continuous time, and agent-to-agent collaboration.