Every web agent eventually hits the same wall: memory.
Web agents either try to remember every single page, click, and note until the context window is jammed with noise. Or they summarize too aggressively, throwing away the one detail that turns out to be critical fifteen steps later. The AgentFold paper, “AgentFold: Long-Horizon Web Agents with Proactive Context Management” (arXiv:2510.24699), is essentially a direct attack on that problem.
Instead of treating history as an append-only log, AgentFold teaches the agent to manage its own memory like a working scratchpad. At each step, the agent decides what to keep in full detail, what to compress, and what to “fold up” into a higher-level takeaway. The result: a web agent that can stay useful across hundreds of steps without needing a massive, expensive model or an absurdly long context window.
In this article, I am going to unpack what this paper is really saying in plain language, connect it to the broader world of agentic AI, and answer the core questions a builder should have: what problem AgentFold actually solves, how its “context folding” works under the hood, what the numbers really say, and what it means for the agents you want to ship.
What Problem Is AgentFold Actually Solving?
AgentFold aims to address the issue that today’s web agents either remember everything and run out of tokens or summarize too much, overlooking what matters.
Let’s take a basic prompt:
“Research three competitor tools, pull pricing, check recent reviews, confirm which ones have a WhatsApp integration, and draft a comparison for my sales deck.”
Most current web agents follow a ReAct-style pattern:
think → browse → read → think → repeat.
Every page they open, every snippet they read, and every intermediate thought gets appended to one giant history and sent back to the model again and again.
Two terrible things happen over long tasks:
- Context saturation — “Conversation history” grows to tens of thousands of tokens. The model has to reread its entire diary on every step. This not only exceeds context limits, but it also obscures key facts (such as “Tool A doesn’t support WhatsApp”) under a lot of irrelevant chatter.
- Throw-away summaries — Some web agents try the opposite trick: instead of keeping everything, they auto-summarize the entire history into a short paragraph at every step. That keeps context cheap, but now you have a different failure mode: the summary might silently drop product details, dates, or edge cases that become crucial 20 steps later.
So the core problem isn’t “web navigation” per se. It’s long-horizon context management:
- Tasks that require dozens or hundreds of actions (e.g., BrowseComp and WideSearch benchmarks) quickly exceed standard context windows.
- Naively scaling up to bigger models and longer contexts makes infra costs explode, without solving the underlying “keep vs compress vs forget” decision.
- Small mistakes early in the trajectory compound because the agent can’t reliably look back at what truly mattered.
AgentFold attempts to address this issue by instructing the agent to manage memory as it works on problems dynamically. Let’s see how it works in the next section.
How Does AgentFold’s “Context Folding” Memory System Work?
- The Agent Has Two Kinds of Memory, Not One
Most agents keep a single log of everything they’ve seen and thought. AgentFold splits this into:
- The question: The original user task (“Compare these tools…”)
- Latest interaction: The complete, detailed record of what just happened (page viewed, snippet read, tool call, etc.)
- Multi-scale state summaries: A running list of compressed “state notes” about what’s been discovered so far
So the agents work in the following manner:
- Question
- Summary 1: “Finished checking Tool A pricing.”
- Summary 2: “Tool B has no WhatsApp integration.”
- Latest step: “Currently on Tool C’s G2 reviews page…”
The agent always sees the complete latest step, plus a curated stack of past summaries.
2. Every Step, It Does Two Things: Think and Fold
On each turn, AgentFold doesn’t just “think and act.” It does:
- Thinking: “Given the question, my summaries, and the latest interaction, what should I do next?”
- Folding directive: “How should I reshape my memory so it stays useful for later steps?”
This part is the key innovation. It’s an explicit, learned instruction about memory, not a hidden side effect.
3. Folding Comes in Two Flavors
AgentFold can fold its context in two different ways:
Granular condensation
- It takes the latest interaction and converts it into a concise, precise note, then adds it to the summary list.
- Example: After reading a pricing page, it might add: “Tool A: $99/mo base, WhatsApp support via add-on.”
Deep consolidation
- It looks back over several past summaries + the latest step and merges them into one higher-level “chapter.”
- Example: After finishing all research on Tool A, it collapses 10 small notes into: “Tool A: mid-range cost, no native WhatsApp, strong reviews on ease of use.”
In the deep consolidation case, the old detailed notes are removed and replaced by this more abstract “chapter.”
So the agent is constantly asking itself: “Is this a new small fact I should store as-is. Or have I just finished a mini-project that I can compress into one takeaway?”
4. The Result: Context That Grows Slowly, But Stays Smart
Because AgentFold is always folding and consolidating:
- The context length grows sub-linearly, even over 100+ steps — it goes from ~3.5k tokens to ~7k tokens over 100 turns, instead of exploding linearly with every click.
- It can handle up to 500 turns with a 128k context model without becoming overwhelmed by its own history.
In human terms, it’s like maintaining a clean notebook with clear section summaries, instead of scrolling through your entire browser history every time you need to make a decision.
5. How does it Learn This Behavior?
- The “teacher” LLM not only chooses actions but also produces effective folding directives (i.e., what to condense versus consolidate).
- The authors then fine-tune a 30B model (based on Qwen3–30B-A3B) on these structured trajectories — without requiring RL or any additional pre-training.
So the model isn’t just learning how to browse. It’s learning: “When I finish this kind of subtask, here’s how I should clean up and reorganize my memory.”
How does this change the model’s behavior? We’ll see that in the next section.
How Much Better Does AgentFold Perform Than Today’s Leading Open-Source And Proprietary Web Agents?
AgentFold’s 30B model is playing in the same league as (or beating) much larger open-source systems and even potent proprietary agents, including o4-mini.
OpenAI and other research groups have released benchmarks that measure a web agent’s capability to find a specific answer on the web. So, most of these benchmark results tell you how many answers the agent is actually able to see accurately.
For AgentFold, this looks like:
- BrowseComp (English): 36.2%
- BrowseComp-ZH (Chinese): 47.3%
- WideSearch: 62.1%
- GAIA (text-only subset): 67.0%
For context, Deepseek achieves 30% on BrowseComp while O4-Mini achieves 28.3%. AgentFold beats both models without any extra RL. It is also comparable to o3 on the GAIA benchmark.
How Can You Use This?
If you’re an engineer or founder looking to include web search in your product, AgentFold gives you a few advantages:
- You can use a small model to create your web agent and reduce your infrastructure costs.
- You can run more complex workflows without hitting context limits
Now, this is not a cause for celebration just yet. There are some limitations to the powers of this:
- It’s practically untested in enterprise contexts and is limited to a small number of turns in experimental setups.
- It still falls short of proprietary web agents like Deep Research and o3.
- Since an LLM guides the folding, some subtasks within the browsing task may be deprioritized, which could affect the results you obtain.
If you zoom out, AgentFold is not “just another web agent paper.” It is an argument that the next real unlock in agent performance will not come from stuffing bigger models into longer contexts, but from teaching agents to manage their own memory like adults, rather than toddlers.
The paper begins with a very real pain point: long-horizon tasks where agents either hoard everything and choke on context, or summarize too aggressively and overlook the one fact that mattered. By turning memory into a first-class decision (“what do I keep as-is, what do I compress, what do I merge into a chapter?”), AgentFold demonstrates that a 30B open model with clever context folding can compete with, and sometimes surpass, much larger and more expensive systems.
At the same time, the work is a strong prototype, not a finished product recipe. It still fails on a significant portion of complex tasks, and its benchmark performance may not accurately translate to the intricacies of enterprise workflows. You would not drop this into a mission-critical customer-facing flow without guardrails, monitoring, and your own fine-tuning.
AgentFold does not solve the problem of long-horizon agents once and for all, but it provides a concrete pattern and vocabulary to work with. If ReAct was about giving agents tools, AgentFold is about giving them a working brain for those tools. The next generation of practical agentic systems will almost certainly borrow from this idea of proactive context management, whether they refer to it as “folding” or not.
Enjoyed this article? Sign up for our newsletter to receive regular insights and stay connected.

