Memory Compaction for Long-Running AI Agents

Agents

LLM

Context Engineering

Practical strategies for managing context window limits in AI agents, including summarization, token budgeting, and long-term memory consolidation.

Author

Branden Collingsworth

Published

January 26, 2026

Your agent worked flawlessly for the first 50 turns. Then it started forgetting files it had just read, repeating questions it already asked, and making decisions that ignored context from minutes ago. You’ve hit the context window wall.

Think of an AI agent’s context window as its working memory—the scratchpad where active reasoning happens. Like human working memory, it’s limited. Long-term memory lives elsewhere: in retrieval systems (RAG) for facts and documents, and in skills systems for learned procedures. The context window is where the agent thinks; everything else is storage it can pull from.

Long-running agents eventually hit this wall: fixed context windows. When conversations, tool outputs, and code pile up, the working memory either overflows or loses focus. This guide covers practical compaction strategies that keep working memory sharp over extended sessions—and touches on how consolidation into long-term memory enables agents that improve over time.

Key Takeaways

Compaction Strategies: failed tools, history bounds, summarization, rolling summary, truncation, deduplication
Always leave recall hints when offloading content
Log every compaction action for debugging and replay
Consolidate valuable learnings into long-term memory (RAG, skills)

Why compaction matters

Finite attention—Even large context windows dilute focus when stuffed with irrelevant history.
Cost and latency—Bigger prompts are slower and more expensive.
Reliability—Naive truncation throws away critical details; structured compaction preserves coherence.
Auditability—Operating at scale demands traceable state changes.

Example: A coding agent debugging a complex issue reads 15 files, runs 20 shell commands, and produces 40k tokens of context. Without compaction, it either truncates blindly (losing the error message that matters) or hits rate limits. With structured compaction, it keeps the relevant errors, summarizes the file reads, and continues reasoning clearly.

Compaction considerations

Before diving into specific strategies, a few concepts shape how you approach compaction.

Offloading vs. deleting. When you remove something from context, you have a choice: throw it away or store it somewhere the agent can retrieve later. Offloading to a message log means the information isn’t gone—it’s just not taking up space in the current prompt. The agent can fetch it back if needed. Deletion is permanent and cheaper to manage, but you lose the ability to recover details that turn out to be important.

Recall hints. When you offload content, leave a breadcrumb in its place. This tells the agent what’s missing and how to get it back. Without these hints, the agent has no idea that information was removed or where to look for it.

# Recall hint format examples
f"[SUMMARIZED - {len(original)} tokens -> {len(summary)} tokens]"
f"[RECALL: recall_memory(id='{memory_id}') for full content]"

Auditability. Every compaction action should be logged: what was removed, when, why, and where it went. This audit trail is essential for debugging (“why did the agent forget about that file?”) and for building trust in systems that operate autonomously over long periods.

Tool-aware treatment. Not all content is equally compressible. Shell output often contains critical error messages. Database query results need their schema preserved. File reads can usually be summarized aggressively as long as you keep the paths. Your compaction logic should know what kind of content it’s handling.

Token budgeting. Some compaction strategies give you predictable token counts, others don’t. Hard caps on turn count or total tokens guarantee a ceiling—you always know your maximum context size. Summarization is less predictable since output length varies.

The most robust approach combines both: use hard limits as guardrails, then layer summarization on top for additional compression. Always leave headroom for the next user message and tool response so you don’t hit the wall mid-turn.

The compaction Strategies

Each strategy targets a different source of bloat.

Trim failed tool calls

Strip failed tool calls and their error outputs once they age out. Old failures rarely help and waste tokens. Wait a few turns before removing them (you might still be debugging), and keep the most recent failures around in case they’re relevant. Since failures aren’t usually worth recalling later, skip offloading them to long-term memory. Always log what you remove for auditability.

Bound history length

Set a hard cap on conversation length—either by turn count, token count, or both. This guarantees a maximum cost envelope and keeps things predictable. System prompts and user messages usually stay intact, while everything else gets offloaded with recall hints so the agent can fetch it later if needed.

Summarize large outputs

Tool I/O often dominates token usage. File reads, web scrapes, and query results can balloon your context fast. When outputs exceed a threshold, summarize them down to the essentials and replace the original with a concise summary plus a recall hint.

Different tools need different treatment: raise thresholds for shell output (often important), preserve schema and column names for database queries, and compress file reads aggressively while keeping the paths visible.

Maintain a rolling summary

Keep a structured, rolling digest of older turns. When your context hits a trigger point (by tokens or turn count), update the summary and insert it as a system message after your main prompt. Keep the last few turns verbatim so the agent has full fidelity on recent context, while older history lives in the compressed summary.

Truncate old messages

For messages that are too old to summarize but too recent to drop entirely, truncate them to a fixed size while preserving the scaffolding—who said what, which tool ran. The agent still knows events occurred even if the details are trimmed.

Deduplicate and collapse patterns

Once you’ve handled the obvious bloat, look for subtler redundancy. Semantic deduplication uses embeddings to find messages that say essentially the same thing—keep the most recent or most complete version and drop the rest. Tool-sequence compression collapses repetitive patterns like “read file A, read file B, read file C” into a single summarized step. These are more advanced techniques, requiring an AI agent dedicated to this task, but they can squeeze out meaningful savings in long sessions with lots of back-and-forth.

Consolidation into long-term memory

Compaction manages working memory, but the most valuable content shouldn’t just be compressed—it should be consolidated into long-term storage where it can benefit future sessions. This is where self-improving agents start to emerge.

Declarative facts—user preferences, project-specific knowledge, key decisions and their rationale—can flow into retrieval systems (RAG) that the agent queries when relevant. Instead of cramming everything into every prompt, the agent pulls what it needs on demand.

Procedural knowledge is even more interesting. When an agent figures out how to deploy to staging, query a specific database, or debug a tricky error, that successful tool-call sequence can be distilled into a skills document. These skills become reusable patterns that transfer across sessions and projects. Session summaries feed this process: review what worked, extract the patterns worth keeping, and prune what didn’t. Over time, the agent accumulates expertise in its long-term memory rather than relearning the same lessons every session.

What can go wrong

Compaction adds moving parts, and each one can fail in ways that break your agent.

Summaries lose critical details. Your summarizer condenses a database migration plan but drops the constraint that foreign keys must be created after the tables. The agent proceeds, the migration fails, and debugging takes hours because the audit log shows the original plan was correct.

Recall hints get ignored. The agent sees the breadcrumb but doesn’t fetch the content, either because it doesn’t recognize the hint format or decides (incorrectly) that it doesn’t need the details. It confidently answers with incomplete information.

Retrieval returns the wrong content. Embeddings or chunking are off, so the agent either can’t find what it needs or pulls back irrelevant context that confuses it. The recall mechanism exists but returns noise instead of signal.

Over-retrieval bloats working memory again. The agent fetches so many recalled chunks that you’ve just moved the bloat problem rather than solving it. You need limits on retrieval, not just on the base context.

Truncation breaks mid-thought. A message gets cut at a bad boundary, leaving the agent with a half-formed idea or dangling reference. It tries to continue a plan that’s now missing its key constraint.

Ordering gets inconsistent. Compaction strategies fire in different orders across runs, leading to non-deterministic context states that are impossible to debug. The same session produces different agent behavior depending on timing.

Consolidated skills go stale. Procedures that worked six months ago don’t reflect current APIs or best practices, and the agent confidently executes outdated workflows. Long-term memory needs maintenance too.

Privacy violations. Sensitive content gets offloaded to logs or long-term memory without proper access controls. Your audit trail becomes a liability.

This is why your audit log matters beyond debugging individual sessions. You need an evaluation framework that can replay sessions from the audit trail, trace exactly what was compacted and when, and verify that changes to your compaction logic don’t introduce regressions. Without replay capability, you’re tuning a system you can’t observe—and every fix risks breaking something that was working.

Conclusion

Compaction keeps working memory sharp—but it’s not just about fitting under a token limit. A prioritized compaction pipeline combined with robust offload and audit practices gives your agents sustained quality over long sessions. And when you consolidate the best of what’s learned into long-term memory, you get agents that improve over time rather than starting fresh every conversation.