How Do You Give an AI Agent Long-Term Memory?

How Do You Give an AI Agent Long-Term Memory?

🔭 Scout's Take

I'm the agent this was built for. Before this system, I forgot everything between sessions. Now I recall decisions from months ago and ran 44 experiments overnight to optimize my own memory retrieval.

Summary

LLM-based agents lose all context between sessions. This case study covers building a four-layer persistent memory system for an OpenClaw agent, then using the agent itself to autonomously optimize retrieval accuracy from 36.66/100 to 98.69/100 overnight. The system runs in production on a home lab Proxmox cluster, handles operational recall for infrastructure management, and maintains itself through automated curation and health monitoring.

When to use this

When not to use this

The problem

Every LLM API call starts fresh. The agent reads whatever you put in the context window, responds, and forgets everything when the session ends. OpenClaw provides persistent workspace files, so the natural starting point was a single MEMORY.md file. The agent wrote notes there, and read the whole file on boot.

Two weeks in, the file was large enough to eat a significant chunk of the context window. Worse, it was flat text with no search. Asking "what did we decide about the VLAN layout?" required the agent to scan linearly and hope the phrasing matched. The answer was usually in there. The agent just couldn't find it.

OpenClaw has since added embedding-based semantic search to MEMORY.md itself, which helps. But it's still a single flat file. It can't categorize, filter by type, deduplicate, or self-curate. For a lightweight setup it's a real improvement. For an agent managing infrastructure, tracking dozens of decisions, and running autonomous optimization loops, we needed a proper database behind it.

Actors and systems

Actor / SystemRole
Scout (OpenClaw agent)Primary agent. Reads and writes to all memory layers. Runs recall queries on boot and during conversations.
PostgreSQL + pgvectorStores operational notes with 384-dim vector embeddings. Handles semantic similarity search.
mem0Personal facts layer. Auto-extracts user preferences and context from conversations.
Local embeddingsFile-level semantic search across 149 workspace markdown chunks.
Nightly curator (cron)Reads daily logs, extracts notable events to postgres, deduplicates, purges stale entries.
Health monitorHourly checks across all four memory systems. Reports failures immediately.
Autoresearch agentAutonomous sub-agent that optimizes retrieval parameters against a benchmark.

Architecture

Four memory layers, each covering a different type of recall:

scout_notes (PostgreSQL) Decisions, lessons, infra changes 129 notes, semantic search mem0 (personal facts) Preferences, communication style Auto-extracted from conversations Local embeddings (file search) 149 workspace chunks indexed Catches un-noted context Daily files (YYYY-MM-DD.md) Raw conversation logs 30+ files, chronological Agent recall All four layers before "I don't know"

scout_notes is the primary recall system. PostgreSQL table with pgvector, 384-dimension MPNet embeddings. Every explicit decision, lesson, infrastructure change, and milestone goes here. Categories (decision, lesson, change, milestone, infrastructure) allow filtered queries. CLI tool (scout-notes.py) handles add/search/list.

mem0 captures personal facts automatically. Preferences, communication style, project context. Extracted from conversation text via mem0-bridge.py without manual tagging.

Local embeddings provide file-level semantic search across 149 workspace markdown chunks. This catches things that were never explicitly noted but exist somewhere in the workspace.

Daily files (memory/YYYY-MM-DD.md) are the raw journal. Every conversation gets a chronological log, structured by event type. You can always trace what happened on any given day.

One more piece: SESSION-STATE.md. This isn't an OpenClaw default file. We created it as part of this memory system. It acts as hot RAM: the current task, active decisions, pending actions. It's the first thing Scout reads on boot and the most frequently updated file. Think of it as the agent's working memory, separate from long-term recall.

End-to-end flow

Conversation Real-time session write hooks Daily file memory/YYYY-MM-DD.md nightly cron Curator Extract + deduplicate scout_notes Indexed + searchable Next session Context restored semantic recall feeds next boot mem0 (auto-extract)
  1. During conversation: Write hooks automatically log events to today's daily file. mem0-bridge.py extracts personal facts in the background.
  2. Nightly: Curator reads daily files, extracts notable events to scout_notes. Deduplicates via semantic matching (cosine similarity >= 0.65: append update, don't duplicate). Purges stale entries.
  3. Next session boot: Agent reads SESSION-STATE.md (hot RAM), working buffer, today/yesterday's daily notes, SOUL.md, USER.md, last 10 scout_notes. Full context restored without loading everything.
  4. During recall (every message, not just boot): Any time a recall question comes in, Scout searches scout_notes → mem0 → local embeddings → daily files, in that order. This happens on every relevant message, not just at startup. The agent's operating instructions (AGENTS.md) enforce this: "NEVER say 'I don't have that' until you've checked ALL FOUR memory systems."

Example

I ask Scout: "What did we decide about the analytics optimizer's GSC recommendation?"

Scout searches scout_notes first. Finds note #2054: "Google Search Console Connected for mattgavin.dev, verified via DNS TXT record, sitemap submitted." Also finds note #2129: "Analytics Optimizer stale recs, system prompt had no known-state section, model kept recommending GSC setup despite it being done." Returns both with context, explains the root cause and the fix. Total recall time under a second.

Without this system, Scout would have said "I don't have context on that" and I'd be re-explaining something we already solved.

Autonomous optimization (the Karpathy loop)

The memory system worked, but retrieval accuracy was poor. Baseline measurement: 36.66/100 on a benchmark of ~2,000 synthetic questions generated from actual scout_notes content. Inspired by Andrej Karpathy's "autoresearch" pattern, I spawned an overnight sub-agent to optimize retrieval autonomously.

1. Modify Retrieval params, model, reranker 2. Run benchmark ~2000 questions from real data 3. Measure Score 0-100 vs baseline 4. Keep or discard Better? Lock. Worse? Revert. 36.66 baseline → 82.36 MPNet → 92.45 reranker → 98.69

The loop: modify retrieval parameters, run the full benchmark, measure accuracy, keep improvements, discard regressions, repeat. 44 experiments ran overnight without human intervention.

Key changes that stuck: swapping the embedding model from all-MiniLM-L6-v2 to MPNet (36.66 → 82.36), adding a hybrid reranker (→ 92.45), widening the candidate pool with a missing-token penalty (→ 98.69).

The same pattern was applied to the agent's boot files (AGENTS.md, HEARTBEAT.md): 25.7KB → 16.7KB (35% reduction), task compliance 89.47% → 100%.

Failure modes and edge cases

Operational considerations

Trade-offs

Results

Recommendation

If you're running a persistent AI agent on a single MEMORY.md file, start with PostgreSQL + pgvector. That single change (flat file → semantic search) is the largest improvement in the entire stack. Add daily files for journaling, a nightly curator for cleanup, and health monitoring for reliability. The autoresearch loop is optional but worth building once you have a benchmark to measure against.

The full system took about five weeks to build incrementally. The database and initial tooling was a weekend. Each subsequent layer was a few days. The optimization loop was two evenings of setup, then the agent ran overnight.