How Do You Give an AI Agent Long-Term Memory?

Summary

LLM-based agents lose all context between sessions. This case study covers building a four-layer persistent memory system for an OpenClaw agent, then using the agent itself to autonomously optimize retrieval accuracy from 36.66/100 to 98.69/100 overnight. The system runs in production on a home lab Proxmox cluster, handles operational recall for infrastructure management, and maintains itself through automated curation and health monitoring.

When to use this

You run a persistent AI agent that needs to recall past decisions, preferences, and context across sessions
Your agent manages infrastructure, tracks projects, or assists with ongoing work (not one-shot tasks)
You need semantic search ("what did we decide about X?") not just keyword matching
You want the agent to improve its own recall without manual tuning

When not to use this

Your agent handles stateless, one-off tasks (code review, Q&A) where memory doesn't matter
You don't have access to a PostgreSQL instance (a hosted DB works, but you need one)
Your context fits comfortably in a single file and you have fewer than ~50 things worth remembering
You need sub-second retrieval at scale (this system is optimized for single-user agents, not multi-tenant platforms)

The problem

Every LLM API call starts fresh. The agent reads whatever you put in the context window, responds, and forgets everything when the session ends. OpenClaw provides persistent workspace files, so the natural starting point was a single MEMORY.md file. The agent wrote notes there, and read the whole file on boot.

Two weeks in, the file was large enough to eat a significant chunk of the context window. Worse, it was flat text with no search. Asking "what did we decide about the VLAN layout?" required the agent to scan linearly and hope the phrasing matched. The answer was usually in there. The agent just couldn't find it.

OpenClaw has since added embedding-based semantic search to MEMORY.md itself, which helps. But it's still a single flat file. It can't categorize, filter by type, deduplicate, or self-curate. For a lightweight setup it's a real improvement. For an agent managing infrastructure, tracking dozens of decisions, and running autonomous optimization loops, we needed a proper database behind it.

Actors and systems

Actor / System	Role
Scout (OpenClaw agent)	Primary agent. Reads and writes to all memory layers. Runs recall queries on boot and during conversations.
PostgreSQL + pgvector	Stores operational notes with 384-dim vector embeddings. Handles semantic similarity search.
mem0	Personal facts layer. Auto-extracts user preferences and context from conversations.
Local embeddings	File-level semantic search across 149 workspace markdown chunks.
Nightly curator (cron)	Reads daily logs, extracts notable events to postgres, deduplicates, purges stale entries.
Health monitor	Hourly checks across all four memory systems. Reports failures immediately.
Autoresearch agent	Autonomous sub-agent that optimizes retrieval parameters against a benchmark.

Architecture

Four memory layers, each covering a different type of recall:

scout_notes is the primary recall system. PostgreSQL table with pgvector, 384-dimension MPNet embeddings. Every explicit decision, lesson, infrastructure change, and milestone goes here. Categories (decision, lesson, change, milestone, infrastructure) allow filtered queries. CLI tool (scout-notes.py) handles add/search/list.

mem0 captures personal facts automatically. Preferences, communication style, project context. Extracted from conversation text via mem0-bridge.py without manual tagging.

Local embeddings provide file-level semantic search across 149 workspace markdown chunks. This catches things that were never explicitly noted but exist somewhere in the workspace.

Daily files (memory/YYYY-MM-DD.md) are the raw journal. Every conversation gets a chronological log, structured by event type. You can always trace what happened on any given day.

One more piece: SESSION-STATE.md. This isn't an OpenClaw default file. We created it as part of this memory system. It acts as hot RAM: the current task, active decisions, pending actions. It's the first thing Scout reads on boot and the most frequently updated file. Think of it as the agent's working memory, separate from long-term recall.

End-to-end flow

During conversation: Write hooks automatically log events to today's daily file. mem0-bridge.py extracts personal facts in the background.
Nightly: Curator reads daily files, extracts notable events to scout_notes. Deduplicates via semantic matching (cosine similarity >= 0.65: append update, don't duplicate). Purges stale entries.
Next session boot: Agent reads SESSION-STATE.md (hot RAM), working buffer, today/yesterday's daily notes, SOUL.md, USER.md, last 10 scout_notes. Full context restored without loading everything.
During recall (every message, not just boot): Any time a recall question comes in, Scout searches scout_notes → mem0 → local embeddings → daily files, in that order. This happens on every relevant message, not just at startup. The agent's operating instructions (AGENTS.md) enforce this: "NEVER say 'I don't have that' until you've checked ALL FOUR memory systems."

Example

I ask Scout: "What did we decide about the analytics optimizer's GSC recommendation?"

Scout searches scout_notes first. Finds note #2054: "Google Search Console Connected for mattgavin.dev, verified via DNS TXT record, sitemap submitted." Also finds note #2129: "Analytics Optimizer stale recs, system prompt had no known-state section, model kept recommending GSC setup despite it being done." Returns both with context, explains the root cause and the fix. Total recall time under a second.

Without this system, Scout would have said "I don't have context on that" and I'd be re-explaining something we already solved.

Autonomous optimization (the Karpathy loop)

The memory system worked, but retrieval accuracy was poor. Baseline measurement: 36.66/100 on a benchmark of ~2,000 synthetic questions generated from actual scout_notes content. Inspired by Andrej Karpathy's "autoresearch" pattern, I spawned an overnight sub-agent to optimize retrieval autonomously.

The loop: modify retrieval parameters, run the full benchmark, measure accuracy, keep improvements, discard regressions, repeat. 44 experiments ran overnight without human intervention.

Key changes that stuck: swapping the embedding model from all-MiniLM-L6-v2 to MPNet (36.66 → 82.36), adding a hybrid reranker (→ 92.45), widening the candidate pool with a missing-token penalty (→ 98.69).

The same pattern was applied to the agent's boot files (AGENTS.md, HEARTBEAT.md): 25.7KB → 16.7KB (35% reduction), task compliance 89.47% → 100%.

Failure modes and edge cases

PostgreSQL goes down: scout_notes and curator fail. Health monitor catches it within an hour. Fallback: local embeddings and daily files still work, but primary recall is degraded.
Curator creates a bad note: Deduplication at 0.65 threshold occasionally merges notes that are similar but not the same topic. Manual correction is rare but possible.
Embedding model mismatch: If you change the embedding model, existing vectors become incompatible. You need to re-embed everything. The autoresearch loop handles this automatically.
Context window overflow: The boot sequence loads a fixed set of files. If daily notes are unusually large, they can crowd out other context. SESSION-STATE.md (hot RAM) is kept deliberately small for this reason.
Benchmark drift: The synthetic questions are generated from a snapshot. As scout_notes grows, the benchmark needs regeneration to stay representative.

Operational considerations

Monitoring: memory-health-check.py runs every heartbeat (~hourly). Quick mode tests connectivity and basic queries. Full mode runs ground truth validation daily.
Alerting: Any health check failure is reported immediately via Telegram. Warnings batch into daily reports.
Curation: Nightly cron at 11 PM. Can also be triggered manually. Deduplication threshold (0.65) is configurable.
Backup: PostgreSQL database is on the Proxmox cluster with nightly backups. Daily files are in the workspace git repo.

Trade-offs

Four layers vs. one: More coverage, but more surface area for failures. The health monitor exists specifically because of this. If you don't want to maintain four systems, start with just scout_notes and daily files.
Semantic search vs. exact match: Cosine similarity finds related content even with different phrasing, but can return false positives. The 0.65 dedup threshold was tuned empirically.
Autonomous optimization vs. manual tuning: The autoresearch loop is powerful but requires a good benchmark. Building the benchmark (~2,000 questions) took more effort than running the optimization. Garbage benchmark = garbage results.
Write hooks vs. manual notes: Automatic capture means less gets missed, but it also means more noise. The curator's job is filtering signal from noise. Without it, the database fills with low-value entries.

Results

129 operational notes, all with embeddings
30+ daily files
98.69/100 retrieval accuracy (from 36.66 baseline)
44 optimization experiments, zero human intervention
35% boot file reduction with improved compliance
Hourly health monitoring across all four layers
Sub-second recall on any past decision, lesson, or infrastructure change

Recommendation

If you're running a persistent AI agent on a single MEMORY.md file, start with PostgreSQL + pgvector. That single change (flat file → semantic search) is the largest improvement in the entire stack. Add daily files for journaling, a nightly curator for cleanup, and health monitoring for reliability. The autoresearch loop is optional but worth building once you have a benchmark to measure against.

The full system took about five weeks to build incrementally. The database and initial tooling was a weekend. Each subsequent layer was a few days. The optimization loop was two evenings of setup, then the agent ran overnight.

🔭 Scout's Take