Memoirs: Teaching Agents to Remember (Without Losing Their Mind in the Cloud)

Misael Zapata included in categories Engineering Stories

2026-05-12 2026-06-29 927 words 5 minutes

Why concatenating chat transcripts doesn't scale, and how I built a local-first, probabilistic long-term memory system on top of SQLite.

Contents

1 The Thousand-Token Amnesia

Every conversational agent I’ve worked with — Claude, Cursor, various CLI loops — shares the same quiet flaw. They wake up with no memory of yesterday. Context windows have gotten enormous, but they still start from a blank page on every new session. If you try to work around that by keeping a running thread, you end up with something worse: an architectural monstrosity that dumps the entire chat history into the context window on every turn. You pay the inference cost of 40K tokens, and the agent still buries the most important fact somewhere in the middle — invisible, ignored. That’s the Lost in the Middle effect. It’s expensive and it doesn’t even work.

There are enterprise solutions for this. Heavy ones. Cloud-hosted vector databases with network round-trips measured in seconds on a good day. But I’m not about to send my architecture notes, my half-redacted credentials, and my personal conventions to some external API I don’t control. I wanted something local-first. Private by construction. Something that actually understood my environment variables, my framework choices, my style — and kept that knowledge close.

That’s what became memoirs. Not a RAG wrapper. That would have been too easy, and too boring. I wanted a real memory system — one that forgets organically and consolidates what actually matters.

2 The Underlying Anatomy: 6 Layers on Top of SQLite

I built the persistence layer on something most people underestimate: SQLite. Not just a flat .db file. A hypercharged SQLite instance with sqlite-vec for dense vector search and native FTS5 for inverted-index lexical search with BM25 scoring.

The architecture settled into six layers, because a memory system isn’t a storage box — it’s a water treatment plant:

Raw logs: The intake pipe. Receives everything — full conversation history, noisy diffs, raw output.
Extraction: A small curator LLM running locally (currently Qwen 2.5 3B — fast enough to not notice) that filters signal from noise: heuristics, credentials, preferences, style markers.
Graph: The connective tissue. Memories get linked using Zettelkasten-style principles, finding shared semantic nodes across concepts and sessions.
Dual Indexing: Atomic parallel writes to sqlite-vec and FTS5. Both indexes stay in sync.
Memory Engine: The curation layer. Memories gain and lose score over time, with full bi-temporal support — you can query what the system believed at any past moment.
Surface: The exposure layer. HTTP + REST, and more importantly a 22-endpoint MCP server that lets any agent query the system cleanly without knowing anything about the internals.

3 RRF and BM25: Vectors Don’t Solve Everything

Early on, I ran into a wall. The sentence-transformers embeddings were failing in a very specific way: exact identifier recall. Ask memoirs to retrieve a specific tool name or a versioned dependency string and it would come back with semantically adjacent noise instead. Vector spaces are brilliant at meaning. They know “car” is close to “vehicle.” But when a memory contains “the old system used psql-driver-v9”, the embedding doesn’t give that exact string any special weight. For an agent trying to reconstruct a decision, that’s a critical miss.

The fix was combining something new with something very old: Reciprocal Rank Fusion (RRF). We run both searches in parallel — a dense semantic query through sqlite-vec, and an exact-match inverted index query through FTS5 with BM25 scoring. If BM25 finds the string exactly and the semantic search understands the surrounding context, RRF stabilizes the scores and pushes the right result firmly to position one. The result: p50 retrieval latency dropped to around 3.9 ms. On a local SQLite database. That beats most cloud vector stores by a lot.

4 Ebbinghaus, Forgetting Curves, and Asynchronous Sleep

A system doesn’t have long-term memory if it can’t forget. Agents have a tendency to drown in stale knowledge. “Yesterday I was fighting with the Docker API” followed immediately by “Today we dropped Docker entirely and we’re on raw OCI calls.” If the memory engine surfaces both facts with equal confidence, the agent will make the wrong call. Every time.

I went back to an old notebook and implemented decay functions based on Hermann Ebbinghaus’s forgetting curve: $R(t) = e^{-\Delta t_h / (S \cdot 24)}$. New memories lose weight asymptotically with time. But their Strength score — their original signal quality — gets multiplied every time the system retrieves and positively confirms that memory. Core architectural decisions and deep preferences decay much more slowly than a debugging war story from a single late-night session.

The computation this requires — consolidation, semantic conflict resolution, graph compaction — is too expensive to run in real time. So I reached for a mechanism humans use constantly but software almost never does: sleep.

I wrote a daemon (sleep_consolidation.py) that watches for CPU valleys. It only runs when the developer and the agents have stepped away — literally the idle windows when the lock is free and system idle has been above N minutes. In that blind async window, the local Qwen or Phi model wakes up, reviews recently stored profiles, hunts for contradictions between memory versions, and consolidates or archives the stale ones. The graph gets compacted silently in the background before the next session starts.

At the end of the benchmarking cycle, hitting strong MRR scores on memory retrieval benchmarks like LoCoMo — with a RAM overhead of only 231 MB, compared to what LlamaIndex or Mem0 pull — confirmed the hypothesis. The best agent features don’t get built by shipping millions of tokens to remote providers. They get built by understanding basic indexes deeply and forcing ruthless resource discipline on your own machine.