8 March 2025

The Context Window Is Not the Constraint

By Asgeir Albretsen5 min read

ai-memoryretrievalcontext-windowknowledge-base

Everyone assumed bigger context windows would solve the AI memory problem. The research says otherwise. The bottleneck was always retrieval.

When Google announced Gemini 1.5 in February 2024, a lot of people quietly decided the personal AI memory problem was on its way to being solved. The new model supported up to one million tokens of context — the equivalent of roughly 700,000 words, or about seven novels. The obvious implication: just put everything in. Notes, emails, meeting records, every document you've ever written. Let the model figure out what matters.

It doesn't work like that.

What the research actually found

In 2023, Nelson Liu and colleagues at Stanford published a paper called "Lost in the Middle." They measured how language model performance changes depending on where relevant information appears inside a long context window. The finding was blunt: models are significantly better at using information near the start or end of their context than information buried in the middle.

Performance on multi-document question answering dropped by more than 30% when the answer document was positioned in the middle of twenty others versus first or last. The models weren't failing to read the information. They were reading it and then losing track of it.

This isn't specific to older or weaker models. Gemini 1.5, at full capacity, averages roughly 60% recall. That means four in ten facts in the context are effectively invisible — not missing, just unreachable. The model processes them and then can't reliably act on them.

The context window, it turns out, is not a storage space. It's closer to working memory. And working memory degrades when you overfill it.

The size instinct

The belief that bigger context windows solve everything is understandable. It follows the same logic as buying more RAM. More capacity, fewer constraints, better results.

But language models don't work that way. A model handed 200,000 tokens of loosely related notes, preferences, and project history has to do real work before it can answer a simple question. It has to decide what's relevant. That process is imperfect, and it gets less reliable as context grows.

The "needle in a haystack" benchmark makes this visible. You embed a specific fact inside a large document and ask the model to find it. GPT-4, which nominally supports 128,000 tokens, shows measurable degradation past about 12,800 — roughly 10% of its stated capacity. More space doesn't straightforwardly mean more usable space.

The research comparing retrieval-augmented generation (RAG) against raw long-context approaches reaches a similar conclusion. Long-context models often win on general benchmarks. But RAG systems tend to win on citation accuracy and perform better when the question requires finding a specific, structured fact in a well-organized corpus. The reason isn't mysterious: RAG forces you to be deliberate about what goes into the context. You retrieve a few highly relevant chunks instead of flooding the window with everything you have. That density usually helps.

What retrieval actually does

A knowledge base's job isn't to be a context window. It's to be much larger than a context window — and to have strong opinions about what belongs in it.

Most personal knowledge accumulates without much structure. You write a note. Then another. Years later you have thousands of them, with no reliable way to find the three that matter for a specific question. Keyword search finds what you typed. Semantic search finds what you meant. Structured queries find facts — "what did I last agree with Maya about her contract deadline?" — that neither keyword nor vector search handles cleanly on its own. The most reliable retrieval combines all three: search by meaning, filter by entity type, sort by recency, return something dense and relevant rather than something vast and noisy.

This changes what the context window is actually for. It's not storage. It's a working surface — the place you assemble the specific, curated set of facts the model needs to reason about a particular question. The retrieval system decides what goes in. The context window is where the thinking happens.

Harbor is built around this distinction. The SQLite layer stores everything. The MCP server retrieves selectively — using tools like search_knowledge, read_document, and query_database to find what's actually relevant, rather than returning a full export of your notes. When Claude or another model queries Harbor, it doesn't receive a dump of your life. It receives a small, structured, precise answer to what it asked.

That's not a limitation. It's the design.

The constraint worth keeping

There's something counterintuitive worth sitting with here. Limits on what the AI sees aren't just a workaround for engineering constraints. They may be genuinely beneficial.

When I give an AI more context than it needs, the quality of its responses tends to get worse, not better. Not always — sometimes background matters. But more often, noise drowns out signal. The model picks up something tangential. A preference I mentioned once overrides a pattern I've expressed consistently for years. Recency in the context window matters more to the model than actual importance.

Controlled retrieval forces the question: what is actually relevant here? A knowledge base that can answer that question, and pass only the answer to the model, produces something more useful than a very long window filled arbitrarily.

Gemini 1.5's million-token context is a genuine engineering achievement, and context windows will keep expanding. But the bottleneck in personal AI memory was never size. It was always retrieval — and behind retrieval, structure. Knowledge that isn't organized can't be retrieved reliably. And knowledge that can't be retrieved reliably doesn't make the AI smarter. It just makes the context longer.

Asgeir Albretsen is the founder of Harbor.

← All posts