Treating the LLM context window like memory: a demand‑paging proxy that cuts wasted tokens
This paper argues that a large language model’s context window is not general memory but a tiny L1 cache. The authors measured 857 productio
This paper argues that a large language model’s context window is not general memory but a tiny L1 cache. The authors measured 857 production sessions and about 4.45 million effective input tokens and found 21.8% of those tokens were “structural waste.” That waste comes from three sources the paper identifies: unused tool schemas (11.0%), duplicated content (2.2%), and stale tool results (8.7%) that get reprocessed with a median amplification of 84.4×.
To fix this, the team built Pichay, a demand‑paging system that sits between client programs and the model provider. Implemented as a transparent proxy, Pichay evicts stale context, watches for the model to request content that was evicted (a “page fault”), and then reloads it. Pages that cause faults are “pinned” so they stay available in a working set. In offline replay across 1.4 million simulated evictions the measured fault rate was 0.0254%. In a live deployment over a 681‑turn session Pichay reduced context consumption by up to 93% (from 5,038 KB to 339 KB) and kept the session running with a 97% eviction rate under normal pressure.
At a high level the system borrows ideas from operating systems’ virtual memory. When the proxy removes content to save space, that removal is like evicting a page from cache. If the model later needs that content, the proxy detects the access and re‑injects it, which is like a page fault and reload. Repeated faults teach the proxy which pages are truly part of the working set; one design choice the authors report is “fault‑driven pinning,” where a single fault can promote a page to stay resident. The proxy also emits short eviction summaries that serve as retrieval handles the model can understand rather than sending the full evicted text every turn.