Practice — day 12.
Gaps in your learning.
Today's queue.
Your queue · 7 items
Activity
Recent captures
RAG memory
Speculative decoding
without rollbacks.
Speculative decoding accelerates large language model inference by drafting candidate tokens with a smaller model and verifying them with the target model. Existing approaches suffer from a fundamental tension: aggressive drafting produces high speedup but requires expensive rollbacks when speculation fails, while conservative drafting wastes the parallel verification budget.
We introduce a rollback-free variant that treats every draft as a soft commit — verification reshuffles the speculation tree in place without discarding context. Across LLaMA-3 70B, DeepSeek-V3, and Qwen-72B, we observe 2.1–3.4× speedup over vanilla speculative decoding, with no degradation in output quality.
1.1 The rollback problem
Prior work on speculative decoding assumes a strict accept-reject regime: every draft token is either accepted whole or fully rejected. This over-commits to draft-model uncertainty. In production traffic, we measure rollback rates of 18–34% for non-trivial prompts — meaning a third of draft work is, on average, thrown away.
The rollback cost compounds with model size. For 70B+ models the verifier KV cache must be rewound across multiple layers, which dominates inference cost beyond batch size 8.
From this paper
Read-write loops in
agentic systems.
01What it is
An agentic system that read-write loops treats its memory as a first-class data structure, not an after-thought. Every tool call writes back into the same memory store the planner reads from on the next turn → frag 0421 — closing the gap between "what I just learned" and "what I think about next."
This differs from purely read-only retrieval (classical RAG), where the model fetches from a static corpus and discards what it generated → frag 0312. Read-write loops make the agent's own outputs part of the retrievable surface, which is what makes long-horizon tasks viable past ~10 turns.
02What surprised past-you
You started reading this from a different angle — "how do agents avoid retracing their own steps?" The answer turned out to be the same as the memory question, just shaped differently.
The same primitive — → frag 0843 soft-commit memory — appears in three places you weren't expecting:
- Speculative decoding: draft tokens are commits, verification reshuffles. No rollback needed.
- Agentic planning: tool outputs are commits, the next plan reshuffles. Same shape.
- Postgres MVCC: every write is a soft commit until VACUUM. Old idea, new context. → frag 0612
03Open threads
- How do you bound the write rate? Pinecone v3 made per-user writes 4× cheaper, but the unit economics still flip at ~50 writes/turn. → frag 0721
- What does eviction look like? No paper you've captured talks about which memories to forget. Likely an unanswered gap. ? unanswered
- Does your prompt-cache note apply here? simonw's piece argued caching solves hot-context. That's half of read-write. Worth merging into this page.