Parent-friendly summaries without paying the LLM twice
The cheapest language-model call is the one you do not make. That sounds like a cost-cutting tip. It is actually the central design constraint of every parent-facing summary in this app, and getting it right turned out to be a question about cache keys, not about prompts.
This is the fourth part of a series on a school-assignment assistant built for one child. By now the raw notes are scraped and the genuinely-future items are extracted. This part is about the layer a parent actually reads: a warm weekly summary, a plain-language version of a dense classroom note, a digest of school notices, a quick "what's tonight" brief, a per-subject overview. Five small features that all generate text with a model, and all of which would be slow and needlessly expensive if I generated naively.
The problem refresh creates
A summary is attached to a page. A parent opens the weekly page, reads it, closes it, opens it again that evening to show the other parent. If each open triggers a fresh generation, I have paid for the same paragraph twice and made the parent wait twice, for output that did not change between the two visits. Multiply that across five summary types and a habit of checking the app a few times a day and the waste is most of the bill, for zero added value. None of the underlying data moved.
So every generator goes through one small function whose entire job is to not call the model when it does not have to:
async function getOrGenerate({ kind, key, maxAgeHours = null }, builder) {
const cached = db.getAiSummary({ kind, key });
if (cached) {
if (!maxAgeHours) return cached.content; // content-keyed: trust it
const ageMs = Date.now() - new Date(cached.generated_at).getTime();
if (ageMs < maxAgeHours * 3600 * 1000) return cached.content; // time-keyed: check age
}
const content = await builder(); // miss: pay once
db.saveAiSummary({ kind, key, content });
return content;
}A summary is identified by a kind (which generator) and a key (which instance). On a hit it returns stored text. On a miss it runs the expensive builder once and stores the result. Trivial so far. The interesting part is the key.
The key is the invalidation strategy
The naive key for, say, a subject overview would be the subject. "mathematics." But then the summary never updates: new maths work lands all term and the cached paragraph is frozen at whatever was true the first time anyone looked. The usual fix is a background job that expires caches when data changes, which is a second moving part to build, run, and debug.
I did not want a second moving part. So instead of invalidating the cache, I derive the key from the data so that new data simply asks a different question:
const key = `${subjectKey}_${assignments.length}_${events.length}`;When a new assignment lands, the count changes, the key changes, and the lookup misses, which regenerates against the now-complete data. No invalidation logic, no job, no event bus. The cache is never wrong, because a stale entry is keyed to a state of the world that no longer exists and is therefore never asked for again. The old rows sit in the table as dead weight, which for a single-family app is a rounding error I am happy to ignore. If it ever mattered, a one-line periodic delete of anything but the latest key per kind would handle it, but it has never mattered.
Two kinds of summary do not fit the content-keyed pattern, and they are the reason getOrGenerate also accepts a max age. The "what's tonight" brief is keyed to the date but I want it to refresh through the day as the evening's context shifts, so it carries a time-to-live. The content-derived key handles "regenerate when the work changes"; the TTL handles "regenerate as the day moves." Most generators want the first. A couple want the second. The same small function serves both because the only difference is whether you also check the age.
What the prompts are actually for
With the economics handled by one function, the prompts get to be about voice rather than plumbing. These summaries are read by a tired parent, not an analyst, so every generator's system prompt pushes hard in the same direction: warm, specific, concise, and grounded. The weekly summary is told to "sound like a thoughtful teacher, kind and observant, never generic." The plain-language one is told to rewrite a dense note so a parent understands it in thirty seconds and can ask the child about it at dinner. The notices digest is told to surface only what a parent must act on and to skip the thank-yous.
The recurring instruction across all of them is "be specific to the notes, do not invent." A warm summary that hallucinates a detail is worse than no summary, because it sounds exactly as confident as a true one. So the prompts ask for concrete observations drawn from the actual text, and the structured output shape keeps them honest: short headline, a couple of sentences of body, a few labeled highlights. There is not much room to wander when the schema only has room for the truth.
The shape of a small LLM feature
Strip these five features down and they are the same three pieces: a derived cache key that makes staleness impossible, a single guard that turns "should I generate" into one branch, and a prompt that spends its words on tone and grounding because the cost problem is already solved elsewhere. None of that is novel. The point is the order of operations. I see a lot of LLM features that start with an elaborate prompt and bolt caching on later as an optimization. Here the cache key was the design, and the prompt was the easy part that came after. For anything a user will refresh, that order is the right one.
There is one feature left that does not fit this caching model at all, because the parent's question is different every time and the right context cannot be guessed ahead of time. The next part is about that: a small question-answering agent that starts with a two-week window and reaches further back on its own when the question demands it.
