Internal AI

Parent-friendly summaries without paying the LLM twice

June 20, 20267 min read

The cheapest language-model call is the one you do not make. That is not a cost-cutting tip, it is the central design constraint behind every parent-facing summary in this app, and getting it right turned out to be a question about LLM summary caching strategy, not about prompts.

This is the fourth part of a series on a school-assignment assistant built for one child. By now the raw notes are scraped and the genuinely-future items are extracted. This part is about the layer a parent actually reads: a warm weekly summary, a plain-language version of a dense classroom note, a digest of school notices, a quick "what's tonight" brief, a per-subject overview. Five small features that all generate text with a model, and all of which would be slow and needlessly expensive if I generated naively.

TL;DR

Generating LLM summaries on every page load is slow and wasteful when the underlying data has not changed. The fix is a single cache function, getOrGenerate, that keys each summary to a fingerprint derived from the data itself, so a stale entry is never asked for again without any invalidation logic. For time-sensitive summaries, the same function accepts a max-age TTL instead. The design work is in the cache key; the prompt is comparatively easy.

The problem refresh creates

A summary is attached to a page. A parent opens the weekly page, reads it, closes it, opens it again that evening to show the other parent. If each open triggers a fresh generation, I have paid for the same paragraph twice and made the parent wait twice, for output that did not change between the two visits. Multiply that across five summary types and a habit of checking the app a few times a day and the waste is most of the bill, for zero added value. None of the underlying data moved.

So every generator goes through one small function whose entire job is to not call the model when it does not have to:

async function getOrGenerate({ kind, key, maxAgeHours = null }, builder) {
  const cached = db.getAiSummary({ kind, key });
  if (cached) {
    if (!maxAgeHours) return cached.content;          // content-keyed: trust it
    const ageMs = Date.now() - new Date(cached.generated_at).getTime();
    if (ageMs < maxAgeHours * 3600 * 1000) return cached.content;  // time-keyed: check age
  }
  const content = await builder();                    // miss: pay once
  db.saveAiSummary({ kind, key, content });
  return content;
}

A summary is identified by a kind (which generator) and a key (which instance). On a hit it returns stored text. On a miss it runs the expensive builder once and stores the result. Trivial so far. The interesting part is the key.

Deriving the key from the data instead of invalidating it

The naive key for, say, a subject overview would be the subject. "mathematics." But then the summary never updates: new maths work lands all term and the cached paragraph is frozen at whatever was true the first time anyone looked. The usual fix is a background job that expires caches when data changes, which is a second moving part to build, run, and debug.

I did not want a second moving part. So instead of invalidating the cache, I derive the key from the data so that new data simply asks a different question:

const key = `${subjectKey}_${assignments.length}_${events.length}`;

When a new assignment lands, the count changes, the key changes, and the lookup misses, which regenerates against the now-complete data. No invalidation logic, no job, no event bus. The cache is never wrong, because a stale entry is keyed to a state of the world that no longer exists and is therefore never asked for again. The old rows sit in the table as dead weight, which for a single-family app is a rounding error I am happy to ignore. If it ever mattered, a one-line periodic delete of anything but the latest key per kind would handle it, but it has never mattered.

Two kinds of summary do not fit the content-keyed pattern, and they are the reason getOrGenerate also accepts a max age. The "what's tonight" brief is keyed to the date but I want it to refresh through the day as the evening's context shifts, so it carries a time-to-live. The content-derived key handles "regenerate when the work changes"; the TTL handles "regenerate as the day moves." Most generators want the first. A couple want the second. The same small function serves both because the only difference is whether you also check the age.

What the prompts are actually for

With the economics handled by one function, the prompts get to be about voice rather than plumbing. These summaries are read by a tired parent, not an analyst, so every generator's system prompt pushes hard in the same direction: warm, specific, concise, and grounded. The weekly summary is told to "sound like a thoughtful teacher, kind and observant, never generic." The plain-language one is told to rewrite a dense note so a parent understands it in thirty seconds and can ask the child about it at dinner. The notices digest is told to surface only what a parent must act on and to skip the thank-yous.

The recurring instruction across all of them is "be specific to the notes, do not invent." A warm summary that hallucinates a detail is worse than no summary, because it sounds exactly as confident as a true one. So the prompts ask for concrete observations drawn from the actual text, and the structured output shape keeps them honest: short headline, a couple of sentences of body, a few labeled highlights. There is not much room to wander when the schema only has room for the truth.

The shape of a small LLM feature

Strip these five features down and they are the same three pieces: a derived cache key that makes staleness impossible, a single guard that turns "should I generate" into one branch, and a prompt that spends its words on tone and grounding because the cost problem is already solved elsewhere. None of that is novel. The point is the order of operations. I see a lot of LLM features that start with an elaborate prompt and bolt caching on later as an optimization. Here the cache key was the design, and the prompt was the easy part that came after. For anything a user will refresh, that order is the right one.

There is one feature left that does not fit this caching model at all, because the parent's question is different every time and the right context cannot be guessed ahead of time. The next part is about that: a small question-answering agent that starts with a two-week window and reaches further back on its own when the question demands it.

FAQ

How do I avoid calling an LLM twice for the same summary?

Cache the generated output and key it to the content state, such as a count of assignments. When the data changes the key changes, the cache misses, and the model is called exactly once for the new state.

How do I invalidate an LLM cache without a background job?

Derive the cache key from the data itself, for example by including record counts in the key. A stale entry is keyed to a state of the world that no longer exists, so it is never looked up again and no explicit invalidation is needed.

When should I use a TTL instead of a content-derived cache key for LLM output?

Use a TTL when the relevant context shifts with time rather than with data changes, such as a 'what's tonight' brief that should refresh as the day progresses. Content-derived keys handle 'regenerate when the work changes'; TTLs handle 'regenerate as time moves.'

Why does caching matter more than prompt engineering for refreshed LLM features?

Any feature a user will reload can silently multiply model calls with zero added value. Solving the cache key first means the prompt only has to handle tone and grounding, not cost control.

How do I structure a minimal LLM caching layer in JavaScript?

A single async function that accepts a kind, a derived key, and an optional maxAgeHours covers both invalidation strategies. On a cache hit it returns stored text; on a miss it runs the builder once, saves the result, and returns it.