Internal AI

Teaching an LLM to ignore the recap and surface the homework

June 20, 20268 min read

Here is a real classroom note, lightly paraphrased: "Today learners revised two-digit addition with regrouping and completed pages 14 to 16. Most were able to carry over correctly. We will continue practice. Spelling assessment on Unit 5 words this Friday." Four sentences. Three of them are a recap of a day that has already happened. One of them is the only thing a parent needs to act on. Getting an LLM to extract action items and nothing else means reading all four and surfacing exactly one.

This is the third part of a series on building a school-assignment assistant for one child. Earlier parts covered why it exists and how the raw notes get scraped out of the school portal. This part is about the feature that makes the whole thing worth having, and it is almost entirely an exercise in teaching a language model to throw work away.

TL;DR

Building a school assignment assistant means teaching the LLM to refuse most of what it reads, not just extract everything. Teacher notes are written as after-the-fact recaps, so the prompt spends most of its length on exclusions, telling the model to surface only future action items like tests, homework, or things to bring. A sentinel row with type "none" is written for notes that yield nothing actionable, preventing re-processing and keeping costs flat across the full term. Output is normalized in plain code before it touches the database, so hallucinated dates and invented event types never persist.

The value of a school assignment assistant is in the refusal

It is tempting to describe this as "extract events with an LLM," as if the interesting part is the extraction. It is not. Pulling structured items out of text is the easy, well-trodden thing models do. The hard part, the part that decides whether a parent trusts the app or quietly stops opening it, is everything the model declines to surface.

Teachers write these notes after the school day, in the past tense, as a record of what happened. That is correct for them and a trap for me. If I extract generously, the parent gets a list dominated by "revised addition," "completed worksheet," "discussed the water cycle." None of it is wrong. All of it is noise, because none of it is a thing the child has to do. A few days of that and the list is just a feed, and a feed is exactly what I was trying to escape in Part 1. The product is not "show me what happened." It is "show me the handful of things I have to act on." Optimizing for that means optimizing for precision and being willing to pay for it in recall.

Prompting an LLM to extract action items means mostly listing exclusions

So the extraction prompt spends most of its length on exclusions, not instructions. It tells the model the crucial context up front, that these notes are after-the-fact recaps and most of the text is not actionable, and then it draws the line hard:

EXTRACT only genuinely future action items: a test scheduled for a future date, homework to complete at home, an item to bring, a project with a deadline.

DO NOT EXTRACT: anything taught or completed in class today, vague statements like "practice will continue" with no date, learning objectives, anything that already happened.

And then the instruction that does the heavy lifting: "Be strict. When in doubt, mark it as none. A parent should only see items they need to act on." I run it at a low temperature, because I want the same note to classify the same way every time, not creatively. The output is forced JSON: each item carries the source assignment id, a parent-actionable title, an optional date if one is genuinely stated, and an event type from a small fixed set (test, homework, bring_item, project, and so on). This is where it becomes useful: the prompt is less about finding every possible event and more about refusing the ones that do not belong.

Marking the absence of events is the real design

The non-obvious part is not what happens when the model finds something. It is what happens when it finds nothing, which is most of the time.

Every assignment gets processed exactly once. Before each run I filter to assignments that have no extraction recorded yet, so a note is never sent to the model twice. But "I already looked and there was nothing actionable here" is itself a result I have to store, or I will re-examine the same empty notes forever and pay for the privilege. So when an assignment yields no future items, I write a sentinel row with the type none against that assignment id. It marks the note as seen-and-empty.

// After the model responds, any assignment that produced no real events
// still gets a sentinel so it is never reprocessed.
const processed = new Set(events.map(e => e.assignment_id));
for (const a of unprocessed) {
  if (!processed.has(a.id)) {
    db.insertEvents([{ assignment_id: a.id, event_type: "none", subject_key: a.subject_key }]);
  }
}

This sounds trivial and it is the difference between a job that costs a few cents a day and one whose cost grows with the entire history of the term. It also means the "did I check this?" question has a definite answer in the database, which matters more than it sounds: when the app shows a quiet day, I can tell whether that is because nothing is due or because the extractor has not run, and those are very different things to a parent. That database answer only works because the earlier shape of the project treated the data model as the spine, not as an afterthought.

Trusting the model exactly as far as the schema

I do not trust the model's output shape, even with forced JSON. The response gets normalized before it touches the database. A date is only kept if it matches a strict YYYY-MM-DD pattern, otherwise it becomes null rather than a hallucinated "next week." The subject is backfilled from the source assignment when the model omits it, because I already know which assignment each item came from and the model does not need to be the authority on that. The type defaults to a safe generic if the model invents one outside the set.

const events = parsed.map(e => ({
  assignment_id: e.assignment_id || null,
  title: e.title || "Unknown event",
  event_date: /^\d{4}-\d{2}-\d{2}$/.test(e.event_date) ? e.event_date : null,
  subject_key: e.subject_key || lookupSubject(e.assignment_id),
  event_type: e.event_type || "activity",
}));

The principle I keep coming back to with these features: the model is a good reader and an unreliable typist. Let it do the judgment, the reading and the classifying that I could not write rules for, and let plain code own the parts I can verify, the date format, the foreign keys, the enum. The boundary between "ask the model" and "check it in code" is where most of the reliability lives.

Where this leaves the project

This is a strict pipeline by design, and strictness has a real failure mode I should name: it will occasionally drop something genuinely actionable that was phrased ambiguously, because I told it to discard when unsure. I decided a missed item is a better failure than a noisy list, because a noisy list fails silently by training the parent to ignore the app, while a missed item is a single visible miss I can learn from. That trade is not free and it is not obviously right for every app. It is right for this one.

With future events now sitting in their own table, clean and de-noised, the next part is about turning them, and the raw notes behind them, into something warm enough to actually read: parent-facing summaries that do not cost me a model call every time someone refreshes the page.

FAQ

How do I stop an LLM from extracting past events as action items?

Give the model explicit exclusion rules up front: tell it these notes are after-the-fact recaps, then list what not to extract (things taught in class, completed work, vague continuations with no date). Running at low temperature keeps the classification consistent across identical notes.

How do I avoid reprocessing the same notes with an LLM every time the job runs?

Write a sentinel row to the database for every note that yields no actionable events, not just the ones that do. Filtering to assignments with no extraction record before each run means each note is sent to the model exactly once, keeping cost flat as history grows.

How do I prevent LLM hallucinated dates from entering my database?

Validate the model's output in plain code before any database write. Accept a date only if it matches a strict YYYY-MM-DD pattern; otherwise store null. Backfill fields you already know from the source record rather than trusting the model to repeat them correctly.

Should I optimize an LLM extraction pipeline for precision or recall?

For a parent-facing action-item list, precision is the right trade-off. A noisy list trains users to ignore the app silently, while an occasional missed item is a single visible failure you can learn from. The post instructs the model to mark results as none when in doubt.

How do I force structured JSON output from an LLM and keep it safe?

Use forced JSON output with a fixed enum for event types, then normalize every field in code before it reaches the database. Default unknown types to a safe generic, backfill known foreign keys from the source record, and reject any value that fails format validation.