Internal AI

Ask the app: a Q&A agent over a kid's school year

June 20, 20268 min read

A parent types "What was that book the teacher mentioned in February?" and I have no idea what context to pre-fetch for that. Every feature I built before this one answered a question I could predict at build time: what is due, what happened this week, what does this note mean. I could fetch exactly the right data and cache the answer. A free-text box breaks that entirely. "How is she doing in maths this term?" "Did they ever send the field-trip form?" A bounded LLM agent with a small, fixed set of tools, a hard cap on how many times it can loop, and a final iteration forced to return an answer no matter what is the pattern that made this tractable.

This is the fifth part of a series on a school-assignment assistant for one child. The earlier features were retrieval I could plan. This one is a small agent pointed at a SQLite database of one kid's school year, allowed to look things up until it can answer. This is the part where I had to decide how much autonomy to give it, and the answer was "some, with a hard ceiling."

TL;DR

Building a Q&A agent over a single child's school year, I skipped the standard vector retrieval pipeline because the corpus is too small to need it. Instead, the agent gets three plain SQL-backed tools, a two-week preloaded context window, and a hard ceiling of four iterations, with the final iteration forced to return an answer rather than request another tool call. This keeps the agent cheap, fast, and guaranteed to terminate.

Why an agent and not a retrieval pipeline

The reflexive architecture for "answer questions over my documents" is retrieval-augmented generation with a vector store: embed everything, embed the question, pull the nearest chunks, stuff them in the prompt. I did not build that, and I want to be clear it was not laziness. It was scale.

The entire corpus is one child's school year. A few hundred classroom notes, some events, a couple hundred notices. That is small enough that I do not need approximate nearest-neighbor search over embeddings to find relevant material. A LIKE query over the text finds "multiplication" just fine. A date range pulls "February." The whole premise of vector retrieval, that the corpus is too large to scan and too fuzzy to query directly, simply does not hold at this size. Adding an embedding pipeline and a vector index would be solving a scale problem I do not have, and I would still have to maintain it. So the tools the agent gets are not semantic search. They are the plain database queries I would run by hand.

Start loaded, expand on demand

The design that worked is "give it a sensible default, let it ask for more." When a question comes in, I do not start the model empty. I pre-load the last two weeks of assignments, events, and notices into the first prompt, because most questions a parent asks are about the recent past and that context answers them in one shot, with no tool calls and no extra latency.

If the question reaches further than two weeks, the model has three tools to widen its own view:

const ASK_TOOLS = [
  { name: "expand_history",      // a wider date range when the question reaches back
    params: ["start_date", "end_date"] },
  { name: "search",              // full-text over all assignments and notices, any date
    params: ["query"] },
  { name: "get_subject_history", // everything for one subject, newest first
    params: ["subject_key"] },
];

Each tool is a thin wrapper over a query that already existed for other parts of the app. search runs a substring match across titles, descriptions, and notice content. get_subject_history pulls one subject's full record. The model reads the question, decides whether its loaded context is enough, and if not, picks a tool. "How is she doing in maths" triggers get_subject_history. "The book from February" triggers expand_history or search. The system prompt nudges it toward one focused tool call over guessing, and toward answering immediately when the initial window already covers the question.

The bounded LLM agent ceiling is the whole point

An agent that can call tools in a loop can also loop forever, or rack up calls chasing a question that the data simply cannot answer. For a personal app, a runaway loop is not a catastrophe, but it is wasted money and a spinning page, so the loop is bounded hard at four iterations. More importantly, the last iteration is special: I take the tools away and force a JSON response.

for (let i = 0; i < MAX_TOOL_ITERATIONS; i++) {
  const isLast = i === MAX_TOOL_ITERATIONS - 1;
  const response = await chat({
    messages,
    tools: ASK_TOOLS,
    tool_choice: isLast ? "none" : "auto",            // last turn: no more tools
    ...(isLast ? { response_format: { type: "json_object" } } : {}),
  });
  // ...if the model called tools, run them, append results, continue.
  // ...otherwise, parse and return the answer.
}

That single isLast branch is the difference between an agent that always terminates with an answer and one that can hang. Without it, the model could ask for a fifth tool call that never comes and leave the parent staring at a spinner. With it, the worst case is "it gathered what it could in three rounds and then had to answer," which is exactly the behavior I want under uncertainty. The answer it returns carries short source citations, like "Wed 15 Apr classwork," so the parent can see what it leaned on rather than taking a confident paragraph on faith.

Trusting tool arguments about as far as I trust dates

The same instinct from the extraction part applies here: the model is a good decider and a sloppy typist. It picks the right tool reliably; its arguments need defending. Tool argument JSON is parsed inside a try-catch that falls back to an empty object rather than throwing, an empty search query short-circuits to an error result instead of scanning everything, and a bad subject key returns nothing rather than a crash. The model drives, but every door it opens has been checked from the other side. None of these guards have fired often. All of them mean a malformed tool call degrades to a weaker answer instead of a failed request, which for a parent typing a question one-handed at bedtime is the difference between "hmm, not much there" and a broken page.

When the small version is the right version

This is maybe sixty lines of orchestration around three trivial queries, and it answers open-ended questions about a child's year better than a vector pipeline would, because the corpus is small and the queries are exact. The lesson I keep relearning across this whole project: match the machinery to the actual size of the problem. A two-week default window covers the common case for free, three targeted tools cover the long tail, and a hard four-iteration ceiling with a forced final answer keeps it terminating. That is the entire agent, and it is small on purpose.

One question remains, and it is the one that keeps an LLM side project from quietly becoming a budget problem: what does all of this cost, and how would I even know. The last part is about the plumbing that answers that, and the admin page that makes it visible.

FAQ

Why use a SQL agent instead of RAG with a vector store for small datasets?

When the corpus is small enough to query directly, approximate nearest-neighbor search over embeddings adds maintenance overhead without improving results. A substring LIKE query or a date-range filter finds the relevant rows just as well, with no embedding pipeline to manage.

How do you stop an LLM agent from looping forever on tool calls?

Set a hard maximum iteration count and make the final iteration special: remove the tools and force a structured JSON response. This guarantees the agent terminates with an answer even if it has not gathered everything it wanted.

How do I pre-load context for a Q&A agent to reduce tool calls?

Load the most likely relevant data, such as the last two weeks of records, into the first prompt before the agent runs. Most questions are about the recent past, so this default window answers them in one shot with no extra latency or tool calls.

How do I handle bad or malformed tool call arguments from an LLM?

Parse tool argument JSON inside a try-catch that falls back to an empty object, and short-circuit obviously invalid inputs like an empty search query before they hit the database. This degrades a bad tool call to a weaker answer rather than a crashed request.

When is a bounded agent loop better than a retrieval-augmented generation pipeline?

When the corpus is small and the queries are exact, a bounded agent with a few targeted database tools outperforms a vector pipeline because it avoids the approximation cost and embedding maintenance. The agent pattern also handles open-ended, unpredictable questions that a fixed retrieval fetch cannot anticipate.