Bassam Ismail
Engineering

Tracking what the LLM costs in a one-person app

6 min read

The thing that kills a hobby project that talks to a paid API is not a single expensive call. It is the slow leak you do not notice: a loop that runs more often than you thought, a page that regenerates on every refresh, a feature you forgot was even calling the model. You find out at the end of the month, from the invoice, which is the worst possible place to learn it. So before I trusted this app to run unattended, I gave it a way to tell me what it was doing.

This is the last part of a series on a school-assignment assistant built for one child. The previous parts built features that call a language model: extraction, five kinds of summary, a question-answering agent. This part is about the unglamorous layer underneath all of them, the one that turns "what is this costing me" from a monthly surprise into a question I can answer at any moment.

One door, and everything goes through it

The single decision that made cost visible was refusing to call the model SDK directly anywhere in the app. Every call, without exception, goes through one thin wrapper. Nothing else talks to the API.

async function chatCompletion(openai, params, ctx = {}) {
  const start = Date.now();
  const response = await openai.chat.completions.create(params);
  const u = response.usage || {};
  db.recordTokenUsage({
    use_case: ctx.useCase || "unknown",
    model: response.model || params.model,
    prompt_tokens: [REDACTED:secret] || 0,
    completion_tokens: [REDACTED:secret] || 0,
    total_tokens: [REDACTED:secret] || 0,
    duration_ms: Date.now() - start,
    meta: ctx.meta || null,
  });
  return response;
}

That is the whole thing. It calls the model, reads the token usage off the response, and writes a row. The wrapper is boring by design: it adds no behavior, only bookkeeping. But because it is the only path to the model, that bookkeeping is complete. There is no call it misses, no feature that quietly escapes accounting, because there is no other way to reach the API. When I add a feature, I do not have to remember to instrument it. I physically cannot call the model without going through the meter.

The tag is what makes it useful

Recording total tokens would tell me the bill. Recording the use_case tag with each call tells me where the bill comes from, which is the part I can act on. Every call site passes a label: event_extraction, weekly_summary, ask_baheej, plain_language, and so on. The meta field carries a little structured context too, the date, the assignment count, the iteration number.

With that one tag, "what is this costing me" becomes a grouped query instead of a guess. I can see that extraction dominates because it processes every note, that the summaries are nearly free because they are cached and rarely regenerate, that the question-agent is spiky because some questions trigger tool calls and some do not. That is the difference between knowing my total and understanding my spend. A total tells me whether to panic. The breakdown tells me where to look first if I do.

The clearest signal sits right in the code, before a single token is spent: the output budget I allow each feature per call. Extraction gets the most room because it has the most to say across a day's notes. The agent gets a little less. The summaries are kept tight, both for cost and because a parent wants short.

OUTPUT BUDGET PER CALL, BY FEATUREextraction2000 tokask agent900 toksummaries800 tok

These are ceilings I set deliberately, not measured usage, and they are the first lever I would reach for if a feature ran hot: a smaller cap is a smaller worst case. Pairing the intended ceiling with the recorded actual is what tells me whether a feature is behaving, or quietly pushing against its limit every time.

Runs need a log too, for a different reason

Token usage answers "what did the model cost." A second table, the run log, answers a question that matters just as much for an unattended job: did it even run, and did it work. Each scheduled fetch writes a row when it starts and updates it when it finishes, with status and timing.

This is not about money. It is about trust in a system I am not watching. The whole premise from Part 1 was that this should not depend on me being sharp at 9:40pm. That promise is only real if the job actually runs every morning, and the only way I know that without babysitting it is that it leaves a trail. When the app shows a quiet day, the run log is how I tell the difference between "genuinely nothing due" and "the fetch failed and you are looking at stale data." Those look identical on the page and could not be more different to a parent, and the distinction lives entirely in whether the morning's run is recorded as having succeeded.

The admin page is just these two tables, made visible

There is a small admin view, behind a role check, that does nothing clever: it reads the token-usage and run-log tables and shows them. Spend grouped by use case and by day. Recent runs with their status and duration. It is the least sophisticated screen in the project and the one that lets me leave the rest of it alone. I glance at it occasionally, confirm extraction is still the biggest line and nothing new has crept onto the list, see a column of green runs, and close the tab. That glance is the entire return on building the layer, and it is enough.

What the whole series was really about

Six parts, and a pattern runs through all of them that I did not plan and only saw in hindsight. Build for one reader, so you can hard-code what a product would have to generalize. Reach for the dumbest tool that fits the actual size of the problem, the standard-library HTTP client over the headless browser, the substring search over the vector store. Let the model do judgment and let plain code own everything verifiable. Derive your cache keys so staleness cannot happen. And put one meter on the one door, so the thing can run without you and still tell you the truth about itself.

None of these are clever. The school's portal already had all the data; I just gave it a different shape, for one family, and made sure I would notice when it broke. That is the unromantic version of building software for the people you actually know, and after a term of my daughter not forgetting her art supplies, it is the version I would recommend.