Engineering

Tracking what the LLM costs in a one-person app

June 20, 20267 min read

The thing that kills a hobby project that talks to a paid API is not a single expensive call. It is the slow leak you do not notice: a loop that runs more often than you thought, a page that regenerates on every refresh, a feature you forgot was even calling the model. You find out at the end of the month, from the invoice, which is the worst possible place to learn it. So before I trusted this app to run unattended, I gave it a way to tell me what it was doing.

This is the last part of a series on a school-assignment assistant built for one child. The previous parts built features that call a language model: extraction, five kinds of summary, a question-answering agent. This part is about the unglamorous layer underneath all of them, the one that turns "what is this costing me" from a monthly surprise into a question I can answer at any moment.

TL;DR

Hobby apps that call paid LLM APIs tend to leak money slowly through forgotten loops and uncached pages, not a single expensive call. The fix is routing every model call through one thin wrapper that records token usage and a use-case tag to a database row, so spend is always a grouped query rather than a monthly invoice surprise. A second run-log table answers whether the scheduled job actually fired and succeeded, which looks identical to "nothing happened" without a trail. Together, these two tables and a simple admin view let an unattended app tell you the truth about what it costs and whether it is working.

One door, and everything goes through it

The single decision that made cost visible was refusing to call the model SDK directly anywhere in the app. Every call, without exception, goes through one thin wrapper. Nothing else talks to the API.

async function chatCompletion(openai, params, ctx = {}) {
  const start = Date.now();
  const response = await openai.chat.completions.create(params);
  const u = response.usage || {};
  db.recordTokenUsage({
    use_case: ctx.useCase || "unknown",
    model: response.model || params.model,
    prompt_tokens: u.prompt_tokens || 0,
    completion_tokens: u.completion_tokens || 0,
    total_tokens: u.total_tokens || 0,
    duration_ms: Date.now() - start,
    meta: ctx.meta || null,
  });
  return response;
}

That is the whole thing. It calls the model, reads the token usage off the response, and writes a row. The wrapper is boring by design: it adds no behavior, only bookkeeping. But because it is the only path to the model, that bookkeeping is complete. There is no call it misses, no feature that quietly escapes accounting, because there is no other way to reach the API. When I add a feature, I do not have to remember to instrument it. I physically cannot call the model without going through the meter.

The tag is what makes it useful

Recording total tokens would tell me the bill. Recording the use_case tag with each call tells me where the bill comes from, which is the part I can act on. Every call site passes a label: event_extraction, weekly_summary, ask_baheej, plain_language, and so on. The meta field carries a little structured context too, the date, the assignment count, the iteration number.

With that one tag, "what is this costing me" becomes a grouped query instead of a guess. I can see that extraction dominates because it processes every note, that the summaries are nearly free because they are cached and rarely regenerate, that the question-agent is spiky because some questions trigger tool calls and some do not. That is the difference between knowing my total and understanding my spend. A total tells me whether to panic. The breakdown tells me where to look first if I do.

The clearest signal sits right in the code, before a single token is spent: the output budget I allow each feature per call. Extraction gets the most room because it has the most to say across a day's notes. The agent gets a little less. The summaries are kept tight, both for cost and because a parent wants short.

These are ceilings I set deliberately, not measured usage, and they are the first lever I would reach for if a feature ran hot: a smaller cap is a smaller worst case. Pairing the intended ceiling with the recorded actual is what tells me whether a feature is behaving, or quietly pushing against its limit every time.

Runs need a log too, for a different reason

Token usage answers "what did the model cost." A second table, the run log, answers a question that matters just as much for an unattended job: did it even run, and did it work. Each scheduled fetch writes a row when it starts and updates it when it finishes, with status and timing.

This is not about money. It is about trust in a system I am not watching. The whole premise from Part 1 was that this should not depend on me being sharp at 9:40pm. That promise is only real if the job actually runs every morning, and the only way I know that without babysitting it is that it leaves a trail. When the app shows a quiet day, the run log is how I tell the difference between "genuinely nothing due" and "the fetch failed and you are looking at stale data." Those look identical on the page and could not be more different to a parent, and the distinction lives entirely in whether the morning's run is recorded as having succeeded.

The admin page is just these two tables, made visible

There is a small admin view, behind a role check, that does nothing clever: it reads the token-usage and run-log tables and shows them. Spend grouped by use case and by day. Recent runs with their status and duration. It is the least sophisticated screen in the project and the one that lets me leave the rest of it alone. I glance at it occasionally, confirm extraction is still the biggest line and nothing new has crept onto the list, see a column of green runs, and close the tab. That glance is the entire return on building the layer, and it is enough.

What the whole series was really about

Six parts, and one pattern runs through all of them that I did not plan and only saw in hindsight: build for one reader, reach for the dumbest tool that fits, let the model do the judgment and plain code own everything verifiable, and put one meter on the one door so the thing can run without you and still tell you the truth about itself.

None of these are clever. The school's portal already had all the data; I just gave it a different shape, for one family, and made sure I would notice when it broke. That is the unromantic version of building software for the people you actually know, and after a term of my daughter not forgetting her art supplies, it is the version I would recommend.

FAQ

How do I track LLM token costs per feature in a small app?

Route every model call through a single wrapper function that reads the usage field off the API response and writes a row to a database, tagging each row with a use-case label. Grouping by that label later turns your total spend into a per-feature breakdown you can act on.

Why tag each LLM call with a use-case label?

A total token count tells you whether to panic; a use-case tag tells you which feature to investigate first. Without the label you can only see the bill, not where the bill comes from.

How do I know if my scheduled LLM job actually ran?

Write a run-log table that records a row when the job starts and updates it with status and duration when it finishes. Without this, a failed fetch and a genuinely quiet day look identical in the UI.

What is the simplest way to prevent untracked LLM API calls in a Node app?

Prohibit direct SDK calls everywhere in the codebase and expose only one wrapper function. Because there is no other path to the API, every call is automatically instrumented and no new feature can quietly escape accounting.

Should I set max_tokens limits per feature to control LLM costs?

Yes. Setting a per-feature output budget is the first lever to reach for if a feature runs hot, because a smaller cap means a smaller worst case. Comparing the intended ceiling against recorded actual usage is how you tell whether a feature is behaving or pushing against its limit every time.