Internal AI

Open-weights LLM evaluation for self-hosted use

June 22, 202610 min read

At 11:40 p.m., I was staring at a home server that had technically accepted a new open-weights model and practically refused to be useful. The demo prompt looked fine. The public benchmark table looked better than fine. Then my real workload hit it: summarizing clipped articles, drafting shell commands from messy notes, and answering questions over a small local archive. Tokens dribbled out, memory pressure climbed, and one answer confidently invented a flag for a command I run often enough to resent the creativity. That was the point where self-hosted LLM evaluation stopped being a vibes exercise for me. I built a repeatable harness around my own tasks, hardware ceiling, and failure modes, and it changed when I replace a model: not when a leaderboard moves, but when the model clears the jobs I actually run on the box I actually own.

TL;DR

Self-hosted LLM evaluation should measure useful behavior on your hardware, not leaderboard scores from cloud API conditions. Build a small harness with fixed prompts, expected failure checks, latency and memory limits, and a swap rule before testing new open-weights releases. The answer is not a perfect benchmark. It is a repeatable decision process that tells you whether a model is worth running locally.

Self-hosted LLM evaluation needs a harness

Cloud benchmarks answer a respectable question: how does a model perform under controlled test conditions, usually with a provider handling infrastructure, batching, kernels, memory, routing, and the occasional invisible act of mercy. That is not my question.

My question is less glamorous: can this model run on a small server without turning my evening automations into a queueing theory lecture? Can it produce a boring, correct answer for the things I ask every week? Can it fit in memory while the rest of the machine still does machine things, such as serving files and pretending Docker is free?

That difference matters because open-weights models have two evaluation surfaces. One is intelligence in the abstract. The other is local usefulness under constraint. The second surface is where most self-hosting decisions get made, even when people pretend they are making the first one.

I do not need a universal ranking. I need a model swap gate.

The harness I ended up with is deliberately small. It tests about two dozen prompts across four categories: retrieval answers over local notes, summarization of noisy web clips, command drafting, and structured extraction. Each prompt has a short acceptance rule. Some are exact. Some are semantic. Some are anti-rules, which are often more useful: do not invent a CLI flag, do not cite a file that was not retrieved, do not output invalid JSON, do not exceed a latency budget.

That last part is where generic benchmarks quietly fall apart. A model that is 8 percent better on a public reasoning score but takes twice as long on my machine is not an upgrade. It is a very articulate delay.

What I measured instead of leaderboard position

The mistake I kept making was treating model choice like a single-axis problem. It is not. For local use, the model has to clear three gates at once: correctness on my tasks, resource fit on my hardware, and operational behavior under repeated runs.

Task fit

I started by writing down the jobs I actually use the server for. Not aspirational jobs. Not prompts from a launch thread. The ones in shell history and notes.

The categories looked like this:

Category	Example check	Failure that matters
Local Q&A	Answer from retrieved notes	Cites missing source
Summaries	Compress noisy article text	Drops the main caveat
Command help	Produce shell command	Invents option or path
Extraction	Return typed JSON	Invalid schema

The table is small because the harness has to survive contact with laziness. If adding a model takes an afternoon, I will stop doing it. The useful shape is closer to a preflight checklist than a research benchmark.

Hardware fit

For each run, I capture time to first token, total generation time, tokens per second, peak resident memory, and whether the model spilled into a slower path. My server is not exotic, which is the point. It has enough RAM to run useful quantized models and not enough RAM to forgive sloppy choices.

Here is the kind of runner I use. The endpoint is generic, but the shape is real: fixed model name, fixed prompt file, fixed generation parameters, and metrics emitted as JSON lines.

cd ~/llm-eval
export LLM_HOST="http://127.0.0.1:11434"
export MODEL_NAME="local-model:latest"
node run-eval.mjs \
  --host "$LLM_HOST" \
  --model "$MODEL_NAME" \
  --cases ./cases/self-hosted.jsonl \
  --out ./runs/$(date -u +%Y%m%dT%H%M%SZ).jsonl

And the service runs with resource limits on purpose. If a model only works when it can bully the machine, that is useful information, not an inconvenience to hide.

services:
  local-llm:
    image: ollama/ollama:latest
    ports:
      - "127.0.0.1:11434:11434"
    volumes:
      - ./models:/root/.ollama
    deploy:
      resources:
        limits:
          memory: 24g
    environment:
      OLLAMA_KEEP_ALIVE: "10m"
      OLLAMA_NUM_PARALLEL: "1"

The single parallel request is not a moral position. It is how I isolate the model. After a model clears the gate, I test concurrency separately. Mixing those two questions early creates the usual debugging fog: is the model bad, is the runner overloaded, or did I accidentally benchmark impatience?

The mechanism that made the harness useful

The harness became useful only after I stopped scoring answers like a tiny professor. I needed decision signals, not elegance.

Keep prompts fixed

Every case is stored as a JSON line with an id, input, expected properties, and limits. I do not tune prompts per model during the first pass, because that turns evaluation into courtship. A model can get a second pass if it is close, but the first run is cold.

{"id":"cmd-tar-exclude","kind":"command","limit_ms":12000,"prompt":"Create a tar command that archives ./site while excluding ./site/cache and ./site/tmp. Return only the command.","must_include":["tar","--exclude=./site/cache","--exclude=./site/tmp","./site"],"must_not_include":["--ignore-cache","--skip-tmp"]}
{"id":"json-contact","kind":"extract","limit_ms":9000,"prompt":"Extract name, email, and organization from this note as JSON with keys name, email, organization: Dana from Example Studio, [email protected].","json_schema":{"required":["name","email","organization"]}}

This is not academically pure. It is better: it catches failures that cost me time. The command case rejects invented flags because invented flags are not almost correct. They are a small productivity tax with a friendly explanation attached.

Score failures before preferences

The runner records pass, fail, and warning states. Warnings are things I dislike but might tolerate, such as verbosity in a task that asked for a compact answer. Failures are things that make the model unsafe or useless for that workflow: invalid JSON, unsupported command options, missing citations, timeouts, or memory pressure beyond the limit.

export function scoreCase(testCase, result) {
  const text = result.output.trim();
  const failures = [];
 
  for (const token of testCase.must_include ?? []) {
    if (!text.includes(token)) failures.push(`missing:${token}`);
  }
 
  for (const token of testCase.must_not_include ?? []) {
    if (text.includes(token)) failures.push(`forbidden:${token}`);
  }
 
  if (testCase.json_schema) {
    try {
      const parsed = JSON.parse(text);
      for (const key of testCase.json_schema.required ?? []) {
        if (!(key in parsed)) failures.push(`missing_key:${key}`);
      }
    } catch {
      failures.push("invalid_json");
    }
  }
 
  if (result.elapsed_ms > testCase.limit_ms) failures.push("timeout");
  return { ok: failures.length === 0, failures };
}

Notice what this does not do. It does not ask another model to grade the answer. I use judge models sometimes for exploratory work, but they were the wrong primitive here. A judge introduces another model, another prompt, another set of preferences, and another opportunity for the evaluation to admire fluent nonsense. For a home-server swap gate, deterministic checks cover more ground than they get credit for.

The most expensive local model is the one that makes you check its work twice.

Track the boring numbers

Once the correctness gate exists, the hardware metrics become actionable. Without correctness, speed is trivia. With correctness, speed decides whether the model belongs in a daily path or a weekend experiment.

I keep thresholds blunt. A short command task should return in about 12 seconds. A longer summarization can take about 30 seconds. Peak memory should leave enough headroom for the rest of the server. These numbers are not portable, but the act of writing them down is.

What I rejected

I rejected three tempting approaches.

First, I rejected choosing by leaderboard rank. Leaderboards are useful for discovery, but they are weak as deployment criteria for local models. They rarely tell me whether a quantized version on my hardware will behave well inside my latency budget.

Second, I rejected full manual review. I still read samples, especially after a close run, but reading every answer does not scale and tends to reward confident prose. The point of the harness is to catch boring regressions before my taste gets involved.

Third, I rejected one giant score. A model that is excellent at summarization and bad at command help should not be averaged into ambiguity. I want a small report that tells me where it can be used.

{
  "model": "local-model:latest",
  "passed": 19,
  "failed": 3,
  "warnings": 4,
  "median_tokens_per_second": 18.7,
  "p95_elapsed_ms": 21400,
  "peak_memory_mb": 18432,
  "usable_for": ["summaries", "local_qa"],
  "blocked_for": ["command_help"]
}

That report is the artifact I trust. It is not universal, but it is honest about the machine and the work.

Deep-dive: A practical swap rule

My current rule is simple: a new model replaces the current one only if it passes every existing critical case, improves at least one workflow I care about, and stays within the same memory class. If it needs more memory, it has to be materially better on a task that already causes pain.

That rule prevents churn. New releases are frequent enough that curiosity can become maintenance. A local server should not become a shrine to release notes.

The remaining sharp edges

This approach has costs. The harness reflects my workload, so it can miss capabilities I have not encoded. It also makes prompt drift visible in an annoying way. If I improve a prompt in the actual application, I need to decide whether to update the test case, freeze it for comparability, or run both for a while.

There is also a ceiling on deterministic checks. Some answers are genuinely qualitative. For those, I keep a small review set and compare outputs side by side after the automated gate. The key is sequencing: automation rejects obvious misses first, then human judgment handles the narrower question of usefulness.

The payoff is not that I found the best model. I found a way to stop asking that question. The better question is whether a model has earned a specific job on a specific machine.

FAQ

Why are cloud LLM benchmarks unreliable for self-hosted models?

They measure model behavior under infrastructure you do not control. For self-hosted LLM evaluation, latency, memory headroom, quantization, and local workload fit can matter as much as benchmark accuracy.

Where should I capture latency in a local LLM harness?

Capture time to first token and total elapsed time from the client that calls the local server. Server-side logs help debug, but the client view reflects what your workflow actually experiences.

What should I test before swapping an open-weights model?

Test fixed prompts from your real workflows, hard failure modes such as invalid JSON or invented flags, peak memory, and latency budgets. A model should clear the old model's critical cases before replacing it.

Should I use another LLM as a judge for local model evaluation?

Use judge models sparingly. For a swap gate, deterministic checks are cheaper, repeatable, and better at catching concrete failures such as missing fields, unsupported options, and timeouts.

How many prompts are enough for a home-server LLM benchmark?

Start with 15 to 30 prompts that cover the tasks you actually run. A small set maintained over time beats a large benchmark you avoid updating.