Internal AI

When Agents Can Write Their Own Skills, Retrieval Becomes the Product Boundary

June 25, 202611 min read

I was watching an internal agent solve a task it had never seen before, write down the procedure as a reusable skill, and then reach for that skill on a later run with just enough confidence to make me uncomfortable. The impressive part was not that it could create files, run commands, or commit changes. Plenty of automation can do that. The unnerving part was that agent skill persistence had started to carry its own operating manual forward, and one bad retrieval could turn yesterday's useful shortcut into today's wrong turn.

TL;DR

Agent skill persistence turns retrieval into the product boundary. A self-improving agent can save a solved procedure as a new skill, but the hard engineering problem is deciding when that procedure becomes durable memory, how it is named and indexed, and how the system detects when retrieval selects the wrong skill. The practical answer is to treat skills like versioned product surface: gated creation, constrained retrieval, evaluation loops, and guardrails around tool expansion.

The argument I would now make plainly is this: once agents can write their own skills, the product is no longer the tool list. The product is the retrieval lifecycle around the skills.

That sounds abstract until the first misfire. A task arrives. The agent searches its saved skills. It finds something with overlapping words. It loads the wrong procedure, then uses real tools with fake relevance. The agent is not broken in the obvious way. It is worse: it is locally competent while globally misrouted. That is how a system gets expensive without first looking stupid.

Where agent skill persistence moved the boundary

The first version of the agent was simple enough to reason about. It had code execution, file access, and Git actions. Give it a novel task, and it could inspect the workspace, run a command, edit a file, and record what happened. The new step was persistence: after solving a task, it could save the procedure as a skill so future runs did not rediscover the same workflow from scratch.

That is a useful capability. It is also a category change.

Before skill persistence, the main question was whether the agent could execute a requested operation safely. After skill persistence, the question became whether the agent should remember the operation at all. A saved skill is not just a note. It is future control flow.

The known issue was blunt: the agent occasionally referenced the wrong skill. That does not sound dramatic in a status update. In practice it is the failure mode that matters most, because it sits between intent and execution. The wrong skill can still contain valid commands. It can still produce reasonable logs. It can still make a commit.

I have more sympathy now for boring lifecycle machinery than I did before. Naming, indexing, freshness, ownership, evaluation, and rollback are not administrative garnish. They are the controls that decide what kind of agent you actually shipped. That same instinct shows up in One Policy Gate for an Autonomous Agent: the product boundary is often the place where a capable system is forced to slow down and prove what it is about to do.

A skill is a capability with a memory leak

A self-created skill usually starts as a trace of a successful run: what the agent observed, which tools it used, what order worked, and which checks proved the result. If that trace is promoted too casually, the system accumulates procedures that look reusable but only worked because the original context quietly supplied half the answer.

A useful skill needs a contract. I want these fields before I let an agent-created procedure into durable retrieval:

id: repo-test-runner
name: Repository test runner
summary: Run the local test suite and report failing files.
applies_when:
  - package.json exists
  - npm test script exists
  - no language-specific runner is requested
reject_when:
  - workspace has pyproject.toml and no package.json
  - user asks for deployment or release work
required_tools:
  - file_read
  - command_exec
risk_level: low
verification:
  command: npm test
  success_pattern: "0 failed"
owner: agent-platform
version: 3

The applies_when and reject_when fields matter more than the cheerful prose in summary. Summaries are for humans and embedding models. Applicability rules are for keeping the system from being charmingly wrong.

A saved skill is not just a note. It is future control flow.

The temptation is to treat skill creation as a documentation problem. Let the agent write a good title, a tidy description, and maybe a few steps. That is not enough. The agent needs to persist the boundary conditions that made the procedure valid.

Retrieval is where correctness gets slippery

Most retrieval bugs are not absurd. They are plausible. A skill for running JavaScript tests and a skill for running TypeScript type checks will share vocabulary. A skill for preparing a release and a skill for creating a Git commit may both mention branches, tags, changelogs, and verification. Similarity search is doing exactly what it was asked to do, which is not the same as doing what the product needs.

The retrieval path I prefer has three gates.

Gate one: lexical and structural filtering

Do not ask embeddings to solve every part of selection. First filter by cheap, explicit facts from the current task and workspace. If the workspace has package.json, JavaScript skills are candidates. If it has pyproject.toml, Python skills become candidates. If the user asked for a calendar workflow, repository skills should not even enter the room.

A simple candidate query can carry more product judgment than a large prompt:

SELECT id, name, summary, risk_level, version
FROM agent_skills
WHERE enabled = true
  AND risk_level IN ('low', 'medium')
  AND required_tools <@ ARRAY['file_read', 'command_exec', 'git_action']
  AND applies_tags && ARRAY['repo', 'tests']
  AND NOT reject_tags && ARRAY['deploy', 'calendar']
ORDER BY last_eval_score DESC, updated_at DESC
LIMIT 20;

This is not glamorous. It is also where a lot of the damage is prevented.

Gate two: semantic ranking with evidence

After filtering, embeddings are useful for ranking. I still do not want the agent to select a skill because the vector score looked warm and persuasive. The selector should produce evidence: which words matched, which workspace facts matched, which applicability rule passed, and which reject rule did not fire.

{
  "skill_id": "repo-test-runner",
  "decision": "select",
  "score": 0.82,
  "evidence": {
    "matched_files": ["package.json"],
    "matched_tags": ["repo", "tests"],
    "passed_rules": ["npm test script exists"],
    "blocked_rules": []
  }
}

That artifact is useful for debugging, but it also changes agent behavior. A selector that must explain why a skill applies tends to expose weak matches before execution. The point is not perfect introspection. The point is friction at the boundary where memory becomes action.

Gate three: bounded execution

Skill retrieval should not silently expand the agent's powers. If a skill requires Git actions, and the current task only allowed file reads, retrieval should fail closed or ask for a narrower plan. This becomes more important as the toolkit grows. Code execution plus file writes plus Git operations can achieve compound goals. They can also compound a mistake.

When should an experience become durable memory?

The hardest product question is promotion. An agent can solve many tasks once. That does not mean each solution deserves to live in the shared skill index.

I ended up thinking in terms of three thresholds: repeatability, observability, and blast radius.

Threshold	Question	Promotion signal
Repeatability	Will this procedure likely recur?	Similar task seen at least twice
Observability	Can we verify success cheaply?	Command, assertion, or log pattern exists
Blast radius	What can a bad match damage?	Tool scope is low or explicitly bounded

A one-off workaround with vague verification should stay as an execution trace, not a skill. A repeated procedure with a clear success check can become a private skill. A procedure that touches releases, credentials, billing, or production data needs human review before it becomes generally retrievable.

This is where the self-improving story loses some romance. The agent can draft skills. It should not freely promote every draft into shared memory. Promotion is a product decision with engineering evidence attached.

Deep-dive: A minimal promotion check

type CandidateSkill = {
  id: string;
  name: string;
  appliesWhen: string[];
  rejectWhen: string[];
  requiredTools: string[];
  verification?: { command: string; successPattern: string };
  similarTaskCount: number;
  riskLevel: 'low' | 'medium' | 'high';
};
 
export function canPromote(skill: CandidateSkill): boolean {
  if (skill.similarTaskCount < 2) return false;
  if (!skill.verification) return false;
  if (skill.appliesWhen.length === 0) return false;
  if (skill.rejectWhen.length === 0) return false;
  if (skill.riskLevel === 'high') return false;
  return skill.requiredTools.every((tool) =>
    ['file_read', 'command_exec', 'git_action'].includes(tool),
  );
}

This is intentionally conservative. It rejects some useful skills, but it also prevents the index from becoming a junk drawer with a ranking model taped to the front.

The wrong-skill bug is an evaluation problem

Once the agent occasionally references the wrong skill, unit tests around tool execution are not enough. The selector itself needs evaluation data.

The useful eval set is not only happy paths. It needs near misses: tasks that sound similar but require different skills. If the agent has a skill for running tests, include a task about writing tests. If it has a release checklist, include a task about drafting release notes without publishing anything.

A basic eval runner can be painfully ordinary:

cd ~/agent-platform
node scripts/evaluate-skill-retrieval.mjs \
  --cases fixtures/skill-retrieval-cases.jsonl \
  --top-k 5 \
  --fail-on-wrong-top1

And the cases should name both the expected skill and the skills that must not be selected:

{"task":"Run the repository test suite and summarize failures","workspace":["package.json","src/app.ts"],"expected":"repo-test-runner","forbidden":["release-checklist"]}
{"task":"Prepare a changelog draft without committing files","workspace":["CHANGELOG.md","package.json"],"expected":"changelog-drafter","forbidden":["release-checklist"]}

This turns an anecdotal annoyance into a measurable product boundary. The goal is not to make retrieval flawless. The goal is to know which mistakes are getting more likely as the skill library grows.

The cost of doing this properly

There is a real cost here. A constrained skill system is slower to feel magical. The agent has to ask whether a skill applies. It may refuse to use a plausible procedure. It may leave a newly drafted skill in review instead of immediately reusing it.

I am fine with that trade. A self-improving agent that remembers too eagerly becomes a procedural hoarder. It accumulates enough memory to appear experienced and enough ambiguity to become difficult to trust.

The sharper edge is maintenance. Skills age. Tool permissions change. Repositories move from one test runner to another. A skill that was correct six weeks ago can keep running against a setup it no longer fits. Versioning and periodic evals are not optional paperwork in this design. They are the cost of letting the agent keep memory that can steer future action. I made a related argument in Building Press, Part 3: The prompts are data, not code: once behavior is stored as data, the lifecycle around that data becomes part of the product.

What I would build first

If I were hardening this system from scratch, I would not start by adding more tools. I would start by making the skill boundary visible.

First, every selected skill gets logged with its candidate set, evidence, version, and allowed tools. Second, every newly created skill starts as a draft with promotion rules. Third, retrieval has negative tests, not just expected matches. Fourth, high-risk skills require explicit review before they can be selected automatically.

That gives the agent room to improve without letting its memory become folklore.

The interesting part of self-skill creation is not that the agent can write down what worked. Humans have been doing that in runbooks for a long time, usually after the third outage and one bleak meeting. The interesting part is that the agent can put the runbook back into the execution path. At that point retrieval is where the product has to decide which procedures the agent is allowed to reuse.

FAQ

Why does agent skill persistence make retrieval risky?

Because a retrieved skill can steer real tool use. A wrong match is not just a bad search result; it can become the procedure the agent executes.

Where should I capture skill applicability rules?

Capture them in the skill metadata, close to the procedure itself. Use explicit applies_when and reject_when fields so retrieval can filter before semantic ranking.

How do I evaluate skill retrieval for an agent?

Build an eval set with expected skills and forbidden near misses. Run it whenever skills are added, edited, or re-indexed, and fail the build when the wrong skill ranks first.

Should agents be allowed to create their own skills automatically?

They can draft them automatically, but promotion should be gated. Repeatability, verification, and blast radius decide whether a draft becomes durable memory.

What is the main guardrail for expanding agent tools?

Keep tool permissions bound to the current task and selected skill. Retrieval should not grant broader powers than the task policy already allows.