Engineering

A Slack Mention Becomes a Pull Request

June 22, 202610 min read

Someone drops @agent triage the latest Sentry spike into #payments; a minute later, Slack has a PR link. Before the agent can touch GitHub, it has to answer the question that decides whether this is useful or dangerous: does #payments mean the payments repo, and only the payments repo?

That question is really three questions compressed into one. Where is the request coming from? (That is authority.) When does the irreversible work happen? (That is time.) Who signs off on the result? (That is review.) Each one binds to a single primitive: the channel id, a Postgres row, and a Slack reaction. Channel resolution is the first example. The rest of the design follows from those three bindings.

This is the first of four parts on building an internal Slack agent. This one covers the whole path end to end. The next three go deep on the parts that turned out to matter most: the queue, the safety gate, and the seams. The shape follows the same discipline behind Submitting Timesheets from Slack Without Writing Too Early: receive the intent quickly, delay the irreversible work, make the final write explicit.

TL;DR

A Slack engineering agent maps each project to a Slack channel plus one or more GitHub repos and integrations. An @agent mention resolves to a project by channel id, parses a command, acknowledges with an :eyes: reaction, and enqueues a row in a Postgres tasks table. A worker claims the row, runs an agent loop, posts intermediate status back to the thread, and finishes by opening a PR. The review verdict comes back the same way it went out: as a Slack reaction. The system runs on three processes and one database.

What it actually is

A project is the core unit. A project is a Slack channel, one or more GitHub repos, and a set of optional integrations (issues, error tracking, docs, meeting transcripts). When you mention the agent in a channel, the channel id is what tells the system which repos it may touch, which error-tracker projects it can read, and which actions it is allowed to take. The "tenant" is not a header or a subdomain. It is the channel you happened to be standing in.

That design choice pays off immediately: permissions and context are a property of where you are, not of who you are or what you remember to pass. A mention in the payments channel can open a PR against the payments repo and nothing else, because that is the only repo wired to that project.

Three processes, no other moving infrastructure: a web app for the admin UI, a single long-running process that hosts the Slack listener plus the worker plus the scheduler, and a one-shot migrator. One Postgres behind all of it. No Redis, no queue broker, no cache. That constraint shaped everything downstream, and it is the subject of Part 2.

The portable mental model: bind authority to the place where work is requested, then cross an explicit async boundary before doing irreversible work.

Inside the Slack engineering agent path

Here is what happens between the mention and the PR.

Resolve the project before doing anything

The listener subscribes to app_mention over Slack Socket Mode, so there is no public webhook endpoint to secure. The first real work is stripping the mention text and looking the channel up:

// app_mention handler
const channelId = event.channel;
const project = await store.getByChannelId(channelId);
if (!project) {
  await say("This channel isn't wired to a project yet.");
  return;
}
const command = parseCommand(stripMention(event.text));

If the channel does not map to a project, the agent says so and stops. There is no global fallback, deliberately: an un-provisioned channel has no repos and no permissions, so there is nothing safe to do.

A few costs worth naming here. Shared channels (ones bridged across Slack workspaces) complicate the channel-id lookup because the id can vary by context. Renamed or archived channels silently break the mapping until someone re-provisions them. And if two repos are plausibly in scope (a mono-repo split mid-migration, say), the system has no principled way to choose: it will use whatever the project record says, which may be stale. These are solvable, but they require active maintenance of the project configuration, not just the initial wiring.

Acknowledge with a reaction, then enqueue

Before any slow work, the agent reacts to the triggering message with :eyes:. That reaction is the acknowledgment protocol: the person who typed the command sees, within a second, that the agent picked it up, even though the actual work will take a minute or more. Then it writes a row to the queue and returns:

await client.reactions.add({ channel, timestamp: event.ts, name: "eyes" });
await queue.enqueue({
  projectId: project.id,
  command,
  slackChannel: channel,
  slackThreadTs: event.ts,
});

The handler is now done. Running the agent does not happen inside the Slack event handler, which matters: Slack expects a fast ack, and an agent run can take minutes. The queue is the boundary between received and done. That same boundary shows up in Building Press, Part 4: Review is where the human gates the irreversible, where the system keeps the machine-generated work separate from the human approval step.

One failure mode to acknowledge: Slack can deliver the same event more than once if the Socket Mode connection drops mid-flight. The :eyes: react will fail with a duplicate-reaction error on the second delivery (harmless), but the enqueue call can create a second task row unless the worker is idempotent on (channel, thread_ts). Duplicate-event handling is something to add explicitly; the basic path above does not include it.

A worker claims it and runs the loop

A separate worker loop pulls the next task using SELECT ... FOR UPDATE SKIP LOCKED against the tasks table. That single clause is what makes concurrent workers safe: each worker atomically claims a row that no other worker holds, without any external lock manager. A task moves through states (pending, claimed, running, done, failed), and the claim only succeeds if the row is in pending. Here is the minimal shape of that query:

-- claim the next available task
UPDATE tasks
SET status = 'claimed', claimed_at = now(), worker_id = $1
WHERE id = (
  SELECT id FROM tasks
  WHERE status = 'pending'
  ORDER BY created_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING *;

SKIP LOCKED means a worker never waits for a row another worker already holds. It moves on to the next pending row instead. If the process crashes after claiming but before writing done, the row stays in claimed. A scheduler sweeps for rows that have been claimed for longer than a timeout threshold and resets them to pending, which lets a worker pick them up again. That sweep is the crash-safety mechanism. Retries increment a counter on the row; the worker checks that counter before running and can dead-letter the task if it has exceeded a retry limit. Part 2 covers the full queue mechanics, but that is enough context to understand why the diagram shows claim (skip locked) rather than a message broker.

As the loop works, it posts intermediate status back into the same Slack thread, so the channel sees "looking at the stack trace," "found the handler," "opening a PR" rather than a minute of silence. The agent loop itself, and the single gate that decides what it is allowed to do, are Part 3.

When the loop finishes, the result is rendered from Markdown into Slack's mrkdwn and posted, and the PR link comes with it.

One partial-failure case: if the agent loop succeeds but the process crashes before writing the PR url back to the task row, a retry will re-run the loop and may open a second PR. Guarding against that requires checking for an existing open PR on the branch before creating one, which the current loop does not do automatically.

The review verdict is a reaction

When the agent posts its result, it records that message's timestamp on the task row as slack_review_ts. A human reviews the PR and reacts to that result message with ✅ or ❌. The listener subscribes to reaction_added, matches the reaction's message timestamp against slack_review_ts, and captures the verdict:

// reaction_added handler
const task = await store.getTaskByReviewTs(event.item.ts);
if (!task) return; // not one of ours
await store.recordReviewVerdict(task.id, event.reaction === "white_check_mark" ? "approved" : "rejected");

Slack's reaction events are enough to turn that familiar gesture into structured review state. There is no separate review UI, no button, no form. The same gesture people already use to signal approval becomes the structured signal the system records.

The tradeoff is that reactions carry no access control. Anyone in the channel can add ✅ or ❌ to the result message, including the agent itself or a bot. If review authority matters, the reaction_added handler should check that the reactor's Slack user id maps to someone with explicit approval rights before writing the verdict. The current snippet skips that check.

Why this shape

Two decisions define the system, and both are about reducing surface area.

Channel-as-tenant: identity, permissions, and context all derive from the channel id, so there is exactly one lookup (getByChannelId) at the front of every request. Everything downstream inherits the answer. No ambient credentials, no ambiguity about which repo was intended.

Queue as the only async primitive: the Slack handler does almost nothing. Resolve, react, enqueue. Everything expensive is a row that a worker will claim. That keeps the event handler fast (Slack is happy) and makes the slow work observable, retryable, and crash-safe, which is exactly what the next part covers.

Together they express the portable principle stated above: bind authority to the place where work is requested, cross an explicit async boundary before doing irreversible work. The costs are real (stale channel mappings, duplicate events, reaction spoofing, partial retries), but the surface each one acts on is narrow and traceable.

FAQ

Why resolve the tenant from the Slack channel instead of the user?

Because permissions should follow the place, not the person. A channel is wired to specific repos and a specific allowlist, so a mention there can only ever act on those. Tying it to the user would mean re-deciding scope on every message and risking a command acting on the wrong repo.

Why use a reaction for the review verdict instead of a button or a web form?

Reactions are the gesture people already use to signal approval in Slack, and they are free to capture. Recording the result message's timestamp on the task row lets reaction_added match the verdict back to the task with no extra UI. Less to build, and it meets reviewers where they already are.

Why does the Slack handler enqueue instead of running the agent inline?

Slack expects a fast acknowledgment, and an agent run can take minutes. Enqueuing returns immediately (with an :eyes: react as the ack) and hands the slow work to a worker, so the event never times out and the run becomes retryable and crash-safe.

Does Socket Mode mean there's no webhook to secure?

Yes. The listener connects out to Slack over a Socket Mode app token rather than receiving inbound webhooks, so there is no public endpoint to authenticate and protect. The trade is an outbound long-lived connection per process instead.