Engineering

Building Press, Part 1: Read the work, leave the secrets

June 19, 20269 min read

Press is an editorial engine I built to read my actual work: Slack threads, pull requests, meeting notes, and my AI session logs. It exists because the work leaves a trail worth telling: a tricky call in Slack, a debugging session that finally cracked, a migration, a meeting where someone reframed the whole problem. By the time a quiet hour arrives to write about any of it, the context has gone cold and the blank page wins. So Press mines that trail for stories and drafts posts in my voice.

This first piece in a six-part series starts with the most dangerous boundary in the whole system: confidentiality at egress, the line between what I do and what the engine is allowed to see. If you want a system that reads your real work, confidentiality cannot be a review step. It has to be a property of the egress path, enforced in code, before anything leaves the machine where the secret lives. Everything else in Press is downstream of getting that one boundary right. The next layer is the data model, which I cover in Building Press, Part 2: The data model is the spine.

TL;DR

Building Press, an editorial engine that ingests real work (Slack threads, pull requests, meeting notes, and AI session logs), requires confidentiality to be enforced at egress in code, not as a manual review step. The collector strips secrets and replaces sensitive terms against a glossary before anything crosses the wire; the server only ever receives redacted text and metadata about what was redacted. Local AI session logs never leave the laptop at all, while cloud sources are handled by an always-on box. A deterministic redaction guard runs at the source, and a second LLM-based deep-scan runs later at the publish gate, so no single layer carries the full burden.

Five sources, two of which never leave the laptop

Press pulls from five places. Three are cloud APIs: the Slack threads I was actually in, my GitHub pull requests and commits, and Granola meeting summaries. Two are local files that exist nowhere but my laptop: my Claude and Codex session logs, the running record of what I asked an AI and what it did.

That split matters because the two kinds of source have completely different trust profiles. The cloud sources are already mediated by an API and a token. The local logs are raw. They contain half-formed thoughts, pasted snippets, client names, the occasional credential I was debugging against. If any source is going to leak something I would regret, it is the logs. So the design question is not "how do we clean the data once it is in the database." It is "how do we guarantee the data is clean before it is ever in the database."

Enforcing confidentiality at egress, not on review

The collector runs where the source lives. It reads a signal, strips known secrets, and replaces sensitive terms against a glossary I maintain. Only two things cross the wire to the server: the redacted text, and metadata about what was redacted (how many hits, of what kind). The server never sees the original. It cannot, because the original never left.

There is a deliberate asymmetry in what gets logged. The collector records that a redaction happened and what category it fell into. It never records the value. A redaction log that stored the secret in order to tell you it found a secret would be the exact failure it was built to prevent. So the log answers "did the guard fire, and how often," and is useless to an attacker.

I considered two simpler designs and rejected both. The first was a manual "remember to scrub it before publishing" step. That is not a control, it is a hope, and it fails the first genuinely busy week when you are tired and moving fast. The second was redacting on the server, after ingest, which is cleaner to build because the server has all the data in one place. But it inverts the trust model: now the raw secret has already traveled, already been written to a disk I do not fully control, already sat in a backup. Redaction after egress protects the report, not the data. The whole point is to protect the data.

Important

Confidentiality has to be automatic. A human "remember to scrub it" step fails the first busy week, and a server-side scrub runs after the secret has already traveled. The only redaction you can trust is the one that runs before egress and logs that it fired, never what it found.

The day the laptop went dark

The clean version of the egress story has an ugly footnote, and it is worth telling because it is where the architecture actually got decided.

Deep-dive: why the laptop went dark, and how the box took over

For a while the entire cycle was pinned to my laptop. The comment in the script said, plainly, "the server has no Codex auth and no session logs," and that was true, so the laptop ran everything: collection, mining, drafting. Then one morning the dashboard was stale. Nothing had run. The laptop had been asleep at the scheduled time, and because the laptop was the only thing that ran the cycle, the whole pipeline simply did not happen. There was no error, just silence, which is worse.

The fix was to notice that the comment was only half true. Slack, GitHub, and Granola are cloud APIs reachable with portable tokens. They never needed the laptop at all. Only the Claude and Codex session logs are genuinely laptop local. So I split the cycle by source. The always-on box now runs SOURCES=slack,github,granola plus mining and drafting on a cron. The laptop's only remaining job is to push its local session logs when it happens to be awake. Moving GitHub off the gh CLI (not installed in the box's container) and onto the REST API was the one real porting cost.

The deeper fix was making the failure visible. The Sources page now shows a per-source "last collected" timestamp and raises a stalled-cycle banner when something has not run in too long. A dark laptop used to be a mystery. Now it is a line on a page.

What that incident taught me was not really about cron. It was that "where does this run" is a confidentiality decision in disguise. The laptop ran everything because the laptop was where the secrets were. Once redaction moved to egress, most of the work no longer needed to touch the laptop at all, and the parts that did (the local logs) became a small, clearly-bounded exception instead of the reason the entire pipeline was fragile. Drawing the egress boundary correctly is what made the box-and-laptop split obvious.

What this does not catch

Deterministic redaction has a real ceiling, and it would be dishonest to pretend otherwise. It catches what it has been told to catch: known secret patterns and the terms in the glossary. It will not catch a novel piece of personal information that matches no pattern and is in no glossary, for example a client's name mentioned for the first time in a meeting. The deterministic pass is fast, cheap, and guaranteed, which is exactly why it belongs at egress. It is not, on its own, sufficient.

That is why it is the first guard and not the only one. A second, slower check runs much later, at the publish gate: an LLM deep-scan that reads the drafted post for the things a regex cannot see. I will get to that gate in Part 4. The important property is layering. The cheap deterministic guard runs where the data is most dangerous (at the source, before egress) and the expensive judgment-based guard runs where the cost is affordable (once, on a finished draft).

The thing I would tell anyone building a system that ingests their real work is this: pick the moment of egress as the place you enforce confidentiality, and make the enforcement code rather than habit. Once confidentiality at egress is a line in the program instead of a note in your head, you stop spending attention on it, and you get to spend that attention on the writing instead.

FAQ

How do you prevent secrets from leaking when ingesting AI session logs into a pipeline?

Run redaction at the collector, on the machine where the logs live, before anything is sent over the wire. The server should only ever receive the redacted text and metadata about what was redacted, never the original. That way a compromised server or backup cannot expose the raw secret.

Why is server-side redaction after ingest a bad idea?

By the time server-side redaction runs, the raw secret has already traveled, been written to a disk you may not fully control, and potentially landed in a backup. Redacting after egress protects the report, not the data itself.

What is the difference between a deterministic redaction guard and an LLM deep-scan at a publish gate?

The deterministic guard is fast, cheap, and runs at egress; it catches known secret patterns and glossary terms but will miss novel personal information it has not been told about. The LLM deep-scan runs once on the finished draft and can catch things a regex cannot, like an unrecognized client name. Layering both is the point.

Why did splitting the pipeline between a laptop and an always-on box matter for confidentiality?

Once redaction moved to egress at the collector, the cloud sources (Slack, GitHub, Granola) no longer needed to touch the laptop at all. Only the genuinely local AI session logs remained laptop-bound, turning a fragile all-or-nothing dependency into a small, clearly-bounded exception.

What should a redaction log record when it finds a secret?

It should record that a redaction happened and what category it fell into, never the value itself. A log that stores the secret in order to report finding it defeats its own purpose.