Bassam Ismail
Building Press
Building Press·Part 1 of 6
Engineering

Building Press, Part 1: Read the work, leave the secrets

7 min read

Confidentiality at egress became the boundary I had to draw before Press could safely read the work I actually do. I knew the cost of not writing before I knew the cost of writing. Every week the work left a trail worth telling: a Slack thread where a tricky call got made, a debugging session that finally cracked, a migration, a meeting where someone reframed the whole problem. By the time a quiet hour arrived to write about any of it, the context had gone cold and the blank page won. So I built Press, an editorial engine that reads the work I actually do, mines it for stories, and drafts them in my voice. This is Part 1 of how it works, and it starts with the most dangerous boundary in the whole system: the line between what I do and what the engine is allowed to see.

My thesis is narrow and a little uncomfortable: if you want a system that reads your real work, confidentiality cannot be a review step. It has to be a property of the egress path, enforced in code, before anything leaves the machine where the secret lives. Everything else in Press is downstream of getting that one boundary right. The next layer is the data model, which I cover in Building Press, Part 2: The data model is the spine.

Five sources, two of which never leave the laptop

Press pulls from five places. Three are cloud APIs: the Slack threads I was actually in, my GitHub pull requests and commits, and Granola meeting summaries. Two are local files that exist nowhere but my laptop: my Claude and Codex session logs, the running record of what I asked an AI and what it did.

That split matters because the two kinds of source have completely different trust profiles. The cloud sources are already mediated by an API and a token. The local logs are raw. They contain half-formed thoughts, pasted snippets, client names, the occasional credential I was debugging against. If any source is going to leak something I would regret, it is the logs. So the design question is not "how do we clean the data once it is in the database." It is "how do we guarantee the data is clean before it is ever in the database."

Confidentiality at egress, not on review

The collector runs where the source lives. It reads a signal, strips known secrets, and replaces sensitive terms against a glossary I maintain. Only two things cross the wire to the server: the redacted text, and metadata about what was redacted (how many hits, of what kind). The server never sees the original. It cannot, because the original never left.

There is a deliberate asymmetry in what gets logged. The collector records that a redaction happened and what category it fell into. It never records the value. A redaction log that stored the secret in order to tell you it found a secret would be the exact failure it was built to prevent. So the log answers "did the guard fire, and how often," and is useless to an attacker.

I considered two simpler designs and rejected both. The first was a manual "remember to scrub it before publishing" step. That is not a control, it is a hope, and it fails the first genuinely busy week when you are tired and moving fast. The second was redacting on the server, after ingest, which is cleaner to build because the server has all the data in one place. But it inverts the trust model: now the raw secret has already traveled, already been written to a disk I do not fully control, already sat in a backup. Redaction after egress protects the report, not the data. The whole point is to protect the data.

Important

Confidentiality has to be automatic. A human "remember to scrub it" step fails the first busy week, and a server-side scrub runs after the secret has already traveled. The only redaction you can trust is the one that runs before egress and logs that it fired, never what it found.

The day the laptop went dark

The clean version of the egress story has an ugly footnote, and it is worth telling because it is where the architecture actually got decided.

Deep-dive: why the laptop went dark, and how the box took over

For a while the entire cycle was pinned to my laptop. The comment in the script said, plainly, "the server has no Codex auth and no session logs," and that was true, so the laptop ran everything: collection, mining, drafting. Then one morning the dashboard was stale. Nothing had run. The laptop had been asleep at the scheduled time, and because the laptop was the only thing that ran the cycle, the whole pipeline simply did not happen. There was no error, just silence, which is worse.

The fix was to notice that the comment was only half true. Slack, GitHub, and Granola are cloud APIs reachable with portable tokens. They never needed the laptop at all. Only the Claude and Codex session logs are genuinely laptop local. So I split the cycle by source. The always-on box now runs SOURCES=slack,github,granola plus mining and drafting on a cron. The laptop's only remaining job is to push its local session logs when it happens to be awake. Moving GitHub off the gh CLI (not installed in the box's container) and onto the REST API was the one real porting cost.

The deeper fix was making the failure visible. The Sources page now shows a per-source "last collected" timestamp and raises a stalled-cycle banner when something has not run in too long. A dark laptop used to be a mystery. Now it is a line on a page.

The lesson from that incident was not really about cron. It was that "where does this run" is a confidentiality decision in disguise. The laptop ran everything because the laptop was where the secrets were. Once redaction moved to egress, most of the work no longer needed to touch the laptop at all, and the parts that did (the local logs) became a small, clearly-bounded exception instead of the reason the entire pipeline was fragile. Drawing the egress boundary correctly is what made the box-and-laptop split obvious.

What this does not catch

Deterministic redaction has a real ceiling, and it would be dishonest to pretend otherwise. It catches what it has been told to catch: known secret patterns and the terms in the glossary. It will not catch a novel piece of personal information that matches no pattern and is in no glossary, for example a client's name mentioned for the first time in a meeting. The deterministic pass is fast, cheap, and guaranteed, which is exactly why it belongs at egress. It is not, on its own, sufficient.

That is why it is the first guard and not the only one. A second, slower check runs much later, at the publish gate: an LLM deep-scan that reads the drafted post for the things a regex cannot see. I will get to that gate in Part 4. The important property is layering. The cheap deterministic guard runs where the data is most dangerous (at the source, before egress) and the expensive judgment-based guard runs where the cost is affordable (once, on a finished draft).

The thing I would tell anyone building a system that ingests their real work is this: pick the moment of egress as the place you enforce confidentiality, and make the enforcement code rather than habit. Once confidentiality at egress is a line in the program instead of a note in your head, you stop spending attention on it, and you get to spend that attention on the writing instead.