Engineering

Building Press, Part 2: The data model is the spine

June 19, 20268 min read

The first time Press, the editorial engine I built to draft from my own work, handed me a draft I wanted to change, I hesitated before touching it. Not because the writing was precious, but because I did not trust what "edit this" would cost me. If the post was a single blob of text in a single column, one careless save, or one over-eager AI rewrite, would overwrite the only copy. That hesitation is the tell I want to talk about. It means the data model is wrong.

After building the thing, I came away with this: the data model is the spine of an editorial engine, and it has to be made of distinct tables with one clear owner each. The moment you collapse "content" into a single fuzzy record, every stage of the pipeline starts reaching across lines it should not, and editing stops feeling safe.

TL;DR

When building an editorial engine, collapsing all content into a single table with a status column makes every stage of the pipeline unsafe to edit. In Press, the data model uses three distinct tables (signals, stories, posts) plus a post_revisions table so that each stage can fail or be corrected independently. A post's body is stored on a revision row, not on the post row itself, meaning an AI rewrite appends a new revision rather than overwriting the only copy. The shape of your tables determines which mistakes are even expressible.

Three tables, not one blob

Press models the lifecycle as three nouns that turn into each other: signals become stories become posts.

A signal is a single redacted snippet of work: a Slack message, a commit, a line from a meeting summary. A story is a cluster of signals the miner judged to be one tellable thing. A post is the published artifact a story becomes after drafting and review.

The temptation is to skip the ceremony and keep one content table with a status column that marches from "raw" to "published." I tried that shape in my head and rejected it, because it fails the test that matters: can each stage fail on its own, and can a human step in at any point without untangling the others? With three tables, mining can produce a bad cluster and I can fix or dismiss the story without touching any signal or post. Drafting can write a weak post and I can rewrite it without re-mining. The boundaries are not bureaucracy. They are the seams where a human or a retry can get a grip.

story_id: the whole "have we processed this?" check

Mining needs to know which signals it has already consumed. The entire mechanism for that is one nullable column. A signal carries a story_id once it has been clustered, and mining selects the unprocessed ones with a single predicate:

select * from signals where story_id is null;

That is the whole bookkeeping system. No "mined" boolean, no separate queue table, no timestamp to reason about. A signal is either spoken for (it points at a story) or it is fair game (the column is null). The simplicity is the feature: there is exactly one place the truth lives, and it is impossible for "consumed" and "assigned to a story" to disagree, because they are the same fact.

I want to be honest about what this does not do. story_id is null tracks consumption, not novelty. It guarantees a signal is mined at most once. It does not guarantee that two stories are about genuinely different things. On a busy week, the miner clustered overlapping signals into two stories that were really the same topic, and both got drafted before I noticed. The column did its job perfectly and still let near-duplicates through, because deduplication of meaning is a different problem living one layer up, in the mining prompt. A clean column is not a substitute for a clean judgment.

The body lives on a revision, not the post

Here is the decision that earned its keep more than any other: a post's body is not stored on the post row. The posts row holds identity and metadata (slug, title, pillar, status, SEO fields) and a pointer, current_revision_id. The actual Markdown lives in post_revisions, and every edit, human or AI, inserts a new revision rather than mutating the old one.

This is why "let the AI revise this" is a safe button instead of a scary one. An AI rewrite does not overwrite anything. It writes a new revision and, if I like it, I move the pointer. The worst case is a revision I simply never promote. The history is a diffable, rollback-able stack, and the question "what did this look like before the model touched it" always has an answer.

Tip

Make the destructive operation impossible to express, not merely discouraged. When "edit" can only mean "append a revision," there is no code path that loses the previous version, so you never have to remember to be careful.

Deep-dive: body-on-revision, and the autosave sharp edge

The pointer is deliberately a soft reference. posts.current_revision_id is just a uuid column, not a hard foreign key, because a strict FK would create a circular constraint: a post points at a revision, and a revision points back at its post, so neither can be inserted first without a deferred constraint dance. A soft pointer sidesteps that entirely, at the cost that the application is responsible for keeping it valid.

The honest sharp edge is autosave. "Every edit is a new revision" is true for committed edits and AI revisions, the ones that go through the explicit save path. The live editor, though, autosaves your in-progress typing back onto the working revision in place, so it does not spawn a row on every keystroke. That is the right call for an editor (you do not want a thousand revisions from one writing session), but it means "reversible" applies to promoted and AI revisions, not to an open editing session you have been typing into for ten minutes. If you want a checkpoint mid-session, you take a snapshot explicitly. Knowing exactly where the guarantee starts and stops is more useful than pretending it is absolute.

The tables at the edges

Two more tables hang off posts, and keeping them separate follows the same logic. social_drafts holds the LinkedIn and X copy for a post: secondary artifacts with their own lifecycle (queued, posted, a captured URL) that should not bloat the post row or block the blog from rendering. engagement records the metrics those social posts pull back later. A post can exist and publish with no social drafts, a social draft can be posted and gather engagement independently, and none of it touches the post's body. Each table owns one concern, and the failure of any one of them leaves the others standing.

Design the tables so feared operations cannot be written down

The shape of your data decides which mistakes are even possible. I did not make editing safe by being careful; I made it safe by storing the body on a revision so that carelessness has nowhere to land. I did not make mining idempotent with discipline; I made it idempotent with a nullable column that cannot lie. When the model is right, distinct tables let it work; when it is wrong, they give a person somewhere to stand. Design the tables so that the operations you fear cannot be written down, and you stop needing to fear them.

FAQ

Why store post body on a revision table instead of the post row?

Keeping the body in a post_revisions table means every edit, human or AI, inserts a new row instead of mutating the existing one. The post row holds a pointer (current_revision_id) that you advance only when you accept a change, so the previous version is always recoverable. This makes AI rewrites safe by construction: the worst outcome is a revision you never promote.

How does Press track which signals have already been mined?

Each signal row has a nullable story_id column. Mining selects signals where story_id is null, and once a signal is clustered into a story the column is populated. There is no separate queue table or boolean flag, so consumed and assigned-to-a-story are the same fact and cannot disagree.

Why use three separate tables (signals, stories, posts) instead of one content table with a status column?

A single table with a status column means a bad cluster from mining and a weak drafted post are tangled in the same row. Three tables let each stage fail independently: you can fix or discard a story without touching any signal or post, and you can rewrite a post without re-running the miner.

Why is posts.current_revision_id a soft pointer rather than a foreign key?

A strict foreign key would create a circular constraint: the post points at a revision and the revision points back at the post, so neither can be inserted first without a deferred constraint. A soft uuid pointer sidesteps that circular dependency, at the cost of the application being responsible for keeping the reference valid.

Does story_id is null prevent duplicate or near-duplicate stories from being mined?

No. The nullable story_id column guarantees a signal is consumed at most once, but it does not detect when two stories cover the same topic. Deduplication of meaning lives one layer up in the mining prompt, not in the column.