Measuring Performance Work Before Declaring Victory
I was looking at a stack of merged pull requests that all looked correct in isolation, which is the least comforting kind of correct. We had shipped 13 performance changes into a planning app: fewer correlated counts, narrower Prisma includes, SQL aggregation instead of JavaScript loops, better Apollo cache behavior, and a RAM resize. The work smelled right. The problem was that we could not use Sentry traces and database stats well enough to say which changes mattered, what they cost, or whether production users were actually seeing the win.
TL;DR
To measure performance work, I treated observability as a temporary operating posture, not a permanent setting. We sampled Sentry traces at 100% while debugging, enabled pg_stat_statements, added an environment-controlled Sentry debug switch, provisioned dashboards as code, then dropped trace sampling to 10% once the production cost was no longer worth the extra visibility.
The answer was not a prettier dashboard. It was a measurement loop with an exit ramp. Instrument hard while the system is under suspicion, use the data to rank fixes, commit the dashboard surface so the team can repeat the measurement, then turn the expensive knobs back down. Leaving every dial at maximum after the incident is over is not rigor. It is just a slow invoice.
Measuring performance work with Sentry traces and database stats
The first bad sign was social, not technical. We were about to describe a day of performance work using the language of belief: this should be faster, this removes a lot of work, this looks better locally. That is how teams accidentally turn merged code into folklore.
The app had two telemetry surfaces available but underused. Sentry could tell us which GraphQL transactions were slow at p95, and Postgres could tell us which SQL statements consumed the most total execution time. Neither was enough alone.
Sentry gave us the user-facing shape: GetProjectTasks at about 3.6 seconds p95, GetProjects around 639 ms, dashboardMetrics around 704 ms, GetResources around 398 ms. That made the pain legible.
Postgres gave us the database truth underneath it: repeated queries, mean time, call count, and total time. That helped separate a query that is individually awful from a query that is modest but called so often it quietly burns the room down.
The useful mental model here is what I now call the measurement posture: a deliberately temporary configuration that answers a specific operational question. It has three parts:
| Part | Question it answers | Example |
|---|---|---|
| Exposure | What are we willing to pay to see right now? | 100% Sentry trace sampling during debugging |
| Attribution | Can we connect symptoms to mechanisms? | Sentry transaction p95 plus pg_stat_statements query totals |
| Exit | When do we stop paying the extra cost? | Drop trace sampling to 10% after dashboards are seeded |
That last column matters. Without an exit, observability changes become permanent infrastructure sediment. I used a similar cost-control instinct when tracking what the LLM costs in a one-person app: measure aggressively while the question is open, then make the expensive behavior explicit.
The database needed to speak first
pg_stat_statements is a PostgreSQL extension that records normalized query fingerprints with timing and call statistics. It is not magic. It does not explain your ORM. It just says, with admirable indifference, which SQL statements spent the most time executing. The PostgreSQL documentation covers the extension and its collected statistics at postgresql.org/docs/current/pgstatstatements.html.
For this app, we enabled it in the local and deployed Postgres command by adding the preload library and tracking all statements:
services:
postgres:
image: postgres:16
command:
- "postgres"
- "-c"
- "shared_preload_libraries=pg_stat_statements"
- "-c"
- "pg_stat_statements.track=all"
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: [REDACTED:secret]
POSTGRES_DB: app
ports:
- "5432:5432"After the container restarted, the extension still had to be created inside the database:
cd ~/planning-app
docker compose exec postgres psql -U app -d app -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"The query I kept coming back to was boring, which is usually a point in its favor:
SELECT
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) AS pct,
left(query, 220) AS query
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = current_database())
ORDER BY total_exec_time DESC
LIMIT 20;That view immediately changed the order of work. After several app-level fixes landed, the top database costs were not the old obvious monsters. They were permission-scope queries for team members: an OR of EXISTS subqueries against time_entries and resource_bookings. The list query and its pagination count had become the new top two statements by total execution time.
The fix was not to admire the SQL and hope the planner felt encouraged. We added the index that matched the access pattern:
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_time_entries_user_project
ON time_entries (user_id, project_id);
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_resource_bookings_user_project
ON resource_bookings (user_id, project_id);There is a sharp edge here: pg_stat_statements aggregates normalized SQL, so it can hide parameter-specific behavior. If one tenant, project, or role produces a skewed plan, the aggregate can make it look less dramatic than it feels to that user. I still want query plans for the worst cases. But as a ranking tool, it is excellent.
The trace sampler was a dial, not a belief system
Sentry tracing was the other half of the measurement posture. We temporarily sampled traces at 100% because the immediate job was to debug missing transactions and rank hot paths. Sampling every trace in production is expensive, noisy, and occasionally necessary. The trick is remembering the word temporarily. Sentry's tracing docs describe the sampling controls at docs.sentry.io/platforms/javascript/tracing.
The client, server, and edge configs all carried the same sampling decision while we were measuring:
import * as Sentry from "@sentry/nextjs";
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
environment: process.env.NEXT_PUBLIC_APP_ENV ?? "development",
tracesSampleRate: 1.0,
replaysSessionSampleRate: 0.1,
replaysOnErrorSampleRate: 1.0,
});The 100% setting helped us see the transactions that mattered. It showed, for example, that GetProjectTasks was the heaviest GraphQL transaction at about 3.6 seconds p95. The mechanism was not subtle once the trace pointed at it: the task list included _count for subtasks and timeEntries, which generated correlated count subqueries per task row. Around 500 rows meant roughly 1,000 embedded sub-selects. The database was not being slow. We were asking it to do paperwork with a stapler taped to each page.
We removed the list-time counts and maintained logged minutes separately, without changing the GraphQL contract or the UI. That is the kind of performance change I prefer: keep the boundary stable, move the work to a cheaper place, and make the measurement prove the claim.
A similar pattern showed up in GetProjects. The app fetched individual billable time entries with a joined user, then iterated in JavaScript to compute actual fees. On real data, that meant thousands of rows per request, many differing only by issue and rate. We moved aggregation into SQL with groupBy, collapsing rows by roughly 3 to 5 times before they crossed the application boundary.
const fees = await prisma.timeEntry.groupBy({
by: ["projectId", "userId", "billRate"],
where: {
projectId: { in: projectIds },
billable: true,
deletedAt: null,
},
_sum: {
minutes: true,
},
});
return fees.map((row) => ({
projectId: row.projectId,
userId: row.userId,
amount: ((row._sum.minutes ?? 0) / 60) * row.billRate,
}));That code is less flexible than hydrating everything and looping in JavaScript. It also has the virtue of not hydrating everything and looping in JavaScript.
Debug mode belonged in the environment
While tracing, we also had a missing-transactions problem. The Sentry SDK can emit verbose diagnostics through debug: true, but baking that into a redeploy is the wrong operational shape. When instrumentation is already suspect, adding a deployment cycle to inspect the inspector is a small private comedy.
So we put debug mode behind an environment variable controlled through runtime configuration:
import * as Sentry from "@sentry/node";
const sentryDebug = process.env.SENTRY_DEBUG === "true";
Sentry.init({
dsn: process.env.SENTRY_DSN,
environment: process.env.APP_ENV ?? "production",
tracesSampleRate: Number(process.env.SENTRY_TRACES_SAMPLE_RATE ?? "1"),
debug: sentryDebug,
});That made diagnostics a reversible operational action:
cd ~/planning-app
aws ssm put-parameter \
--name "/planning-app/production/SENTRY_DEBUG" \
--type "String" \
--value "true" \
--overwriteImportant
Debug flags are observability controls, not application features. They should be cheap to turn on, visible to operators, and just as cheap to turn off.
The cost is that environment-controlled behavior needs discipline. Someone has to know which runtime reads the value, when it refreshes, and whether a process restart is required. A flag that looks dynamic but is only read at boot can lie very politely.
Dashboards as code kept the measurement repeatable
The next problem was dashboard drift. Clicking together a dashboard during a performance push is fine for the first hour. By the second change, you need the dashboard to be an artifact, not a memory.
We committed a provisioner script that deleted dashboards with matching titles and recreated them from code. Idempotent-ish was enough. The point was not perfect reconciliation. The point was that editing widgets meant editing a script and rerunning it, instead of manually reconstructing whatever I had clicked at 11 p.m.
const dashboards = [
{
title: "Backend performance",
widgets: [
{
title: "GraphQL p95",
displayType: "line",
queries: [
{
name: "p95 by transaction",
fields: ["transaction", "p95(transaction.duration)"],
conditions: "event.type:transaction transaction:GraphQL/*",
orderby: "-p95(transaction.duration)",
},
],
},
],
},
];
for (const dashboard of dashboards) {
await deleteDashboardByTitle(dashboard.title);
await createDashboard(dashboard);
}cd ~/planning-app
SENTRY_AUTH_TOKEN="[REDACTED:secret]" \
SENTRY_ORG="example-org" \
SENTRY_PROJECT="planning-app" \
npm run setup:sentry-dashboardsThis is where measurement became organizational rather than personal. A teammate could rerun the same dashboards, compare the same p95 charts, and inspect the same database rankings. That does not make the numbers perfect. It makes the argument reproducible. That same bias toward repeatable operational artifacts shows up in Building Press, Part 3: The prompts are data, not code, where the important thing was making behavior reviewable instead of ambient.
Turning Sentry traces back down was part of the fix
Once the dashboards existed and the worst transactions had been attacked, we lowered Sentry trace sampling from 100% to 10% across the client, server, and edge configs:
Sentry.init({
dsn: process.env.NEXT_PUBLIC_SENTRY_DSN,
environment: process.env.NEXT_PUBLIC_APP_ENV ?? "production",
tracesSampleRate: 0.1,
replaysSessionSampleRate: 0.1,
replaysOnErrorSampleRate: 1.0,
});The replay settings stayed unchanged. Session replay was already sampled at 10%, and replay-on-error stayed at 100% because error context is a different value proposition than routine trace volume.
This is the part teams skip because it feels less heroic than shaving seconds off p95. But reducing sampling was not cleanup. It was the completion of the measurement posture. We had paid for visibility while the question was open. Once the dashboards and fixes gave us enough confidence, continuing to pay for every trace would have been an unbounded tax.
The final user-facing result was meaningful: the most-used pages got roughly 3 to 6 times faster. Slow GraphQL queries that had been sitting around 0.7 to 3.6 seconds at p95 moved into the rough range of 80 to 500 ms. Those numbers are not a universal law. They are the measured outcome of this app, this data shape, and this sequence of fixes.
Deep-dive: The fix pattern under the PRs
The performance PRs were not one technique repeated 13 times. They were a few boring patterns applied where measurement pointed.
| Hot path | Mechanism | Fix |
|---|---|---|
GetProjectTasks | Correlated _count subqueries per list row | Remove list-time counts, maintain logged minutes |
GetProjects | Fetch thousands of time-entry rows, aggregate in JS | Aggregate fees with groupBy before hydration |
dashboardMetrics | Fetch active booking rows to sum one field | Sum booking percentage in SQL |
GetResources | Hydrate full joined project rows | Select only id, code, and name |
| Project scope | OR EXISTS filters over large tables | Add composite indexes for user and project |
The common thread is not "use SQL for everything." It is to stop moving high-cardinality intermediate data across a boundary when the consumer needs a small result.
FAQ
How do I measure performance work after several PRs land?
Use one user-facing signal and one mechanism-facing signal. In this case, Sentry ranked GraphQL transactions by p95, while pg_stat_statements ranked normalized SQL by total execution time, calls, and mean time.
Why use 100% Sentry trace sampling temporarily?
A short 100% sampling window can make missing or rare transaction patterns visible while debugging. The key is to pair it with an explicit reduction plan, such as dropping to 10% after the hot paths and dashboards are understood.
Where should I capture slow database queries in Postgres?
Enable pg_stat_statements, create the extension in the database, and query it by total execution time. Use it to rank candidates, then inspect plans for specific slow statements before adding indexes or rewriting queries.
Why provision observability dashboards as code?
Dashboards as code make performance claims repeatable. A committed provisioner lets the team recreate the same widgets, compare the same transaction groups, and review changes like any other engineering artifact.
What is the main risk of this measurement approach?
Sampling and aggregate query stats can both hide edge cases. A 10% trace sample may miss low-volume pain, and pg_stat_statements can smooth over parameter-specific query plans, so important fixes still need targeted verification.
Observability did its job when we stopped treating it like a shrine and started treating it like a tool rental: use the right one, pay for it while the work is real, return it before it becomes furniture.
