Bassam Ismail
Engineering

Walking Back from Serverless

11 min read

The first version of this artifact host made its core operation worse: uploading one bundle became two code paths because API Gateway would not accept the bytes. What looked like the safe default became the move from serverless to a single Go binary, a runtime decision I kept designing around.

The platform in Part 1 and Part 2 is one Go binary on one box. It did not start there. The first version was serverless, built the way the diagrams in the conference talks tell you to: API Gateway in front of a fleet of Lambdas, a managed user pool for sign-in, DynamoDB and S3 for state, all wired together in CDK. Scales to zero, no servers to patch, tear a whole environment down with one command. For a while it was fine. Then the platform's own job, taking a bundle of files and serving it at a URL, started fighting the runtime it was built on. This part is what went wrong, the decision to leave, and the scars the rewrite left behind.

TL;DR

The serverless v1 (API Gateway, a Lambda fleet, a Cognito user pool, DynamoDB, S3) hit three walls that matter for an artifact host: API Gateway's 10MB request-body cap forced the core "upload a bundle" operation into two separate code paths, the serving path paid cold starts on first load, and the MCP server's mid-deploy question wanted real streaming that the gateway made awkward. The fix was the one Go binary from Part 2. The migration was compute-only: DynamoDB and S3 never moved. Two rewrite scars stuck with me: the AWS SDK's own checksum headers silently breaking a presigned upload, and a Go nil slice serializing to JSON null and getting a tool rejected by the AI client's schema validator.

What the serverless version was

The shape was conventional. API Gateway was the front door. Behind it sat a fleet of small TypeScript Lambdas, one cluster per concern: creating and listing projects and groups, issuing and checking API keys, handling the upload, serving files back, the auth callbacks. Sign-in went through a managed user pool, with a pre-sign-up trigger that enforced the company email domain so only employees could get an account. Project and group records lived in DynamoDB; the uploaded bundles lived in S3. CDK described all of it, so a deploy was a stack update and an environment was reproducible.

None of that is wrong on its face. For a lot of services it is exactly right. The trouble was specific to what this service does.

The cap that shaped the core feature

Serverless stops being a straightforward default when your product's core interaction crosses the platform boundary: body size, connection lifetime, or first-byte latency. For this platform, all three applied, and the body-size wall hit first.

The whole point of the platform is uploading a bundle and getting a URL. So the operation that mattered most was the upload, and the upload was the one serverless made hardest.

API Gateway caps a request body at 10MB. An artifact bundle, a built single-page app with a few images or a wasm blob, clears that without trying; the platform was willing to accept tens of megabytes. So a single conceptual action, "publish this bundle," had to become two different code paths:

  • a small-bundle path, where the files come inline in the request, and
  • a large-bundle path, where the client first asks for a presigned S3 URL, uploads the bytes straight to S3, and then tells the API "they are there, index them."

That split is still visible in the code today, with the limits written down as constants: a couple thousand files maximum, a total-bytes ceiling in the tens of megabytes, a fifteen-minute presigned-URL lifetime, and a hard few-megabyte cap on an inline upload before the client is required to presign instead. The presigned path is not free. It is a second set of handlers, a second failure mode, and a permanent footnote in clients: small bundles go inline, big ones do the dance. The split originated because of the gateway, not because of the problem being solved.

This is the moment the default stops being free. You are no longer building your feature; you are building around the shape of the runtime.

The presigned PUT that would not PUT

The large-bundle path also produced the single best bug of the project, the kind you only find by shipping.

When the MCP client publishes a big artifact, it asks the platform for a presigned PUT URL and uploads to it directly. That worked, and then at some point it stopped: the uploads started coming back rejected with a signature error, even though the URL was freshly minted and correct.

The cause was the AWS SDK protecting me. Recent SDKs add integrity checksums to S3 uploads by default, which means when you presign a PUT, the SDK folds checksum headers (x-amz-sdk-checksum-algorithm and an x-amz-checksum-crc32 value) into the set of signed headers. The signature now only validates if the client sends those exact headers. A browser or a plain HTTP client doing a simple PUT url --data-binary @bundle does not send them, so S3 rejects the request as a signature mismatch. The presigned URL was valid; it was valid for a request nobody was making.

The fix is to opt the presign out of the SDK's checksum headers, so the signed request is a plain PUT that any client can satisfy:

// Presign a PUT that does NOT require the SDK's default checksum headers,
// so a simple client upload (no x-amz-checksum-* headers) still matches
// the signature.
presigner := s3.NewPresignClient(client, func(o *s3.PresignOptions) {
	o.ClientOptions = append(o.ClientOptions, func(so *s3.Options) {
		so.RequestChecksumCalculation = aws.RequestChecksumCalculationWhenRequired
	})
})
req, err := presigner.PresignPutObject(ctx, &s3.PutObjectInput{
	Bucket: aws.String(bucket),
	Key:    aws.String(key),
})

The lesson is the unglamorous kind: a managed SDK's helpful default became my bug the moment I handed part of the request to a client the SDK did not control. Presigned URLs are a contract with an arbitrary HTTP client, and each header the SDK signs is a header that client is now required to reproduce.

The two quieter edges

The body cap was the loud problem. Two others pushed the same way, and both map to the same framework: connection lifetime and first-byte latency.

The first was cold starts on the serving path. The first request to a Lambda that has scaled to zero pays for the wake-up, and "the first load is slow" is a bad look for a tool whose pitch is "open this link." For an artifact someone just shared, the first click is the one that matters most, and it was the slowest.

The second was streaming. As Part 1 described, the MCP server asks a question in the middle of a deploy: which group, then which project. That elicitation wants a long-lived, streamed connection. Holding a streamed response open cleanly through API Gateway and Lambda is a fight, and the result was an awkward non-streaming fallback. A long-running process streams by default; it is just what a process does.

None of these is a knock on serverless in general. They are a knock on serverless for this workload. Large request bodies, latency-sensitive first loads, and long-lived streams are precisely the three things a scale-to-zero function model is worst at, and an artifact host needs all three.

Why serverless to a single Go binary became the path

The rewrite to one Go binary sounds drastic and was not, because of one scoping decision: only the compute changed.

WHAT CHANGED, AND WHAT DID NOTV1 SERVERLESSV2 ONE BOXAPI GatewayLambda fleetCognito user poolDynamoDBS3 bundlesCaddyone Go binaryin-process OAuthDynamoDBS3 bundles

DynamoDB stayed. S3 stayed. The table shapes, the keys, the object layout, the URL patterns: all unchanged. What got replaced was the front edge. The gateway, the function fleet, and the managed user pool became one process, a reverse proxy, and a few hundred lines of in-process OAuth. The roughly sixty projects and the thousands of stored objects did not migrate anywhere; the new compute simply pointed at the same data.

That is what turned a scary rewrite into a contained one. The place these projects usually die is data migration, and there was none to do. I was swapping the part of the stack that was causing pain and leaving the part that worked alone.

The lesson that stuck: own your content types

One smaller habit carried directly from a v1 lesson into the v2 serving code. Files are stored with their content type set at upload time, inferred from the extension through a small lookup table, rather than left for the serving side to guess. Get this wrong and an object stored without a type defaults to application/octet-stream, and a browser handed application/octet-stream does the only thing it can: it downloads the file instead of rendering it. For a platform whose job is to render pages, a page that downloads is a total failure that looks like a small one. Setting the type once, at the point of upload, is what makes subsequent reads reliable.

The empty list the AI client rejected

The other scar was pure porting tax, and it took down the MCP integration over a single optional field.

A tool definition exposed to the AI client has a required field that must be a JSON array of parameter names. Empty is fine. Missing is fine. null is not. In Node, an empty array serializes to [] and you never think about it. In Go, an uninitialized slice is nil, and nil marshals to JSON null:

var required []string   // nil  -> marshals to null   (rejected)
required = []string{}    // empty -> marshals to []     (accepted)

A tool with no required parameters built its required slice by appending in a loop that never ran, so the slice stayed nil, serialized to null, and the client's strict schema validator rejected the tool. One optional field came back as the wrong kind of empty and the connector would not register at all.

The fix is to initialize the slice so it is empty rather than nil, wherever a schema is built:

required := []string{}            // empty, never nil
for _, p := range params {
	if p.Required {
		required = append(required, p.Name)
	}
}

There is now a regression test that marshals a parameterless tool and asserts required is not null, with a comment naming the exact culprit: the client's Zod validator rejects null. It is the smallest possible bug with the largest possible blast radius, and it is the whole character of porting between runtimes. A default that goes unnoticed in one language becomes load-bearing in the next, and you find out because the thing simply refuses to connect.

So when is serverless the wrong default?

Not often. Scale-to-zero and no-ops are real wins, and for spiky, stateless, small-payload work they are hard to beat. The decision flips when your workload leans on the exact things functions are worst at. For an artifact host that was three at once: large uploads, a latency-sensitive first byte, and a long-lived stream. Once a meaningful share of your code exists only to dodge the runtime's limits, the runtime is no longer saving you work. It is the work, and a plain process on a plain box gives it all back.

FAQ

Why move from serverless to a single Go binary?

The workload needed large uploads, low-latency first loads, and long-lived streaming. Those were all awkward behind API Gateway and Lambda, while one long-running process made them ordinary again.

Did you actually save money leaving serverless?

That was not the driver. At this traffic the costs are close, and the small box rounds to a few dollars a month. What it saved was friction: the dual upload path, the streaming workaround, and the cold start were each a tax on building features, and one process removed all three. Cheaper was a side effect, not the goal.

Why keep DynamoDB and S3 instead of moving to Postgres and a disk?

Because they were not the problem, and moving them would have added the one genuinely risky kind of work (data migration) to a project that otherwise had none. The pain was entirely in the compute layer. Touching only the thing that hurt is what kept the rewrite contained.

Is the 10MB upload split still in the code?

Yes. The inline path for small bundles and the presigned-S3 path for large ones both survive into the one-binary version. The gateway originally forced the split, but accepting large HTTP bodies is its own practical constraint independent of Lambda, so both paths remain. The difference now is that one process owns both paths rather than separate functions sitting behind a gateway.