The Gocracker Chronicles: A microVM in Go, from Weekend Hack to Production Sandbox

1 Firecracker is great. I wrote another one anyway.
I love Firecracker. I would put a little Firecracker sticker on my laptop if AWS shipped one. The idea — boot a real Linux VM in milliseconds, peel away every device nobody needs, lock down the syscalls, call it a day — is one of the most elegant systems-engineering ideas of the last decade. It turned “container but actually isolated” from a meme into a product.
So naturally, one weekend, I decided to replace it.
Not because it is bad. Because every time I wanted to run ubuntu:22.04 as a microVM, there were six manual steps between me and a prompt: pull the image, extract the rootfs, build an ext4 disk, generate an initrd, write a forty-line JSON config, create a TAP device, fight iptables, then call the API. Firecracker is a Rust specialist. It boots VMs gorgeously and assumes the rest is your problem. That is the right design for AWS, where every step is another specialist service. For a person running things on a laptop, it is a catastrophe of tabs.
I wanted a generalist. gocracker run --image ubuntu:22.04, and I wanted it to have already done all of that by the time I finished pressing Enter. So I wrote one. In Go. Because the language where “batteries included” is the culture is the obvious place to build a batteries-included microVM.
If Firecracker is CGI — a bare, principled interface that any smart component can drive — then gocracker is FastCGI: the same interface with a long-running, comfortable process wrapped around it that assumes you are a developer on a laptop and would like to get on with your day.
FIRECRACKER gocracker
(Rust specialist) (Go generalist)
just a VMM VMM + OCI + initrd + TAP + Compose
you bring: you bring:
- rootfs - one command
- kernel - one kernel
- initrd
- tap + NAT
- JSON config
- 40 lines of bash
beautiful. I am lazy and I like it.2 What KVM actually is
KVM — the Kernel-based Virtual Machine — is a character device, /dev/kvm, that exposes hardware virtualisation as ioctls. That’s the whole thing. Open a file, call ioctls on it, and a CPU starts running code on your behalf inside a hardware sandbox.
I like to describe KVM ioctls as the assembly language of VMs. Low-level, orthogonal, each one does exactly one thing, and you have to compose them into something useful yourself. There’s no scheduler, there’s no device model. There’s a small alphabet of primitives — create the VM, create the vCPU, map memory, run, get registers, set registers — and you are the one who turns those into a computer.
The heart of every VMM ever written is a loop. You call run on the vCPU. The host thread blocks. The guest runs. Eventually the guest does something that needs the host’s attention — touches an MMIO register, hits an I/O port, gets an interrupt, shuts down — and KVM returns control. That return is called a VMexit. Everything else — devices, boot loaders, snapshot engines — is sugar around this loop.
HOST GUEST
+---------------------+ +-------------------+
| ioctl(KVM_RUN) | ---------> | guest instructions|
| | | |
| | <--------- | VMexit |
| dispatch(exit_reason) | (MMIO / IO / IRQ) |
| handle_mmio() | | |
| handle_io() | | |
| inject_irq() | ---------> | resume |
+---------------------+ +-------------------+The whole binding into KVM lives in one file, and the central number is exactly five hex digits:
// internal/kvm/kvm.go
kvmRun = 0xAE80When userspace calls ioctl(vcpu_fd, 0xAE80, 0), the kernel transfers control to the guest CPU. The host thread blocks. The guest runs. That’s the heart of every VMM ever written.
3 One goroutine per vCPU is the whole model
Here is the joy of writing a VMM in Go, and the thing that made the weekend feel like a weekend. A virtual CPU is a file descriptor that blocks until the guest exits. If you want two vCPUs, you want two host threads. If you want sixteen, you want sixteen threads.
Go has a word for “a thing that looks like a thread that blocks on a syscall.” The annoying footnote is that Go’s scheduler normally moves goroutines between OS threads whenever it feels like it, and KVM really does not like that. The fix is three words:
runtime.LockOSThread()That’s it. That’s the multiprocessing story. Each vCPU gets a goroutine. Each goroutine locks its thread and runs the exit loop. The runtime handles the rest. No thread pool to write. Not even a sync primitive — the channels in your toolbox already work.
per-vCPU goroutine (LockOSThread)
+--------------------------------------------------+
| |
| for { |
| err := ioctl(vcpu_fd, KVM_RUN, 0) |
| if err == EINTR || err == EAGAIN { |
| continue // transient; resume |
| } |
| switch run.exit_reason { |
| case IO: handleIO(run.io) |
| case MMIO: handleMMIO(run.mmio) |
| case SHUTDOWN: return |
| case INTR: continue |
| case HLT: waitForIRQ(); continue |
| } |
| } |
+--------------------------------------------------+When a real guest boots, this loop runs millions of times a second. Most exits are fast — a virtio queue notify, a UART write, a timer tick — and the loop returns back into KVM_RUN before anyone notices. The art is making every case of the switch cheap.
By the end of the weekend, I had a green prompt, a running VM, and a cold boot under 400 ms. I would put Rust on my laptop sticker too. But Go gave me goroutines that made “one thread per vCPU” a three-word feature, an HTTP API server in maybe two hundred lines, JSON config that Just Works, and the most mature OCI library ecosystem of any language. Cross-compiling to ARM64 takes two environment variables. The fights I had with Go were real but bounded; the convenience compounded for months.
4 The next three weeks
The VM booted. One command. Life was beautiful. And then the same cheap-goroutines property that made the first weekend enjoyable started charging interest.
Each bug began as a different symptom and dead-ended in the same place: two goroutines disagreeing about who owned a piece of kernel state. The jailer kept leaving its shoes at the door — bind mounts surviving a crashed sandbox, poisoning the next VM. A naked close on a shared vsock file descriptor turned out to be a refcount operation, not a protocol message; the host blocked forever until the user pressed a key out of sheer desperation, which unblocked a background goroutine, which dropped the last reference, which finally sent the shutdown packet. A panic during cleanup left the terminal in raw mode. Each Go runtime release wanted one more syscall the seccomp filter didn’t know about, and “Bad system call” became the soundtrack of release week.
Fifteen goroutines in a trench coat is still fifteen goroutines. The lesson was less about any one bug and more about the shape of all of them: in a microVM, the host kernel keeps state about your VM, and if you don’t clean it up on the happy path, nobody is going to clean it up on the sad path. After the fifth race in three days, the structural fix was not “be more careful” — I was already being careful — but flipping -race on in CI and writing tests that deliberately raced producer and consumer to make race conditions show up reproducibly. The race detector is the single most important setting you can turn on a Go project. It is ten-to-a-hundred times slower, and worth every cycle.
Once CI stopped flaking, I finally had a VM I could trust enough to measure.
5 I was staring at the wrong box
For about two weeks, I thought gocracker had a 2× problem against Firecracker. The duration field I was printing went from ~30 ms without the jailer to ~55 ms with it. Two times worse. Fork-exec was the villain. The jailer was the villain. Seven REST PUTs were the villain.
Look at the wall clock though. About 860 ms vs about 880 ms. About twenty milliseconds of difference, on an 860 ms baseline. That isn’t 2×; it’s about 2%, and 2% is noise. The “2×” was entirely in a duration field that measured different things in the two code paths. The in-process path measured raw vmm.New. The worker path measured fork-exec, jailer setup, chroot, seven REST PUTs, plus vmm.New. Both stopped the clock before the guest kernel had printed a single byte. Neither number measured time to useful guest. Neither was comparable to the other.
Splitting the measurement into four honest phases — orchestration, VMM setup, start, guest first output — made the 2× evaporate:
// pkg/vmm/timings.go
//
// BootTimings is the per-phase breakdown of how long it took to
// bring a microVM to life.
//
// - Orchestration: host-side work *before* the guest kernel starts
// - VMMSetup: time inside vmm.New() — KVM_CREATE_VM, memory...
// - Start: KVM_RUN starts on the vCPU goroutines
// - GuestFirstOutput: first byte the guest prints on the UART
The jailer cost roughly 30 ms in orchestration, on top of a ~300 ms guest kernel boot that both paths shared. An honest ~10% orchestration tax, not a 2× penalty.
before the breakdown after the breakdown
(one misleading number) (four honest numbers)
+-----------------------+ +--------------------+
| duration = ~55ms | | orchestration ~30ms|
| (in runViaWorker this | | vmm_setup ~8ms|
| includes jailer, | | start ~2ms|
| fork, a few REST | | guest_first ~320ms|
| PUTs, then vmm.New, | | total ~360ms|
| then start) | +--------------------+
+-----------------------+ |
v
"2x slower" "I was staring
at the wrong box."And then something else fell out: about three hundred of those milliseconds were Linux booting inside the VM. If I wanted a faster VM, my code wasn’t the problem. My kernel was.
Performance debugging is expense-account auditing. You stare at the line items. You keep staring until you find the one that says “business lunch $480” and the restaurant turns out to be a Costco. The gotcha is never where you expect it.
6 The kernel was the problem
I forked the guest kernel into two profiles: a generic one I ship by default, and a “minimal” one that rips out anything a VM with virtio and nothing else will never need. ACPI NUMA went. Hibernation went. The entire USB subsystem went. Power management, profiling, SCSI, loop devices, XFS, NFS — all gone. Virtio stayed. ext4 stayed. kvm-clock stayed. The kernel shrank by about 12% and the boot dropped a chunk just from running fewer initcalls.
Then came the small change that mattered more than any of the others. I added one parameter to the kernel command line: loglevel=4. It tells the kernel “only print warnings and above to the console; everything else still goes to the ring buffer so you can see it via dmesg.” The bulk of the boot output stopped going to the emulated UART.
It turns out that a virtualised UART is expensive per byte. Every byte the kernel writes to the serial console is an MMIO exit into userspace, which is a context switch, which is a few microseconds of wasted time. Multiply by a few thousand boot-time bytes, and boot was dominated by printing. Silencing the console knocked roughly 130 ms off the boot.
One line.
The smaller wins followed the same theme: stop fighting the kernel, and ask it to do its job. Cache a discard probe whose answer only depends on the host filesystem, not the guest. Route x86 interrupts through eventfd + IRQFD instead of an ioctl per assertion, the same way the ARM64 backend already did. Turn the Go garbage collector off in the short-lived VMM subprocess:
// cmd/gocracker-vmm/main.go
import "runtime/debug"
func init() {
debug.SetGCPercent(-1) // short-lived process; let the OS reap memory
}The process doesn’t need a GC; it will happily run to completion, and the OS reaps its memory. A few milliseconds here, a few milliseconds there. Not clever, individually. Cumulative.
Stacking the wins as a bar chart at the rough scale of real measurements:
standard kernel:
[==orch==][vmm][======guest_first_output: ~305ms======] ~390ms
~70ms ~15 ~305
this is Linux booting.
minimal kernel:
[==orch==][vmm][=====guest_first_output: ~280ms=====] ~365ms (-25)
fewer init calls.
minimal + loglevel=4:
[==orch==][vmm][guest_first_output: ~170ms] ~250ms (-115)
80% of the cost
was *printing*.After all of that: cold boot in the 150–170 ms range. Roughly 45 ms behind Firecracker, down from much more. A Go-vs-Rust gap measurable in milliseconds, on a boot dominated by a foreign Linux kernel I don’t control. That is the right place to end up. If somebody tells you their microVM is a few dozen milliseconds behind Firecracker, you nod politely; if they tell you it is 2×, you have questions.
7 Once the cold boot was small, the warm path got embarrassing
Snapshot restore used to be the fast path. An 80 ms restore on top of a 400 ms cold boot is a rounding error. An 80 ms restore on top of a 170 ms cold boot is half your budget.
The old restore did the obvious thing: allocate a fresh 128 MiB anonymous mmap for guest RAM, read the entire snapshot file into a Go byte slice, memcpy the whole thing into position. Steps one through three took about 80 ms, exactly as you would expect if you have ever memcpied 128 MiB on every request.
Then the question: what if I just didn’t copy it?
Linux has a flag called MAP_PRIVATE. When you mmap a file with it, the kernel does no actual I/O up front. It sets up a page-table entry that says “if userspace touches this page, fault into the kernel, read it from the file, map it in. If userspace writes to the page, fault, copy-on-write to a private anonymous page, and redirect the mapping to the copy.” The file itself is never modified.
The Netflix analogy is the one I keep coming back to. Netflix does not first download the whole movie to your device and then start playing. It starts playing immediately and fetches each minute as you watch it. If you fast-forward past parts, those parts never get downloaded. You pay per minute watched, not per movie selected. MAP_PRIVATE is that pattern for guest RAM.
The new path mmaps the snapshot directly into the guest memory region:
mem, _ := unix.Mmap(int(f.Fd()), 0, int(memSize),
unix.PROT_READ|unix.PROT_WRITE, unix.MAP_PRIVATE)
_ = unix.Madvise(mem, unix.MADV_HUGEPAGE)Pages the guest never touches are never loaded. Pages it reads but doesn’t write stay shared with the page cache. Pages it writes go to private COW copies, and the snapshot file stays clean.
BEFORE: eager copy
+----------------+ +----------------+ +----------------+
| mem.bin file |-->| os.ReadFile |--->| copy(ram, mem) |
| 128 MiB | | read 128 MiB | | 128 MiB memcpy|
+----------------+ +----------------+ +----------------+
|
v
~80ms before this point
AFTER: lazy mmap (MAP_PRIVATE)
+----------------+ +----------------------------+
| mem.bin file |<--| mmap(fd, PRIVATE) |
| 128 MiB | | sets up page table only |
+----------------+ +----------------------------+
|
v
guest touches page N
|
v
minor fault (-> page cache)
kernel maps the page on the fly
|
v
~20ms to "running"The full page-fault dance under the hood looks like this:
guest vCPU host kernel (KVM + mm) snapshot
+---------+ +-------+
| read P |---(EPT miss)--->| PTE not-present, PRIVATE | on |
| | | -> minor fault | disk |
| | | -> page cache lookup | |
| | | (or read from disk) <---+ |
| | | -> install PTE readable | |
| |<-----(resume)---| | |
+---------+ +-------+
later, guest writes page P:
+---------+ +-------+
| write P |---(EPT miss)--->| PTE readable only | |
| | | -> COW fault | |
| | | -> alloc anon page | |
| | | -> copy from page cache | |
| | | -> install PTE writable | |
| | | (snapshot unchanged!) | |
| |<-----(resume)---| | |
+---------+ +-------+Every step there is what Linux already does for any file-backed mmap. Not a single line of page-fault handling had to be implemented. Just stop fighting the kernel and ask it to do its job.
On a 128 MiB Alpine snapshot, restore dropped from ~80 ms to about 20 ms. Snapshot-resume was suddenly several times faster than cold boot. (One important caveat: do not delete the snapshot file while VMs are running off it. Ask me how I know.)
8 Mise en place for VMs
Walk into a decent restaurant at lunchtime. Order the steak frites. It arrives in six minutes. The steak alone is a six-minute cook, liberal estimate. The frites take twelve. Béarnaise needs fifteen. How did the kitchen do it in six minutes?
Mise en place. The potatoes are par-cooked and drained before you arrive. The béarnaise is emulsified and held. The plate came out of the warmer the moment the order hit the pass. The only thing the kitchen does after your order is the final sear.
A warm pool is mise en place for VMs. The fastest competing sandbox provider on the public benchmark sat around 100 ms — exactly what you would expect from skipping the restore entirely by having a VM already running, paused, waiting for someone to say go. If the leader wins by pre-cooking, stop optimising the stove.
The warm pool became three design decisions, each one the result of an argument with a hypothetical bad day.
First, Acquire is non-blocking. The temptation with pool APIs is to make Acquire block until a worker is available. That always-give-the-user-a-worker feels safe. It is not. If the pool is empty, something already went wrong, and making the user wait for a fresh restore is strictly worse than falling through to the cold-boot path that already works. The pool is best-effort. A miss must never make the user slower than the baseline.
Second, releasing a worker kills it. Every pooling library eventually wants to recycle a worker back into the pool. In a multi-tenant world, the worker just handled a request has touched whatever the last tenant asked for. Handing it to the next tenant is a tenant-isolation hole, and the fact that nobody has exploited it yet is not a security argument. Every Acquire returns a process that has never served a request. Refill happens in the background, so the next caller pays nothing. The pool is always-moving. Never reused.
Third, refill is asynchronous, capped, and race-safe. A burst of refill requests for the same template should not stampede into ten parallel spawns; a refill spawn that races with shutdown should clean up after itself; and a clock must be injectable so staleness tests are deterministic. None of these are clever. They are just the invariants you regret missing the first time the pool runs in production.
The whole flow with both the cache and the pool wired in:
request arrives
|
v
warmcache.Lookup(key)
|
+-- miss --> cold boot (~250ms) <-- baseline
|
hit, snapshotDir=S
|
v
pool.Acquire(key, S)
|
+-- empty --> restore_direct (~20ms) <-- still better
|
got a warm worker
|
v
worker.Resume (~3ms) <-- fastest path
|
v
serve request
|
v
pool.Release(w) --> worker.Close()
|
v
EnsureRefill in background
|
v
spawn replacement (~20ms off the hot path)The pool’s API surface is small on purpose:
// pkg/warmpool/pool.go
type Worker interface {
ID() string
Close() error
}
func (p *Pool) Acquire(key, snapshotDir string) (Worker, bool, error)
func (p *Pool) Release(w Worker)
func (p *Pool) EnsureRefill(key, snapshotDir string)On the hot path: the warm worker already has guest RAM mapped, the vCPU state loaded, the VM paused. Acquire returns. A single resume ioctl flips it from paused to running. Three milliseconds later, the guest has already said hello. Mise en place.
9 Nine sandboxes burning a CPU for nothing
Nine sandboxes were running, paused, idle. No traffic. No exec sessions. No HTTP. Sitting on a shell prompt inside a warm pool, waiting for someone to ask them to work. top showed the host at 46% of one core.
Forty-six percent for keeping nine idle Linux guests alive. About five percent of a core per idle VM. A physical idle Linux box uses around 0.1% of a core on modern hardware. A properly virtualised idle guest should be cheaper, not fifty times more expensive.
Something was very wrong.
The clean way to see what a vCPU thread is doing is to sample it. A few seconds of perf gave back a stack trace that was unambiguous: enter KVM, exit almost immediately, sleep for a millisecond, go back in. Over and over, a thousand times per second, on every vCPU thread, in parallel. Nine threads doing this at once was exactly the 370% of a core the host was reporting.
The cause was a hedge. Deep in the vCPU loop, the HLT exit was being “handled” with a one-millisecond sleep:
case KVM_EXIT_HLT:
// Guest is idle. Don't spin; give it a breather.
time.Sleep(time.Millisecond)The sleep was reasonable in a world where the VMM owns the interrupt controller in userspace. gocracker does not. gocracker uses in-kernel IRQCHIP — the right default for almost every workload — where KVM is supposed to hold the thread inside the ioctl until the next interrupt fires, with no exit at all. The sleep was dead code that had survived a design change that nobody questioned.
The fix was a deletion:
case KVM_EXIT_HLT:
// No-op. In-kernel IRQCHIP already blocks the vCPU until the
// next interrupt. There is no productive work for userspace here.
On the next loop iteration, the code calls into KVM again, and KVM — because it owns the IRQCHIP and knows there is no interrupt due — blocks the thread inside the kernel for as long as the guest stays idle.
Same nine-idle-sandbox test. top: 7%. Not seven percent per VM. Seven percent for the whole fleet. From ~370% to ~7% by deleting a line. Fifty times less.
The general pattern is worth naming. The code you added “just to be safe” is often the code most worth deleting, because nobody questions it. The parts of a system you fight with get reviewed to death. The parts that nobody complains about get to rot in peace. When the rot finally costs you, it costs you fifty times more than anything you ever thought about.
10 A fast microVM is a toolbox, not a product
By the time Acquire to first guest instruction was three milliseconds, I was proud of that for about a week. Then I tried to build something with it.
What I wanted was the thing everyone wants these days: a REST API where a customer says “give me a Python 3.12 sandbox with numpy and pandas, let me run code in it,” a sandbox appears, three seconds later they get a stdout back, and they move on with their lives. A raw gocracker run cannot do any of that. It boots a VM. That is all. If a microVM is an engine block, what I needed was the rest of the car.
The first decision was the most important one: keep gocracker exactly what it was, and build the managed layer as a separate thing. gocracker stays the low-level VMM, snapshot cache, and warm-pool-of-workers. It speaks bytes and ioctls. It has no opinions about customers or templates. sandboxd is a new daemon that sits above it and owns templates, leases, pools, and preview tokens. The SDK only ever talks to sandboxd. sandboxd only ever talks to gocracker over a unix socket. The extra round-trip is a feature, not a bug.
I learned the value of that split by first not making it cleanly, and spending three hours debugging a race condition that only existed because two layers were sharing a pointer they had no business sharing. Crossing a process boundary makes you negotiate. Sharing a pointer lets you cheat. The boundary is the seatbelt.
The split is a clean three-tier flow:
SDK (Python / Go / TS)
|
| HTTP over unix socket
v
sandboxd <-- managed-runtime daemon
| (templates, leases, pools, preview tokens)
| HTTP over unix socket
v
gocracker serve <-- low-level VMM orchestrator
| (KVM ioctls, snapshots, warm pool of workers)
| KVM ioctls + vsock
v
guest VM
|
+-- toolbox agent (listens on a vsock port inside the guest)Three hops. Two daemons. The SDK never talks to gocracker directly — it doesn’t know gocracker exists. sandboxd is the only public API surface; everything downstream is an implementation detail. Crossing a process boundary on a unix socket is cheap (sub-millisecond for small JSON payloads), and the separability pays for itself the first time you want to restart sandboxd without killing a hundred live VMs.
Templates are the other load-bearing idea. A customer doesn’t want “a Linux VM.” They want the environment they use for their AI agent — a specific base image, some apt packages, some pip packages, a working directory, some env vars. A template captures that mix, plus the snapshot that results from booting the spec once and letting it reach a steady state. Two templates with identical specs share a snapshot. A second create with the same spec is a no-op.
type Template struct {
ID string
Name string
SnapshotDir string
SpecHash string // canonical fingerprint of image, kernel, mem, env...
ContextHash string // build-context tarball when using a Dockerfile
WarmPolicy WarmPolicy
}That sounds obvious until you imagine the lifecycle of a real SaaS: most template creates are idempotent retries. A deploy re-runs. A CI job resubmits. An SDK lazily ensures-exists before a create. If every one of those cost a fresh docker build, you would be shipping a $40/month product on $400/month of infra. Content-addressable identity at every layer compounds: the warm cache was content-addressable, templates are content-addressable on top of it, and sandboxes are cheap because templates are cheap.
11 Five warm ready. None of them were.
I was running a load test against a freshly-rebuilt sandboxd. Nothing fancy — create a sandbox, exec echo hi, delete the sandbox, in a tight loop. The pool was configured for three hot-ready and three paused-ready for a single template. Every create should be essentially instantaneous, because the pool should keep six warmed sandboxes alive and I only ever needed one.
It worked for about ninety seconds.
Then every create started failing. Not slowly. Not with backpressure. Every single one, with variations on “runtime returned 404: unknown vm.” The pool status endpoint reported three hot-ready, two paused-ready, zero leased. A perfectly healthy pool, according to itself. The VMs had been dead for minutes.
That was a fun Tuesday.
The first version of the reconciler trusted its own in-memory record. It counted entries marked warm_ready, compared the count to MinHot, and concluded: healthy, no action needed. Nothing in the reconciler was looking. One warm-ready VM died silently — vCPU panic, OOM-kill, guest wedge, fstab typo dropping systemd into rescue mode, whichever — and sandboxd kept counting it as alive. Subsequent leases failed at attach time with 404s, the lease handler marked the entry “broken” and fell through to cold boot, but the broken entries lingered in memory as warm_leased until a separate cleanup goroutine reaped them. Meanwhile the pool kept claiming five warm, the reconciler kept making no decisions, and every single request cold-booted.
The cascade wasn’t spectacular. No fire alarms. No pager. The system was silently degrading itself into worst-case mode, one 404 at a time, while dutifully reporting green.
sandboxd's view actual runtime state
+-----------------+ +----------------------+
| warm_ready: 3 | | VM #1: dead |
| warm_ready: 2 | | VM #2: dead |
| leased: 0 | | VM #3: alive but oom |
| total: 5 | | VM #4: missing |
+-----------------+ | VM #5: missing |
^ +----------------------+
| |
| "healthy, no action" | lease attempt -> 404
| |
| v
reconciler tick lease handler marks broken,
counts in-memory state falls through to cold boot
compares with MinHot |
does nothing v
user sees 2-second cold boot
every single requestSilent degradation is the worst mode. Loud failure lets you page on it. Silent failure means the graph looks green while customers leave.
The fix was structural and small. The reconciler now does three things in order, and the order is load-bearing:
func (m *Manager) reconcileTemplate(tpl *Template) {
m.reapDead(tpl) // probe runtime, drop ghosts
inv := m.inventoryFor(tpl.ID) // count from honest state
m.pruneExcess(tpl, inv)
m.replenishUpToMin(tpl, inv)
}First, probe every warm sandbox the manager thinks it owns and reap anything the runtime no longer knows about — “inconclusive” counts as dead, because a pool of maybe-alive VMs is worse than a pool with a hole in it. Second, count from the now-honest inventory. Third, trim excess and replenish up to the minimum. Before the fix: five phantom sandboxes, every request a cold boot, the pool cheerfully reporting healthy. After: instant creates again.
I have hit this exact bug before. I suspect you have too. Every time it wears a slightly different outfit — a Kubernetes controller that trusts cached pod status instead of the kubelet, a connection pool that marks a backend healthy because the last response was 200 while the socket has been FIN-ed for thirty seconds, a service registry whose heartbeat thread is unrelated to the work thread so the service can be deadlocked and still pinging, a browser that caches a DNS record past reality. The underlying mistake is the same every time: trusting an in-memory representation of the world, across a process boundary, without probing. In-memory state and out-of-process reality always drift. The question is not whether you will notice; it is when, and what user-visible damage accumulates in between.
Two more guardrails went in around the same time. A per-template backoff so a single broken template — say, one whose snapshot is subtly corrupt — cannot single-handedly keep the reconciler pegged spawning failing VMs every tick, starving healthier templates. And a global inflight budget on spawn work across the host, because ten templates each wanting to replenish three VMs at once is thirty parallel spawns, which is plenty to make every spawn slower than it needs to be, which makes the timeouts tighter, which cascades. Per-template caps are not enough. The number of things that can go wrong simultaneously across N templates grows faster than the per-template cap restrains it.
12 What the foundation taught me
Looking back at the whole arc, a few things stand out enough to be worth carrying forward.
Host kernel state outlives your process. Clean it up on startup as well as shutdown. close(fd) is a refcount operation, not a protocol message — if you need the peer to know you’re gone, you have to actually say it. Every exit path needs a terminal restore, because defer is a suggestion that signals and seccomp trips ignore. The race detector in CI is non-negotiable for any Go project that holds state across goroutines.
Your biggest cost is almost certainly not the thing you wrote. Linux booting inside the VM was three quarters of a four hundred millisecond cold start. Nothing I wrote mattered until I went and shrunk that. A virtualised UART is expensive per byte; silencing the kernel log on the console path was the single biggest performance win in the project. MAP_PRIVATE is free money for snapshot restore. The Go garbage collector is a tax you can choose not to pay in short-lived subprocesses.
Trust the kernel more than your instincts. In-kernel IRQCHIP already solves idle vCPU parking. The defensive sleep on top was negative work. Defensive code is a lie detector for assumptions that have since changed: revisit the hedges when the underlying system shifts. And sometimes the biggest win is a deletion.
Once a warm pool gets above a microVM, the rules change. The pool is best-effort — a miss must never make the user slower than the baseline. Kill workers on release; never give a tenant a process that has touched another tenant’s data. Reconciler loops must observe before they act, because the only thing more dangerous than a wrong cache is a cache the system has stopped questioning. And inconclusive is always dead in a pool — the cost of treating a maybe-alive sandbox as dead is one cold boot; the cost of treating a dead one as alive is the lease failure your customer sees.
None of these wins are individually clever. Every one is a thing somebody else figured out years ago — mmap, copy-on-write, per-tenant isolation, eventfd plus IRQFD, mise en place as a concept, trusting the in-kernel IRQCHIP scheduler. Nothing invented here. What happened was stopping fighting each of them one at a time. That is how ~3 ms user-visible cold starts happen. You earn them layer by layer. There is no single hero change. There is a stack of small, honest ones, each of which makes the next one cheaper to write.
The interesting parts of the machine are done. What is left is keeping them honest.