Why Fundamentals Compound: From Textbook to Production Bug

I’ve been spending some time not on Rudis lately, but a combination of:

Stevens’ TCP/IP Illustrated, Volume 1. I’d read parts of it years ago, but this time I’m much more invested in the whole book. Mostly motivated by what I deal with professionally and noticing how going in-depth makes you see things differently.
Berkeley’s CS 168 (Internet Architecture and Protocols) has its course materials and projects public, and Project 3 asks you to implement a subset of TCP (reliability without congestion control) inside a Python network simulator built on top of POX. A colleague recommended it as a nice follow-along while reading the book, and they were right.

This post ties three things together: the book, the course project, and a production bug that suddenly made TCP keepalives very concrete.

The Book

I picked up TCP/IP Illustrated again because I wanted to actually understand the protocols I use every day, not just know that they work. First time around, years ago, I skipped anything that felt too academic. This time I didn’t.

I kept noticing things that connected to situations I’d seen professionally or in side projects. Why SYN and FIN consume sequence space, for instance. They need to be reliably acknowledged, and the cleanest way TCP does that is to give them sequence numbers, same as data bytes. If SYN didn’t take up sequence space, you’d need a whole separate mechanism to confirm the connection was established.

Or TIME_WAIT. It exists to catch delayed duplicate segments from an old connection that might confuse a new one on the same port pair, and to make sure the final ACK actually reaches the other side. If it gets lost, the peer retransmits its FIN, and you need to still be around to respond. I’d seen the state diagram many times but never really thought about why that state needs to linger.

The Production Bug

While I was reading the book and tinkering with the course project in the evenings, something came up professionally that matched what I’d been studying a little too well.

I got involved in investigating an issue on a PostgreSQL instance that handled a lot of writes. The clients connecting to it used JDBC, which means they were talking to PostgreSQL over the Extended Query Protocol.

Quick aside for anyone not familiar with it: when you fire up psql and type a query, you’re using the Simple Query Protocol. You send a query and get results back. Most application frameworks (JDBC drivers, connection pools) use the Extended Query Protocol instead, which splits execution into separate steps:

Parse — turns query text into a prepared statement
Bind — attaches parameter values
Execute — runs it
Sync — tells the server you’re done

Better for performance since you can reuse prepared statements, but the connection ends up in more intermediate states.

sequenceDiagram
    participant C as Client
    participant S as PostgreSQL

    rect rgb(240, 240, 240)
    Note over C,S: Simple Query Protocol
    C->>S: Query ("SELECT ...")
    S->>C: RowDescription + DataRow + CommandComplete
    S->>C: ReadyForQuery
    end

    rect rgb(230, 245, 255)
    Note over C,S: Extended Query Protocol
    C->>S: Parse ("SELECT ... $1")
    S->>C: ParseComplete
    C->>S: Bind (params)
    S->>C: BindComplete
    C->>S: Execute
    S->>C: DataRow + CommandComplete
    C->>S: Sync
    S->>C: ReadyForQuery
    end

A batch of client connections went silent, with no graceful TCP teardown. Could’ve been a network device, the client process dying, a load balancer somewhere. The clients reconnected on fresh JDBC connections, but the old server-side sessions stuck around — nothing arrived on the wire to tell PostgreSQL those clients were gone.

Some of those orphaned sessions had been mid-protocol-exchange, somewhere between Parse and Sync, and were still holding transaction state. In pg_stat_activity, they showed up as idle in transaction with a wait event of ClientRead, meaning PostgreSQL had finished processing and was waiting for the client to send the next protocol message. The client was long gone.

Those zombie sessions were still holding row-level locks. New writes to the same rows just piled up, waiting on locks that nobody was ever going to release.

sequenceDiagram
    participant C as Client (JDBC)
    participant S as PostgreSQL

    C->>S: Parse
    S->>C: ParseComplete
    C->>S: Bind
    S->>C: BindComplete
    C->>S: Execute
    S->>C: CommandComplete
    Note over S: Holding row locks, waiting for Sync

    C-xS: ☠ Client dies (no FIN sent)

    Note over S: State: idle in transaction<br/>Wait event: ClientRead<br/>Locks: still held

    participant C2 as New Client
    C2->>S: INSERT INTO same_rows
    Note over C2,S: Blocked — waiting on locks<br/>held by dead session

We went after it from a few angles — walked the timeline, checked for infra changes, looked at connection pool behavior on the client side. I also fed the pg_stat_activity snapshots and relevant logs into Claude, which correlated those idle in transaction / ClientRead sessions with the write latency spike and pointed at the keepalive settings within minutes.

The root cause: PostgreSQL’s TCP keepalive settings were at defaults. Three parameters control this: tcp_keepalives_idle, tcp_keepalives_interval, tcp_keepalives_count. All default to 0, meaning “use whatever the OS says.” On the Linux boxes involved, that’s tcp_keepalive_time = 7200, two hours of silence before the kernel sends the first keepalive probe. With the default interval (75 seconds) and count (9 retries), you’re over two hours before the OS declares a connection dead.

We lowered all three so that a dead connection gets detected in about five minutes instead of two hours.

sequenceDiagram
    participant K as Kernel
    participant S as PostgreSQL

    rect rgb(255, 235, 235)
    Note over K,S: Default settings: ~2 hours 11 minutes
    Note over K: tcp_keepalive_time = 7200s
    K->>K: Wait 2 hours (idle)
    Note over K: tcp_keepalive_intvl = 75s
    K->>S: Probe 1
    K->>S: Probe 2
    K->>S: ...
    K->>S: Probe 9 (no response)
    K->>S: Connection dead
    end

    rect rgb(230, 255, 230)
    Note over K,S: Tuned settings: ~2 minutes
    Note over K: tcp_keepalives_idle = 60s
    K->>K: Wait 60 seconds (idle)
    Note over K: tcp_keepalives_interval = 10s
    K->>S: Probe 1
    K->>S: Probe 2
    K->>S: ...
    K->>S: Probe 6 (no response)
    K->>S: Connection dead
    end

We also set idle_in_transaction_session_timeout (available since PostgreSQL 9.6) as a safety net, which kills any session that’s been idle in transaction past a threshold regardless of whether the TCP layer has caught up. And client_connection_check_interval (PostgreSQL 14+, Linux only) makes the server periodically poll the socket even while processing a query, so it can catch a dropped connection without relying on keepalive probes at all.

Where LLMs Helped (and Where Fundamentals Helped More)

I mentioned feeding the logs into Claude. The kind of triage it did — correlating wait events with lock contention, pulling the right PostgreSQL parameter names and Linux kernel defaults — is tedious by hand. Claude had it in minutes.

What I keep coming back to, though, is how much the context from Stevens and the course project helped me use Claude’s output. When it suggested keepalives, I didn’t need to go read about what keepalive probes do because I’d been implementing retransmission timeouts two nights before. When the Extended Query Protocol came up, I already knew why its multi-step nature made this worse: more points where a connection can stall mid-transaction while holding locks.

And I knew what follow-up questions to ask. “Does idle_in_transaction_session_timeout help even without fixing keepalives?” “What’s the difference between the OS detecting a dead connection and PostgreSQL cleaning up the session?” Those came from context I had on my own, not from Claude’s suggestions.

The Course Project

If any of the above made you curious about TCP internals, CS 168’s Project 3 is a good way to get your hands dirty. You implement TCP’s reliability layer (handshake through retransmission and RTO estimation) inside a Python network simulator. I’m partway through and it’s been a great exercise.

Code: github.com/aleksandar-had/cs168-transport

Wrapping Up

None of this was planned as a single arc. I picked up Stevens because I wanted to understand things I’d been taking for granted. The course project was a recommendation that happened to line up. The production bug was timing.

But looking back, what sticks with me is how they fed into each other. The book gave me vocabulary and intuition for thinking about TCP at a level I hadn’t before. Actually writing the state machine in the course project was different from reading about it, and it’s the reason I could look at the pg_stat_activity output during the bug investigation and immediately think about connection state rather than just seeing stuck queries. The bug, in turn, made me care about keepalive timers and connection lifecycle at a specificity I wouldn’t have bothered with if I’d just been reading for fun.

On the LLM angle: Claude was fast at finding the pattern in the logs and I don’t want to understate that. But the reason I could actually do something with its output, weigh whether the keepalive theory held up and figure out what else to check, is that I had context from the reading and the project work. The tool got more useful because I had the background.

TCP/IP Illustrated is worth the slow read if you work with networked systems (so, most of us).