Journey of an Insert: Storage in Postgres

Introduction

INSERT INTO — two words, and your data is saved. But what actually happens between pressing Enter and getting back INSERT 0 1? Where does the row go? How does it survive a crash?

I got curious about this and decided to trace the journey of an insert in PostgreSQL — from the command to the bytes on disk. This is the first part in a series where we’ll follow that journey all the way to replication across multiple nodes. In this post, we’ll stay on a single node and look at how Postgres handles connections, buffers data in memory, writes to the WAL, and eventually flushes your data to disk.

To keep things hands-on, we’ll run everything inside a Docker container and inspect what Postgres is doing at each step.

Outline

Prerequisites
Connecting to Postgres and the postgres process
Creating a table and making an insert
Tracing that insert: page-based storage, shared buffer, WAL write at commit, and background writer

Prerequisites

A running Postgres 17.6 service. You can use this Dockerfile and build the container to follow along without any issues.
Being familiar with a bit of PostgreSQL will be helpful.
Being familiar with terms like processes and threads will be helpful for this blog.

Connections in Postgres

Before we get into how postgres “stores” our data, let’s quickly look at what it means to “connect” to Postgres from Postgres’ point of view. You may have connected postgres as a client, but let’s see what this connection means for a postgres server.

Begin by starting the docker container, and running docker exec -it <container> bash. Then run the command: ps aux. This will list all the processes that are running in our container.

postgres@ed47e5efc99e:/$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres       1  0.0  0.1 217536 24448 ?        Ss   Jan19   0:02 postgres
postgres      14  0.0  0.0 217668  8408 ?        Ss   Jan19   0:00 postgres: checkpointer 
postgres      15  0.0  0.0 217692  6488 ?        Ss   Jan19   0:00 postgres: background writer 
postgres      17  0.0  0.0 217536  9560 ?        Ss   Jan19   0:00 postgres: walwriter 
postgres      18  0.0  0.0 219112  8408 ?        Ss   Jan19   0:00 postgres: autovacuum launcher 
postgres      19  0.0  0.0 219120  7640 ?        Ss   Jan19   0:00 postgres: logical replication launcher 
postgres    3858  0.4  0.0   7352  3712 pts/1    Ss   15:26   0:00 bash
postgres    3864  0.0  0.0   9392  3840 pts/1    R+   15:26   0:00 ps aux

We can see several processes listed here. These processes provide a nice abstraction for us that we’ll build on. However, it’s a good idea to know these processes and their behaviors. For now, the important process is the postgres process (PID 1). This is the postmaster process. Whenever a client makes a request to connect to postgres, this postmaster process spawns a child process to handle that connection. This newly spawned, child process will execute the commands that the client asks it to and returns the response. We’ll look at the other processes like the walwriter and background writer later in this blog.

Open a new terminal. For convenience, I am going to call this new terminal the “client terminal” and the previous terminal as the “server terminal”. From the client terminal, let’s connect to Postgres: Execute: docker exec -it <container> psql -U postgres. This is going to open a connection to the postgres server. In the server terminal, again run ps aux. You should see something like this: postgres: postgres postgres [local] idle. This is the child process that the postmaster spawned to handle the connection. Let’s interpret the name of the process:

postgres: - It’s a PostgreSQL backend process
postgres - The username connected (user “postgres”)
postgres - The database name (database “postgres”)
[local] - Connection type (Unix domain socket, not TCP/IP)
idle - Current state (waiting for a query). If we run select pg_sleep(20); from the client terminal, we’ll see the state to change from idle to SELECT.

Summary

Let’s summarise what we have established so far:

When we start postgres, we see several processes like Postgres (postmaster), walwriter, checkpointer, etc. Of these, the postmaster postgres process handles the incoming connections.
Postmaster process spawns processes for each connection. These processes handle requests sent by the client.

Making an Insert

We now can begin the journey of the insert. We have the client terminal connected to the postgres server. From this client terminal, we are going to send a request to the server. Then the postgres: postgres postgres [local] idle process is going to take care of our request.

Let’s create a table and make an insert. Execute this from the client terminal:

CREATE TABLE mytesttable (
    id SERIAL PRIMARY KEY,
    data TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

This will create a table named mytesttable in the default namespace in Postgres.

And now let’s make an insert:

INSERT INTO mytesttable (data) VALUES ('Tracking this specific insert!');

In the client terminal, we see a success response. The row has been inserted now. But who inserted this row? Where is it stored? Did it have any side-effects / other operations that happened in the background? These are important details that we need to explain.

When we execute an INSERT, the data flows through three stages:

In the “page” in shared buffer cache (in memory).
In the WAL on disk.
On the actual data file on disk, asynchronously.

Page Based Storage System

Postgres and several other databases follow a “Page based Storage” mechanism. All data is grouped into “pages” of 8KB, and this is the atomic unit. That is, even if postgres wants to fetch one row, it will fetch the entire 8KB page containing that row. This is often done so because sizes like 8KB are how the device drivers fetch data from disk. Each 8KB page is organized into a file that has a maximum size of 1GB. Everything that we store on disk: the table data, the metadata, the indexes. Everything follows this page based storage. Each page has some headers and data.

Though fascinating, it is not in this article’s scope to discuss storage of Postgres. Please refer to some of the sources given in the references to know more about the page based storage organization. For us, it is enough to know that postgres organizes storage in files composed of units called “pages” which have a fixed size of 8KB.

Let’s inspect the file where our data is stored.

From the client terminal, run SELECT pg_relation_filepath('mytesttable');. This will tell you the filepath where the data of this table will be stored. I see this file: base/5/16387. This file is a collection of pages of size 8KB. You can also do a sanity check by checking that each filesize is divisible by 8192.
From the server terminal, do cd /var/lib/postgresql/data. This is the directory where the postgres stores data.
In this directory, if we did: cat base/5/16387, we should see the output stored on the disk! This isn’t fully readable but we can still see that this corresponds to our insert.

postgres@ed47e5efc99e:~/data$ cat base/5/16387
��S�  ����  ?Tracking this specific insert!��   ��

Shared Buffer

Postgres has a configurable shared memory buffer. This shared memory is accessible by all the processes that postgres spawns. It serves as a cache for reads, and writes go through this buffer as well. Think of this as an array of page pointers that point to a page.

When the backend process handles our insert, it first fetches the target page from disk into the shared buffer (if it isn’t cached already). It then modifies the page in the shared buffer and marks it as “dirty”. It also generates a WAL record and writes it into the WAL buffers — a separate area of shared memory dedicated to WAL data. At commit time, the backend process itself flushes the WAL buffer to disk. Only once the WAL record is on disk is the transaction considered committed and the success response returned to the client. Note that the actual data page is not yet written back to disk — that happens asynchronously later.

Let’s inspect this buffercache now. From the client terminal, enable the pg_buffercache extension:

CREATE EXTENSION IF NOT EXISTS pg_buffercache;

First, let’s see how big the buffer cache is:

SHOW shared_buffers;

This shows 128MB — the total size of the shared buffer cache (this is configurable). We can also check how many page slots that translates to:

SELECT count(*) AS total_buffers FROM pg_buffercache;

We see 16384 as the result. That’s 16384 slots, each holding one 8KB page. Sanity check: \(16384 \times 8 = 128\) MB.

Now let’s find our table’s page in the cache:

SELECT
    b.bufferid,
    b.relblocknumber,
    b.isdirty,
    b.usagecount,
    b.pinning_backends
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
WHERE c.relname = 'mytesttable';

This shows the bufferid, which is the slot in the buffer array that our page occupies. relblocknumber tells us which 8KB block within the table’s file this is — block 0 is the first page. That is, in the data file, our data is in the first page. If you run this immediately after the insert, you’ll see isdirty=t, confirming that this page has been modified in memory but not yet flushed to disk. The background writer or checkpointer will eventually flush it, at which point isdirty flips to f.

WAL Write at Commit

When our backend process commits the transaction, it writes the WAL record from the WAL buffer to the WAL file on disk. Since the WAL is an append-only log, this is a sequential write — the fastest kind of disk write. This is what makes the commit durable: even if Postgres crashes right after, the WAL record on disk can be replayed to recover the insert.

So where does the walwriter process we saw earlier fit in? The walwriter is a background helper that periodically flushes WAL buffers to disk. It’s useful when Postgres is configured for asynchronous commit (where the backend doesn’t wait for the disk write). In the default synchronous mode that we’re using, the backend handles the flush itself. We’ll explore WAL in much more detail in the next post.

Background Writer Process

At this point, our insert is durable — the WAL record is on disk, and the client has received a success response. But the actual data page in the shared buffer is still “dirty”: it has been modified in memory but not yet written back to the data file on disk.

This is where the background writer process comes in. It periodically scans the shared buffer for dirty pages and writes them back to their data files on disk. This write is a random I/O operation (unlike the sequential WAL write), so it’s slower — which is exactly why it happens asynchronously in the background rather than during the transaction.

Remember the file we inspected earlier with cat base/5/16387? That’s the data file that the background writer eventually flushes to. There’s also the checkpointer process (which we saw in our ps aux output) — it performs a similar role but on a larger scale, ensuring that all dirty pages are written to disk at regular intervals called checkpoints.

Once dirty pages have been flushed to disk, the shared buffer entries can be reclaimed to free up memory.

Summary

To summarise: when we make an insert, the backend process modifies the page in the shared buffer and writes the WAL record to disk at commit time. Later, the background writer process asynchronously flushes the dirty data page to disk.

What’s Next

We’ve traced our insert from the command line to the bytes on disk — through the shared buffer, the WAL, and finally the data files. But we glossed over a big question: what exactly is in the WAL, and why does it matter so much?

In the next post, we’ll crack open the WAL, inspect its contents record by record, and see how it becomes the foundation for replication across multiple nodes.

Introduction

Outline

Prerequisites

Connections in Postgres

Summary

Making an Insert

Page Based Storage System

Shared Buffer

WAL Write at Commit

Background Writer Process

Summary

What’s Next

References