Introduction
INSERT INTO — two words, and your data is saved. But what actually happens between pressing Enter and getting back INSERT 0 1? Where does the row go? How does it survive a crash?
I got curious about this and decided to trace the journey of an insert in PostgreSQL — from the command to the bytes on disk. This is the first part in a series where we’ll follow that journey all the way to replication across multiple nodes. In this post, we’ll stay on a single node and look at how Postgres handles connections, buffers data in memory, writes to the WAL, and eventually flushes your data to disk.
To keep things hands-on, we’ll run everything inside a Docker container and inspect what Postgres is doing at each step.
Outline
- Prerequisites
- Connecting to Postgres and the
postgresprocess - Creating a table and making an insert
- Tracing that insert: page-based storage, shared buffer, WAL write at commit, and background writer
Prerequisites
- A running Postgres 17.6 service. You can use this Dockerfile and build the container to follow along without any issues.
- Being familiar with a bit of PostgreSQL will be helpful.
- Being familiar with terms like processes and threads will be helpful for this blog.
Connections in Postgres
Before we get into how postgres “stores” our data, let’s quickly look at what it means to “connect” to Postgres from Postgres’ point of view. You may have connected postgres as a client, but let’s see what this connection means for a postgres server.
Begin by starting the docker container, and running docker exec -it <container> bash. Then run the command: ps aux. This will list all the processes that are running in our container.
postgres@ed47e5efc99e:/$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 1 0.0 0.1 217536 24448 ? Ss Jan19 0:02 postgres
postgres 14 0.0 0.0 217668 8408 ? Ss Jan19 0:00 postgres: checkpointer
postgres 15 0.0 0.0 217692 6488 ? Ss Jan19 0:00 postgres: background writer
postgres 17 0.0 0.0 217536 9560 ? Ss Jan19 0:00 postgres: walwriter
postgres 18 0.0 0.0 219112 8408 ? Ss Jan19 0:00 postgres: autovacuum launcher
postgres 19 0.0 0.0 219120 7640 ? Ss Jan19 0:00 postgres: logical replication launcher
postgres 3858 0.4 0.0 7352 3712 pts/1 Ss 15:26 0:00 bash
postgres 3864 0.0 0.0 9392 3840 pts/1 R+ 15:26 0:00 ps auxWe can see several processes listed here. These processes provide a nice abstraction for us that we’ll build on. However, it’s a good idea to know these processes and their behaviors. For now, the important process is the postgres process (PID 1). This is the postmaster process. Whenever a client makes a request to connect to postgres, this postmaster process spawns a child process to handle that connection. This newly spawned, child process will execute the commands that the client asks it to and returns the response. We’ll look at the other processes like the walwriter and background writer later in this blog.
Open a new terminal. For convenience, I am going to call this new terminal the “client terminal” and the previous terminal as the “server terminal”. From the client terminal, let’s connect to Postgres: Execute: docker exec -it <container> psql -U postgres. This is going to open a connection to the postgres server. In the server terminal, again run ps aux. You should see something like this: postgres: postgres postgres [local] idle. This is the child process that the postmaster spawned to handle the connection. Let’s interpret the name of the process:
postgres:- It’s a PostgreSQL backend processpostgres- The username connected (user “postgres”)postgres- The database name (database “postgres”)[local]- Connection type (Unix domain socket, not TCP/IP)idle- Current state (waiting for a query). If we runselect pg_sleep(20);from the client terminal, we’ll see the state to change fromidletoSELECT.
Summary
Let’s summarise what we have established so far:
- When we start postgres, we see several processes like Postgres (postmaster), walwriter, checkpointer, etc. Of these, the postmaster
postgresprocess handles the incoming connections. - Postmaster process spawns processes for each connection. These processes handle requests sent by the client.
Making an Insert
We now can begin the journey of the insert. We have the client terminal connected to the postgres server. From this client terminal, we are going to send a request to the server. Then the postgres: postgres postgres [local] idle process is going to take care of our request.
Let’s create a table and make an insert. Execute this from the client terminal:
CREATE TABLE mytesttable (
id SERIAL PRIMARY KEY,
data TEXT,
created_at TIMESTAMP DEFAULT NOW()
);This will create a table named mytesttable in the default namespace in Postgres.
And now let’s make an insert:
INSERT INTO mytesttable (data) VALUES ('Tracking this specific insert!');In the client terminal, we see a success response. The row has been inserted now. But who inserted this row? Where is it stored? Did it have any side-effects / other operations that happened in the background? These are important details that we need to explain.
When we execute an INSERT, the data flows through three stages:
- In the “page” in shared buffer cache (in memory).
- In the WAL on disk.
- On the actual data file on disk, asynchronously.
Page Based Storage System
Postgres and several other databases follow a “Page based Storage” mechanism. All data is grouped into “pages” of 8KB, and this is the atomic unit. That is, even if postgres wants to fetch one row, it will fetch the entire 8KB page containing that row. This is often done so because sizes like 8KB are how the device drivers fetch data from disk. Each 8KB page is organized into a file that has a maximum size of 1GB. Everything that we store on disk: the table data, the metadata, the indexes. Everything follows this page based storage. Each page has some headers and data.
Though fascinating, it is not in this article’s scope to discuss storage of Postgres. Please refer to some of the sources given in the references to know more about the page based storage organization. For us, it is enough to know that postgres organizes storage in files composed of units called “pages” which have a fixed size of 8KB.
Let’s inspect the file where our data is stored.
- From the client terminal, run
SELECT pg_relation_filepath('mytesttable');. This will tell you the filepath where the data of this table will be stored. I see this file:base/5/16387. This file is a collection of pages of size 8KB. You can also do a sanity check by checking that each filesize is divisible by 8192. - From the server terminal, do
cd /var/lib/postgresql/data. This is the directory where the postgres stores data. - In this directory, if we did:
cat base/5/16387, we should see the output stored on the disk! This isn’t fully readable but we can still see that this corresponds to our insert.
postgres@ed47e5efc99e:~/data$ cat base/5/16387
��S� ���� ?Tracking this specific insert!�� ��WAL Write at Commit
When our backend process commits the transaction, it writes the WAL record from the WAL buffer to the WAL file on disk. Since the WAL is an append-only log, this is a sequential write — the fastest kind of disk write. This is what makes the commit durable: even if Postgres crashes right after, the WAL record on disk can be replayed to recover the insert.
So where does the walwriter process we saw earlier fit in? The walwriter is a background helper that periodically flushes WAL buffers to disk. It’s useful when Postgres is configured for asynchronous commit (where the backend doesn’t wait for the disk write). In the default synchronous mode that we’re using, the backend handles the flush itself. We’ll explore WAL in much more detail in the next post.
Background Writer Process
At this point, our insert is durable — the WAL record is on disk, and the client has received a success response. But the actual data page in the shared buffer is still “dirty”: it has been modified in memory but not yet written back to the data file on disk.
This is where the background writer process comes in. It periodically scans the shared buffer for dirty pages and writes them back to their data files on disk. This write is a random I/O operation (unlike the sequential WAL write), so it’s slower — which is exactly why it happens asynchronously in the background rather than during the transaction.
Remember the file we inspected earlier with cat base/5/16387? That’s the data file that the background writer eventually flushes to. There’s also the checkpointer process (which we saw in our ps aux output) — it performs a similar role but on a larger scale, ensuring that all dirty pages are written to disk at regular intervals called checkpoints.
Once dirty pages have been flushed to disk, the shared buffer entries can be reclaimed to free up memory.
Summary
To summarise: when we make an insert, the backend process modifies the page in the shared buffer and writes the WAL record to disk at commit time. Later, the background writer process asynchronously flushes the dirty data page to disk.
What’s Next
We’ve traced our insert from the command line to the bytes on disk — through the shared buffer, the WAL, and finally the data files. But we glossed over a big question: what exactly is in the WAL, and why does it matter so much?
In the next post, we’ll crack open the WAL, inspect its contents record by record, and see how it becomes the foundation for replication across multiple nodes.