Rescue Data with pg_filedump?

Backups are a DBA’s lifeline — but what if your PostgreSQL database has crashed without any backups? Maybe pg_filedump can help!

Recently, I encountered a rather challenging case. Here’s the situation: A user’s PostgreSQL database was corrupted. It was a Gitlab-managed PostgreSQL instance with no replicas, no backups, and no dumps. It was running on BCACHE (using SSD as transparent cache), and after a power outage, it wouldn’t start.

But that wasn’t the end of it. After several rounds of mishandling, it was completely wrecked: First, someone forgot to mount the BCACHE disk, causing Gitlab to reinitialize a new database cluster; then, due to various reasons, isolation failed, and two database processes ran on the same cluster directory, frying the data directory; next, someone ran pg_resetwal without parameters, pushing the database back to its origin point; finally, they let the empty database run for a while and then removed the temporary backup made before the corruption.

When I saw this case, I was speechless: How do you recover from this mess? It seemed like the only option was to extract data directly from the binary files. I suggested they try a data recovery company, and I asked around, but among the many data recovery companies, almost none offered PostgreSQL data recovery services. Those that did only handled basic issues, and for this situation, they all said it was a long shot.

Data recovery quotes are typically based on the number of files, ranging from ¥1000 to ¥5000 per file. With thousands of files in the Gitlab database, roughly 1000 tables, the total recovery cost could easily reach hundreds of thousands. But after a day, no one took the job, which made me think: If no one can handle this, doesn’t that make the PG community look bad?

I thought about it and decided: This job looks painful but also quite challenging and interesting. Let’s treat it as a dead horse and try to revive it — no cure, no pay. You never know until you try, right? So I took it on myself.

The Tool

To do a good job, one must first sharpen one’s tools. For data recovery, the first step is to find the right tool: pg_filedump is a great weapon. It can extract raw binary data from PostgreSQL data pages, handling many low-level tasks.

The tool can be compiled and installed with the classic make three-step process, but you need to have the corresponding major version of PostgreSQL installed first. Gitlab defaults to using PG 13, so ensure the corresponding version’s pg_config is in your path before compiling:

git clone https://github.com/df7cb/pg_filedump
cd pg_filedump && make && sudo make install

Using pg_filedump isn’t complicated. You feed it a data file and tell it the type of each column in the table, and it’ll interpret the data for you. For example, the first step is to find out which databases exist in the cluster. This information is stored in the system view pg_database. This is a system-level table located in the global directory, assigned a fixed OID 1262 during cluster initialization, so the corresponding physical file is typically: global/1262.

vonng=# select 'pg_database'::RegClass::OID;
 oid
------
 1262

This system view has many fields, but we mainly care about the first two: oid and datname. datname is the database name, and oid can be used to locate the database directory. We can use pg_filedump to extract this table and examine it. The -D parameter tells pg_filedump how to interpret the binary data for each row in the table. You can specify the type of each field, separated by commas, and use ~ to ignore the rest.

pg-filedump-1.png

As you can see, each row of data starts with COPY. Here we found our target database gitlabhq_production with OID 16386. Therefore, all files for this database should be located in the base/16386 subdirectory.

Recovering the Data Dictionary

Knowing the directory of files to recover, the next step is to extract the data dictionary. There are four important tables to focus on:

pg_class: Contains important metadata for all tables •pg_namespace: Contains schema metadata •pg_attribute: Contains all column definitions •pg_type: Contains type names

Among these, pg_class is the most crucial and indispensable table. The other system views are nice to have: they make our work easier. So, we first attempt to recover this table.

pg_class is a database-level system view with a default OID = 1259, so the corresponding file for pg_class should be: base/16386/1259, in the gitlabhq_production database directory.

pg-filedump-2.png

A side note: Those familiar with PostgreSQL internals know that while the actual underlying storage filename (RelFileNode) defaults to matching the table’s OID, some operations might change this. In such cases, you can use pg_filedump -m pg_filenode.map to parse the mapping file in the database directory and find the Filenode corresponding to OID 1259. Of course, here they match, so we’ll move on.

We parse its binary file based on the pg_class table structure definition (note: use the table structure for the corresponding PG major version):

pg_filedump -D 'oid,name,oid,oid,oid,oid,oid,oid,oid,int,real,int,oid,bool,bool,char,char,smallint,smallint,bool,bool,bool,bool,bool,bool,char,bool,oid,xid,xid,text,text,text' -i base/16386/1259

Then you can see the parsed data. The data here is single-line records separated by \t, in the same format as PostgreSQL COPY command’s default output. So you can use scripts to grep and filter, remove the COPY at the beginning of each line, and re-import it into a real database table for detailed examination.

pg-filedump-3.png

When recovering data, there are many details to pay attention to, and the first one is: You need to handle deleted rows. How to identify them? Use the -i parameter to print each row’s metadata. The metadata includes an XMAX field. If a row was deleted by a transaction, this record’s XMAX will be set to that transaction’s XID. So if a row’s XMAX isn’t zero, it means this is a deleted record and shouldn’t be included in the final output.

pg-filedump-4.png

Here XMAX indicates this is a deleted record

With the pg_class data dictionary, you can clearly find the OID correspondences for other tables, including system views. You can recover pg_namespace, pg_attribute, and pg_type using the same method. What can you do with these four tables?

pg-filedump-5.png

You can use SQL to generate the input path for each table, automatically construct the type of each column as the -D parameter, and generate the schema for temporary result tables. In short, you can automate all the necessary tasks programmatically.

SELECT  id, name, nspname, relname, nspid, attrs, fields, has_tough_type,
        CASE WHEN toast_page > 0 THEN toast_name ELSE NULL END AS toast_name, relpages, reltuples, path
FROM
    (
        SELECT n.nspname || '.' || c.relname AS "name", n.nspname, c.relname, c.relnamespace AS nspid, c.oid AS id, c.reltoastrelid AS tid,
               toast.relname AS toast_name, toast.relpages AS toast_page,
               c.relpages, c.reltuples, 'data/base/16386/' || c.relfilenode::TEXT AS path
        FROM meta.pg_class c
                 LEFT JOIN meta.pg_namespace n ON c.relnamespace = n.oid
           , LATERAL (SELECT * FROM meta.pg_class t WHERE t.oid = c.reltoastrelid) toast
        WHERE c.relkind = 'r' AND c.relpages > 0
          AND c.relnamespace IN (2200, 35507, 35508)
        ORDER BY c.relnamespace, c.relpages DESC
    ) z,
    LATERAL ( SELECT string_agg(name,',') AS attrs,
                     string_agg(std_type,',') AS fields,
                     max(has_tough_type::INTEGER)::BOOLEAN AS has_tough_type
              FROM meta.pg_columns WHERE relid = z.id ) AS columns;

Note that the data type names supported by pg_filedump -D parameter are strictly limited to standard names, so you must convert boolean to bool, INTEGER to int. If the data type you want to parse isn’t in the list below, you can first try using the TEXT type. For example, the INET type for IP addresses can be parsed using TEXT.

bigint bigserial bool char charN date float float4 float8 int json macaddr name numeric oid real serial smallint smallserial text time timestamp timestamptz timetz uuid varchar varcharN xid xml

But there are indeed other special cases that require additional processing, such as PostgreSQL’s ARRAY type, which we’ll cover in detail later.

Recovering a Regular Table

Recovering a regular data table isn’t fundamentally different from recovering a system catalog table: it’s just that catalog schemas and information are publicly standardized, while the schema of the database to be recovered might not be.

Gitlab is also a well-known open-source software, so finding its database schema definition isn’t difficult. If it’s a regular business system, you can spend more effort to reconstruct the original DDL from pg_catalog.

Once you know the DDL definition, you can use the data type of each column in the DDL to interpret the data in the binary file. Let’s use public.approval_merge_request_rules, a regular table in Gitlab, as an example to demonstrate how to recover such a regular data table.

create table approval_project_rules
(
    id                        bigint,
    created_at                timestamp with time zone,
    updated_at                timestamp with time zone,
    project_id                integer,
    approvals_required        smallint,
    name                      varchar,
    rule_type                 smallint,
    scanners                  text[],
    vulnerabilities_allowed   smallint,
    severity_levels           text[],
    report_type               smallint,
    vulnerability_states      text[],
    orchestration_policy_idx  smallint,
    applies_to_all_protected_branches              boolean,
    security_orchestration_policy_configuration_id bigint,
    scan_result_policy_id                          bigint
);

First, we need to convert these types into types that pg_filedump can recognize. This involves type mapping: if you have uncertain types, like the text[] string array fields above, you can first use text type as a placeholder, or simply use ~ to ignore them:

bigint,timestamptz,timestamptz,int,smallint,varchar,smallint,text,smallint,text,smallint,text,smallint,bool,bigint,bigint

Of course, the first thing to know is that PostgreSQL’s tuple column layout is ordered, and this order is stored in the attrnum field of the system view pg_attribute. The type ID for each column in the table is stored in the atttypid field, and to get the English name of the type, you need to reference the pg_type system view through the type ID (of course, system default types have fixed IDs, so you can also use ID mapping directly). In summary, to get the interpretation method for physical records in a table, you need at least the four system dictionary tables mentioned above.

With the order and types of columns in this table, and knowing the location of this table’s binary file, you can use this information to translate the binary data.

pg_filedump -i -f -D 'bigint,...,bigint' 38304

pg-filedump-6.png

For output, it’s recommended to add the -i and -f options. The former prints metadata for each row (needed to determine if a row has been deleted based on XMAX); the latter prints the original binary data context (necessary for handling complex data that pg_filedump can’t handle).

Normally, each record will start with either COPY: or Error:. The former represents successful extraction, while the latter represents partial success or failure. If it fails, there can be various reasons that need to be handled separately. For successful data, you can take it directly - each line is a piece of data, separated by \t, replace \N with NULL, process it, and save it in a temporary table for later use.

Of course, the devil is in the details. If data recovery were this easy, it wouldn’t be so challenging.

The Devil is in the Details

When handling data recovery, there are many small details to pay attention to. Here are a few important points.

First is TOAST field handling. TOAST stands for “The Oversized-Attribute Storage Technique”. If you find that a parsed field’s content is (TOASTED), it means this field was too long and was sliced and transferred to a dedicated table - the TOAST table.

If a table has fields that might be TOASTed, it will have a corresponding TOAST table, identified by reltoastrelid in pg_class. TOAST can be treated as a regular table, so you can use the same method to parse TOAST data, stitch it back together, and fill it into the original table. We won’t expand on this here.

pg-filedump-7.png

The second issue is complex types. As mentioned in the previous section, pg_filedump’s README lists supported types, but types like arrays require additional binary parsing.

For example, when you dump array binary data, you might see a string of \0\0. This is because pg_filedump directly spits out complex types it can’t handle. This brings additional problems - null values in strings will cause your inserts to fail, so your parsing script needs to handle this. When encountering a complex column that can’t be parsed, you should first mark it and keep the binary value for later processing.

Let’s look at a concrete example: using the public.approval_merge_request_rules table from above. From the dumped data, binary view, and ASCII view, we can see some scattered strings: critical, unknown, etc., mixed in with a string of \0 and binary control characters. Yes, this is the binary representation of a string array. PostgreSQL arrays allow arbitrary type nesting at arbitrary depths, so the data structure here is a bit complex.

pg-filedump-8.png

For example, the highlighted area in the image corresponds to data that is an array containing three strings: {unknown,high,critical}::TEXT[]. 01 represents that this is a one-dimensional array, followed by the null bitmap, and the type OID 0x00000019 representing array elements. 0x19 in decimal is 25, corresponding to the text type in pg_type, indicating this is a string array (if it were 0x17, it would be an integer array). Next is the dimension 0x03 for the first dimension of this array, since this array only has one dimension with three elements; the following 1 tells us where the starting offset of the first dimension is. After that are the three string structures: each starts with a 4-byte length (needs to be right-shifted to handle the marker), followed by the string content, with layout alignment and padding to consider.

In summary, you need to dig through the source code implementation, and there are endless details here: variable length, null bitmaps, field compression, out-of-line storage, and endianness. Make one wrong move, and what you extract is just a useless mess.

You can choose to directly parse the original binary from the recorded context using Python scripts, or register new types and callback handler functions in the pg_filedump source code to reuse PG’s provided C parsing functions. Neither approach is particularly easy.

Fortunately, PostgreSQL itself provides some C language helper functions & macros to help you complete most of the work, and luckily, the arrays in Gitlab are all one-dimensional, with types limited to integer arrays and string arrays. Other data pages with complex types can also be reconstructed from other tables, so the overall workload is still manageable.

pg-filedump-9.png

Epilogue

This job took me two days to complete. I won’t go into the dirty details of the process - I doubt readers would be interested. After a series of processing, correction, and verification, the data recovery work was finally completed! Except for a few corrupted records in a few tables, all other data was successfully extracted. Wow, a full thousand tables!

I’ve done some data recovery work before, and most cases were relatively simple: data block corruption, control file/CLOG damage, or ransomware infection (writing a few garbage files to the Tablespace). But this is the first time I’ve encountered a case that was so thoroughly wrecked. The reason I dared to take this job was that I have some understanding of the PG kernel and know these tedious implementation details. As long as you know it’s an engineering problem that can be solved, you won’t worry about not being able to complete it, no matter how dirty or tiring the process is.

Despite some shortcomings, pg_filedump is still a good tool. I might consider improving it later to provide complete support for various data types, so we don’t have to write a bunch of Python scripts to handle various tedious details. After completing this case, I’ve already packaged pg_filedump for PG 12-16 x EL 7-9 and placed it in Pigsty’s Yum repository, included by default in Pigsty’s offline software package. It’s now implemented in Pigsty v2.4.1. I sincerely hope you never need to use this extension, but if you ever find yourself in a situation where you do, I hope it’s right there at your fingertips, ready to use out of the box.

Finally, I want to say one thing: Many software applications need databases, but database installation, deployment, and maintenance are high-threshold tasks. The PostgreSQL instance that Gitlab spins up is already quite good quality, but it’s still helpless in this situation, let alone those crude single-instance Docker images made by hand. One major failure can wipe out a company’s accumulated code, data, CI/CD processes, and Issue/PR/MR records. I really suggest you carefully review your database system and at least make regular backups!

pg-filedump-10.png

The core difference between Gitlab’s Enterprise and Community editions lies in whether the underlying PG has high availability and monitoring. And Pigsty - the out-of-the-box PostgreSQL distribution can better solve these problems for you, completely open source and free, charging nothing: whether it’s high availability, PITR, or monitoring systems, everything is included. Next time you encounter such a problem, you can automatically switch/roll back with one click, handling it much more gracefully. We previously ran our own Gitlab, Jira, Confluence, and other software on it. If you have similar needs, why not give it a try?

Last modified 2025-04-04: bump to v3.4.1 (98648b1)