pgpdf
PDF type with meta admin & Full-Text Search
Overview
Attributes
Yes | No | Yes | Yes | Yes | Yes | Yes | - |
Packages
EL | PIGSTY | pgpdf_$v* | 0.1.0 | - | 17 16 15 14 13 |
Debian | PIGSTY | postgresql-$v-pgpdf | 0.1.0 | - | 17 16 15 14 13 |
Availability
PG17 | PG16 | PG15 | PG14 | PG13 | |
---|---|---|---|---|---|
el8.x86_64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
el8.aarch64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
el9.x86_64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
el9.aarch64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
d12.x86_64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
d12.aarch64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
u22.x86_64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
u22.aarch64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
u24.x86_64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
u24.aarch64 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 | 0.1.0 |
Download
To add the required PGDG / PIGSTY upstream repository, use:
pig repo add pgdg -u # add PGDG repo and update cache (leave existing repos)
pig repo add pigsty -u # add PIGSTY repo and update cache (leave existing repos)
pig repo add pgsql -u # add PGDG + Pigsty repo and update cache (leave existing repos)
pig repo set all -u # set repo to all = NODE + PGSQL + INFRA (remove existing repos)
./node.yml -t node_repo -e node_repo_modules=node,pgsql # -l <cluster>
Or download the latest packages directly:
Install
Install this extension with:
pig ext install pgpdf; # install by extension name, for the current active PG version
pig ext install pgpdf -v 17; # install for PG 17
pig ext install pgpdf -v 16; # install for PG 16
pig ext install pgpdf -v 15; # install for PG 15
pig ext install pgpdf -v 14; # install for PG 14
pig ext install pgpdf -v 13; # install for PG 13
dnf install pgpdf_17*;
dnf install pgpdf_16*;
dnf install pgpdf_15*;
dnf install pgpdf_14*;
dnf install pgpdf_13*;
apt install postgresql-17-pgpdf;
apt install postgresql-16-pgpdf;
apt install postgresql-15-pgpdf;
apt install postgresql-14-pgpdf;
apt install postgresql-13-pgpdf;
./pgsql.yml -t pg_ext -e '{"pg_extensions": ["pgpdf"]}' # -l <cls>
Preload this extension with:
shared_preload_libraries = 'pgpdf'; # add to pg cluster config
Create this extension with:
CREATE EXTENSION pgpdf;
Usage
The actual PDF parsing is done by poppler.
This allows you to work with PDFs in an ACID-compliant way. The usual alternative relies on external scripts or services which can easily make your data ingestion pipeline brittle and leave your raw data out-of-sync.
Download some PDFs.
wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf -O /tmp/pgintro.pdf
wget https://pdfobject.com/pdf/sample.pdf -O /tmp/sample.pdf
You can create a pdf
type, by casting either a text
filepath or bytea
column.
CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
pdf
----------------------------------------------------------------------------------
PostgreSQL Introduction +
Digoal.Zhou +
7/20/2011Catalog +
PostgreSQL Origin
If you don’t have the PDF file in your filesystem, but have already stored its content in a bytea
column, you can just cast it to pdf
.
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::bytea::pdf;
Examples
Create a table with a pdf
column:
CREATE TABLE pdfs(name text primary key, doc pdf);
INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');
Parsing and validation should happen automatically. The files will be read from the disk only once!
[!NOTE] The filepath should be accessible by the
postgres
process / user! That's different than the user running psql. If you don't understand what this means, as your DBA!
String Functions and Operators
Standard Postgres String Functions and Operators should work as usual:
SELECT 'Below is the PDF we received ' || '/tmp/pgintro.pdf'::pdf;
SELECT upper('/tmp/pgintro.pdf'::pdf::text);
SELECT name
FROM pdfs
WHERE doc::text LIKE '%Postgres%';
Full-Text Search (FTS)
You can also perform full-text search (FTS), since you can work on a pdf
file like normal text.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
?column?
----------
t
(1 row)
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');
?column?
----------
f
(1 row)
Document similarity with pg_trgm
You can use pg_trgm to get the similarity between two documents:
CREATE EXTENSION pg_trgm;
SELECT similarity('/tmp/pgintro.pdf'::pdf::text, '/tmp/sample.pdf'::pdf::text);
Metadata
The following functions are available:
-
pdf_title(pdf) → text
-
pdf_author(pdf) → text
-
pdf_num_pages(pdf) → integer
Total number of pages in the document
-
pdf_page(pdf, integer) → text
Get the i-th page as text
-
pdf_creator(pdf) → text
-
pdf_keywords(pdf) → text
-
pdf_metadata(pdf) → text
-
pdf_version(pdf) → text
-
pdf_subject(pdf) → text
-
pdf_creation(pdf) → timestamp
-
pdf_modification(pdf) → timestamp
SELECT pdf_title('/tmp/pgintro.pdf');
pdf_title
-------------------------
PostgreSQL Introduction
(1 row)
SELECT pdf_author('/tmp/pgintro.pdf');
pdf_author
------------
周正中
(1 row)
Getting a subset of pages
SELECT pdf_num_pages('/tmp/pgintro.pdf');
pdf_num_pages
---------------
24
(1 row)
SELECT pdf_page('/tmp/pgintro.pdf', 1);
pdf_page
------------------------------
Catalog +
PostgreSQL Origin +
Layout +
Features +
Enterprise Class Attribute+
Case
(1 row)
SELECT pdf_subject('/tmp/pgintro.pdf');
pdf_subject
-------------
(1 row)
SELECT pdf_creation('/tmp/pgintro.pdf');
pdf_creation
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_modification('/tmp/pgintro.pdf');
pdf_modification
--------------------------
Wed Jul 20 11:13:37 2011
(1 row)
SELECT pdf_creator('/tmp/pgintro.pdf');
pdf_creator
------------------------------------
Microsoft® Office PowerPoint® 2007
(1 row)
SELECT pdf_metadata('/tmp/pgintro.pdf');
pdf_metadata
--------------
(1 row)
SELECT pdf_version('/tmp/pgintro.pdf');
pdf_version
-------------
PDF-1.5
(1 row)
Installation
Install poppler dependencies
Linux
sudo apt install -y libpoppler-glib-dev pkg-config
Homebrew/MacOS
brew install poppler pkgconf
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
After the installation, in a session:
CREATE EXTENSION pgpdf;
Docker
Get the Docker image with:
docker pull florents/pgpdf:pg17
This adds pgpdf to the Postgres image (replace 17
with your Postgres server version, and run it the same way).
Run the image in a container.
docker run --name pgpdf -p 5432:5432 -e POSTGRES_PASSWORD=pass florents/pgpdf:pg17
Through another terminal, connect to the running server (container).
PGPASSWORD=pass psql -h localhost -p 5432 -U postgres
[!WARNING] Reading arbitrary binary data (PDF) into your database can pose security risks. Only use this for files you trust.