Collation in PostgreSQL
Why does Pigsty default to locale=C
and encoding=UTF8
when initializing PostgreSQL databases?
The answer is simple: Unless you explicitly need LOCALE-specific features, you should never configure anything other than C.UTF8
for character encoding and collation settings.
I’ve previously written about character encoding, so let’s focus on LOCALE configuration today.
While there might be some justification for using non-UTF8
character encoding on the server side, using any LOCALE other than C
is unforgivable. In PostgreSQL, LOCALE isn’t just about trivial things like date and currency display formats - it affects critical functionality.
Incorrect LOCALE configuration can lead to performance degradation of several to dozens of times, and prevents LIKE
queries from using regular indexes. Meanwhile, setting LOCALE=C
doesn’t affect scenarios that genuinely need localization rules. As the official documentation states: “Use LOCALE only if you really need it.”
Unfortunately, PostgreSQL’s default locale
and encoding
settings depend on the operating system configuration, so C.UTF8
might not be the default. This leads many users to unknowingly misuse LOCALE, suffering significant performance penalties and missing out on certain database features.
TL;DR
- Always use
UTF8
character encoding andC
collation rules. - Using non-C collation rules can increase string comparison operation overhead by several to dozens of times, significantly impacting performance.
- Using non-C collation rules prevents
LIKE
queries from using regular indexes, creating potential pitfalls. - For instances using non-C collation rules, you can create indexes using
text_ops COLLATE "C"
ortext_pattern_ops
to supportLIKE
queries.
What is LOCALE?
We often see LOCALE
(locale) settings in operating systems and various software, but what exactly is it?
LOCALE support refers to applications adhering to cultural preferences, including alphabets, sorting, number formats, etc. A LOCALE consists of many rules and definitions:
LC_COLLATE |
String sorting order |
---|---|
LC_CTYPE |
Character classification (What is a character? Is its uppercase form equivalent?) |
LC_MESSAGES |
Language of messages |
LC_MONETARY |
Currency format |
LC_NUMERIC |
Number format |
LC_TIME |
Date and time format |
…… | Others…… |
A LOCALE is a set of rules, typically named using a language code + country code. For example, the LOCALE zh_CN
used in mainland China has two parts: zh
is the language code, and CN
is the country code. In the real world, one language might be used in multiple countries, and one country might have multiple languages. Taking Chinese and China as an example:
China (COUNTRY=CN
) related language LOCALE
s:
zh
: Chinese:zh_CN
bo
: Tibetan:bo_CN
ug
: Uyghur:ug_CN
Chinese-speaking (LANG=zh
) countries or regions:
CN
China:zh_CN
HK
Hong Kong:zh_HK
MO
Macau:zh_MO
TW
Taiwan:zh_TW
SG
Singapore:zh_SG
A LOCALE Example
Let’s look at a typical Locale definition file: Glibc’s zh_CN
Here’s a small excerpt that shows various format definitions - how months and weeks are named, how currency and decimal points are displayed, etc.
But there’s one crucial element here: LC_COLLATE
, the sorting method (Collation), which significantly impacts database behavior.
LC_CTYPE
copy "i18n"
translit_start
include "translit_combining";""
translit_end
class "hanzi"; /
<U4E00>..<U9FA5>;/
<UF92C>;<UF979>;<UF995>;<UF9E7>;<UF9F1>;<UFA0C>;<UFA0D>;<UFA0E>;/
<UFA0F>;<UFA11>;<UFA13>;<UFA14>;<UFA18>;<UFA1F>;<UFA20>;<UFA21>;/
<UFA23>;<UFA24>;<UFA27>;<UFA28>;<UFA29>
END LC_CTYPE
LC_COLLATE
copy "iso14651_t1_pinyin"
END LC_COLLATE
LC_TIME
% January, February, March, April, May, June, July, August, September, October, November, December
mon "<U4E00><U6708>";/
"<U4E8C><U6708>";/
"<U4E09><U6708>";/
"<U56DB><U6708>";/
...
% Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday
day "<U661F><U671F><U65E5>";/
"<U661F><U671F><U4E00>";/
"<U661F><U671F><U4E8C>";/
...
week 7;19971130;1
first_weekday 2
% %Y年%m月%d日 %A %H时%M分%S秒
d_t_fmt "%Y<U5E74>%m<U6708>%d<U65E5> %A %H<U65F6>%M<U5206>%S<U79D2>"
% %Y年%m月%d日
d_fmt "%Y<U5E74>%m<U6708>%d<U65E5>"
% %H时%M分%S秒
t_fmt "%H<U65F6>%M<U5206>%S<U79D2>"
% AM, PM
am_pm "<U4E0A><U5348>";"<U4E0B><U5348>"
% %p %I时%M分%S秒
t_fmt_ampm "%p %I<U65F6>%M<U5206>%S<U79D2>"
% %Y年 %m月 %d日 %A %H:%M:%S %Z
date_fmt "%Y<U5E74> %m<U6708> %d<U65E5> %A %H:%M:%S %Z"
END LC_TIME
LC_NUMERIC
decimal_point "."
thousands_sep ","
grouping 3
END LC_NUMERIC
LC_MONETARY
% ¥
currency_symbol "<UFFE5>"
int_curr_symbol "CNY "
For example, zh_CN
provides LC_COLLATE
using the iso14651_t1_pinyin
collation rule, which is a pinyin-based sorting rule.
Let’s demonstrate how LOCALE’s COLLATION affects PostgreSQL behavior with an example.
Collation Example
Create a table containing 7 Chinese characters and perform sorting operations.
CREATE TABLE some_chinese(
name TEXT PRIMARY KEY
);
INSERT INTO some_chinese VALUES
('阿'),('波'),('磁'),('得'),('饿'),('佛'),('割');
SELECT * FROM some_chinese ORDER BY name;
Execute the following SQL to sort the records using the default C
collation rule. Here, we can see that it’s actually sorting based on the ascii|unicode
code points.
vonng=# SELECT name, ascii(name) FROM some_chinese ORDER BY name COLLATE "C";
name | ascii
------+-------
佛 | 20315
割 | 21106
得 | 24471
波 | 27874
磁 | 30913
阿 | 38463
饿 | 39295
But this code-point-based sorting might be meaningless for Chinese users. For example, a Chinese dictionary wouldn’t use this sorting method. Instead, it would use the pinyin sorting rule used by zh_CN
, sorting by pinyin. Like this:
SELECT * FROM some_chinese ORDER BY name COLLATE "zh_CN";
name
------
阿
波
磁
得
饿
佛
割
We can see that sorting with the zh_CN
collation rule produces results in pinyin order abcdefg
, rather than the meaningless Unicode code point order.
Of course, this query result depends on the specific definition of the zh_CN
collation rule. Such collation rules aren’t defined by the database itself - the database only provides the C
collation (or its alias POSIX
). COLLATIONs typically come from either the operating system, glibc
, or third-party localization libraries (like icu
), so different actual definitions might produce different effects.
But at what cost?
The biggest negative impact of using non-C
or non-POSIX
LOCALE in PostgreSQL is:
Specific collation rules have a huge performance impact on operations involving string comparisons, and they also prevent LIKE
queries from using regular indexes.
Additionally, the C LOCALE is guaranteed by the database itself to work on any operating system and platform, while other LOCALEs aren’t, making non-C Locale less portable.
Performance Impact
Let’s consider an example using LOCALE collation rules. We have 1.5 million Apple Store app names and want to sort them according to different regional rules.
-- Create a table of app names, containing both Chinese and English
CREATE TABLE app(
name TEXT PRIMARY KEY
);
COPY app FROM '/tmp/app.csv';
-- View table statistics
SELECT
correlation, -- correlation coefficient 0.03542578, basically random distribution
avg_width, -- average length 25 bytes
n_distinct -- -1, meaning 1,508,076 records with no duplicates
FROM pg_stats WHERE tablename = 'app';
-- Run a series of experiments with different collation rules
SELECT * FROM app;
SELECT * FROM app order by name;
SELECT * FROM app order by name COLLATE "C";
SELECT * FROM app order by name COLLATE "en_US";
SELECT * FROM app order by name COLLATE "zh_CN";
The results are quite shocking - using C
and zh_CN
can differ by ten times:
# | Scenario | Time(ms) | Notes |
---|---|---|---|
1 | No sort | 180 | Uses index |
2 | order by name |
969 | Uses index |
3 | order by name COLLATE "C" |
1430 | Sequential scan, external sort |
4 | order by name COLLATE "en_US" |
10463 | Sequential scan, external sort |
5 | order by name COLLATE "zh_CN" |
14852 | Sequential scan, external sort |
Here’s the detailed execution plan for experiment 5. Even with sufficient memory configured, it still spills to disk for external sorting. However, all experiments with explicit LOCALE specification showed this behavior, allowing us to compare the performance difference between C and zh_CN
.
Another more comparative example is comparison operations.
Here, all strings in the table are compared with 'World'
, equivalent to performing 1.5 million specific rule comparisons on the table, without even involving disk I/O.
SELECT count(*) FROM app WHERE name > 'World';
SELECT count(*) FROM app WHERE name > 'World' COLLATE "C";
SELECT count(*) FROM app WHERE name > 'World' COLLATE "en_US";
SELECT count(*) FROM app WHERE name > 'World' COLLATE "zh_CN";
Even so, compared to C LOCALE
, zh_CN
still takes nearly 3 times longer.
# | Scenario | Time(ms) |
---|---|---|
1 | Default | 120 |
2 | C | 145 |
3 | en_US | 351 |
4 | zh_CN | 441 |
If sorting might be O(n²) comparisons with 10x overhead, then the 3x overhead for O(n) comparisons here roughly matches. We can draw a preliminary conclusion:
Compared to C
Locale, using zh_CN
or other Locales can cause several times additional performance overhead.
Besides performance issues, incorrect Locale can also lead to functional limitations.
Functional Limitations
Besides poor performance, another unacceptable issue is that using non-C
LOCALE prevents LIKE queries from using regular indexes.
Let’s use our previous experiment as an example. We’ll execute the following query on database instances using C
and en_US
as default LOCALE:
SELECT * FROM app WHERE name LIKE '中国%';
Find all apps starting with “中国” (China).
On a C-based database
This query can normally use the app_pkey
index, leveraging the ordered nature of the primary key B-tree to speed up the query, completing in about 2 milliseconds.
postgres@meta:5432/meta=# show lc_collate;
C
postgres@meta:5432/meta=# EXPLAIN SELECT * FROM app WHERE name LIKE '中国%';
QUERY PLAN
-----------------------------------------------------------------------------
Index Only Scan using app_pkey on app (cost=0.43..2.65 rows=1510 width=25)
Index Cond: ((name >= '中国'::text) AND (name < '中图'::text))
Filter: (name ~~ '中国%'::text)
(3 rows)
On an en_US-based database
We find that this query cannot use the index, performing a full table scan. The query degrades to 70 milliseconds, 30-40 times worse performance.
vonng=# show lc_collate;
en_US.UTF-8
vonng=# EXPLAIN SELECT * FROM app WHERE name LIKE '中国%';
QUERY PLAN
----------------------------------------------------------
Seq Scan on app (cost=0.00..29454.95 rows=151 width=25)
Filter: (name ~~ '中国%'::text)
Why?
Because index (B-tree index) construction is also based on ordering, which means equality and comparison operations.
However, LOCALE has its own set of rules for string equivalence. For example, the Unicode standard defines many counterintuitive equivalence rules (after all, it’s a universal language standard - like multiple characters combining to form a string equivalent to another single character, details in the Modern Character Encoding article).
Therefore, only the simplest C
LOCALE can perform pattern matching normally. The C LOCALE’s comparison rules are very simple - just compare character code points one by one, without any fancy tricks. So, if your database unfortunately uses a non-C LOCALE, then LIKE
queries cannot use default indexes.
Solution
For non-C LOCALE instances, only special types of indexes can support such queries:
CREATE INDEX ON app(name COLLATE "C");
CREATE INDEX ON app(name text_pattern_ops);
Here, using the text_pattern_ops
operator family to create an index can also support LIKE
queries. This is a special operator family for pattern matching that ignores LOCALE and directly performs pattern matching based on character-by-character comparison, which is the C LOCALE way.
Therefore, in this case, only indexes built on the text_pattern_ops
operator family, or indexes using COLLATE "C"'
on the default text_ops
, can support LIKE
queries.
vonng=# EXPLAIN ANALYZE SELECT * FROM app WHERE name LIKE '中国%';
Index Only Scan using app_name_idx on app (cost=0.43..1.45 rows=151 width=25) (actual time=0.053..0.731 rows=2360 loops=1)
Index Cond: ((name ~>=~ '中国'::text) AND (name ~<~ '中图'::text))
Filter: (name ~~ '中国%'::text COLLATE "en_US.UTF-8")
After creating the index, we can see that the original LIKE
query can use the index.
The issue of LIKE
not being able to use regular indexes might seem solvable by creating an additional text_pattern_ops
index. But this means that what could have been solved directly using the existing PRIMARY KEY
or UNIQUE
constraint’s built-in index now requires additional maintenance costs and storage space.
For developers unfamiliar with this issue, they might encounter performance issues in production because queries aren’t using indexes (e.g., if development uses C but production uses non-C LOCALE).
Compatibility
Suppose you’ve inherited a database already using non-C
LOCALE (this is quite common), and now that you know the dangers of using non-C LOCALE, you decide to change it back.
What should you watch out for? Specifically, Locale configuration affects the following PostgreSQL features:
- Queries using
LIKE
clauses. - Any queries relying on specific LOCALE collation rules, e.g., using pinyin sorting as result ordering.
- Queries using case conversion related functions:
upper
,lower
, andinitcap
. - The
to_char
function family, when formatting to local time. - Case-insensitive matching in regular expressions (
SIMILAR TO
,~
).
So, for any queries involving case conversion, always “explicitly specify Collation!”
If unsure, you can list all queries involving the following keywords using pg_stat_statements
for manual review:
LIKE|ILIKE -- Using pattern matching?
SIMILAR TO | ~ | regexp_xxx -- Using i option?
upper, lower, initcap -- Using for languages with case (Western characters)?
ORDER BY col -- When sorting by text columns, relying on specific collation? (e.g., pinyin)
Compatibility Modifications
Generally, C LOCALE is a superset of other LOCALE configurations in terms of functionality, and you can always switch from other LOCALEs to C. If your business doesn’t use these features, you usually don’t need to do anything. If you use localization features, you can always achieve the same effect in C LOCALE by explicitly specifying COLLATE
.
SELECT upper('a' COLLATE "zh_CN"); -- Perform case conversion based on zh_CN rules
SELECT '阿' < '波'; -- false, under default collation 阿(38463) > 波(27874)
SELECT '阿' < '波' COLLATE "zh_CN"; -- true, explicitly using Chinese pinyin collation: 阿(a) < 波(bo)
However, please note that collations provided by glibc - like “zh_CN” - aren’t necessarily stable. The PostgreSQL community’s recommended best practice is to use C
or POSIX
as the default collation.
Then use ICU as the collation provider, for example:
SELECT '阿' < '波' COLLATE "zh-x-icu"; -- true, explicitly using Chinese pinyin collation: 阿(a) < 波(bo)
- zh-x-icu: Roughly represents “Generic Chinese (unspecified simplified/traditional)”, using ICU rules.
- zh-Hans-x-icu: Represents Simplified Chinese (Hans = Han Simplified), ICU rules.
- zh-Hans-CN-x-icu: Simplified Chinese (Mainland China)
- zh-Hans-HK-x-icu: Simplified Chinese (Hong Kong)
- zh-Hant-x-icu: Represents Traditional Chinese (Hant = Han Traditional)
Overriding CTYPE
You can override CTYPE when performing case conversions:
SELECT
'é' AS original,
UPPER('é') AS upper_default, -- Using default Locale
UPPER('é' COLLATE "C") AS upper_en_c, -- C Locale doesn't handle these characters 'é'
UPPER('é' COLLATE PG_C_UTF8) AS upper_en_cutf8, -- C.UTF8 handles some case issues É
UPPER('é' COLLATE "en_US.UTF-8") AS upper_en_us; -- en_US.UTF8 converts to uppercase É
oid | collname | collnamespace | collowner | collprovider | collisdeterministic | collencoding | collcollate | collctype | colllocale | collicurules | collversion
-------+------------------+---------------+-----------+--------------+---------------------+--------------+--------------+--------------+------------+--------------+-------------
12888 | lzh_TW | 11 | 10 | c | t | 6 | lzh_TW | lzh_TW | NULL | NULL | 2.28
12889 | lzh_TW.utf8 | 11 | 10 | c | t | 6 | lzh_TW.utf8 | lzh_TW.utf8 | NULL | NULL | 2.28
13187 | zh_CN | 11 | 10 | c | t | 2 | zh_CN | zh_CN | NULL | NULL | 2.28
13188 | zh_CN.gb2312 | 11 | 10 | c | t | 2 | zh_CN.gb2312 | zh_CN.gb2312 | NULL | NULL | 2.28
13189 | zh_CN.utf8 | 11 | 10 | c | t | 6 | zh_CN.utf8 | zh_CN.utf8 | NULL | NULL | 2.28
13190 | zh_HK.utf8 | 11 | 10 | c | t | 6 | zh_HK.utf8 | zh_HK.utf8 | NULL | NULL | 2.28
13191 | zh_SG | 11 | 10 | c | t | 2 | zh_SG | zh_SG | NULL | NULL | 2.28
13192 | zh_SG.gb2312 | 11 | 10 | c | t | 2 | zh_SG.gb2312 | zh_SG.gb2312 | NULL | NULL | 2.28
13193 | zh_SG.utf8 | 11 | 10 | c | t | 6 | zh_SG.utf8 | zh_SG.utf8 | NULL | NULL | 2.28
13194 | zh_TW.euctw | 11 | 10 | c | t | 4 | zh_TW.euctw | zh_TW.euctw | NULL | NULL | 2.28
13195 | zh_TW.utf8 | 11 | 10 | c | t | 6 | zh_TW.utf8 | zh_TW.utf8 | NULL | NULL | 2.28
13349 | zh_CN | 11 | 10 | c | t | 6 | zh_CN.utf8 | zh_CN.utf8 | NULL | NULL | 2.28
13350 | zh_HK | 11 | 10 | c | t | 6 | zh_HK.utf8 | zh_HK.utf8 | NULL | NULL | 2.28
13351 | zh_SG | 11 | 10 | c | t | 6 | zh_SG.utf8 | zh_SG.utf8 | NULL | NULL | 2.28
13352 | zh_TW | 11 | 10 | c | t | 4 | zh_TW.euctw | zh_TW.euctw | NULL | NULL | 2.28
13353 | zh_TW | 11 | 10 | c | t | 6 | zh_TW.utf8 | zh_TW.utf8 | NULL | NULL | 2.28
14066 | zh-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh | NULL | 153.80.32.1
14067 | zh-Hans-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hans | NULL | 153.80.32.1
14068 | zh-Hans-CN-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hans-CN | NULL | 153.80.32.1
14069 | zh-Hans-HK-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hans-HK | NULL | 153.80.32.1
14070 | zh-Hans-MO-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hans-MO | NULL | 153.80.32.1
14071 | zh-Hans-SG-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hans-SG | NULL | 153.80.32.1
14072 | zh-Hant-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hant | NULL | 153.80.32.1
14073 | zh-Hant-HK-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hant-HK | NULL | 153.80.32.1
14074 | zh-Hant-MO-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hant-MO | NULL | 153.80.32.1
14075 | zh-Hant-TW-x-icu | 11 | 10 | i | t | -1 | NULL | NULL | zh-Hant-TW | NULL | 153.80.32.1
The only known issue currently appears in the pg_trgm
extension.
https://www.pgevents.ca/events/pgconfdev2024/sessions/session/95/slides/26/pgcon24_collation.pdf