Collations#

What is a Collation?#

Collation decides how text is compared and sorted.

A collation defines how text is:

✅ sorted (ORDER BY)
✅ compared (=, <, >)
✅ matched (LIKE, indexes)

In short:

Collation = language + rules for text comparison

What Collation Controls#

Collation affects ONLY text (TEXT, VARCHAR, CHAR) and controls:

Sorting (ORDER BY)
Comparing (=, <, >)
Searching (LIKE)
Index behavior on text

It does NOT affect:

numbers
dates
storage size

Real-world analogy#

Thing	What it means
Dictionary	Has rules for word order
Language	Has accents, cases, symbols
Collation	Dictionary rules for the database

Simple SQL example

Table

CREATE TABLE names (name TEXT);

Data

INSERT INTO names VALUES
('apple'),
('Banana'),
('Álvaro');

🔴 Without thinking about collation

SELECT name FROM names ORDER BY name;

Result may be:

Banana
apple
Álvaro

Why?

Capital letters first
Accented letters last

That’s one collation rule.

🟢 With a different collation

SELECT name
FROM names
ORDER BY name COLLATE "en_US.UTF-8";

Result becomes more human-friendly.

Best practice

For most apps:

Keep database default collation
Use C collation for:
- emails
- usernames
- codes (fast & predictable)
Use language collation for:
- names
- descriptions

Example

SELECT 'apple' = 'Apple';

Result depends on collation:

Some collations → false
Case-insensitive collation → true

That’s all collation does.

Summary

Collation = text comparison rules
Controls sorting & comparing
Comes from language rules
You usually don’t touch it
You only customize it when needed

Pashto is special#

Pashto:

Uses Arabic script
Has extra letters (ټ ړ ږ ښ ځ etc.)
Reads right-to-left
Sorting rules are different from English

So ASCII / C collation is NOT OK

Unicode-aware collation IS required

What you MUST do

1️⃣ Use UTF-8 database

SHOW server_encoding;
-- UTF8

2️⃣ Use ICU collation

Modern PostgreSQL (v10+) supports ICU, which handles:

Arabic script
Pashto letters
Unicode ordering
Case & accent rules

Check ICU support:

SELECT DISTINCT collprovider
FROM pg_collation;

Possible results

If you see:

collprovider
-------------
c
i
d

Then:

Value	Meaning
c	libc (en_US.UTF-8)
i	ICU (und-x-icu) or (ar-x-icu)
d	default Usually points to a libc or ICU

If i exists → ICU is available on your system

Side-by-side comparison

Feature	ar-x-icu	und-x-icu
Language-specific	Arabic family	Mixed / neutral
Script awareness	Arabic script	All scripts
Pashto support	✅ Very good	✅ Good
Chinese support	❌ Not ideal	✅ Good
English support	✅	✅
Best for	Arabic-family text	Multilingual apps

Do you need ICU for Pashto?

Short answer:: Not strictly, but recommended for correctness

Scenario	Need ICU?
Store Pashto text	❌
Display Pashto text	❌
Sort Pashto alphabetically	✅
Search linguistically	✅
Case-insensitive search	✅

Best practice create ICU-based DB (recommended)

CREATE DATABASE pashto_db
WITH
ENCODING = 'UTF8'
LOCALE_PROVIDER = icu
ICU_LOCALE = 'und';

und = Unicode default (safe for Pashto)
Works perfectly for RTL languages

Best practice for Pashto tables (recommended)

Option A — Use database default (most cases)

If your database was created with UTF-8 and ICU:

CREATE TABLE pashto_words (
    word TEXT
);

✔ Works
✔ Safe
✔ No extra setup
✔ Recommended for 90% cases

Option B — Explicit Pashto-aware collation (advanced but clean)

If you want full control:

CREATE COLLATION ps_af (
    provider = icu,
    locale = 'ps-AF',
    deterministic = false
);

Then use it:

CREATE TABLE pashto_words (
    word TEXT COLLATE ps_af
);

Or per query:

SELECT word
FROM pashto_words
ORDER BY word COLLATE ps_af;

Create a pashto_words table now

CREATE TABLE pashto_words (
    word TEXT
);

Insert data:

INSERT INTO pashto_words (word) VALUES
('افغان'),
('پښتو'),
('ټولنه'),
('ښار'),
('کتاب'),
('ګل'),
('کور');

❌ WRONG: ASCII / C collation

SELECT word
FROM pashto_words
ORDER BY word COLLATE "C";

What happens (WRONG behavior)

PostgreSQL compares raw byte values
Arabic letters are ordered by Unicode codepoints
RTL is ignored
Pashto-specific order is ignored

Example result (❌ meaningless):

word
-------
افغان
ټولنه
پښتو
ښار
کتاب
کور
ګل
(7 rows)

➡️ This order does NOT follow Pashto alphabet rules
➡️ It’s just computer byte math

✅ CORRECT: ICU Pashto collation

First, create (or use existing) Pashto collation:

CREATE COLLATION ps_af (
    provider = icu,
    locale = 'ps-AF'
);

Now sort:

SELECT word
FROM pashto_words
ORDER BY word COLLATE ps_af;

Correct Pashto order (✅)

word
-------
افغان
پښتو
ټولنه
ښار
کتاب
کور
ګل
(7 rows)

✔ Respects Pashto alphabet
✔ Knows Arabic script
✔ Correct RTL handling
✔ Correct dictionary order

Super simple rule to remember

English-only → default collation OK

Pashto / Arabic → ICU collation REQUIRED

But:: ICU is already built-in → you usually don’t need to do anything extra.

1️⃣ What is UTF-8?

UTF-8 = Universal text encoding

It answers the question:

How are characters saved in bytes?

UTF-8 supports:

English (A–Z)
Pashto (پښتو)
Arabic (العربية)
Chinese (中文)
Emojis 😄🔥
Everything

Example

Hello
پښتو
سلام
🙂

All of these can exist in one column because UTF-8 supports them.

In PostgreSQL

SHOW server_encoding;

Result:

UTF8

This means database can store Pashto correctly

2️⃣ What is ICU?

ICU = International Components for Unicode

It answers a different question:: How should text be compared, sorted, and searched?

This is where language rules matter.

Why UTF-8 alone is NOT enough

UTF-8 only stores characters.

It does NOT know:

Which letter comes first
How Pashto letters should be ordered
Case rules
Accent rules
RTL languages

That’s where ICU comes in.

3️⃣ What ICU actually does

ICU understands:

Alphabet order per language
Case sensitivity
Accent sensitivity
Unicode normalization
RTL languages

Example (Pashto letters)

Pashto letters:

ا ب پ ت ټ ث ج چ

ICU knows:

پ comes after ب
ټ is different from ت
Arabic rules ≠ English rules

Without ICU → sorting breaks With ICU → sorting works

Collations

Contents

Collations#

What is a Collation?#

What Collation Controls#

Real-world analogy#

Pashto is special#