collations Collations#

What is a Collation?#

Collation decides how text is compared and sorted.

A collation defines how text is:
  • ✅ sorted (ORDER BY)

  • ✅ compared (=, <, >)

  • ✅ matched (LIKE, indexes)

In short:

Collation = language + rules for text comparison


What Collation Controls#

Collation affects ONLY text (TEXT, VARCHAR, CHAR) and controls:
  1. Sorting (ORDER BY)

  2. Comparing (=, <, >)

  3. Searching (LIKE)

  4. Index behavior on text

It does NOT affect:
  • numbers

  • dates

  • storage size


Real-world analogy#

Thing

What it means

Dictionary

Has rules for word order

Language

Has accents, cases, symbols

Collation

Dictionary rules for the database


Simple SQL example

Table

CREATE TABLE names (name TEXT);

Data

INSERT INTO names VALUES
('apple'),
('Banana'),
('Álvaro');

🔴 Without thinking about collation

SELECT name FROM names ORDER BY name;

Result may be:

Banana
apple
Álvaro
Why?
  • Capital letters first

  • Accented letters last

That’s one collation rule.


🟢 With a different collation

SELECT name
FROM names
ORDER BY name COLLATE "en_US.UTF-8";

Result becomes more human-friendly.


Best practice

For most apps:
  • Keep database default collation

  • Use C collation for:
    • emails

    • usernames

    • codes (fast & predictable)

  • Use language collation for:
    • names

    • descriptions

Example

SELECT 'apple' = 'Apple';
Result depends on collation:
  • Some collations → false

  • Case-insensitive collation → true

That’s all collation does.


Summary
  • Collation = text comparison rules

  • Controls sorting & comparing

  • Comes from language rules

  • You usually don’t touch it

  • You only customize it when needed


Pashto is special#

Pashto:
  • Uses Arabic script

  • Has extra letters (ټ ړ ږ ښ ځ etc.)

  • Reads right-to-left

  • Sorting rules are different from English

So ASCII / C collation is NOT OK

Unicode-aware collation IS required


What you MUST do

1️⃣ Use UTF-8 database

SHOW server_encoding;
-- UTF8

2️⃣ Use ICU collation

Modern PostgreSQL (v10+) supports ICU, which handles:
  • Arabic script

  • Pashto letters

  • Unicode ordering

  • Case & accent rules


Check ICU support:

SELECT DISTINCT collprovider
FROM pg_collation;

Possible results

If you see:

collprovider
-------------
c
i
d

Then:

Value

Meaning

c

libc (en_US.UTF-8)

i

ICU (und-x-icu) or (ar-x-icu)

d

default Usually points to a libc or ICU

If i exists → ICU is available on your system


Side-by-side comparison

Feature

ar-x-icu

und-x-icu

Language-specific

Arabic family

Mixed / neutral

Script awareness

Arabic script

All scripts

Pashto support

✅ Very good

✅ Good

Chinese support

❌ Not ideal

✅ Good

English support

Best for

Arabic-family text

Multilingual apps


Do you need ICU for Pashto?

Short answer:

Not strictly, but recommended for correctness

Scenario

Need ICU?

Store Pashto text

Display Pashto text

Sort Pashto alphabetically

Search linguistically

Case-insensitive search

Best practice create ICU-based DB (recommended)

CREATE DATABASE pashto_db
WITH
ENCODING = 'UTF8'
LOCALE_PROVIDER = icu
ICU_LOCALE = 'und';
  • und = Unicode default (safe for Pashto)

  • Works perfectly for RTL languages


Best practice for Pashto tables (recommended)

Option A — Use database default (most cases)

If your database was created with UTF-8 and ICU:

CREATE TABLE pashto_words (
    word TEXT
);
  • ✔ Works

  • ✔ Safe

  • ✔ No extra setup

  • ✔ Recommended for 90% cases


Option B — Explicit Pashto-aware collation (advanced but clean)

If you want full control:

CREATE COLLATION ps_af (
    provider = icu,
    locale = 'ps-AF',
    deterministic = false
);

Then use it:

CREATE TABLE pashto_words (
    word TEXT COLLATE ps_af
);

Or per query:

SELECT word
FROM pashto_words
ORDER BY word COLLATE ps_af;

Create a pashto_words table now

CREATE TABLE pashto_words (
    word TEXT
);

Insert data:

INSERT INTO pashto_words (word) VALUES
('افغان'),
('پښتو'),
('ټولنه'),
('ښار'),
('کتاب'),
('ګل'),
('کور');

❌ WRONG: ASCII / C collation

SELECT word
FROM pashto_words
ORDER BY word COLLATE "C";
What happens (WRONG behavior)
  • PostgreSQL compares raw byte values

  • Arabic letters are ordered by Unicode codepoints

  • RTL is ignored

  • Pashto-specific order is ignored

Example result (❌ meaningless):

word
-------
افغان
ټولنه
پښتو
ښار
کتاب
کور
ګل
(7 rows)
  • ➡️ This order does NOT follow Pashto alphabet rules

  • ➡️ It’s just computer byte math


✅ CORRECT: ICU Pashto collation

First, create (or use existing) Pashto collation:

CREATE COLLATION ps_af (
    provider = icu,
    locale = 'ps-AF'
);

Now sort:

SELECT word
FROM pashto_words
ORDER BY word COLLATE ps_af;

Correct Pashto order (✅)

word
-------
افغان
پښتو
ټولنه
ښار
کتاب
کور
ګل
(7 rows)
  • ✔ Respects Pashto alphabet

  • ✔ Knows Arabic script

  • ✔ Correct RTL handling

  • ✔ Correct dictionary order


Super simple rule to remember

English-only → default collation OK

Pashto / Arabic → ICU collation REQUIRED

But:

ICU is already built-in → you usually don’t need to do anything extra.


1️⃣ What is UTF-8?

UTF-8 = Universal text encoding

It answers the question:

How are characters saved in bytes?

UTF-8 supports:
  • English (A–Z)

  • Pashto (پښتو)

  • Arabic (العربية)

  • Chinese (中文)

  • Emojis 😄🔥

  • Everything

Example

Hello
پښتو
سلام
🙂

All of these can exist in one column because UTF-8 supports them.

In PostgreSQL

SHOW server_encoding;

Result:

UTF8

This means database can store Pashto correctly


2️⃣ What is ICU?

ICU = International Components for Unicode

It answers a different question:

How should text be compared, sorted, and searched?

This is where language rules matter.


Why UTF-8 alone is NOT enough

UTF-8 only stores characters.

It does NOT know:
  • Which letter comes first

  • How Pashto letters should be ordered

  • Case rules

  • Accent rules

  • RTL languages

That’s where ICU comes in.


3️⃣ What ICU actually does

ICU understands:
  • Alphabet order per language

  • Case sensitivity

  • Accent sensitivity

  • Unicode normalization

  • RTL languages

Example (Pashto letters)

Pashto letters:

ا ب پ ت ټ ث ج چ
ICU knows:
  • پ comes after ب

  • ټ is different from ت

  • Arabic rules ≠ English rules

Without ICU → sorting breaks With ICU → sorting works