Monday, 28 May 2018

Column-Level Collation and Case-Insensitive Database in Oracle Database 12c

Collation determines how strings are compared, which has a direct impact on ordering (sorting) and equality tests between strings. Oracle Database 12c Release 2 (12.2) lets you specify the collation used for columns that hold string data, allowing you to easily perform case insensitive queries, as well as control the output order of queried data.

  • Assumptions

This article assumes the following.

1. Your are using Oracle Database 12c Release 2 (12.2).
2. Your MAX_STRING_SIZE parameter is set to EXTENDED. You can see how to do this here. Without this you will see this error, "ORA-43929: Collation cannot be specified if parameter MAX_STRING_SIZE=STANDARD is set."
3. Your COMPATIBLE parameter is set to 12.2 or higher.
4. The database character set is AL32UTF8, NLS_LANGUAGE=AMERICAN and NLS_TERRITORY=AMERICA. Other setting can be used, but the output may not resemble that shown in this article.

  • Default Behaviour

To understand the impact of collation, it's important to understand the default behaviour of an Oracle database. Before we start using UTF data we need to make sure SQL*Plus will handle it properly. We can do this by setting the NLS_LANG environment variable correctly before starting SQL*Plus.

# set NLS_LANG=AMERICAN_AMERICA.UTF8
export NLS_LANG=AMERICAN_AMERICA.UTF8

sqlplus test/test@pdb1

Create the following test table and populate it with some data containing special characters. Notice the variation in the first letter of the first name.

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  location    VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);

INSERT INTO t1 VALUES (1, 'Löwenbrauerei', 'Bräunlingen');
INSERT INTO t1 VALUES (2, 'LÖwenbrauerei', 'BrÄunlingen');
INSERT INTO t1 VALUES (3, 'Lowenbrauerei', 'Braunlingen');
INSERT INTO t1 VALUES (4, 'LOwenbrauerei', 'BrAunlingen');
COMMIT;

The following query shows how Oracle treats special characters with regards to sort operations. We can see upper case comes before lower case in the sort order, and the regular characters come before the special characters.

COLUMN company FORMAT A15
COLUMN location FORMAT A15

SELECT *
FROM   t1
ORDER BY company;

ID COMPANY    LOCATION
---------- --------------- ---------------
4 LOwenbrauerei   BrAunlingen
3 Lowenbrauerei   Braunlingen
         2 LÖwenbrauerei   BrÄunlingen
         1 Löwenbrauerei   Bräunlingen

SQL>

The following query shows how Oracle treats special characters in comparisons. We can see an exact match is made.

SELECT *
FROM   t1
WHERE  company = 'Löwenbrauerei';

ID COMPANY    LOCATION
---------- --------------- ---------------
         1 Löwenbrauerei   Bräunlingen

SQL>


SELECT *
FROM   t1
WHERE  company LIKE '%ö%';

ID COMPANY    LOCATION
---------- --------------- ---------------
         1 Löwenbrauerei   Bräunlingen

SQL>

The following query shows how Oracle treats special characters in group operations. Once again, the variants on the first character are treated separately.

SELECT company, COUNT(*) AS amount
FROM   t1
GROUP BY company
ORDER BY company;

COMPANY         AMOUNT
---------------     ----------
LOwenbrauerei 1
Lowenbrauerei 1
LÖwenbrauerei           1
Löwenbrauerei            1

SQL>

In the following examples, think back to this default behaviour.

  • Setting Collation

Collation can be set at a number of different levels, which will be demonstrated below. Each section will not repeat all possible syntax variations or collations as that would be too repetitive.

There are two basic types of collation.

1. Binary : Ordering and comparisons of string data are based on the numeric value of the characters in the strings.
2. Linguistic : Ordering and comparisons of string data are based on the alphabetic sequence of the characters, regardless of their numeric values.

When using collations there are three suffixes that alter the behaviour of sorts and comparisons.

1. "_CI" : Case insensitive, but accent sensitive.
2. "_AI" : Both case and accent insensitive.
3. "_CS" : Both case and accent sensitive. This is default if no extension is used.

If no collation is specified, directly or via a default setting, the default USING_NLS_COMP pseudo-collation is used, which means the NLS_SORT and NLS_COMP parameters are used to determine the actual collation used.

The only supported collation for CLOB and NCLOB columns is the USING_NLS_COMP pseudo-collation.
  • Column-Level
The collation of a specific column can be defined when the table is created. In the following example we set the collation of the COMPANY column to BINARY_CI, which will make sorts and comparisons of that columns data case insensitive, but will still treat special characters as separate.

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR) COLLATE BINARY_CI,
  CONSTRAINT t1_pk PRIMARY KEY (id)
);

INSERT INTO t1 VALUES (1, 'Löwenbrauerei');
INSERT INTO t1 VALUES (2, 'LÖwenbrauerei');
INSERT INTO t1 VALUES (3, 'Lowenbrauerei');
INSERT INTO t1 VALUES (4, 'LOwenbrauerei');
COMMIT;

We can see the collation has made a difference if we query data in the column.

SELECT company, COUNT(*) AS amount
FROM   t1
GROUP BY company
ORDER BY company;

COMPANY     AMOUNT
--------------- ----------
Lowenbrauerei       2
Löwenbrauerei          2

SQL>

SELECT *
FROM   t1
WHERE  company LIKE '%ö%';

ID COMPANY
---------- ---------------
         1 Löwenbrauerei
         2 LÖwenbrauerei

SQL>
We can also set the collation of new columns added to an existing table. In the following example we set the collation of the LOCATION column to BINARY_AI, which will make sorts and comparisons of that columns data both case insensitive and accent insensitive.

ALTER TABLE t1 ADD (
  location  VARCHAR2(15 CHAR) COLLATE BINARY_AI
);


UPDATE t1 SET location = 'Bräunlingen' WHERE id = 1;
UPDATE t1 SET location = 'BrÄunlingen' WHERE id = 2;
UPDATE t1 SET location = 'Braunlingen' WHERE id = 3;
UPDATE t1 SET location = 'BrAunlingen' WHERE id = 4;
COMMIT;
The output below shows the impact of the collation on this new column.

SELECT location, COUNT(*) AS amount
FROM   t1
GROUP BY location
ORDER BY location;

LOCATION     AMOUNT
--------------- ----------
Bräunlingen            4

SQL>

SELECT *
FROM   t1
WHERE  location LIKE '%ä%';

ID COMPANY    LOCATION
---------- --------------- ---------------
         1 Löwenbrauerei   Bräunlingen
         2 LÖwenbrauerei   BrÄunlingen
3 Lowenbrauerei   Braunlingen
4 LOwenbrauerei   BrAunlingen

SQL>

We can see the collation of the columns using the {DBA|ALL|USER}_TAB_COLUMNS views.

COLUMN column_name FORMAT A30
COLUMN collation FORMAT A20

SELECT column_id,
       column_name,
       collation
FROM   user_tab_columns
WHERE  table_name = 'T1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME           COLLATION
---------- --------------------------- --------------------
1   ID
2   COMPANY   BINARY_CI
3   LOCATION   BINARY_AI

SQL>
  • Table-Level
The default collation for the whole table can be defined when the table is created.

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
)
DEFAULT COLLATION BINARY_CI;

The table default collation can be changed using the ALTER TABLE statement, but this only affects new columns added to table, not existing columns.

ALTER TABLE t1 DEFAULT COLLATION BINARY_AI;

ALTER TABLE t1 ADD (
  location  VARCHAR2(15 CHAR)
);

By displaying the column level collation we can see the existing column was not affected by the change in the default collation of the table.

COLUMN column_name FORMAT A30
COLUMN collation FORMAT A20

SELECT column_id,
       column_name,
       collation
FROM   user_tab_columns
WHERE  table_name = 'T1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME  COLLATION
---------- ------------------------------ --------------------
1   ID
2   COMPANY   BINARY_CI
3   LOCATION   BINARY_AI

SQL>

The default table collation is displayed using the {DBA|ALL|USER}_TABLES views.

SELECT default_collation
FROM   user_tables
WHERE  table_name = 'T1';

DEFAULT_COLLATION
--------------------------------------------------------------------------------
BINARY_AI

SQL>
  • Schema-Level
The default collation for a schema can be defined when the user is created.

CONN / AS SYSDBA
ALTER SESSION SET CONTAINER = pdb1;

CREATE USER test2 IDENTIFIED BY test2
  DEFAULT TABLESPACE users
  QUOTA UNLIMITED ON users
  DEFAULT COLLATION BINARY_CI;

GRANT CREATE SESSION, CREATE TABLE to test2;

If we connect to the user and create a table we can see the default setting is used to define the default table collation.

CONN test2/test2@pdb1

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);


SELECT default_collation
FROM   user_tables
WHERE  table_name = 'T1';

DEFAULT_COLLATION
--------------------------------------------------------------------------------
BINARY_CI

SQL>

The default schema collation is changed using the ALTER USER command, but this doesn't affect the default collation of any existing objects.

ALTER USER test2 DEFAULT COLLATION BINARY_AI;

If we add another column to the existing table you may think it will use the new default schema collation, but you would be wrong.

ALTER TABLE t1 ADD (
  location  VARCHAR2(15 CHAR)
);


COLUMN column_name FORMAT A30
COLUMN collation FORMAT A20

SELECT column_id,
       column_name,
       collation
FROM   user_tab_columns
WHERE  table_name = 'T1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME  COLLATION
---------- ------------------------------ --------------------
1   ID
2   COMPANY   BINARY_CI
3   LOCATION   BINARY_CI

SQL>

Remember, changes to the default schema collation do not affect existing objects, so the default table collation is unchanged.

SELECT default_collation
FROM   user_tables
WHERE  table_name = 'T1';

DEFAULT_COLLATION
--------------------------------------------------------------------------------
BINARY_CI

SQL>
  • Session-Level
There are two ways to set the default collation for a session. The DEFAULT_COLLATION parameter can be set at session level.

CONN test/test@pdb1
ALTER SESSION SET DEFAULT_COLLATION=BINARY_CI;

SELECT SYS_CONTEXT('USERENV', 'SESSION_DEFAULT_COLLATION') FROM DUAL;

SYS_CONTEXT('USERENV','SESSION_DEFAULT_COLLATION')
--------------------------------------------------------------------------------
BINARY_CI

SQL>


DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  location    VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);


SELECT column_id, column_name, collation
FROM   user_tab_columns
WHERE  table_name = 'T1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME   COLLATION
---------- ------------------------------ --------------------
1 ID
2 COMPANY   BINARY_CI
3 LOCATION   BINARY_CI

SQL>


SELECT default_collation
FROM   user_tables
WHERE  table_name = 'T1';

DEFAULT_COLLATION
--------------------------------------------------------------------------------
BINARY_CI

SQL>

Alternatively the NLS_SORT and NLS_COMP parameters can be set at session level, making them the defaults for the session.

CONN test/test@pdb1
ALTER SESSION SET NLS_SORT=BINARY_CI;
ALTER SESSION SET NLS_COMP=LINGUISTIC;


DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(10),
  location    VARCHAR2(10),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);


SELECT column_id, column_name, collation
FROM   user_tab_columns
WHERE  table_name = 'T1'
ORDER BY column_id;

 COLUMN_ID COLUMN_NAME            COLLATION
---------- -------------------------- --------------------
1 ID
2 COMPANY   USING_NLS_COMP
3 LOCATION   USING_NLS_COMP

SQL>


SELECT default_collation
FROM   user_tables
WHERE  table_name = 'T1';

DEFAULT_COLLATION
------------------------------------------------------------------
USING_NLS_COMP

SQL>

Notice the default references to the USING_NLS_COMP pseudo-collation, which means the NLS_SORT and NLS_COMP parameters are used to determine the actual collation used.
  • Database-Level
If the NLS_SORT and NLS_SORT parameters are set at the CDB level they represent the default values for the CDB and all associated PDBs.

CONN / AS SYSDBA

-- Alter the CDB defaults.
ALTER SYSTEM SET NLS_SORT=BINARY_CI SCOPE=SPFILE;
ALTER SYSTEM SET NLS_COMP=LINGUISTIC SCOPE=SPFILE;
SHUTDOWN IMMEDIATE;
STARTUP;


SHOW PARAMETER NLS_SORT

NAME                                 TYPE        VALUE
------------------------------- ----------- ------------------------------
nls_sort                             string      BINARY_CI
SQL>

SHOW PARAMETER NLS_COMP

NAME                                 TYPE        VALUE
------------------------------- ----------- ------------------------------
nls_comp                             string      LINGUISTIC
SQL>


-- Switch to the PDB.
ALTER SESSION SET CONTAINER = pdb1;

SHOW PARAMETER NLS_SORT

NAME                                 TYPE        VALUE
------------------------------- ----------- ------------------------------
nls_sort                             string      BINARY_CI
SQL>

SHOW PARAMETER NLS_COMP

NAME                                 TYPE        VALUE
-------------------------------- ----------- ------------------------------
nls_comp                             string      LINGUISTIC
SQL>

If the parameters are set at the PDB level they override the CDB settings.

CONN / AS SYSDBA
ALTER SESSION SET CONTAINER = pdb1;

-- Alter the PDB defaults.
ALTER SYSTEM SET NLS_SORT=BINARY SCOPE=SPFILE;
ALTER SYSTEM SET NLS_COMP=BINARY SCOPE=SPFILE;
SHUTDOWN IMMEDIATE;
STARTUP;


SHOW PARAMETER NLS_SORT

NAME                                 TYPE        VALUE
------------------------------ ----------- ------------------------------
nls_sort                             string      BINARY
SQL>

SHOW PARAMETER NLS_COMP

NAME                                 TYPE        VALUE
------------------------------ ----------- ------------------------------
nls_sort                             string      BINARY
SQL>

Remember to reset these values if you don't want any unexpected behaviour in your instance.

CONN / AS SYSDBA
ALTER SYSTEM RESET NLS_SORT SCOPE=SPFILE;
ALTER SYSTEM RESET NLS_COMP SCOPE=SPFILE;

ALTER SESSION SET CONTAINER = pdb1;

ALTER SYSTEM RESET NLS_SORT SCOPE=SPFILE;
ALTER SYSTEM RESET NLS_COMP SCOPE=SPFILE;

CONN / AS SYSDBA
SHUTDOWN IMMEDIATE;
STARTUP;
  • Statement-Level
There are a number of ways to influence the results of statements, regardless of the collation associated with the columns. The following examples show how you might use the COLLATE operator and NLSSORT function.

First we need a new session and a new table with no collation defined.

CONN test/test@pdb1

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  location    VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);

INSERT INTO t1 VALUES (1, 'Löwenbrauerei', 'Bräunlingen');
INSERT INTO t1 VALUES (2, 'LÖwenbrauerei', 'BrÄunlingen');
INSERT INTO t1 VALUES (3, 'Lowenbrauerei', 'Braunlingen');
INSERT INTO t1 VALUES (4, 'LOwenbrauerei', 'BrAunlingen');
COMMIT;

Now we can try a few examples of using the COLLATE operator and NLSSORT function.

SELECT *
FROM   t1
ORDER BY company COLLATE BINARY_CI;

ID COMPANY    LOCATION
---------- --------------- ---------------
3 Lowenbrauerei   Braunlingen
4 LOwenbrauerei   BrAunlingen
         1 Löwenbrauerei   Bräunlingen
         2 LÖwenbrauerei   BrÄunlingen

SQL>


SELECT *
FROM   t1
ORDER BY NLSSORT(company, 'NLS_SORT=BINARY_AI');

ID COMPANY    LOCATION
---------- --------------- ---------------
         1 Löwenbrauerei   Bräunlingen
4 LOwenbrauerei   BrAunlingen
3 Lowenbrauerei   Braunlingen
         2 LÖwenbrauerei   BrÄunlingen

SQL>


COLUMN sorted_name FORMAT A15

SELECT company COLLATE BINARY_AI AS sorted_name,
       COUNT(*) AS amount
FROM   t1
GROUP BY company COLLATE BINARY_AI
ORDER BY 1;

SORTED_NAME    AMOUNT
--------------- ----------
Löwenbrauerei          4

SQL>


COLUMN sorted_name FORMAT A15

SELECT NLSSORT(company, 'NLS_SORT=BINARY_AI') AS sorted_name,
       COUNT(*) AS amount
FROM   t1
GROUP BY NLSSORT(company, 'NLS_SORT=BINARY_AI')
ORDER BY 1;

SORTED_NAME    AMOUNT
---------------   ----------
6C6F77656E62726 4

SQL>


SELECT *
FROM   t1
WHERE SUBSTR(company,2,1) COLLATE BINARY_AI = 'ö' COLLATE BINARY_AI
ORDER BY company;

ID COMPANY    LOCATION
---------- --------------- ---------------
4 LOwenbrauerei   BrAunlingen
3 Lowenbrauerei   Braunlingen
         2 LÖwenbrauerei   BrÄunlingen
         1 Löwenbrauerei   Bräunlingen

SQL>


SELECT *
FROM   t1
WHERE NLSSORT(SUBSTR(company,2,1), 'NLS_SORT=BINARY_AI') = NLSSORT('ö', 'NLS_SORT=BINARY_AI')
ORDER BY company;

ID COMPANY    LOCATION
---------- --------------- ---------------
4 LOwenbrauerei   BrAunlingen
3 Lowenbrauerei   Braunlingen
         2 LÖwenbrauerei   BrÄunlingen
         1 Löwenbrauerei   Bräunlingen

SQL>

  • Linguistic Indexes

Any index created on a column using a named collation is actually a function-based index, but you can create additional indexes with alternative collations.

CREATE INDEX t1_company_1_idx ON t1(NLSSORT(company, 'NLS_SORT=GERMAN_AI'));
CREATE INDEX t1_company_2_idx ON t1(NLSSORT(company, 'NLS_SORT=FRENCH_AI'));

CREATE INDEX t1_location_idx ON t1(location COLLATE BINARY_AI);

  • Views and Materialized Views

Views and materialized views can take advantage of collation. The following example creates a clean table, a view that uses collation, then queries the view to shown the collation in action.

CONN test/test@pdb1

DROP TABLE t1 PURGE;

CREATE TABLE t1 (
  id          NUMBER,
  company     VARCHAR2(15 CHAR),
  location    VARCHAR2(15 CHAR),
  CONSTRAINT t1_pk PRIMARY KEY (id)
);

INSERT INTO t1 VALUES (1, 'Löwenbrauerei', 'Bräunlingen');
INSERT INTO t1 VALUES (2, 'LÖwenbrauerei', 'BrÄunlingen');
INSERT INTO t1 VALUES (3, 'Lowenbrauerei', 'Braunlingen');
INSERT INTO t1 VALUES (4, 'LOwenbrauerei', 'BrAunlingen');
COMMIT;


-- Create a view using collation.
CREATE OR REPLACE VIEW t1_binary_ai_v (
  id,
  company,
  location
)
AS
SELECT id,
       company COLLATE BINARY_AI,
       location COLLATE BINARY_AI
FROM   t1;


--Query using collation-based sort.
SELECT *
FROM   t1_binary_ai_v
ORDER BY company;

ID COMPANY    LOCATION
---------- --------------- ---------------
         1 Löwenbrauerei   Bräunlingen
4 LOwenbrauerei   BrAunlingen
3 Lowenbrauerei   Braunlingen
         2 LÖwenbrauerei   BrÄunlingen

SQL>

-- Query using collation-based group by.
SELECT company, COUNT(*) AS amount
FROM   t1_binary_ai_v
GROUP BY company
ORDER BY company;

COMPANY      AMOUNT
---------------  ----------
Löwenbrauerei          4

SQL>