Collation usage in SQL Server
Chip Hartney (The Datamology Company)
First written: February 10, 2023
Last updated: January 18, 2024
It is critical that we manage character data robustly and consistently in an enterprise data warehouse (EDW).† This is done, in large part, through the use of an appropriate collation.
Note that many SQL Server databases utilize (largely because it was the default collation for a very long time) the SQL_Latin1_General_CP1_CI_AS collation.† But that collation does not support Unicode data (which is critical in modern applications) and is an obsolete collation intended for backwards compatibility to SQL Server 2000 installations.
In this article, I explain why I recommend the Latin1_General_100_CI_AS_SC_UTF8 collation for most databases in an EDW.†
Please note that I do not have long-term experience with this collation.† So, my recommendation should be taken with a large grain of salt.† Hopefully, though, it will help you think through the same question.
A database created without specifying a collation will inherit that of the server.† If the serverís collation is not what you intend, create each EDW database with an explicit specification of the pertinent collation with a statement such as:
DATABASE Solutions COLLATE Latin1_General_100_CI_AS_SC_UTF8;
For a list of all collations available for use with SQL Server:
SELECT * FROM sys.fn_helpcollations();
To choose a collation, a number of related choices must first be made because that set of choices dictates the collation that is to be used per:
I present here those choices and my guidance for each.
Microsoft recommends using Windows collations instead of legacy collations established for early versions of SQL Server.
I agree and recommend: Windows collation
There are many to choose from, but most US-based organizations and more utilize Latin1_General.
I agree and recommend: Latin1_General
Assuming you use a Windows collation (per above), the code page is determined by the other choices.
Most of the newer collations were published with v10.0 (aka SQL Server 2008), including those that Microsoft frequently recommends (including the Latin1_General collations).
No Latin collations were introduced in v14.0 (140), so the most recent Latin collations are for v10.0 (100).
I recommend: 100
Most orgs I have supported have utilized case-insensitive collations successfully for years.† I donít know of any reason to change.
I, therefore, usually recommend: CI
Most orgs I have supported have utilized accent-sensitive collations successfully for years.† I donít know of any reason to change.
I, therefore, recommend: AS
Most US-based organizations have no need to use Kana-sensitive collations.
I, therefore, usually recommend: none
Most orgs I have supported have never used width-sensitive collations.† I donít know of any reason to change.
I, therefore, recommend: none
Most orgs I have supported have not used variation selector-sensitive collations.† I donít know of any reason to change.
I, therefore, recommend: none
To fully support Unicode supplementary characters in built-in functions, one must specify the ďSCĒ option.
Note: This option is inapplicable if a version 140 collation is utilized.
Auptimal recommends: SC
If an org desires binary sorting (and comparisons) as opposed to the linguistic-driven rules that would otherwise be used for Latin characters, they would specify this option.† I do not recall any org that I have supported ever using binary sorting of character data and generally doubt there is good reason to do so.
I recommend: none
Onemay choose to use a UTF-8-enabled collation by specifying this option.
In general, Auptimal recommends use of UTF-8 encoding (as opposed to UCS-2 or UTF-16) encoding because:
∑ It is fully compatible with basic ASCII.
∑ It is more efficient (space and transmission) for English systems.
∑ It has no endian (byte order) problems.
∑ It is compatible with NULL-terminated strings.
∑ It is compatible with byte-oriented networks.
∑ Since SQL Server 2019, CHAR and VARCHAR columns will automatically use UTF-8 encoding, eliminating the need to declare the columns as NCHAR or NVARCHAR (which use less efficient UTF-16 or UCS-2 encodings).
Use of a UTF-8-enabled collation causes standard data types (CHAR, etc) to support Unicode content automatically, thereby removing the need to utilize nationalized data types (NCHAR, etc).
If a UTF-8-enabled collation is utilized, care must be taken when declaring the size of a character column (because UTF-8 is a variable-length encoding).
I recommend: UTF8
Given the above recommendations for each collation-related choice, I recommend the Latin1_General_100_CI_AS_SC_UTF8 collation for most databases in an EDW.†