|
Management of character data in SQL
Server
Chip Hartney (The Datamology Company)
First written: February 8, 2023
Last updated: January 16, 2024
Contents
Care must be taken in any data store, including a DW, to manage character data values carefully. Because a DW may collect data from many varied source systems, it is critical that the DW be capable of handling any of the characters that may be used in those systems. Because a DW is expected to live into the future and absorb data from new source systems, it is generally critical that the DW be enabled to handle the widest possible range of character data.
Further, an appropriate and consistent framework for the processing of character data must be established. How are character values to be ordered? When should values be determined to be equal or even just similar? Etc.
In this article, I describe how the management of character data is, in RDMBSs, dictated by the choice of a particular collation and how that collation, in turn, dictates most aspects of the framework mentioned above. I describe collations and related topics (code pages, ASCII, Unicode, UTF, and more) with a specific emphasis on how this all plays out in Microsoft SQL Server.
A collation is a specification of how character data is managed including:
· Which characters are supported.
· How character values are to be represented (as bit patterns in memory and on disk).
· How character values are to be sorted.
· How character values are to be compared.
· How to handle case differences between characters that are otherwise the same.
· How to handle other differences (such as accent marks) between characters that are otherwise similar.
In SQL Server, as in most RDBMSs, you can specify a collation for each of the following:
· Each server.
· Each database.
· Each column.
Each specified collation acts as the default for objects lower in the object hierarchy. E.g., the server’s collation is the default for the databases that it contains but (usually) can be overridden by an explicit collation specification for any particular database.
SQL Server supports the following types of collations:
· SQL Server custom collations
o These have “SQL_” prefix in their name.
o These are obsolete (though not deprecated) and preserved for SQL Server 2000 compatibility.
§ Unfortunately, SQL_Latin1_General_CP1_CI_AS remains the default when installing on an OS using US English as its language and, therefore, is the default for instances of SQL Server.
o These collations should not be used for greenfield data envs.
· Windows collations
o These lack “SQL_” prefix in their name.
o These are newer and more functional, sort character values (such as CHAR) and nationalized characters values (such as NCHAR) in the same manner, and are being updated continually by Microsoft.
o These collations should be used for greenfield data envs.
For a list of all collations available for use with SQL Server:
SELECT * FROM sys.fn_helpcollations();
Regardless of the type of collation used, the selection has the following effects:
· Server-level collation dictates:
· The collation used for system databases: master, model, msdb, and tempdb.
o Due to controlling the DB-level Collation of tempdb, it is then the default Collation for string columns in temporary tables (global and local), but not table variables.
o Due to controlling the DB-level Collation of master, it is then the Collation used for Server-level data, such as database names (i.e. name column in sys.databases), login names, etc.
· The collation used for handling of parameter and variable names.
· The collation used for handling of cursor names.
· The collation used for handling of GOTO labels.
· Default collation used for newly created databases.
o Can be overridden with COLLATE clause in CREATE DATABASE statements.
· Database-level collation dictates:
· The collation used for string literals and variables.
· The collation used for database-level metadata (such as object names).
· Default collation used for newly created string columns.
o Can be overridden with COLLATE clause in CREATE TABLE/ALTER TABLE statements.
A SQL Server collation is specified as a string in the following form:
Latin1_General_100_CI_AS_SC_UTF8
See the following for details:
Windows Collation Name (Transact-SQL) - SQL Server | Microsoft Learn
This is actually a series of specifications (separated by underscores), some of which are optional, where the components are:
o The collation designator (aka culture/locale) (e.g., “Latin1_General”):
o Specifies:
§ the linguistic rules used for sorting and comparison.
§ The code page used for character data.
o Latin1_General is the commonly used for American/English use cases.
o The code page (not used in above example) which identifies the 8-bit code page used for encoding extended ASCII characters.
o SQL Server collations only.
o CP1, for example, is Microsoft’s shorthand for CP1252 (code page 1252).
§ Windows is the only platform that uses CP1252 indigenously. It is a hold-over from DOS. It is very similar to ISO 8859-1, but not same.
o The version (e.g., “100”) which identifies the SQL Server version in which the collation was introduced (v10.0, aka SQL Server 2008, in this case).
o Windows collations only.
o Windows collations with no version number are v8.0 (aka SQL Server 2000).
o No Latin collations were introduced in v14.0 (140), so the most recent Latin collations are for v10.0 (100).
o Case sensitivity (e.g., “CI”):
o Options:
§ “CS” for case-sensitive (in which case 'ABC' <> 'abc').
§ “CI” for case-insensitive (in which case 'ABC' = 'abc').
o Accent sensitivity (e.g., “AI”):
o Options:
§ “AS” for accent-sensitive (in which case 'ü' <> 'u').
§ “AI” for accent-sensitive (in which case 'ü' = 'u').
o Kana sensitivity (not specified in above example):
o Options:
§ “KS” for Kana-sensitive.
§ Otherwise (missing) Kana-insensitive.
o Width sensitivity (not specified in above example):
o Options:
§ “WS” for width-sensitive.
§ Otherwise (missing) width-insensitive.
o Variation selector sensitivity (not specified in above example):
o Options:
§ “VSS” for variation selector-sensitive.
§ Otherwise (missing) variation selector-insensitive.
o Supplementary character support (e.g., “SC”):
o Applicable only to version 90 or 100 collations.
o Version 140 collations automatically support supplementary characters.
o Options:
§ “SC” to support Unicode supplementary characters (beyond BMP) in built-in functions.
§ Otherwise (missing) built-in functions support only the Unicode BMP characters.
o Binary sorting and comparison (not specified in above example):
o Never combined with any of the 5 sensitivities or _SC.
o Provides performance boost (at the expense of linguistic rules).
o Options:
§ “BIN2” to eliminate linguistic rules for sorting and comparison and, instead, use a pure binary comparisons based on the encoded bits.
§ Otherwise (missing) use standard linguistic rules and specified sensitivities.
o UTF-8 (e.g., “UTF8”):
o Can only be used on collations that support supplementary characters (i.e. version 90 or 100 collations with _SC in their name, and version 140 collations).
o Options:
§ “UTF8” to utilize UTF-8 encoding for all standard character data types (CHAR and VARCHAR). Unicode data types (NCHAR and NVARCHAR) continue to use UCS-2 or UTF-16.
§ Otherwise (missing) use the encoding dictated by other options.
UTF-8 collations were introduced with SQL Server 2019 (v15.0) and have unique implications.
Because standard (CHAR) and Unicode (NCHAR) data types are handled differently and because there is no need to use Unicode data types, all character columns should be defined as standard data types. I.e., NCHAR (etc) data types should not be used. This ensures that all data is encoded equivalently and no conversions are required during processing.
Because UTF-8 is a variable-length encoding (1 to 4 bytes, depending on the character), performance is somewhat degraded. I.e., processing of fixed-length character strings (based on double-byte character sets) is faster.
Because UTF-8 uses only 1 byte to store the common Latin characters, use of UTF-8 collations results in potentially dramatic reduction in storage/memory usage in those systems that deal solely or primarily with Latin data (as is the case with most businesses in the USA).
Because UTF-8 is a variable-length encoding, column sizing is more difficult. Contrary to popular belief, the size spec of “10” in CHAR(5) is not the number of characters it supports but the number of bytes that can be used to store the character value. The value ‘abcde’ is supported because UTF-8 (like ASCII) uses 1 byte for each of those characters. But the value ‘abcd©’ is not supported because UTF-8 requires 2 bytes to store the copyright symbol (©). The value would be silently truncated!
See following for more information:
Collation and Unicode support - SQL Server | Microsoft Learn
SQL Server includes non-ANSI data types to support Unicode (i.e., “national”) characters:
· NCHAR
· NVARCHAR
· NVARCHAR(MAX)
· NTEXT
These data types support Unicode characters in the BMP range (see Unicode below) and are encoded using a double-byte character set (UCS-2 or, more recently, UTF-16).
Prior to the advent of the UTF-8-enabled collations, one had to use these data types to contain Unicode characters. If a UTF-8-enabled collation is used, the traditional CHAR, VARCHAR, VARCHAR(MAX), and TEXT data types can be used directly.
A code page is an ordered set of characters of a given script in which a numeric index, or code point value, is associated with each character.
A Windows code page is typically referred to as a character set or a charset. Code pages are used to provide support for the character sets and keyboard layouts that are used by different Windows system locales.
American Standard Code for Information Interchange (ASCII) is a standard (first published in 1963 by the X3 committee of the American Standards Association) that assigns characters (letters, numbers, symbols, etc) to the 256 possible values in an 8-bit (1-byte) unit of storage:
· Basic ASCII uses 7 bits to encode the 128 characters that are common to all Latin code pages.
· Extended ASCII uses the 8th bit to encode 128 additional characters that are used in a given locale/culture.
See following for the actual encodings:
ASCII Table - ASCII Character Codes, HTML, Octal, Hex, Decimal
A Windows code page defines the "extended" part of Extended ASCII, and controls which characters are used for values 128 - 255. This group varies between each culture.
Unicode is a standard for mapping code points to characters.
That Unicode Standard is a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world as well as classical and historical texts of many written languages.
The Unicode Consortium allocates to each character a unique code point, which is a value in the range 000000–10FFFF. The most frequently used characters have code point values in the range 000000–00FFFF (65,536 characters) which fit into an 8-bit or 16-bit word in memory and on disk. This range is usually designated as the Basic Multilingual Plane (BMP).
But the Unicode Consortium has established 16 additional "planes" of characters, each the same size as the BMP. This definition allows Unicode the potential to represent 1,114,112 characters (that is, 216 characters per plane * 17 planes) within the code point range 000000–10FFFF. Characters with code point values larger than 00FFFF require two to four consecutive 8-bit words (UTF-8), or two consecutive 16-bit words (UTF-16). These characters located beyond the BMP are called supplementary characters, and the additional consecutive 8-bit or 16-bit words are called surrogate pairs.
A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point (except surrogate code points) to a unique byte sequence. The ISO/IEC 10646 standard uses the term “UCS transformation format” for UTF.
In general, there are three encoding forms of UTF:
· UTF-8 is the byte-oriented encoding form of Unicode that uses:
o A single 8-bit (1-byte) code unit for ASCII chars (code range 000000–00007F).
o A single 16-bit (2-byte) code unit for code range 000080–0007FF.
o A single 24-bit (3-byte) code unit for code range 000800–00FFFF.
o A single 32-bit (4-byte) code unit for code range 010000–10FFFF.
· UTF-16 uses:
o A single 16-bit (2-byte) code unit to encode the most common 63K characters.
o A pair of 16-bit (2-byte) code units, called surrogates, to encode the 1M less common characters.
· UTF-32 uses a 32-bit (4-byte) code unit to record the Unicode scalar value directly (which is the abstract number associated with a Unicode character).
Each UTF is reversible, thus every UTF supports lossless round tripping: mapping from any Unicode coded character sequence to a sequence of bytes and back again.
UTF-8 is pervasive in web applications.
UTF-16 is utilized internally by programming languages such as Java.
Opinion: I like UTF-8 better (for data stores) than UTF-16 and UTF-32 because:
· It is fully compatible with basic ASCII.
· It is more efficient (space and transmission) for English systems.
· It has no endian (byte order) problems.
· It is compatible with NULL-terminated strings.
· It is compatible with byte-oriented networks.
· Since SQL Server 2019, CHAR and VARCHAR columns will automatically use UTF-8 encoding, eliminating the need to declare the columns a NCHAR or NVARCHAR (which use less efficient UTF-16 or UCS-2 encodings).
However, as noted elsewhere, there are ramifications for using a collation. So, it is not clearcut that a UTF-8-enabled collation is best.