MySQL Character Set and Collation

A character set is a set of rules for representing characters, and a collation defines the order of characters in a character set. Character sets are used to store data. A character set can contain multiple collations.

Our world is a diverse world, containing many languages ​​and different characters. In order to store these different characters more efficiently, various character sets have emerged.

A character set defines rules for representing a set of characters (may be not all characters). Some character sets can only represent single-byte characters, and some others can represent multi-byte characters. Some character sets always represent characters in multiple bytes, and some others use single bytes to represent single-byte characters and multiple bytes to represent multi-byte characters.

The collation is the ordering logic of the characters in the character set. For example A and a, it may be A > a in a collation, or may be A = a in another collation, or A < a may be in another collation.

The same character set supports multiple collations, but you must use one of them.

MySQL Character Set

MySQL supports various character sets, allowing you to store almost any character in a string.

You can use the SHOW CHARACTER SET statement to get the character sets supported by the current MySQL server and the default collation for each character set.

SHOW CHARACTER SET;
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| armscii8 | ARMSCII-8 Armenian              | armscii8_general_ci |      1 |
| ascii    | US ASCII                        | ascii_general_ci    |      1 |
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| binary   | Binary pseudo charset           | binary              |      1 |
| cp1250   | Windows Central European        | cp1250_general_ci   |      1 |
| cp1251   | Windows Cyrillic                | cp1251_general_ci   |      1 |
| cp1256   | Windows Arabic                  | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic                  | cp1257_general_ci   |      1 |
| cp850    | DOS West European               | cp850_general_ci    |      1 |
| cp852    | DOS Central European            | cp852_general_ci    |      1 |
| cp866    | DOS Russian                     | cp866_general_ci    |      1 |
| cp932    | SJIS for Windows Japanese       | cp932_japanese_ci   |      2 |
| dec8     | DEC West European               | dec8_swedish_ci     |      1 |
| eucjpms  | UJIS for Windows Japanese       | eucjpms_japanese_ci |      3 |
| euckr    | EUC-KR Korean                   | euckr_korean_ci     |      2 |
| gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |
| gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
| gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
| geostd8  | GEOSTD8 Georgian                | geostd8_general_ci  |      1 |
| greek    | ISO 8859-7 Greek                | greek_general_ci    |      1 |
| hebrew   | ISO 8859-8 Hebrew               | hebrew_general_ci   |      1 |
| hp8      | HP West European                | hp8_english_ci      |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak      | keybcs2_general_ci  |      1 |
| koi8r    | KOI8-R Relcom Russian           | koi8r_general_ci    |      1 |
| koi8u    | KOI8-U Ukrainian                | koi8u_general_ci    |      1 |
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
| latin5   | ISO 8859-9 Turkish              | latin5_turkish_ci   |      1 |
| latin7   | ISO 8859-13 Baltic              | latin7_general_ci   |      1 |
| macce    | Mac Central European            | macce_general_ci    |      1 |
| macroman | Mac West European               | macroman_general_ci |      1 |
| sjis     | Shift-JIS Japanese              | sjis_japanese_ci    |      2 |
| swe7     | 7bit Swedish                    | swe7_swedish_ci     |      1 |
| tis620   | TIS620 Thai                     | tis620_thai_ci      |      1 |
| ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
| ujis     | EUC-JP Japanese                 | ujis_japanese_ci    |      3 |
| utf16    | UTF-16 Unicode                  | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode                | utf16le_general_ci  |      4 |
| utf32    | UTF-32 Unicode                  | utf32_general_ci    |      4 |
| utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_0900_ai_ci  |      4 |
+----------+---------------------------------+---------------------+--------+

Starting with MySQL 8, the default character set for MySQL is utf8mb4, but in the earlier versions of MySQL, the default character set is latin1. The latin1 character set can only represent single-byte characters, while utf8mb4 character sets can represent multi-byte characters including emoticons.

MySQL supports setting character sets at different levels, including: server-level, database-level, table-level, and column-level. You can set different character sets for different levels.

Server-level Character Set and Collation

The server-level character set and collation are the default character sets and collation rules for all databases. If you don’t specify a character set for a database, the database will used the server-level charset.

View MySQL Server Character Set

To view the server-level character set in MySQL, use the variables character_set_server in the following statement:

SHOW VARIABLES LIKE "character_set_server";
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| character_set_server | utf8mb4 |
+----------------------+---------+

To view the server-level collation in MySQL, use the variables collation_server in following statement:

SHOW VARIABLES LIKE "collation_server";
+------------------+--------------------+
| Variable_name    | Value              |
+------------------+--------------------+
| collation_server | utf8mb4_0900_ai_ci |
+------------------+--------------------+

Set Server Character Set and Collation

There are 3 ways to set the character set of MySQL server.

  1. Set the character set and collation of the MySQL server in the configuration file:

    [mysqld]
    character-set-server = utf8mb4
    collation-server = utf8mb4_0900_ai_ci
    

    It is recommended and it is permanent. Remember to restart the MySQL server after modifying the configuration file.

  2. Specify the character set and collation when starting the MySQL server:

    mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_0900_ai_ci
    
  3. Modify the character set and collation at the command line.

    SET character_set_server = utf8mb4;
    SET collation_server = utf8mb4_0900_ai_ci;
    

    This is only valid for the current session.

Database-level Character Set and Collation

The database-level character set and collation only take effect for the current database, which is also the default character set and collation of tables in the database.

View the character set and collation of the database

To view the character set and collation of the current database, use the following statement:

SELECT @@character_set_database, @@collation_database;
+--------------------------+----------------------+
| @@character_set_database | @@collation_database |
+--------------------------+----------------------+
| utf8mb4                  | utf8mb4_0900_ai_ci   |
+--------------------------+----------------------+

To view the character set and collation for a specified database, use the following statement:

SHOW CREATE DATABASE testdb;
+----------+----------------------------------------------------------------------------------------------------------------------------------+
| Database | Create Database                                                                                                                  |
+----------+----------------------------------------------------------------------------------------------------------------------------------+
| testdb   | CREATE DATABASE `testdb` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci */ /*!80016 DEFAULT ENCRYPTION='N' */ |
+----------+----------------------------------------------------------------------------------------------------------------------------------+

You can get the database character set and collation in the CREATE DATABASE statement.

Set database-level character set and collation

You can specify the character set and collation using CHARACTER SET and COLLATE clauses when creating a database, as follows:

CREATE DATABASE database_name
[CHARACTER SET charset_name]
[COLLATE collation_name];

You can also modify the character set and collation of the database using the ALTER DATABASE statement:

ALTER DATABASE database_name
[CHARACTER SET charset_name]
[COLLATE collation_name];

Table-level character sets and collations

MySQL supports setting different character sets and collations for tables. If not set, the database-level character set and collation are inherited.

Set table-level character set and collation

To set character set and collation when creating a table, use the following CREATE TABLE statement:

CREATE TABLE table_name (column_list)
    [CHARACTER SET charset_name]
    [COLLATE collation_name]

To modify the character set and collation of a table, use the following ALTER TABLE statement:

ALTER TABLE table_name
    [CHARACTER SET charset_name]
    [COLLATE collation_name]

View table-level character set and collation

There are 3 ways to view the character set and collation of a table:

  1. Use the SHOW CREATE TABLE statement view the character set and collation of a table:

    SHOW CREATE TABLE t1;
    
    *************************** 1\. row ***************************
          Table: t1
    Create Table: CREATE TABLE `t1` (
      `c1` int DEFAULT NULL,
      `c2` char(1) DEFAULT NULL,
      `c3` json DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
  2. Use the SHOW TABLE STATUS statement view the character set and collation of a table:

    SHOW TABLE STATUS FROM testdb LIKE 't1'\G;
    
    *************************** 1\. row ***************************
              Name: t1
            Engine: InnoDB
            Version: 10
        Row_format: Dynamic
              Rows: 5
    Avg_row_length: 3276
        Data_length: 16384
    Max_data_length: 0
      Index_length: 0
          Data_free: 0
    Auto_increment: NULL
        Create_time: 2022-04-20 02:31:42
        Update_time: NULL
        Check_time: NULL
          Collation: utf8mb4_0900_ai_ci
          Checksum: NULL
    Create_options:
            Comment:

    Here, the column Collation is the collation utf8mb4_0900_ai_ci, which corresponds to the character set utf8mb4.

  3. View a table’s character set and collation from the information_schema.TABLES table:

    SELECT TABLE_COLLATION
    FROM information_schema.TABLES
    WHERE TABLE_SCHEMA = "testdb"
      AND TABLE_NAME = "t1"\G
    
    +--------------------+
    | TABLE_COLLATION    |
    +--------------------+
    | utf8mb4_0900_ai_ci |
    +--------------------+

    Here, the column TABLE_COLLATION is the collation, utf8mb4_0900_ai_ci corresponds to the character set utf8mb4.

Column-level Character Set and Collation

The default character set and collation of a column is inherited from the table, but MySQL supports setting a different character set and collation for a column.

To set independent character set and collation for a column, define a column like this:

col_name data_type
    [CHARACTER SET charset_name]
    [COLLATE collation_name]

You can use the above column definitions in CREATE TABLE and ALTER TABLE statements to set individual character sets and collations for columns, for example:

ALTER TABLE t1
MODIFY c4 VARCHAR(20)
    CHARACTER SET 'utf8'
    COLLATE 'utf8_bin';

To view the character set and collation of a column, you can get them in the information_schema.COLUMNS table:

SELECT CHARACTER_SET_NAME,
  COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = "testdb"
  AND TABLE_NAME = "t1"
  AND COLUMN_NAME = "c4";

Conclusion

MySQL provides various character sets to efficiently store various characters. You can set different character levels and collations for the server, databases, tables, and columns.

If you need to convert strings between different character sets, use two functions provided by MySQL: CONVERT and CAST.