MySQL collation
Table of Content:
In MySQL, collation refers to a set of rules that determine how strings are compared and sorted in the database. It defines how characters are treated when performing operations like ORDER BY
, GROUP BY
, or when comparing two strings.
Key Concepts:
-
Character Set:
- A character set is a set of symbols and encodings. For example, UTF-8 is a popular character set that includes most characters from all the languages in the world.
-
Collation:
- Collation is a set of rules for how to compare characters in a character set. Each character set can have one or more collations. The collation determines things like whether 'A' is considered equal to 'a', or how to sort characters in different languages.
Example:
- utf8_general_ci: This is a collation for the UTF-8 character set. The "general" part means it’s a general-purpose collation, and "ci" stands for case-insensitive, meaning 'A' is considered equal to 'a'.
- utf8_bin: This is another collation for UTF-8, but it’s case-sensitive and compares strings based on the binary value of each character.
Why Collation Matters:
-
Sorting and Searching:
- The collation determines how strings are sorted and compared. For example, in a case-insensitive collation, "apple" and "Apple" would be considered equal, while in a case-sensitive collation, they would not.
-
Multi-Language Support:
- Different languages have different rules for sorting and comparing characters. By selecting an appropriate collation, you can ensure that sorting and searching in your database respect the linguistic rules of the language you’re working with.
-
Indexing:
- Collation can affect the way indexes are built and how efficiently they work, especially in case-insensitive vs. case-sensitive contexts.
Setting Collation:
At the Database Level:
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
At the Table Level:
CREATE TABLE mytable ( col1 VARCHAR(100) ) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
At the Column Level:
CREATE TABLE mytable ( col1 VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci );
Common Collations:
utf8_general_ci
: General-purpose, case-insensitive collation for UTF-8.utf8_bin
: Binary collation for UTF-8, case-sensitive.utf8_unicode_ci
: Case-insensitive collation that sorts according to the Unicode standard, supporting multiple languages more accurately.
Changing Collation:
You can change the collation of an existing table or column:
-
Changing Table Collation:
ALTER TABLE mytable CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE mytable MODIFY col1 VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Conclusion:
Collation is essential for ensuring that string comparisons, sorting, and searching behave as expected in your database, especially when dealing with different languages and character sets.