MySQL and the UTF-8 disharmony

0

The problem:

In case you may not know this, MySQL’s “utf8” is not the same as UTF-8 ! And, to my knowledge, this is happening in all MySQL flavors (MySQL, MariaDB, PerconaDB, etc). The “utf8” (also known as “utf8mb3”) encoding supports up to three bytes per character. The real UTF-8 encoding uses up to four bytes per character. ( I am saying “up to” because, as you may know, UTF-8 uses as less bytes as possible, based on the character that needs to be stored).

Note: The original UTF-8 specifications supported up to six characters but this changed in 2003. Now, it is four.

In most of cases, 3 bytes would be enough and you may have not noticed any problem till now. However, problems start when you try to store “special characters” (like emojis) in a CHAR, VARCHAR or TEXT column that is defined as utf8. You will get an error like that:

“Incorrect string value … for column….”

In case of JSON-encoded strings, you may have not notice this issue if you are storing them in a column with JSON data type. And that’s why MySQL 5.7 (where the JSON data type was introduced) stores JSON only in the utf8mb4 encoding….which means, it uses 4 bytes for JSON. Yes, “utf8mb4” was MySQL’s answer (in 2010) to the problem!

Since MySQL 8, utf8mb4 is the default character set. MySQL 8 has also implemented some optimization on utf8mb4, so using utf8mb4 in MySQL 8 is much faster than using utf8mb3 in MySQL 5.7 (according to Oracle).

A nice article that elaborates more on this issue can be found here.

From utf8 to utf8mb4:

You can move from “utf8” to “utf8mb4” using the following SQL commands:

You need to change the database defaults for the new tables that will be created in the future. You also need to change the character set used by the existing tables. You may also need to modify specific columns in case the conversion leads to some limit violations (see next section).

A nice article on doing such a migration can be found here .

Watch out for index length limitations:

Generally, converting from “utf8” to “utf8mb4” will not damage your data since “utf8mb4 is a superset of utf8mb3” according to MySQL documentation. But careful not to violate the index length limits for CHAR, VARCHAR columns. The prefix InnoDB limitation up to MySQL 5.6 is 767 bytes and in MySQL 5.7 is 3072 bytes. So, if the column uses utf8 (3 bytes per character) and needs to be indexed, it should not be longer than 255 characters. If the column uses utf8mb4 and needs to be indexed, it should not be longer than 191 characters. If we insist on defining the column with a greater length, then we need to index part of the string (e.g first 191 characters).

Index limits explain the potential use of the third SQL statement mentioned in the previous section.

Finally, a nice and basic introduction to character sets, Unicode and UTF-8 can be found here .