How to support full Unicode in MySQL database
Are you using MySQL's utf8 character set in your database ? In this article, I'll explain why you should switch to utf8mb4 , and how to do it.
UTF-8
The UTF-8 encoding can represent every symbol in the Unicode character set, ranging from U+000000
1 to 1. U+10FFFF
That's 1,114,112 possible symbols. (Not all of these Unicode code points have characters assigned to them, but that doesn't prevent UTF-8 from being able to encode them.)
UTF-8 is a variable-width encoding; see the Unicode character set and UTF-8, UTF-16, UTF-32 encodings article. It uses one to four 8-bit bytes to encode each symbol. Symbols with lower numeric code point values are encoded using fewer bytes. In this way, UTF-8 is optimized for the common case of using ASCII characters and other BMP symbols (whose code points range from U+000000
to U+00FFFF
) - while still allowing astral symbols (whose code points range from U+010000
to U+10FFFF
) to be stored.
utf8 in MySQL
For a long time, I have been using MySQL's utf8 character set for my databases, tables, and columns, assuming that it maps to the aforementioned UTF-8 encoding. By using utf8 , I could store any symbols I wanted in my database - or so I thought.
While writing about the internal character encoding of JavaScript, I noticed that I could not U+1D306 TETRAGRAM FOR CENTER (𝌆)
insert the ' symbol into the MySQL database behind this site. The column I was trying to update had utf8_unicode_ci
a collation of , and the connection character set was set to utf8 .
mysql> SET NAMES utf8; # just to emphasize that the connection charset is set to `utf8`
Query OK, 0 rows affected (0.00 sec)
mysql> UPDATE database_name.table_name SET column_name = 'foo𝌆bar' WHERE id = 9001;
Query OK, 1 row affected, 1 warning (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 1
mysql> SELECT column_name FROM database_name.table_name WHERE id = 9001;
+-------------+
| column_name |
+-------------+
| foo |
+-------------+
1 row in set (0.00 sec)
The content is truncated at the first astral Unicode
_ symbol, in this case _ 𝌆
— therefore, the attempt to insert foo𝌆bar
_ is actually inserted foo
, resulting in data loss (and possibly introducing security issues; see below). MySQL also returns a warning message:
mysql> SHOW WARNINGS;
+---------+------+------------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'column_name' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)
It turns out that MySQL's utf8 character set only partially implements correct UTF-8 encoding. It can only store UTF-8 encoded symbols consisting of one to three bytes; encoding symbols that take up four bytes are not supported.
Because astral symbols (whose code points range from U+010000
to U+10FFFF
) each consist of four bytes in UTF-8, we cannot use MySQL's utf8 implementation to store them.
This affects not only 𝌆
characters, but also more important symbols, such as U+01F4A9 PILE OF POO (💩)
. In total there are 1,048,575 possible code points that we cannot use. In fact, MySQL's utf8 only allows us to store 5.88% of all possible Unicode code points (0x00FFFF + 1) / (0x10FFFF + 1)
. Proper UTF-8 can encode 100% of all Unicode code points .
MySQL's
utf8
encoding naming is awkward because it is different from the correctUTF-8
encoding. It does not provide full Unicode support, which may lead to data loss or security vulnerabilities.
utf8mb4 in MySQL
Fortunately, MySQL 5.5.3 (released in early 2010) introduced a new encoding called utf8mb4 that maps to proper UTF-8 and thus fully supports Unicode, including astral symbols.
We can refer to the article Switching MySQL from utf8 to utf8mb4 to switch the encoding.
Summarize
Never use utf8
in MySQL - always use utf8mb4 . It may take some time to update your database and your code, but it is definitely worth the effort. Why arbitrarily limit the set of symbols that can be used in a database? Why lose data every time a user enters an astral symbol as part of a comment or message or whatever we store in the database? There is no reason not to strive for full Unicode
support everywhere. Do the right thing and use utf8mb4 . 🍻
For reprinting, please send an email to 1244347461@qq.com for approval. After obtaining the author's consent, kindly include the source as a link.
Related Articles
Two ways to install mysql-5.5.47 on Linux system and manage mysql
Publish Date:2025/04/26 Views:140 Category:MySQL
-
We know that there are generally two ways to install software on the Linux system. One is to use rpm or yum to install, which is convenient and fast; the other is to use the compiled source package. Although this method is more troublesome
Mysql master-slave replication simple configuration
Publish Date:2025/04/26 Views:120 Category:MySQL
-
I'm learning about MySQL master-slave replication recently and want to apply it to a project at work. Since I'm new to it, I don't understand it very well, so I can only share how to make a simple configuration. At the beginning, we configu
Implementation details of MySQL master-slave replication (I) Exploration of the m
Publish Date:2025/04/26 Views:56 Category:MySQL
-
This article mainly discusses the implementation mechanism of master-slave replication, which may not be directly helpful for our actual application, but understanding its principle can achieve twice the result with half the effort for futu
Implementation details of MySQL master-slave replication (II) Exploration from th
Publish Date:2025/04/26 Views:74 Category:MySQL
-
Previously we explored the master server in master-slave replication. Now let's look at how the slave server works in the entire system. In the article Master Server Exploration, we mentioned that three threads are needed in a master-slave
Mysql master-slave replication - what to do if the slave server stops
Publish Date:2025/04/26 Views:97 Category:MySQL
-
You may find this topic a little ridiculous. What can we do if the server stops? Of course, we should restart the service. Restarting the service is no problem. The problem is that a lot of data has been written to the master database durin
MySQL stored procedure details
Publish Date:2025/04/26 Views:163 Category:MySQL
-
A stored procedure can be thought of as encapsulating a SQL statement that we need to process specially into a function. When needed, we only need to call this function to achieve the desired operation. This process can be called a stored p
How many of these MySQL statement tags have you used?
Publish Date:2025/04/26 Views:122 Category:MySQL
-
In the article "A Peek into MySQL Stored Procedure Details" , we briefly introduced the use of stored procedures. The syntax for creating stored procedures includes BEGIN...END. In addition to BEGIN...END, the following statement tags can b
MySQL CURRENT_TIMESTAMP() Function Detailed Introduction (with Examples)
Publish Date:2025/04/26 Views:132 Category:MySQL
-
This article explains how to use MySQL functions with examples CURRENT_TIMESTAMP() . By using it, we can convert or display the current date and time. The output format is "YYYY-MM-DD HH:MM:SS" or "YYYYMMDDHHMMSS", depending on the context
Back up the MySQL database to a file
Publish Date:2025/04/26 Views:166 Category:MySQL
-
Backing up your database is a very important system administration task that should usually cron be run from a job at scheduled intervals. We will use mysqldump the dump utility that comes with mysql to dump the contents of the database to