Related link: http://mysql.he.net/doc/refman/5.0/en/charset.html

I’ve just finished one of the most difficult and tedious problems I’ve ever solved, so I have to share the solution here in a little tutorial of how I fixed this, even though I’m sure there are better ways, this is what worked for me.

THE PROBLEM - PART 1:
My old CD Baby MySQL database from 1998 was filled with foreign characters and was in MySQL’s default (latin1) encoding.
For years, customers and clients had been using our web interface to give us their names, addresses, song titles, bio, and many things in all kinds of alphabets.
I wanted everything to be in UTF-8. (The database, the website, the MySQL client, everything.)

QUICK DEFINITION : "FOREIGN CHARACTERS"
When I say "foreign characters" I mean not just Greek, Icelandic, Japanese, Chinese, Korean, and others shown at Omniglot, but also the curly-quotes, ellipsis, em-dash, and things described at alistapart.

START OF THE SOLUTION (THE EASY PART):
* - Found a few hours of downtime at 2am on a Sunday night.
* - Shut down the website.
* - Did a raw data dump (mysqldump) of the data to a regular text "dump.sql" file. (85 tables, millions of rows, an 8 gig dump)
* - Completely removed MySQL 3.2 from the system
* - Installed MySQL 5.0 (FreeBSD ports), making sure to use –with-charset=utf8 while compiling (see http://dev.mysql.com/doc/refman/5.0/en/charset-server.html)
* - Did a sed replace on the dump.sql file, changing all table types to utf8.
* - (Also changed from MyISAM to InnoDB but that’s a different story, and had no problem.)
* - Changed my HTML header Content-Type to charset=utf-8 everywhere
* - Changed /etc/my.cnf to default charset utf8
* - Loaded the dump.sql file, and turned the website back on.
* - Made sure it mostly worked, and went to sleep

THE PROBLEM - PART 2:
Some foreign characters were perfect. Others were a jumble : what should have been one quotation-mark turned into a series of THREE jumbly characters. Weird. Had to be fixed. No idea where to start.

FIGURING OUT WHAT’S WRONG (THE HARD PART):
* - Unless you want to do *everything* in a web browser, you need to get a terminal that does Unicode and can display foreign characters. I used uxterm. See http://czyborra.com/unicode/terminals.html
* - I learned about using the SET NAMES utf8 query, but when I did that almost everything turned into a jumble.
* - I could send the database a set names utf8 command, and SOME would work. Or I could do set names latin1, and the rest would work. I was stumped.
* - It took about 10 hours of frowning and furiously typing, but I found out that
— #1 : The MySQL server was using UTF8 encoding.
— #2 : The MySQL client was using latin1 encoding.
— #3 : Even if I got the command-line MySQL client to use utf8, the PHP client was still using latin1 encoding.
— #4 : Most of my data must have been put into the MySQL server with latin1 encoding, which is why it worked with latin1 encoding on the client when getting it out.

Seems I had some characters in latin1, some characters in UTF-8, some in the database as HTML equivalents (分) and some characters that were just a total mystery.

A TOOLBOX FOR SLEUTHING CHARACTER ENCODING PROBLEMS:

#1 - USE MySQL CHAR_LENGTH TO FIND ROWS WITH MULTI-BYTE CHARACTERS:
SELECT name FROM clients WHERE LENGTH(name) != CHAR_LENGTH(name);

#2 - USE MySQL HEX and PHP bin2hex
SELECT name, HEX(name) FROM clients;
Get the result back into PHP, and run a bin2hex on the string, compare it to MySQL’s hex of that same string

#3 - SEE IT IN BOTH ENCODINGS
$db->query("SET NAMES latin1");
$db->query("SELECT name, HEX(name) FROM clients");
(compare the string and its hex result from MySQL with the bin2hex from PHP)
$db->query("SET NAMES utf8");
$db->query("SELECT name, HEX(name) FROM clients");
(compare the string and its hex result from MySQL with the bin2hex from PHP)

For all those strings that looked perfect in LATIN1 encoding, here’s how I would fix them in the database:
$db->query("SET NAMES latin1");
$db->query("SELECT id, name FROM clients");
$hex = bin2hex($x[’name’]);
$db->query("SET NAMES utf8");
$db->query("UPDATE clients SET name=UNHEX($hex) WHERE id=$id")

That seemed to work, for most things.
Problem is, only SOME of the database was in latin1 encoding, so I had to use a few quirky ways, but mostly my own eyes, to fix only these things, and not accidently re-encode something that was perfect.

#4 - USE A HEX/UNHEX REPLACE FOR THE UNFIXABLE CHARACTERS
Imagine, after all that fixing, you found strings like this:

Let~!@s say ^|%What a nice house you~!@ve got here, don~!@t you think?^!%.

Who knows when or how this happened, but obviously ~!@ is meant to be an apostrophe, ^|% an open-quote, and ^!% a closing-quote.

I’d use MySQL SUBSTRING to find the 3 characters that needed replacing:
SELECT SUBSTRING(quote, 353, 3) FROM table WHERE id=1;

Once narrowing it down to the exact string, add a HEX() around it:
SELECT HEX(SUBSTRING(quote, 353, 3)) FROM table WHERE id=1;
… which would give you a result like C8035EF6BB92BF2

Then use that with MySQL REGEXP to find and replace all occurences in your database!
UPDATE table SET field = REPLACE(field, UNHEX(’C8035EF6BB92BF2′), "’") WHERE field REGEXP UNHEX(’C8035EF6BB92BF2′);

I set up some PHP arrays of all my tables, and all their text fields, to run this same query on everything in my database.
Then do it again for curly-quotes and other weirnesses.

A few times, I had no idea what a character was supposed to be (like the Icelandic and Gaelic ones) - so I had to go visit the artist’s website, and find their song titles or bio information spelled correctly there.

#5 - VALIDATE UTF8
Got the is_utf8 function from PHP docs to validate all the values in the database.
Doing this found a bunch of invisible problems, which only through hours of MySQL SUBSTRING and HEX revealed that there were invisible characters with HEX values of 00-19 scattered around my text fields.
I used the same solution as above to replace them:
UPDATE table SET field = REPLACE(field, UNHEX(’05′), ‘’) WHERE field REGEXP UNHEX(’05′);
I looped this inside an array of all hex values under 20.

#6 - CONVERTING HTML ENTITIES
Find HTML entities hidden in the database:
SELECT field FROM table WHERE field REGEXP ‘&#[0-9]*;’
Use the utf8_chr function from the comments of the PHP html_entity_decode page.
Use PHP preg_match_all to find the entities inside the string, and replace them:

function myreplace($string) {
preg_match_all('/&#(\d*)/', $string, $matches);
foreach($matches[1] as $num) {
$string = str_replace("&#$num;", utf8_chr($num), $string);
}
return $string;
}

Update the database with the returned result.

After all this (about 60 hours work over the last 5 days) I think it’s all done.

Phew.

LESSON LEARNED: KEEP EVERYTHING IN UTF-8, ABSOLUTELY EVERYWHERE, FROM DAY ONE. You’ll be glad you did some day.

Go ahead. Show how smart you are. Show how my 60 hours could have been done in 5 minutes with the language of your choice instead of my mess, above.