Convert a database from latin1 to utf8 charset (aka. Thica.net redesigned)

If it’s been a while since my last update on this blog, then I’d say sorry, I was working on a re-design of my favorite thica.net site. Ok, I’ll be honest here: I didn’t invent anything, instead I only took a copy of the great Notepad Chaos theme from Smashing Magazine and spent some time tweak it to my needs. Well, it may sound simple, but it was NOT at all…

Especially when I came into this serious problem: MySQL data charset converting.

Let me say more clearly. I’m a Vietnamese, and I like Vietnamese poems (at least good ones), so Thica.net (which means Poetry.net in English) is naturally written in Vietnamese, and is nothing more than a normal WordPress installation. The thing is, when first installed Thica.net, the charset of the database (and everything inside it) was set to “latin1″ by default. When not being a big problem with English speaker, for multilingual sites it’s prone to big troubles, like:

  • Incorrect search results
  • Wrong order of sorting: in Vietnamese alphabet table, letter “Đ” comes right after “D”, but in my site it came not sooner than “Y”, when “Ý” was placed just before “A” - whoops.

So along with the redesign, I decided that Thica.net’s database needed to be revamped. Easy spoken, indeed. But not that easy to do.

First, MySQL internally supports conversion between charsets. Like this:

<code>ALTER TABLE `table_name` CONVERT TO CHARACTER SET 'utf8';</code>

You may have guessed it, this query doesn’t help much. While the table’s charset has been indeed converted into utf8, the data remains the same, means a bunch of meaningless characters like “Trần Dần”, “Phạm Tiến Duật”, “Nguyá»…n Mỹ” and so on.

Then, I read somewhere that you can dump the entire database into a text file, and use Notepad++ to convert the encoding into UTF-8, then restore the dump back. Or, you can use iconv library. Or, the multibyte functions in PHP. None worked for me.

After some researching in vain, I came to this question: why does querying the latin1 database and display the retrieved data on a page (in utf-8) always works correctly? It turned out that, like an ox, MySQL somehow decided to take the hard and heavy part, like this:

  1. First, client machine requests data using a query
  2. Server quietly converts the (latin1) data into UTF-8 and returns it
  3. Client machine displays the properly formatted data on browser

So it flashed through my mind: I’d take part in in the final step. Instead of displaying the good data however, I will save it somewhere, like in a dump, to be accessible later. And like a real dump, there should be queries to drop and create tables also. Here is the code I wrote:

# if your database is big, do some preparations
# ini_set('memory_limit', '256M');
# ini_set('max_execution_time', 120);
# some config data
define('HOST', 'localhost');
define('USER', 'root');
define('PASS', '');
define('DB', 'test');
define('FILE', 'utf8data-dump.sql');
mysql_connect(HOST, USER, PASS);
mysql_select_db(DB);
# retrieves a list of table names from the database
$rs = mysql_query('SHOW TABLES FROM ' . DB);
$content = '';
while ($row = mysql_fetch_row($rs))
{
	# for each table, get its structure
	$table_name = $row[0];
	$content .= "\r\n-- TABLE STRUCTURE OF $table_name--\r\n";
	$table_struct = mysql_query("SHOW CREATE TABLE " . $table_name);
	$table_struct = mysql_fetch_array($table_struct);
	# add a DROP IF EXISTS query
	$content .= "DROP TABLE IF EXISTS $table_name; \r\n";
	# add a CREATE TABLE query
	# remember, we must replace latin1 charset with utf8
	$content .= str_replace('latin1', 'utf8', $table_struct[1]) . "; \r\n";
	# now, the data
	$content .= "\r\n-- DATA OF $table_name--\r\n";
	$table_data = mysql_query("SELECT * FROM $table_name");
	# if the table is empty, hell with it
	if (mysql_num_rows($table_data) == 0) continue;
	$content .= "INSERT INTO $table_name VALUES ";
	# populate the data
	$str = '';
	while ($data_row = mysql_fetch_row($table_data))
	{
		$str .= '(';
		foreach ($data_row as $field)
		{
			$str .= sprintf("'%s',", addslashes($field));
		}
		$str = rtrim($str, ',');
		$str .= '),';
	}
	$str = rtrim($str, ',');
	$content .= "$str; \r\n";
}
# write the (formatted) data into the dump file.
$handle = fopen(FILE, 'wb');
fwrite($handle, $content);
fclose($handle);

This script worked for me (this post will make no sense otherwise). After running this it in my browser, I got a dump that can be use to restore a perfect utf8 database (well, not really perfect, as I’ll tell below, but acceptable).

The limitation of this script is, it doesn’t support the collates. A table created with utf8 charset will have a default collate being utf8_general_ci, which isn’t preferred over utf8_unicode_ci (like in these charts here and here. At least, my categories are not correctly sorted until I manually set the column `wp_terms`.`name` to utf8_unicode_ci.

The final line: Have you known yet? ;) My new Thica.net is here.

  • Hix, may qua co, tim mai moi co bai viet ve cai nay, tui dang chay cai Oscommerce o Localhost, nhung bi loi khi dich Tieng anh sang tieng Viet 1 table trong do, lam on chi cho tui 1 cach chi tiet ve cai code o tren voi.
    Cam on rat nhieu, tui con ga` lam.
    Hi, Support me in Vietnamese.

  • I follow your posts for quite a long time and must tell that your posts always prove to be of a high value and quality for readers.

  • Thanks for your compliment :)

You can follow any responses to this entry through the RSS 2.0 feed.