UTF-8 by default

Skype Meeting on 20.08.2010

Steffen Kamper, Ernesto Baschny and Michael Stucki

Michael looked at the task with Benny during T3DD10.

The plan is to:

1) modify the default setup of TYPO3 so that new installations run on UTF-8
2) create a "converter" to migrate old installations to UTF-8

Default to UTF-8 database and rendering

Stucki mentioned the needed changes:

forceCharset = utf-8
setDBinit = SET NAMES utf8;

The one-liner setDBinit is probably enough, as it includes all necessary setup. Steffen mentioned other lines:

SET NAMES utf8;
SET CHARACTER SET utf8;
SET SESSION character_set_server=utf8;

Stucki will investigate further if something else is needed. See MySQL docs.

Stucki will do some investigation on where to integrate the "new defaults" so that it doesn't influence existing systems (when upgrading). Setup without setDBinit in the past has to continue working that way (and not suddenly start talking UTF-8 to the database).

Stucki will also document what to do to get rid of setDBinit and we'll add this to the "Installation documentation" (or the install tool?): the necessary setting at the mysql server.

Stucki will ask Xavier about how to create UTF-8 by default in DBAL, or what is needed for that, as Steffen mentioned that DBAL completely ignores setDBinit (as this is MySQL specific).

We'll have to test what happens in the Install Tool for a new installation, if tables are properly created as "UTF-8 tables" when the new setup is configured in config_default.

Default to UTF8 filesystem

Stucki will try to find out if we can build a check to find out if the OS supports UTF-8 files. Ernesto thinks that the checks are not possible, as (under Unix) you can store the bytes you want as file names. It's a matter if the client the user uses to see the files (e.g. "ls -l" in a shell) supports UTF-8.

Futher question about how this behaves on Windows? We think it is UTF-8 by default anyway. Need to check.

Migration tool

The biggest and most complex task is to build a "migration script" that is able to convert an existing installation to UTF-8. We have to take care:

  • Installations with UTF-8 data in latin-1 databases
  • Consider serialized PHP data which might break if we simply convert bytewise (we probably have to unserialize and re-serialize the UTF-8 data).
  • Consider other structures..

Future

Basic concept: Totally get rid of all character sets that TYPO3 supports and only leave UTF-8.

Ernesto's plan is to:

  • Default new installations to UTF-8 in TYPO3 4.5
  • Have the migration tools ready for TYPO3 4.5 (maybe integrated in the Upgrade Wizard)
  • Still support all other charsets in 4.5 (LTS) to ease migration and increase adoption of the LTS release!
  • Remove all charset conversion stuff on 4.6 and only support UTF-8 from them on.

Near future plan:

Get "UTF-8 by default" starting at 4.5alpha1. Michael is working on that. ;)