Project

General

Profile

Actions

Feature #17503

closed

BE should check Mysql charset settings

Added by Martin Kutschker over 16 years ago. Updated over 13 years ago.

Status:
Closed
Priority:
Must have
Category:
-
Target version:
-
Start date:
2007-08-07
Due date:
% Done:

0%

Estimated time:
PHP Version:
4.3
Tags:
Complexity:
Sprint Focus:

Description

Talking to the DB server the charset settings should match the settings of TYPO3 to avoid charset problems. So (at least for an admin) a check if the Mysql variables:

SHOW VARIABLES WHERE Variable_name IN ('character_set_client', 'character_set_connection', 'character_set_results')

All three values must match the current BE charset.

Note that Mysql reports latin1 for iso-8859-1 and windows-1252.

(issue imported from #M6098)


Files

dbCharset.diff (2.7 KB) dbCharset.diff Administrator Admin, 2007-08-16 11:32
blobToText.diff (6.28 KB) blobToText.diff Administrator Admin, 2008-03-02 12:56

Related issues 5 (0 open5 closed)

Related to TYPO3 Core - Bug #17437: When accessing pages form cache "No Temlpate found!" appearsClosedRupert Germann2007-07-21

Actions
Related to TYPO3 Core - Bug #16178: UTF-8: Broken special characters on cached pagesClosed2006-05-24

Actions
Related to TYPO3 Core - Bug #16594: Installer cannot import tables when the db has a collation of utf8_unicode_ciClosedChris topher2006-09-26

Actions
Related to TYPO3 Core - Bug #16388: Installer should honour forceCharset setting when creating a DBClosedChris topher2006-07-21

Actions
Related to TYPO3 Core - Feature #18501: Enable UTF-8 by defaultClosedMichael Stucki2008-03-26

Actions
Actions #1

Updated by Michael Stucki over 16 years ago

Martin, did you already start to work on this? The reason why I'm asking is that I've hugely investigated the situation with MySQL and UTF8, and have already written some patches therefore. I use this place to write down what I've found so far:

- Detection is easy, but what to do with it? IMHO it should be displayed in the Install Tool as well as the little yellow warning box.
- Detection alone does not help anything, we will also need to provide migration solutions. I not only want to have character_set_client, character_set_connection and character_set_results in UTF8 but also the database including every single field. Admins can either do that by using the setDBinit feature, or change the database globally.
- I did not yet care about other DBMs. I guess we might need Karstens input for that...

Actions #2

Updated by Michael Stucki over 16 years ago

While the migration of this is pretty easy [1], we will have big problems with binary fields and forceCharset settings:

- Since some fields are still binary, their contents will not be converted during character set conversion (see [2]). The solution for this is that all blob / mediumblob / tinyblob fields must be investigated and - in case they contain text data - be changed to text / mediumtext / tinytext yet before the conversion is started.

- There will be more trouble in situation where someone has used UTF8 but the database was Latin1. The solution here is to convert all tables to "binary" and then do the conversion. This will prevent MySQL from changing any data. Only the charset definition is changed. After this step, we just need some functionality that convertes the fields back into non-binary fields again.

[1] Pseudo code example:
foreach $table do "ALTER $table CONVERT TO CHARACTER SET utf8"
[2] http://dev.mysql.com/doc/refman/5.1/en/alter-table.html / http://dev.mysql.com/doc/refman/5.1/en/charset.html

Actions #3

Updated by Michael Stucki over 16 years ago

The patch "blobToText.diff" solves the first problem at least in the core. Please cross-check if you find this correct, and if nothing is missing.

We will also have to notify the kickstarter devs, as well as all extension authors. Otherwise their data is simply not converted during the upgrade...

Actions #4

Updated by Michael Stucki over 16 years ago

The patch "dbCharset.diff" is just a demonstration of how I think this could be checked, filled with comments of what must be considered. Let me know what you think about this...

Actions #5

Updated by Michael Stucki over 16 years ago

One more info:

Steps needed to convert a Latin1-database filled with UTF8-data without corrupting the contents:

Problem: The database could usually converted using simple MySQL commands:
ALTER TABLE <table> CONVERT TO CHARACTER SET utf8;

However, there are situations where UTF8-data is filled into a table with Latin1 character set:
- forceCharset has been used on a Latin1 database
- setDBinit = "SET NAMES utf8" has been set
- admin has configured MySQL client, connection and/or result character set, but did not modify the database character set
- database character set is UTF8 but some tables or fields are configured with a different character set. (This will be a special case anyway, so we should think if we want to care about this situation or not...)

Scenario:
Database contains UTF8-data but is tagged as Latin1

Solution:

1. Make a backup of your database!!
2. Apply the patch "blobToText.diff"
3. Run Install Tool => DB compare to make sure there are no more differences between database and .sql-files schema.
4. Open a MySQL client (phpMyAdmin will also work) and run this query:
ALTER TABLE <table> CONVERT TO CHARACTER SET binary;
(repeat this step for every table - we might provide a tool that does this...)
5. Now convert the character set of the whole database. Since all fields are binary, no data will be changed:
ALTER DATABASE <dbname> CHARACTER SET utf8;
ALTER TABLE <table> CONVERT TO CHARACTER SET utf8;
(again, repeat for every table)

6. Go back to the Install Tool and compare again. Now all fields have been of binary types, and they need to be converted back to various text field formats.

7. Check if everything worked fine, and post your feedback here.

Good luck!

Actions #6

Updated by Michael Stucki over 16 years ago

Changing priority to "urgent" because this must be solved by TYPO3 4.2.

Actions #7

Updated by Martin Bless over 16 years ago

Using TYPO3-4.1.2, MySql-4.1.22, PHP-4.4.1, phpMyAdmin-2.6.4.-pl3:

Let's assume 'character_set_client', 'character_set_connection', 'character_set_results' are all correct and set to UTF8. Then the setting of 'setDBinit' is another crucial point. It looks like people are more or less guessing around.

I found the advice to use $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;'.chr(10).'SET
CHARACTER SET utf8;'; This will work for tt_content, but not for sys_template.constants and sys_template.config which are of type BLOB. With non ASCII chars I would get an

Errors: 102: These fields are not properly updated 
in the database: (constants) Probably value mismatch
with fieldtype.

It works if I only use: $TYPO3_CONF_VARS['SYS']['setDBinit'] = 'SET NAMES utf8;';

Is it TYPO3 who is behaving differently on BLOBs in this case? Karsten Dambekalns pointed out:

http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html

Two examples of what the two set statements do exactly, and if you look
closely, you'll get the difference.

A SET NAMES 'x' statement is equivalent to these three statements:
SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

A SET CHARACTER SET x statement is equivalent to these three statements:
SET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;
< < < < <

Well, I have to admit that I still don't completely understand. But I learned that you can easily make a "small" mistake in setting up 'setDBinit' which will lead into a big mess.

Related threads: http://lists.netfielders.de/pipermail/typo3-english/2007-August/042105.html http://lists.netfielders.de/pipermail/typo3-dev/2007-August/024585.html

Actions #8

Updated by Michael Stucki over 16 years ago

That's the reason why all fields must be converted from blob to text in the first step!

Actions #9

Updated by Ingo Renner over 16 years ago

blobToText works

Actions #10

Updated by Benni Mack about 16 years ago

I made another "blobToText" working against the latest trunk

Actions #11

Updated by Martin Kutschker about 16 years ago

Michael, the dbCheck path is nice, but it should really check if those three character sets are identical and match what we expect. But I don't see why the check is done every time in t3lib_db. IMHO it's sufficient if we do it in the installer and in the BE in the ususal warning box for admins right after the login.

Note that Mysql treats iso-8859-1 (latin1) and windows-1252 as synonyms. So if we want to compare our settings with Mysql there might be some mapping needed.

Actions #12

Updated by Martin Kutschker about 16 years ago

Michael, Benjamin, the blobToText diffs change at least one column that contains a serialized array to text (l18n_diffsource). Columns containing serialized arrays or XML data must not be of type text. The former because a charset conversion from as single byte charset to a multi-byte charset will break the array, the latter because the XML data may carry an XML prlogue with a charset (ie the data is not in the "main" charset of the DB).

Actions #13

Updated by Michael Stucki about 16 years ago

Masi, I agree that this check is only needed when changing the DB layout (Installer + EM). Can't remember the reason why I solved it different, but of course this can be changed.

About blobToText: It is exactly the meaning of the above procedure that the DB can be converted without any changes to the content, so stuff like this would not break.
It doesn't matter how many bytes a value uses in MySQL, but the result which PHP gets must always be the the same.

However, you are probably right that serialized fields should all remain blob, because this will be the most safe way. BUT we could (mis)use the serialized fields to auto-detect if a Latin1 DB really contains Latin1 data, or if it's UTF-8 already. This would make the whole conversion a lot easier for users. In fact, they would not have to think about it at all...

- michael

Actions #14

Updated by Steffen Kamper about 16 years ago

can we do the blob2text for 4.2?

I think thats the right way to do the conversion. Also a Warning has to be announced to Devs that they use only blobs for serialized data because it's impossible to check content of each field to determine if there are serialized data in it.

Then the conversion can be tested and finally integrated in 4.3

I suggest to declare utf-8 as default. This prevent most users to do a complicate conversion afterwards. Mostly users don't think of it when they start creating a site, later if they need additional language the content is already filled etc.

Actions #15

Updated by Michael Stucki about 16 years ago

Once again I have updated blobToText.diff:
All blobs are changed except all fields that contain serialized data. Dmitry: Is this ok for you?

Actions #17

Updated by Martin Kutschker about 16 years ago

Beware the fields with XML data. TYPO3/PHP allows iso-8859-1 and UTF-8. This means the contents of this columns may not be the same as the character set of the DB!

Actions #18

Updated by Michael Stucki about 16 years ago

Do you know of a place where this is used? Actually these fields just need to be changed to blob fields similar like it is done for serialized content.

Actions #20

Updated by Michael Stucki about 16 years ago

This is almost fixed, however some GUI parts are yet missing. Will create new entries for them, and add relationships to this record.

Actions

Also available in: Atom PDF