Bug #15287
closedIllegal SGML characters in output
0%
Description
Hi,
the "non SGML character number 128" is probably the most annoying validation error that TYPO3-sites hit when users from the Windows world copy&paste input some field which will go right through to the frontend.
THE PROBLEM
---------------
The origin of the problem comes from the fact that the ISO-Latin-1 character table specifies every character from the decimal range 32 up to 255, but has a gap in the range from 128 to 159 (see [1]). This range is (mis?)used by Microsoft in the so called "Windows-Latin-1" for various characters. The most frequently chars are the EURO-sign, the emdash ("langer Gedankenstrich", which MS-Word creates automatically if you type an hyphen with spaces around it) and opening-double-quotes (bottom) (also created by Word in German if you start some quotation).
So outputting these characters for the Web in "charset=iso-8859-1" mode is not "valid", because they are not part of this charset (which is also why the W3C-validator chokes on them). The very good article in [2] present some alternatives on how to output them in a generic way.
SOME TYPO3 SOLUTIONS
------------------------
Some time in the past I've written "cron_rte_cleanenc", which will remap those characters from the RTE into proper numerical entities (which is what the article [2] suggests as the most widely used method). This is nice, but later I figured out that these characters can also be pasted into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my processing also works on some cases.
Later versions of qcom_htmlcleaner include the switch "Remap illegal chars" (clean_chars), which will translation any "high ASCII" character to a proper entity. Two problems I see with the current approach:
1. it only applies to XHTML_clean(), while the problem also exists in
HTML mode.
2. it translates all characters >127 into entities, which is not
needed. The range 128-159 is sufficient here, as Ä can be
represented by a proper ISO-Latin-1 character already.
MY GOAL/AIM
--------------
I want this translation to happen in TYPO3-core, without needing any extention. Our goal has been XHTML-validity, and this is a major issue in this commitment. This is not a "xhtml_cleaning" problem, but a generic charset problem. We have proven solutions to the problem, we just need to see if they are generic enough not to hurt and add them in a meaningful way to the core.
HOW TO PROCEED
-----------------
We need to find out in which character sets this is a problem. If I set my site to "forceCharSet=utf-8", the problem doesn't exist, because all pasted input will have corresponding UTF-8 entities which are valid. So maybe some charset expert around could tell us a bit about it, and if noone is available, I would do some research on it. I suspect every ISO-Latin-x variant has this problem.
Then we need to create some patches to correct the situation.
[1] http://www.htmlhelp.com/reference/charset/
[2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
(issue imported from #M2048)