Bug #15287

Illegal SGML characters in output

Added by Ernesto Baschny almost 14 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
-
Target version:
-
Start date:
2005-12-15
Due date:
% Done:

0%

TYPO3 Version:
4.0
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

Hi,

the "non SGML character number 128" is probably the most annoying validation error that TYPO3-sites hit when users from the Windows world copy&paste input some field which will go right through to the frontend.

THE PROBLEM
---------------

The origin of the problem comes from the fact that the ISO-Latin-1 character table specifies every character from the decimal range 32 up to 255, but has a gap in the range from 128 to 159 (see [1]). This range is (mis?)used by Microsoft in the so called "Windows-Latin-1" for various characters. The most frequently chars are the EURO-sign, the emdash ("langer Gedankenstrich", which MS-Word creates automatically if you type an hyphen with spaces around it) and opening-double-quotes (bottom) (also created by Word in German if you start some quotation).

So outputting these characters for the Web in "charset=iso-8859-1" mode is not "valid", because they are not part of this charset (which is also why the W3C-validator chokes on them). The very good article in [2] present some alternatives on how to output them in a generic way.

SOME TYPO3 SOLUTIONS
------------------------

Some time in the past I've written "cron_rte_cleanenc", which will remap those characters from the RTE into proper numerical entities (which is what the article [2] suggests as the most widely used method). This is nice, but later I figured out that these characters can also be pasted into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my processing also works on some cases.

Later versions of qcom_htmlcleaner include the switch "Remap illegal chars" (clean_chars), which will translation any "high ASCII" character to a proper entity. Two problems I see with the current approach:

1. it only applies to XHTML_clean(), while the problem also exists in
HTML mode.
2. it translates all characters >127 into entities, which is not
needed. The range 128-159 is sufficient here, as Ä can be
represented by a proper ISO-Latin-1 character already.

MY GOAL/AIM
--------------

I want this translation to happen in TYPO3-core, without needing any extention. Our goal has been XHTML-validity, and this is a major issue in this commitment. This is not a "xhtml_cleaning" problem, but a generic charset problem. We have proven solutions to the problem, we just need to see if they are generic enough not to hurt and add them in a meaningful way to the core.

HOW TO PROCEED
-----------------

We need to find out in which character sets this is a problem. If I set my site to "forceCharSet=utf-8", the problem doesn't exist, because all pasted input will have corresponding UTF-8 entities which are valid. So maybe some charset expert around could tell us a bit about it, and if noone is available, I would do some research on it. I suspect every ISO-Latin-x variant has this problem.

Then we need to create some patches to correct the situation.

[1] http://www.htmlhelp.com/reference/charset/
[2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

(issue imported from #M2048)

History

#1 Updated by Ernesto Baschny almost 14 years ago

FYI

#2 Updated by Michael Stucki almost 14 years ago

Will someone fix this before 4.0?

#3 Updated by Sebastian Kurfuerst almost 14 years ago

I don't think so...

#4 Updated by Michael Stucki almost 14 years ago

Hi Ernesto! What do you suggest to solve the problem?
Is the character only generated in the RTE or also in plain textareas?

#5 Updated by Ernesto Baschny almost 14 years ago

@Stucki,

The problem is in all text-fields. Masi made some comments about it in t.p.content-rendering, but it looks that the solution is somehow more complex.

In the meantime I've made http://typo3.org/extensions/repository/search/cron_latin1cleaner/, which fixes that specific problem by extending t3lib_parsehtml::HTMLcleaner to cleaning that specific chars if metaCharset is "iso-8859-1". But that might not be the best place to do that, because we might want to have the entities stored in the database already (or not?), so a better place needs to be found.

#6 Updated by Fronzes Philippe over 12 years ago

I have just tried to validate a page where some texts were paste from msword.
I got this message from the validator : "non SGML character number 146".
This character correspond to the htmlentity "’".
It should exist a way in the core to replace character which are not HTML valid (0 to 31 inclusive and 127 to 159 inclusive), by their corresponding HTML entities.
I look around the code but i really don't know where to patch for this.

I think the priority should be upgraded because pasting from msword is really a common way of working for editors, and conformity is becoming the rule.

#7 Updated by Ernesto Baschny over 12 years ago

@Fronzes:

you might as well change the character encoding of your website to "windows-1252" and all will work without any further change. Your site will validate even with those SGML-characters, because they are supported in this charset. All clients that I know will be able to handle that. And your clients will still be able to paste their Word-Text into TYPO3, because their original encoding is also "windows-1252".

windows-1252 = latin-1 + chars#14188-159

See: http://en.wikipedia.org/wiki/Windows-1252

Cheers,
Ernesto

#8 Updated by Ernesto Baschny over 12 years ago

or better yet: change the charset of your website to utf-8. :)

#9 Updated by Martin Kutschker over 12 years ago

IE will happily paste the range 128-159 plainly and everything else as HTML entity (wether you like it or not). So if you want strict content you must get rid of the fake ASCII range in TCEmain. For plain text fields you must remove or transliterate the characters, for HTML fields you may use entities.

To do this t3lib_cs could be extended to hold (in a file and loaded on demand?) the necessary mappings for the iso-8859-* charsets and the corresponding windows charsets.

#10 Updated by Ernesto Baschny over 12 years ago

@Masi,

If my Windows Charset is Windows-1250 (Eastern European) and my HTML-Page is ISO-8859-1 (Western European), what will IE paste into text/rte fields? I thought IE will already "convert" stuff to the charset of the current Webpage on FORMs submit, and the only problem is the 128-159 range when Windows-1252 IE submits to ISO-8859-1 page, because those characters will be submitted as BYTES (instead of entities) and TYPO3 won't do anything about it.

I think TYPO3 has no way of knowing the original OS charset used to paste content into text-fields, so I guess we have no use of a mapping between Windows charsets and ISO-charsets. On a ISO-8859-1 site, we just have to make sure the 128-159 chars aren't stored, as they are illegal in this charset, which means convert them to entities (in html-fields) or "transliterate" them in text-fields.

Do you know any other constelation where such illegal bytes in a ISO-charset can be pasted by Windows OS? I haven't heard of any, but maybe there are.

#11 Updated by Martin Kutschker over 12 years ago

AFAIK, IE honours the charset of the HTML page. That is, translates all characters in the paste buffer (whatever charset as this is handled by the OS) not in the charset of the page to entities. The one expection is that it treats iso-8859-1 as windows-1252, which the mentioned result of byte values in the illegal range.

I don't know if IE treats iso-8859-2 as windows-1252 in that regards.

Anyway we don't need to know the orginal charset of the paste buffer. If our charset is iso-8859-1 than the illegal byte range will contain windows-1252 characters. We have to test if this is true for other iso/windows charset pairs (eg iso-8859-2/windows-1252).

#12 Updated by Ernesto Baschny over 12 years ago

@Masi, exactly.

we already have a "mapping" table in TYPO3, using the "script_to_charset_unix/windows" and matching families:

west_european : iso-8859-1 : windows-1252
estonian : iso-8859-1 : windows-1257
east_european : iso-8859-2 : windows-1250
baltic : iso-8859-4 : windows-1257
cyrillic : iso-8859-5 : windows-1251
arabic : iso-8859-6 : windows-1256
greek : iso-8859-7 : windows-1253
hebrew : iso-8859-8 : windows-1255
turkish : iso-8859-9 : windows-1254
thai : iso-8859-11 : cp874
lithuanian : iso-8859-13 : windows-1257
chinese : gb2312 : gb2312
japanese : euc-jp : shift_jis
korean : euc-kr : cp949
simpl_chinese : gb2312 : gb2312
trad_chinese : big5 : big5
vietnamese : - : windows-1258
unicode : utf-8

In my little researches, we only have this "problem" with Latin-1, which is a "pure" subset of W-1252. All other ISO vs Windows combinations are disjunct, so we cannot simply "fill the gaps" of illegal ISO characters with entities or translitarations from the Windows character sets.

E.g. Latin-2 is not a subset of W-1250, so a W-1250 browser should do conversions to Latin-2 anyway, else many characters will be simply wrong. I cannot imagine, that this is the case.

So we should decide:

1) what kind of "fix" do we need, when trying to store illegal ISO-8859-1 chars?

a) should we convert those characters to entities?
b) should we transliterate those characters?

2) when will we apply the "fix"?

a) At the point TYPO3 will save the data into the database (tcemain)?
b) Or just when getting backend form submission with inappropriate data (alt_doc.php)?

This is the "translit" done by iconv (from windows-1252 -> iso-8859-1):

128: Euro sign: EUR
130: baseline single quote: ,
131: florin: ?
132: baseline double quote: ,,
133: ellipsis: ...
134: dagger: +
135: double dagger: ?
136: circumflex accent: ^
137: permile: ?
138: S Hacek: S
139: left single guillemet: <
140: OE ligature: OE
142: : Z
145: left single quote: '
146: right single quote: '
147: left double quote: "
148: right double quote: "
149: bullet: o
150: endash:
151: emdash: -

152: tilde accent: ~
153: trademark ligature: (TM)
154: s Hacek: s
155: right single guillemet: >
156: oe ligature: oe
158:: z
159: Y Dieresis: Y

I doubt that this is what the users will want to see in their text-fields.

As of translating the illegal characters to entities, this is probably not what is expected in pure text-fields, which might go through htmlspecialchars before rendering (e.g. page titles in menus), and which would produce &#xxx; strings in browser output.

I guess we have yet another "simple" alternative: change the default charset from iso-8859-1 to windows-1252 in tslib_fe. Might freak out some Unix-guys, but I am sure any Unix-Client can handle that correctly by now and at least we will have valid pages with the characters that the user entered (€ = EURO).

BTW even mantis has windows-1252 as charset, which is probably why the € sign shows correctly. :)

#13 Updated by Ernesto Baschny over 12 years ago

Another "simple" alternative would be to translate illegal iso-8859-1 to corresponding entities only at render time, and keeping the iso-8859-1 charset. Which is more or less what my extension cron_latin1cleaner does.

#14 Updated by Martin Kutschker over 12 years ago

@ernesto, I'm aware that only windows-1252 is a real superset of a iso codepage. Still we should test what IE really does in such situations.

As for transliteration: of course this is not what the user expects, but if you want to stay clean, you have to make a sacrifice. Or simply set forceCharset to windows-1252.

If you want to do the entity conversion during rendering is the question, when to do it (TS rendering, etc). But the easist and IMHO only practicable solution is to use the page postprocessing (what you do already).

But I fail to see the real problem. Who really cares so about those characters? And why do those not use one of the practical solutions available. 99.99% of the users don't care and why should be TYPO3 burdened with just another conversion.

#15 Updated by Andreas Wolf over 8 years ago

  • Category deleted (Communication)
  • Status changed from New to Needs Feedback
  • Target version deleted (0)
  • PHP Version deleted (4)

Is this still valid?

#16 Updated by Jigal van Hemert almost 7 years ago

  • Status changed from Needs Feedback to Closed

The whole core and the backend is now using UTF-8, so copy-paste conversion is done inside the browser and doesn't need any handling in the core itself anymore. Closed.

Also available in: Atom PDF