Project

General

Profile

Actions

Bug #15287

closed

Illegal SGML characters in output

Added by Ernesto Baschny almost 19 years ago. Updated almost 12 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
-
Target version:
-
Start date:
2005-12-15
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
4.0
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

Hi,

the "non SGML character number 128" is probably the most annoying validation error that TYPO3-sites hit when users from the Windows world copy&paste input some field which will go right through to the frontend.

THE PROBLEM
---------------

The origin of the problem comes from the fact that the ISO-Latin-1 character table specifies every character from the decimal range 32 up to 255, but has a gap in the range from 128 to 159 (see [1]). This range is (mis?)used by Microsoft in the so called "Windows-Latin-1" for various characters. The most frequently chars are the EURO-sign, the emdash ("langer Gedankenstrich", which MS-Word creates automatically if you type an hyphen with spaces around it) and opening-double-quotes (bottom) (also created by Word in German if you start some quotation).

So outputting these characters for the Web in "charset=iso-8859-1" mode is not "valid", because they are not part of this charset (which is also why the W3C-validator chokes on them). The very good article in [2] present some alternatives on how to output them in a generic way.

SOME TYPO3 SOLUTIONS
------------------------

Some time in the past I've written "cron_rte_cleanenc", which will remap those characters from the RTE into proper numerical entities (which is what the article [2] suggests as the most widely used method). This is nice, but later I figured out that these characters can also be pasted into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my processing also works on some cases.

Later versions of qcom_htmlcleaner include the switch "Remap illegal chars" (clean_chars), which will translation any "high ASCII" character to a proper entity. Two problems I see with the current approach:

1. it only applies to XHTML_clean(), while the problem also exists in
HTML mode.
2. it translates all characters >127 into entities, which is not
needed. The range 128-159 is sufficient here, as Ä can be
represented by a proper ISO-Latin-1 character already.

MY GOAL/AIM
--------------

I want this translation to happen in TYPO3-core, without needing any extention. Our goal has been XHTML-validity, and this is a major issue in this commitment. This is not a "xhtml_cleaning" problem, but a generic charset problem. We have proven solutions to the problem, we just need to see if they are generic enough not to hurt and add them in a meaningful way to the core.

HOW TO PROCEED
-----------------

We need to find out in which character sets this is a problem. If I set my site to "forceCharSet=utf-8", the problem doesn't exist, because all pasted input will have corresponding UTF-8 entities which are valid. So maybe some charset expert around could tell us a bit about it, and if noone is available, I would do some research on it. I suspect every ISO-Latin-x variant has this problem.

Then we need to create some patches to correct the situation.

[1] http://www.htmlhelp.com/reference/charset/
[2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

(issue imported from #M2048)

Actions

Also available in: Atom PDF