Bug #15287: Illegal SGML characters in output - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Bug #15287

closed

Illegal SGML characters in output

Added by Ernesto Baschny almost 19 years ago. Updated almost 12 years ago.

Status:

Closed

Priority:

Should have

Assignee:

Category:

Target version:

Start date:

2005-12-15

Due date:

% Done:

Estimated time:

TYPO3 Version:

4.0

PHP Version:

Tags:

Complexity:

Is Regression:

Sprint Focus:

Description

Hi,

the "non SGML character number 128" is probably the most annoying validation error that TYPO3-sites hit when users from the Windows world copy&paste input some field which will go right through to the frontend.

THE PROBLEM
---------------

The origin of the problem comes from the fact that the ISO-Latin-1 character table specifies every character from the decimal range 32 up to 255, but has a gap in the range from 128 to 159 (see [1]). This range is (mis?)used by Microsoft in the so called "Windows-Latin-1" for various characters. The most frequently chars are the EURO-sign, the emdash ("langer Gedankenstrich", which MS-Word creates automatically if you type an hyphen with spaces around it) and opening-double-quotes (bottom) (also created by Word in German if you start some quotation).

So outputting these characters for the Web in "charset=iso-8859-1" mode is not "valid", because they are not part of this charset (which is also why the W3C-validator chokes on them). The very good article in [2] present some alternatives on how to output them in a generic way.

SOME TYPO3 SOLUTIONS
------------------------

Some time in the past I've written "cron_rte_cleanenc", which will remap those characters from the RTE into proper numerical entities (which is what the article [2] suggests as the most widely used method). This is nice, but later I figured out that these characters can also be pasted into fields that are not RTE-enabled (e.g. Title, Subtitle, etc), so my processing also works on some cases.

Later versions of qcom_htmlcleaner include the switch "Remap illegal chars" (clean_chars), which will translation any "high ASCII" character to a proper entity. Two problems I see with the current approach:

1. it only applies to XHTML_clean(), while the problem also exists in
HTML mode.
2. it translates all characters >127 into entities, which is not
needed. The range 128-159 is sufficient here, as Ä can be
represented by a proper ISO-Latin-1 character already.

MY GOAL/AIM
--------------

I want this translation to happen in TYPO3-core, without needing any extention. Our goal has been XHTML-validity, and this is a major issue in this commitment. This is not a "xhtml_cleaning" problem, but a generic charset problem. We have proven solutions to the problem, we just need to see if they are generic enough not to hurt and add them in a meaningful way to the core.

HOW TO PROCEED
-----------------

We need to find out in which character sets this is a problem. If I set my site to "forceCharSet=utf-8", the problem doesn't exist, because all pasted input will have corresponding UTF-8 entities which are valid. So maybe some charset expert around could tell us a bit about it, and if noone is available, I would do some research on it. I suspect every ISO-Latin-x variant has this problem.

Then we need to create some patches to correct the situation.

[1] http://www.htmlhelp.com/reference/charset/
[2] http://www.cs.tut.fi/~jkorpela/www/windows-chars.html

(issue imported from #M2048)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (3)

Bug #15287

Illegal SGML characters in output

Updated by Ernesto Baschny almost 19 years ago

Updated by Michael Stucki almost 19 years ago

Updated by Sebastian Kurfuerst almost 19 years ago

Updated by Michael Stucki almost 19 years ago

Updated by Ernesto Baschny almost 19 years ago

Updated by Fronzes Philippe over 17 years ago

Updated by Ernesto Baschny over 17 years ago

Updated by Ernesto Baschny over 17 years ago

Updated by Martin Kutschker over 17 years ago

Updated by Ernesto Baschny over 17 years ago

Updated by Martin Kutschker over 17 years ago

Updated by Ernesto Baschny over 17 years ago

Updated by Ernesto Baschny over 17 years ago

Updated by Martin Kutschker over 17 years ago

Updated by Andreas Wolf about 13 years ago

Updated by Jigal van Hemert almost 12 years ago