Bug #34774: RemoveXSS doesn't work correctly on 16-bit HTML-entities - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Bug #34774

closed

RemoveXSS doesn't work correctly on 16-bit HTML-entities

Added by Jan Bartels over 12 years ago. Updated over 8 years ago.

Status:

Rejected

Priority:

Should have

Assignee:

Category:

Miscellaneous

Target version:

Start date:

2012-03-12

Due date:

% Done:

Estimated time:

TYPO3 Version:

4.5

PHP Version:

5.3

Tags:

Complexity:

easy

Is Regression:

Sprint Focus:

Description

With the 16-bit utf-8 encoding of entities like [“”…–→]

(codes &#8220; &#8221; &#8211; &#8230; &#8594;)

and probably tons of other entities, the output of t3lib_div::removeXSS() becomes garbled: R20;, R21;, R11;, R30;, U94; and so on. I get those entities from importing RSS-feeds e. g. of Wordpress.

A possible solution would be to replace the hex-values with their according HTML-entities before testing and replacing the numeric values in RemoveXSS::process(). There may be better and faster solutions.

final class RemoveXSS {
...
public static function process($val, $replaceString = '<x>') {
...
// remove all non-printable characters. CR(0a) and LF(0b) and TAB(9) are allowed
// this prevents some character re-spacing such as <java\0script>
// note that you have to handle splits with \n, \r, and \t later since they *are* allowed in some inputs
$val = preg_replace('/([\x00-\x08][\x0b-\x0c][\x0e-\x19])/', '', $val);

//------patch begin-------
        $search = array();       $replace = array();
        $search[] = '/&#8211;/'; $replace[] = '&ndash;';
        $search[] = '/&#8220;/'; $replace[] = '&ldquo;';
        $search[] = '/&#8221;/'; $replace[] = '&rdquo;';
        $search[] = '/&#8230;/'; $replace[] = '&hellip;';
        $search[] = '/&#8594;/'; $replace[] = '&rarr;';
//              ... plus several other entities 
        $val = preg_replace( $search, $replace, $val );
//------patch end-------

        // straight replacements, the user should never need these since they're normal characters
        // this prevents like <IMG SRC=&#X40&#X61&#X76&#X61&#X73&#X63&#X72&#X69&#X70&#X74&#X3A&#X61&#X6C&#X65&#X72&#X74&#X28&#X27&#X58&#X53&#X53&#X27&#X29>

I'm not sure if 16-bit values have to be replaced in RemoveXSS::process() at all. If they don't have to be, they should be skipped completely when analyzing the numeric entities. In this case the questionmark in the search expressions can be deleted (...7c|7d|7e);?/ie';) so that only values less equal to 0x7E are matched:

        $searchHexEncodings = '/&#[xX]0{0,8}(21|22|23|...|7d|7e);/ie';
        $searchUnicodeEncodings = '/&#0{0,8}(33|34|35|...|125|126);/ie';

This suggestion should be checked by some experts on XSS.

Actions

Copy link

Updated by Jigal van Hemert over 12 years ago

Category set to Miscellaneous
Status changed from New to Accepted
Complexity set to easy

Because the trailing semi-colon has to be optional, the first part of the unicode entity is seen as lower ascii character. I think it's enough to add an assertion to make sure no valid (hexa)decimal characters follow the entity number.

Actions

Copy link