Project

General

Profile

Actions

Bug #20866

closed

removeBadHTML is returning an empty string

Added by Thomas Gabler over 14 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
-
Target version:
Start date:
2009-08-12
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
4.2
PHP Version:
5.3
Tags:
Complexity:
Is Regression:
No
Sprint Focus:

Description

The function removeBadHTML is returning an empty string if the text to check contains special utf8-encoded chars.

We have to check whether the preg_replace function returns NULL (not empty string). I assume that it has to do with the "\w" pattern depending on locale settings.

This little PHP-script shows the problem. The preg_replace function returns the error value NULL.

$text = '<br />
Ort: Neumarkt, Bildungshaus Kloster St. Josef (Wildbad 1)<br />
Referentin: Prof. Dr. Irmgard Schroll-Decker (Dr. phil., Dipl. Päd.), Lehrstuhl für Sozialmanagement und Bildungsarbeit an der FH Regensburg).<br />
Veranstalter: Sachausschuss kirchliche Seniorenarbeit Bistum Eichstätt im Diözesanrat der Katholiken im Bistum Eichstätt in Zusammenarbeit mit dem KEB-DiBW, Bereich Altenbildung.<br />
Kosten: 5 € (inkl. Mittagessen).<br />
<br />
Als Sachausschuss Kirchliche Seniorenarbeit des Diözesanrats sehen wir unsere Aufgabe vorrangig darin, Sie in Ihrem Einsatz für die älteren Menschen und für das Miteinander der Generationen in den verschiedensten Bereichen kirchlicher Seniorenarbeit, in Pfarrgemeinden und Verbänden zu unterstützen und Sie darin zu bestärken.<br />
<br />
Deshalb laden wir Sie herzlich ein zum vierten Diözesantag für ehrenamtliche Mitarbeiterinnen und Mitarbeiter in der Seniorenarbeit.<br />
Unter dem Motto "Alt und Jung auf der Spur. Leben im Miteinander der Generationen!" erhalten Sie an diesem Tag die Möglichkeit zum Auftanken und zum Erfahrungsaustausch. Sie bekommen Impulse für Ihre eigene Lebensgestaltung und natürlich für Ihre alltägliche Arbeit mit Älteren:<br />

In einem der 8 Schnupperangebote können Sie praktische Möglichkeiten kennenlernen, wie als und mit älteren Menschen das "Miteinander der Generationen" gestaltet werden kann.<br />
Im Gottesdienst feiern wir schließlich, dass von Gott her das "Miteinander der Generationen" bereits geschenkt ist.<br />
<br />
Wir würden uns freuen, wenn Sie / Ihr Team sich an diesem Tag auf "Spurensuche" begeben und am Diözesantag teilnehmen!';

$pattern = "'<\w+.*?(onabort|onbeforeunload|onblur|onchange|onclick|ondblclick|ondragdrop|onerror|onfilterchange|onfocus|onhelp|onkeydown|onkeypress|onkeyup|onload|onmousedown|onmousemove|onmouseout|onmouseover|onmouseup|onmove|onreadystatechange|onreset|onresize|onscroll|onselect|onselectstart|onsubmit|onunload).*?>'si";

$result = preg_replace($pattern,'',$text);
var_dump($result); // NULL
?>
(issue imported from #M11693)


Files

patch.patch (608 Bytes) patch.patch Administrator Admin, 2009-08-12 09:45
replacetest.php (18.2 KB) replacetest.php php-file to verify PHP error Bernd Wilke, 2012-05-09 17:42
class.tslib_content.php.patch (7.79 KB) class.tslib_content.php.patch Same changes as patch file Christoph Holtermann, 2014-02-16 12:06
Actions #1

Updated by Bernd Wilke almost 12 years ago

The bug still exists in 4.5.15

it seems to be a bug in PHP (tested version: 5.2.13) but a work-around is possible: split up the last regexp-pattern with a lot of variants into multiple pattern with less variants.

Playing around with the data: it seems that the error depends on length of input. The longer the input the less variants are possible in one replacement pattern.

original attached patch-file is a wrong one.

attached: php-file to verify PHP error.

Actions #2

Updated by Bernd Wilke almost 12 years ago

in the php online manual you can find hints for preg_replace failure:
http://php.net/manual/en/function.preg-replace.php

search for 'backtrack_limit' in the page and find:
----
timitheenchanter 28-Jul-2011 01:34
If you have issues where preg_replace returns an empty string, please take a look at these two ini parameters:

pcre.backtrack_limit
pcre.recursion_limit

The default is set to 100K. If your buffer is larger than this, look to increase these two values.
----
AmigoJack 10-Jul-2010 06:13
If you're using preg_replace() on huge strings you have to be aware of PREG's limitations. In fact, after each preg_xxx() function you should check if PREG internally failed (and by "failure" I don't mean regexp syntax errors).

On default PHP installations you will run into problems when using preg_xxx() functions on strings with a length of more than 100'000 characters. To workaround rare occasions you can use this:

<?php
    $iSet= 0;  // Count how many times we increase the limit
    while( $iSet< 10 ) {  // If the default limit is 100'000 characters the highest new limit will be 250'000 characters
        $sNewText= preg_replace( $sRegExpPattern, $sRegExpReplacement, $sVeryLongText );  // Try to use PREG

        if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) {  // Only check on backtrack limit failure
            ini_set( 'pcre.backtrack_limit', (int)ini_get( 'pcre.backtrack_limit' )+ 15000 );  // Get current limit and increase
            $iSet++;  // Do not overkill the server
        } else {  // No fail
            $sVeryLongText= $sNewText;  // On failure $sNewText would be NULL
            break;  // Exit loop
        }
    }
?>

However, be careful: 1.) ini_set() may be forbidden on your server; 2.) preg_last_error() doest not exist prior to PHP 5.2.0; 3.) setting a backtrack limit too high may crash PHP (not only the script currently executed). So if you work a lot with long strings you definitly have to look out for a real solution!
----

you may reach the limits if you
a) use long strings
b) use very complex regular expressions (as in the core function removeBadHTML() )

Actions #3

Updated by Christoph Holtermann almost 11 years ago

My working version at this moment as follows. I ripped the regex apart as suggested and added some error reporting. It's more of a debug version.
I thought it was useful if the string returned wouldn't just be empty but contain the error message.
The message with some details is also being forwarded to devLog.

a) I didn't care about Typos debug and error reporting settings in this solution - should be done
b) I didn't really have a look at the regexs, just split them - this may be optimizable
c) I used an own loop instead of the implemented one to to be able to report which regex failed

        /**
         * Function for removing malicious HTML code when you want to provide some HTML code user-editable.
         * The purpose is to avoid XSS attacks and the code will be continously modified to remove such code.
         * For a complete reference with javascript-on-events, see http://www.wdvl.com/Authoring/JavaScript/Events/events_target.html
         *
         * @param       string          Input string to be cleaned.
         * @param       array           TypoScript configuration.
         * @return      string          Return string
         * @author      Thomas Bley (all from moregroupware cvs code / readmessage.inc.php, published under gpl by Thomas)
         * @author      Kasper Sk�rh�j
         * @author      Christoph Holtermann
         */
        function removeBadHTML($text, $conf) {

                // Copyright 2002-2003 Thomas Bley

                $preg_replace_array = array(
                                "'<script[^>]*?>.*?</script[^>]*?>'si", 
                                "'<applet[^>]*?>.*?</applet[^>]*?>'si", 
                                "'<object[^>]*?>.*?</object[^>]*?>'si", 
                                "'<iframe[^>]*?>.*?</iframe[^>]*?>'si", 
                                "'<frameset[^>]*?>.*?</frameset[^>]*?>'si", 
                                "'<style[^>]*?>.*?</style[^>]*?>'si", 
                                "'<marquee[^>]*?>.*?</marquee[^>]*?>'si", 
                                "'<script[^>]*?>'si", 
                                "'<meta[^>]*?>'si", 
                                "'<base[^>]*?>'si", 
                                "'<applet[^>]*?>'si", 
                                "'<object[^>]*?>'si", 
                                "'<link[^>]*?>'si", 
                                "'<iframe[^>]*?>'si", 
                                "'<frame[^>]*?>'si", 
                                "'<frameset[^>]*?>'si", 
                                "'<input[^>]*?>'si", 
                                "'<form[^>]*?>'si", 
                                "'<embed[^>]*?>'si", 
                                "'background-image:url'si", 
                                "'<\w+.*?(onabort).*?>'si",
                                "'<\w+.*?(onbeforeunload).*?>'si",
                                "'<\w+.*?(onblur).*?>'si",
                                "'<\w+.*?(onchange).*?>'si",
                                "'<\w+.*?(onclick).*?>'si",
                                "'<\w+.*?(ondblclick).*?>'si",
                                "'<\w+.*?(ondragdrop).*?>'si",
                                "'<\w+.*?(onerror).*?>'si",
                                "'<\w+.*?(onfilterchange).*?>'si",
                                "'<\w+.*?(onfocus).*?>'si",
                                "'<\w+.*?(onhelp).*?>'si",
                                "'<\w+.*?(onkeydown).*?>'si",
                                "'<\w+.*?(onkeypress).*?>'si",
                                "'<\w+.*?(onkeyup).*?>'si",
                                "'<\w+.*?(onload).*?>'si",
                                "'<\w+.*?(onmousedown).*?>'si",
                                "'<\w+.*?(onmousemove).*?>'si",
                                "'<\w+.*?(onmouseout).*?>'si",
                                "'<\w+.*?(onmouseover).*?>'si",
                                "'<\w+.*?(onmouseup).*?>'si",
                                "'<\w+.*?(onmove).*?>'si",
                                "'<\w+.*?(onreadystatechange).*?>'si",
                                "'<\w+.*?(onreset).*?>'si",
                                "'<\w+.*?(onresize).*?>'si",
                                "'<\w+.*?(onscroll).*?>'si",
                                "'<\w+.*?(onselect).*?>'si",
                                "'<\w+.*?(onselectstart).*?>'si",
                                "'<\w+.*?(onsubmit).*?>'si",
                                "'<\w+.*?(onunload).*?>'si" );

                foreach ($preg_replace_array as $preg_expression)
                {
                        $text = preg_replace($preg_expression, "", $text);
                        if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) {
                                t3lib_div::devLog('removeBadHTML: Memory exceeded. ', $this->extKey, 2, array($preg_expression, $text));
                                return "removeBadHTML memory exceeded.";
                        }
                }

                $text = preg_replace(
                        '/<a[^>]*href[[:space:]]*=[[:space:]]*["\']?[[:space:]]*javascript[^>]*/i',
                        '',
                        $text
                );

                if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) {
                        t3lib_div::devLog('removeBadHTML: Memory exceeded. ', $this->extKey, 2);
                        return "removeBadHTML memory exceeded.";
                        }

                        // Return clean content
                return $text;
        }

Actions #5

Updated by Mathias Schreiber about 9 years ago

  • Description updated (diff)
  • Target version changed from 4.5.16 to 7.2 (Frontend)
  • Is Regression set to No
  • Sprint Focus set to On Location Sprint
Actions #6

Updated by Jan Helke about 9 years ago

Is this still an issue? I tested the above given snippet as well as the replacetest.php script. Both of them produced fine output.

My environment is PHP 5.5.9. As TYPO3 requires at least PHP 5.5.0, I think we could close that.

Actions #7

Updated by Jan Helke about 9 years ago

  • Status changed from New to Closed

As it seems that this it not an issue anymore, I close that. If you have objections, please provide more information how to reproduce in a current environment.

Actions #8

Updated by Anja Leichsenring over 8 years ago

  • Sprint Focus deleted (On Location Sprint)
Actions

Also available in: Atom PDF