Bug #20866
closedremoveBadHTML is returning an empty string
0%
Description
The function removeBadHTML is returning an empty string if the text to check contains special utf8-encoded chars.
We have to check whether the preg_replace function returns NULL (not empty string). I assume that it has to do with the "\w" pattern depending on locale settings.
This little PHP-script shows the problem. The preg_replace function returns the error value NULL.
$text = '<br />
Ort: Neumarkt, Bildungshaus Kloster St. Josef (Wildbad 1)<br />
Referentin: Prof. Dr. Irmgard Schroll-Decker (Dr. phil., Dipl. Päd.), Lehrstuhl für Sozialmanagement und Bildungsarbeit an der FH Regensburg).<br />
Veranstalter: Sachausschuss kirchliche Seniorenarbeit Bistum Eichstätt im Diözesanrat der Katholiken im Bistum Eichstätt in Zusammenarbeit mit dem KEB-DiBW, Bereich Altenbildung.<br />
Kosten: 5 € (inkl. Mittagessen).<br />
<br />
Als Sachausschuss Kirchliche Seniorenarbeit des Diözesanrats sehen wir unsere Aufgabe vorrangig darin, Sie in Ihrem Einsatz für die älteren Menschen und für das Miteinander der Generationen in den verschiedensten Bereichen kirchlicher Seniorenarbeit, in Pfarrgemeinden und Verbänden zu unterstützen und Sie darin zu bestärken.<br />
<br />
Deshalb laden wir Sie herzlich ein zum vierten Diözesantag für ehrenamtliche Mitarbeiterinnen und Mitarbeiter in der Seniorenarbeit.<br />
Unter dem Motto "Alt und Jung auf der Spur. Leben im Miteinander der Generationen!" erhalten Sie an diesem Tag die Möglichkeit zum Auftanken und zum Erfahrungsaustausch. Sie bekommen Impulse für Ihre eigene Lebensgestaltung und natürlich für Ihre alltägliche Arbeit mit Älteren:<br />
In einem der 8 Schnupperangebote können Sie praktische Möglichkeiten kennenlernen, wie als und mit älteren Menschen das "Miteinander der Generationen" gestaltet werden kann.<br />
Im Gottesdienst feiern wir schließlich, dass von Gott her das "Miteinander der Generationen" bereits geschenkt ist.<br />
<br />
Wir würden uns freuen, wenn Sie / Ihr Team sich an diesem Tag auf "Spurensuche" begeben und am Diözesantag teilnehmen!';
$pattern = "'<\w+.*?(onabort|onbeforeunload|onblur|onchange|onclick|ondblclick|ondragdrop|onerror|onfilterchange|onfocus|onhelp|onkeydown|onkeypress|onkeyup|onload|onmousedown|onmousemove|onmouseout|onmouseover|onmouseup|onmove|onreadystatechange|onreset|onresize|onscroll|onselect|onselectstart|onsubmit|onunload).*?>'si";
$result = preg_replace($pattern,'',$text);
var_dump($result); // NULL
?>
(issue imported from #M11693)
Files
Updated by Bernd Wilke over 12 years ago
- File replacetest.php replacetest.php added
- Target version changed from 0 to 4.5.16
The bug still exists in 4.5.15
it seems to be a bug in PHP (tested version: 5.2.13) but a work-around is possible: split up the last regexp-pattern with a lot of variants into multiple pattern with less variants.
Playing around with the data: it seems that the error depends on length of input. The longer the input the less variants are possible in one replacement pattern.
original attached patch-file is a wrong one.
attached: php-file to verify PHP error.
Updated by Bernd Wilke over 12 years ago
in the php online manual you can find hints for preg_replace failure:
http://php.net/manual/en/function.preg-replace.php
search for 'backtrack_limit' in the page and find:
----
timitheenchanter 28-Jul-2011 01:34
If you have issues where preg_replace returns an empty string, please take a look at these two ini parameters:
pcre.backtrack_limit
pcre.recursion_limit
The default is set to 100K. If your buffer is larger than this, look to increase these two values.
----
AmigoJack 10-Jul-2010 06:13
If you're using preg_replace() on huge strings you have to be aware of PREG's limitations. In fact, after each preg_xxx() function you should check if PREG internally failed (and by "failure" I don't mean regexp syntax errors).
On default PHP installations you will run into problems when using preg_xxx() functions on strings with a length of more than 100'000 characters. To workaround rare occasions you can use this:
<?php $iSet= 0; // Count how many times we increase the limit while( $iSet< 10 ) { // If the default limit is 100'000 characters the highest new limit will be 250'000 characters $sNewText= preg_replace( $sRegExpPattern, $sRegExpReplacement, $sVeryLongText ); // Try to use PREG if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) { // Only check on backtrack limit failure ini_set( 'pcre.backtrack_limit', (int)ini_get( 'pcre.backtrack_limit' )+ 15000 ); // Get current limit and increase $iSet++; // Do not overkill the server } else { // No fail $sVeryLongText= $sNewText; // On failure $sNewText would be NULL break; // Exit loop } } ?>
However, be careful: 1.) ini_set() may be forbidden on your server; 2.) preg_last_error() doest not exist prior to PHP 5.2.0; 3.) setting a backtrack limit too high may crash PHP (not only the script currently executed). So if you work a lot with long strings you definitly have to look out for a real solution!
----
you may reach the limits if you
a) use long strings
b) use very complex regular expressions (as in the core function removeBadHTML() )
Updated by Christoph Holtermann over 11 years ago
My working version at this moment as follows. I ripped the regex apart as suggested and added some error reporting. It's more of a debug version.
I thought it was useful if the string returned wouldn't just be empty but contain the error message.
The message with some details is also being forwarded to devLog.
a) I didn't care about Typos debug and error reporting settings in this solution - should be done
b) I didn't really have a look at the regexs, just split them - this may be optimizable
c) I used an own loop instead of the implemented one to to be able to report which regex failed
/**
* Function for removing malicious HTML code when you want to provide some HTML code user-editable.
* The purpose is to avoid XSS attacks and the code will be continously modified to remove such code.
* For a complete reference with javascript-on-events, see http://www.wdvl.com/Authoring/JavaScript/Events/events_target.html
*
* @param string Input string to be cleaned.
* @param array TypoScript configuration.
* @return string Return string
* @author Thomas Bley (all from moregroupware cvs code / readmessage.inc.php, published under gpl by Thomas)
* @author Kasper Sk�rh�j
* @author Christoph Holtermann
*/
function removeBadHTML($text, $conf) {
// Copyright 2002-2003 Thomas Bley
$preg_replace_array = array(
"'<script[^>]*?>.*?</script[^>]*?>'si",
"'<applet[^>]*?>.*?</applet[^>]*?>'si",
"'<object[^>]*?>.*?</object[^>]*?>'si",
"'<iframe[^>]*?>.*?</iframe[^>]*?>'si",
"'<frameset[^>]*?>.*?</frameset[^>]*?>'si",
"'<style[^>]*?>.*?</style[^>]*?>'si",
"'<marquee[^>]*?>.*?</marquee[^>]*?>'si",
"'<script[^>]*?>'si",
"'<meta[^>]*?>'si",
"'<base[^>]*?>'si",
"'<applet[^>]*?>'si",
"'<object[^>]*?>'si",
"'<link[^>]*?>'si",
"'<iframe[^>]*?>'si",
"'<frame[^>]*?>'si",
"'<frameset[^>]*?>'si",
"'<input[^>]*?>'si",
"'<form[^>]*?>'si",
"'<embed[^>]*?>'si",
"'background-image:url'si",
"'<\w+.*?(onabort).*?>'si",
"'<\w+.*?(onbeforeunload).*?>'si",
"'<\w+.*?(onblur).*?>'si",
"'<\w+.*?(onchange).*?>'si",
"'<\w+.*?(onclick).*?>'si",
"'<\w+.*?(ondblclick).*?>'si",
"'<\w+.*?(ondragdrop).*?>'si",
"'<\w+.*?(onerror).*?>'si",
"'<\w+.*?(onfilterchange).*?>'si",
"'<\w+.*?(onfocus).*?>'si",
"'<\w+.*?(onhelp).*?>'si",
"'<\w+.*?(onkeydown).*?>'si",
"'<\w+.*?(onkeypress).*?>'si",
"'<\w+.*?(onkeyup).*?>'si",
"'<\w+.*?(onload).*?>'si",
"'<\w+.*?(onmousedown).*?>'si",
"'<\w+.*?(onmousemove).*?>'si",
"'<\w+.*?(onmouseout).*?>'si",
"'<\w+.*?(onmouseover).*?>'si",
"'<\w+.*?(onmouseup).*?>'si",
"'<\w+.*?(onmove).*?>'si",
"'<\w+.*?(onreadystatechange).*?>'si",
"'<\w+.*?(onreset).*?>'si",
"'<\w+.*?(onresize).*?>'si",
"'<\w+.*?(onscroll).*?>'si",
"'<\w+.*?(onselect).*?>'si",
"'<\w+.*?(onselectstart).*?>'si",
"'<\w+.*?(onsubmit).*?>'si",
"'<\w+.*?(onunload).*?>'si" );
foreach ($preg_replace_array as $preg_expression)
{
$text = preg_replace($preg_expression, "", $text);
if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) {
t3lib_div::devLog('removeBadHTML: Memory exceeded. ', $this->extKey, 2, array($preg_expression, $text));
return "removeBadHTML memory exceeded.";
}
}
$text = preg_replace(
'/<a[^>]*href[[:space:]]*=[[:space:]]*["\']?[[:space:]]*javascript[^>]*/i',
'',
$text
);
if( preg_last_error()== PREG_BACKTRACK_LIMIT_ERROR ) {
t3lib_div::devLog('removeBadHTML: Memory exceeded. ', $this->extKey, 2);
return "removeBadHTML memory exceeded.";
}
// Return clean content
return $text;
}
Updated by Christoph Holtermann almost 11 years ago
Updated by Mathias Schreiber almost 10 years ago
- Description updated (diff)
- Target version changed from 4.5.16 to 7.2 (Frontend)
- Is Regression set to No
- Sprint Focus set to On Location Sprint
Updated by Jan Helke almost 10 years ago
Is this still an issue? I tested the above given snippet as well as the replacetest.php script. Both of them produced fine output.
My environment is PHP 5.5.9. As TYPO3 requires at least PHP 5.5.0, I think we could close that.
Updated by Jan Helke almost 10 years ago
- Status changed from New to Closed
As it seems that this it not an issue anymore, I close that. If you have objections, please provide more information how to reproduce in a current environment.
Updated by Anja Leichsenring almost 9 years ago
- Sprint Focus deleted (
On Location Sprint)