Bug #28567

Ugly replacement character when removing whitespaces

Added by Dimitri Koenig almost 2 years ago. Updated about 1 month ago.

Status:New Start date:2011-07-29
Priority:Should have Due date:
Assignee:Dimitri Koenig % Done:

0%

Category:indexed search
Target version:-
TYPO3 Version: Complexity:
PHP Version:
Votes: 0

Description

tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning:

$str = preg_replace('/\s\s+/',' ',$str);

But sometimes this produces ugly replacement characters (U+FFFD/65533).

Any solution?

History

Updated by Tim Schenk about 1 year ago

  • Assignee set to Dimitri Koenig

Dimitri Koenig wrote:

tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning: [...]

But sometimes this produces ugly replacement characters (U+FFFD/65533).

Any solution?

i changed this to UTF-8 within a XClass-Extension


$str = str_replace(' ',' ',t3lib_parsehtml::bidir_htmlspecialchars($str,-1));
$str = preg_replace('/[\s][\s]+/u',' ',$str);

and ...

$parts = preg_split('/'.$regExString.'/iu', ' '.$str.' ', 20000, PREG_SPLIT_DELIM_CAPTURE);

what works for me. Might be a good idea to change this charset specific...

But there where a couple other issues with indexing or output of UTF-8 especially in chinese and '&nbsp' in general, e.g. cropping of chinese search result outputs strange (question mark) characters. the solution was allowing cropping only at whitespaces, not within words...

I changed in t3lib_cs::entities_to_utf8

some lines to UTF-8:


$param2 = (is_numeric(ENT_HTML5))?ENT_HTML5:ENT_QUOTES;
$entities = get_html_translation_table(HTML_ENTITIES,$param2,'UTF-8');

and in function t3lib_cs::crop:

@if ($i === FALSE) { // $len outside actual string length
return $string;
} else {
if ($len > 0) {
if (strlen($string{$i})) {
$string = substr($string, 0, $i);
$lastWhiteSpace = strrpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("last"=>$lastWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($lastWhiteSpace){
$string = substr($string, 0, $lastWhiteSpace);
}
return $string . $crop;;
}
} else {
if (strlen($string{$i - 1})) {
$string = substr($string, $i);
$firstWhiteSpace = strpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("first"=>$firstWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($firstWhiteSpace){
$string = substr($string, $firstWhiteSpace);
}
return $crop . $string;
}
}@

Updated by Marc Véron 4 months ago

I can confirm problems with the line
$str = preg_replace('/\s\s+/',' ',$str);

Searching for text that contains the letter à between white space displays nasty results as follows:

demandé � être représentée directement
instead of
demandé à être représentée directement

It only happens with white space aroud (or after) à. If the letter à is embedded in a word, it always displays correctly.

I found out that the problem occurs in line 2022 of class.tx_indexedsearch.php
$str = preg_replace('/\s\s+/',' ',$str);

There is a hint on http://stackoverflow.com/questions/2050723 that PHP's regular expressions are not Unicode-aware.

For a quick hack, I added two lines to separate à from white space:
$str = str_replace('à','|à|',$str); //Hack MV
$str = preg_replace('/\s\s+/',' ',$str);
$str = str_replace('|à|','à',$str); //Hack MV

Results display now as expected, but it would be nice to have it fixed without this hack.

Updated by Oliver Hader about 1 month ago

  • Target version set to (temporary)

Updated by Oliver Hader about 1 month ago

  • Project changed from Indexed Search to Core

Updated by Oliver Hader about 1 month ago

  • Category set to indexed search

Updated by Oliver Hader about 1 month ago

  • Target version deleted ((temporary))

Also available in: Atom PDF