Bug #28567
Ugly replacement character when removing whitespaces
| Status: | New | Start date: | 2011-07-29 | |
|---|---|---|---|---|
| Priority: | Should have | Due date: | ||
| Assignee: | Dimitri Koenig | % Done: | 0% |
|
| Category: | indexed search | |||
| Target version: | - | |||
| TYPO3 Version: | Complexity: | |||
| PHP Version: | ||||
| Votes: | 0 |
Description
tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning:
$str = preg_replace('/\s\s+/',' ',$str);
But sometimes this produces ugly replacement characters (U+FFFD/65533).
Any solution?
History
Updated by Tim Schenk about 1 year ago
- Assignee set to Dimitri Koenig
Dimitri Koenig wrote:
tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning: [...]
But sometimes this produces ugly replacement characters (U+FFFD/65533).
Any solution?
i changed this to UTF-8 within a XClass-Extension
$str = str_replace(' ',' ',t3lib_parsehtml::bidir_htmlspecialchars($str,-1));
$str = preg_replace('/[\s][\s]+/u',' ',$str);
and ...
$parts = preg_split('/'.$regExString.'/iu', ' '.$str.' ', 20000, PREG_SPLIT_DELIM_CAPTURE);
what works for me. Might be a good idea to change this charset specific...
But there where a couple other issues with indexing or output of UTF-8 especially in chinese and ' ' in general, e.g. cropping of chinese search result outputs strange (question mark) characters. the solution was allowing cropping only at whitespaces, not within words...
I changed in t3lib_cs::entities_to_utf8
some lines to UTF-8:
$param2 = (is_numeric(ENT_HTML5))?ENT_HTML5:ENT_QUOTES;
$entities = get_html_translation_table(HTML_ENTITIES,$param2,'UTF-8');
and in function t3lib_cs::crop:
@if ($i === FALSE) { // $len outside actual string length
return $string;
} else {
if ($len > 0) {
if (strlen($string{$i})) {
$string = substr($string, 0, $i);
$lastWhiteSpace = strrpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("last"=>$lastWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($lastWhiteSpace){
$string = substr($string, 0, $lastWhiteSpace);
}
return $string . $crop;;
}
} else {
if (strlen($string{$i - 1})) {
$string = substr($string, $i);
$firstWhiteSpace = strpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("first"=>$firstWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($firstWhiteSpace){
$string = substr($string, $firstWhiteSpace);
}
return $crop . $string;
}
}@Updated by Marc Véron 4 months ago
I can confirm problems with the line
$str = preg_replace('/\s\s+/',' ',$str);
Searching for text that contains the letter à between white space displays nasty results as follows:
demandé � être représentée directement
instead of
demandé à être représentée directement
It only happens with white space aroud (or after) à. If the letter à is embedded in a word, it always displays correctly.
I found out that the problem occurs in line 2022 of class.tx_indexedsearch.php
$str = preg_replace('/\s\s+/',' ',$str);
There is a hint on http://stackoverflow.com/questions/2050723 that PHP's regular expressions are not Unicode-aware.
For a quick hack, I added two lines to separate à from white space:
$str = str_replace('à','|à|',$str); //Hack MV
$str = preg_replace('/\s\s+/',' ',$str);
$str = str_replace('|à|','à',$str); //Hack MV
Results display now as expected, but it would be nice to have it fixed without this hack.
Updated by Oliver Hader about 1 month ago
- Target version set to (temporary)
Updated by Oliver Hader about 1 month ago
- Project changed from Indexed Search to Core
Updated by Oliver Hader about 1 month ago
- Category set to indexed search
Updated by Oliver Hader about 1 month ago
- Target version deleted (
(temporary))