Project

General

Profile

Actions

Bug #28567

closed

Epic #65814: Make Indexed search extbase plugin shine

Ugly replacement character when removing whitespaces

Added by Dimitri Koenig over 12 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
Should have
Category:
Indexed Search
Target version:
Start date:
2011-07-29
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
4.5
PHP Version:
Tags:
Complexity:
Is Regression:
No
Sprint Focus:

Description

tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning:

$str = preg_replace('/\s\s+/',' ',$str);

But sometimes this produces ugly replacement characters (U+FFFD/65533).

Any solution?

Actions #1

Updated by Tim Schenk almost 12 years ago

  • Assignee set to Dimitri Koenig

Dimitri Koenig wrote:

tx_indexedsearch.php->markupSWpartsOfString($str) removes unnecessary whitespaces at the beginning:
[...]

But sometimes this produces ugly replacement characters (U+FFFD/65533).

Any solution?

i changed this to UTF-8 within a XClass-Extension


$str = str_replace(' ',' ',t3lib_parsehtml::bidir_htmlspecialchars($str,-1));
$str = preg_replace('/[\s][\s]+/u',' ',$str);

and ...

$parts = preg_split('/'.$regExString.'/iu', ' '.$str.' ', 20000, PREG_SPLIT_DELIM_CAPTURE);

what works for me. Might be a good idea to change this charset specific...

But there where a couple other issues with indexing or output of UTF-8 especially in chinese and '&nbsp' in general, e.g. cropping of chinese search result outputs strange (question mark) characters. the solution was allowing cropping only at whitespaces, not within words...

I changed in t3lib_cs::entities_to_utf8

some lines to UTF-8:


$param2 = (is_numeric(ENT_HTML5))?ENT_HTML5:ENT_QUOTES;
$entities = get_html_translation_table(HTML_ENTITIES,$param2,'UTF-8');

and in function t3lib_cs::crop:

@if ($i === FALSE) { // $len outside actual string length
return $string;
} else {
if ($len > 0) {
if (strlen($string{$i})) {
$string = substr($string, 0, $i);
$lastWhiteSpace = strrpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("last"=>$lastWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($lastWhiteSpace){
$string = substr($string, 0, $lastWhiteSpace);
}
return $string . $crop;;
}
} else {
if (strlen($string{$i - 1})) {
$string = substr($string, $i);
$firstWhiteSpace = strpos($string, " ");
//t3lib_utility_Debug::debugInPopUpWindow(array("first"=>$firstWhiteSpace,"length"=>strlen($string),"string"=>$string));
if($firstWhiteSpace){
$string = substr($string, $firstWhiteSpace);
}
return $crop . $string;
}
}@
Actions #2

Updated by Marc Véron about 11 years ago

I can confirm problems with the line
$str = preg_replace('/\s\s+/',' ',$str);

Searching for text that contains the letter à between white space displays nasty results as follows:

demandé � être représentée directement
instead of
demandé à être représentée directement

It only happens with white space aroud (or after) à. If the letter à is embedded in a word, it always displays correctly.

I found out that the problem occurs in line 2022 of class.tx_indexedsearch.php
$str = preg_replace('/\s\s+/',' ',$str);

There is a hint on http://stackoverflow.com/questions/2050723 that PHP's regular expressions are not Unicode-aware.

For a quick hack, I added two lines to separate à from white space:
$str = str_replace('à','|à|',$str); //Hack MV
$str = preg_replace('/\s\s+/',' ',$str);
$str = str_replace('|à|','à',$str); //Hack MV

Results display now as expected, but it would be nice to have it fixed without this hack.

Actions #3

Updated by Oliver Hader about 11 years ago

  • Target version set to 2222
Actions #4

Updated by Oliver Hader about 11 years ago

  • Project changed from 1382 to TYPO3 Core
Actions #5

Updated by Oliver Hader about 11 years ago

  • Category set to Indexed Search
Actions #6

Updated by Oliver Hader about 11 years ago

  • Target version deleted (2222)
Actions #7

Updated by Mathias Schreiber about 9 years ago

  • Target version set to 7.5
  • TYPO3 Version set to 4.5
  • Is Regression set to No
Actions #8

Updated by Tymoteusz Motylewski almost 9 years ago

  • Parent task set to #65814
Actions #9

Updated by Tymoteusz Motylewski over 8 years ago

  • Assignee changed from Dimitri Koenig to Tymoteusz Motylewski
Actions #10

Updated by Tymoteusz Motylewski over 8 years ago

  • Status changed from New to Closed

I have teste the issue with all kinds of utf characters ( from http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt as well as the one mentioned in the ticket) and couldn't reproduce.

It might be the case, that this bug was fixed with PHP upgrade to 5.4. In 5.4 they changed the default PHP charset to utf-8.
Also functions like htmlentities, html_entity_decode, htmlspecialchars etc are now utf-8 aware by default.

5.4 has just reached end of life, so we can safely close this ticket.

Please let me know if the issue is not solved for you with modern PHP version.

Actions

Also available in: Atom PDF