Project

General

Profile

Actions

Feature #16947

closed

Use extended Unicode features of PHP 5.1 for word splitting

Added by Martin Kutschker almost 18 years ago. Updated over 11 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
Indexed Search
Target version:
-
Start date:
2007-02-07
Due date:
% Done:

0%

Estimated time:
PHP Version:
Tags:
Complexity:
Sprint Focus:

Description

The word splitter uses currrently a homegrown PHP code to split text into words. Maybe it could be sped up by using PREGs \p{xx} feature to search for glyphs with certain properties.

(issue imported from #M4934)

Actions #1

Updated by Martin Kutschker almost 18 years ago

It's not so easy as also CJK glyphs are reported as letters. But this could be overcome by conditional subpatterns:

/(?(?=[\x41-\x{658}])\pL)/u

Look if character is between \x41 and \x0658. If so, match against the Unicode property "letter".

Unfortunately not all PCREs are compiled with Unicode property support. In this case you get this message:

PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled

Actions #2

Updated by Alexander Opitz over 11 years ago

  • Status changed from New to Needs Feedback
  • Target version deleted (0)
  • TYPO3 Version set to 4.1

As this report is very old, is the handling in newer TYPO3 CMS Versions (like 6.0/6.1) more like you expect it?

Actions #3

Updated by Martin Kutschker over 11 years ago

I haven't used indexed search for years (earlier I switched to mnogoSearch but now have settled for Solr). Personally I still advocate the use of preg_match and it's Unicode features, but any change to the code will only please my tastes.

Actions #4

Updated by Alexander Opitz over 11 years ago

  • Status changed from Needs Feedback to Closed

Ok, so closing this issue.

Actions

Also available in: Atom PDF