Feature #16947: Use extended Unicode features of PHP 5.1 for word splitting - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Feature #16947

closed

Use extended Unicode features of PHP 5.1 for word splitting

Added by Martin Kutschker almost 18 years ago. Updated over 11 years ago.

Status:

Closed

Priority:

Should have

Assignee:

Category:

Indexed Search

Target version:

Start date:

2007-02-07

Due date:

% Done:

Estimated time:

PHP Version:

Tags:

Complexity:

Sprint Focus:

Description

The word splitter uses currrently a homegrown PHP code to split text into words. Maybe it could be sped up by using PREGs \p{xx} feature to search for glyphs with certain properties.

(issue imported from #M4934)

Actions

Copy link

Updated by Martin Kutschker almost 18 years ago

It's not so easy as also CJK glyphs are reported as letters. But this could be overcome by conditional subpatterns:

/(?(?=[\x41-\x{658}])\pL)/u

Look if character is between \x41 and \x0658. If so, match against the Unicode property "letter".

Unfortunately not all PCREs are compiled with Unicode property support. In this case you get this message:

PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled

Actions

Copy link

Updated by Alexander Opitz over 11 years ago

Status changed from New to Needs Feedback
Target version deleted (0)
TYPO3 Version set to 4.1

As this report is very old, is the handling in newer TYPO3 CMS Versions (like 6.0/6.1) more like you expect it?

Actions

Copy link

Updated by Martin Kutschker over 11 years ago

I haven't used indexed search for years (earlier I switched to mnogoSearch but now have settled for Solr). Personally I still advocate the use of preg_match and it's Unicode features, but any change to the code will only please my tastes.

Actions

Copy link

Updated by Alexander Opitz over 11 years ago

Status changed from Needs Feedback to Closed

Ok, so closing this issue.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (1)

Feature #16947

Use extended Unicode features of PHP 5.1 for word splitting

Updated by Martin Kutschker almost 18 years ago

Updated by Alexander Opitz over 11 years ago

Updated by Martin Kutschker over 11 years ago

Updated by Alexander Opitz over 11 years ago