Feature #16947
closedUse extended Unicode features of PHP 5.1 for word splitting
0%
Description
The word splitter uses currrently a homegrown PHP code to split text into words. Maybe it could be sped up by using PREGs \p{xx} feature to search for glyphs with certain properties.
(issue imported from #M4934)
Updated by Martin Kutschker almost 18 years ago
It's not so easy as also CJK glyphs are reported as letters. But this could be overcome by conditional subpatterns:
/(?(?=[\x41-\x{658}])\pL)/u
Look if character is between \x41 and \x0658. If so, match against the Unicode property "letter".
Unfortunately not all PCREs are compiled with Unicode property support. In this case you get this message:
PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled
Updated by Alexander Opitz over 11 years ago
- Status changed from New to Needs Feedback
- Target version deleted (
0) - TYPO3 Version set to 4.1
As this report is very old, is the handling in newer TYPO3 CMS Versions (like 6.0/6.1) more like you expect it?
Updated by Martin Kutschker over 11 years ago
I haven't used indexed search for years (earlier I switched to mnogoSearch but now have settled for Solr). Personally I still advocate the use of preg_match and it's Unicode features, but any change to the code will only please my tastes.
Updated by Alexander Opitz over 11 years ago
- Status changed from Needs Feedback to Closed
Ok, so closing this issue.