Feature #16947
closed
Use extended Unicode features of PHP 5.1 for word splitting
Added by Martin Kutschker almost 18 years ago.
Updated over 11 years ago.
Description
The word splitter uses currrently a homegrown PHP code to split text into words. Maybe it could be sped up by using PREGs \p{xx} feature to search for glyphs with certain properties.
(issue imported from #M4934)
It's not so easy as also CJK glyphs are reported as letters. But this could be overcome by conditional subpatterns:
/(?(?=[\x41-\x{658}])\pL)/u
Look if character is between \x41 and \x0658. If so, match against the Unicode property "letter".
Unfortunately not all PCREs are compiled with Unicode property support. In this case you get this message:
PHP Warning: preg_match(): Compilation failed: support for \P, \p, and \X has not been compiled
- Status changed from New to Needs Feedback
- Target version deleted (
0)
- TYPO3 Version set to 4.1
As this report is very old, is the handling in newer TYPO3 CMS Versions (like 6.0/6.1) more like you expect it?
I haven't used indexed search for years (earlier I switched to mnogoSearch but now have settled for Solr). Personally I still advocate the use of preg_match and it's Unicode features, but any change to the code will only please my tastes.
- Status changed from Needs Feedback to Closed
Ok, so closing this issue.
Also available in: Atom
PDF