Bug #19400
closedIndexed search doesn't find letters in Thai language.
0%
Description
I have a multilanguage site with 35 languages. Some of the langages use special letters like Russian, Chinese, Arabic, Thai and more. The site ist utf-8 encoded. Indexed search have the same configufarion for each language there is only the L parameter different. The search works for all languages except of Thai.
I have prove it in the database. The page for thai is indexed but the search find 0 records if i try to search some word.
Thai use such letters:
????????????????????????????????????????????????????????? (????????) ????
???????????????????????????????? ????????????????????????????????????????????????????????????????????????? ? ??????? ?????????????????????????????????????????????
??????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????????????????
?????????????????????????? ????????????????????????????????? ??????????
?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????????????????? ?????????????????????????????????????????????????????????????????????????????????????????????????
Does known anyone why the search doesnt work. As i write above for china, russia and other languages with special character the search works fine, only the thai doesn't works.
(issue imported from #M9461)
Updated by Juraj Sulek about 16 years ago
As i see it is not posible to copy the letters from page direct here. Here is a page with thai letters: http://www.bmeia.gv.at/th/botschaft/bangkok/impressum.html
Updated by Peter Niederlag over 15 years ago
Has this been solved in the meantime?
Updated by Dmitry Dulepov about 14 years ago
This cannot be really solved. We spent a lot of time investigating this and found out that every asian language has its specifics and it is not possible to build a common algorithm to recognize characters there. For example, japanese has three different spellings and some of them can be mixed. It is beyond our "open source" powers to implement that algorithm.
Currently indexed search implements a very simple approach. Each asian characters consists from 2-3 bytes. Consider you have "~ab~cd~efh~ijk" as a text on the page (~ separates bytes) and need to search for "xa~bc". Indexed search will try this:
abcdefjjik matches xabc?
bcdefjjik matches xabc?
cdefjjik matches xabc?
and so on. This is a standard simple way to do it and it often gives good results but not always (depends a lot on the language and text). I do not think we can implement anything more sophisticated than that.
Updated by Alexander Opitz about 11 years ago
- Status changed from Accepted to Closed
- Target version deleted (
0) - TYPO3 Version set to 4.3
- Is Regression set to No
As this can't be resolved on indexed_search, you may use Solr search fur such languages.
If you think, that this is the wrong decision, then please write to the mailing list typo3.teams.bugs with issue number and an explanation.