Use of mapping-file for ISO/Latin1 causes zero results to be returned for queries with unicode chars
|Priority:||Must have||Due date:|
|Assignee:||Ingo Renner||% Done:||
|Category:||Solr XML Configuration Files|
|TYPO3 Version:||Has patch:|
This one cost me a lot of time and grief and constitutes a serious problem for language-localized UTF8-only TYPO3 installations using SOLR.
I had a 100% utf8 solution in TYPO3 and implemented SOLR. Everything was being indexed correctly (I even checked the Lucene index directly in Luke) and unicode chars were recorded with no problem. However, performing a search query with unicode chars consistently returned zero results both in the SOLR FE extension and using the tomcat SOLR admin query.
After debugging this for more than a day I decided to remove the <charFilter>-definitions from schema.xml, cleared cache and rebuilt the index - and the expected results were now immediately returned when querying with unicode characters.
There seems to be some problem in character translation when the ISO/Latin1 charmap file is referenced in the schema, at least when the site is in UTF8. Unfortunately I could not tell if the bug presents during ADDs to SOLR (did not seem so, but unsure since I also had to rebuild the index for results to show) or during SELECTs from SOLR (seems to happen here, since adding a facet which used unicode chars also returns zero results even when the original query had plenty of results).
There was no error given in the tomcat logs and/or devlog in TYPO3.
I did not check the charmap file itself for problems.
I have not tested this on Latin1-encoded sites or Latin1 encoded Lucene indexe contents.Quick fix:
- remove the <charFilter> definitions from solr/conf/schema.xml and rebuild index.
- perhaps adding a condition to the SOLR query request to disallow character translation if renderCharset 'utf-8' (not sure if SOLR can support this...)
- perhaps detecting if site has renderCharset 'utf-8' when running install.sh or adding an option parameter or simple dialog to the install.sh script, prompting the user to select site encoding
Updated by Ingo Renner over 2 years ago
Updated by Claus Due over 2 years ago
I did try that very early on since it seems to be the general thing that's going wrong when it concerns tomcat and unicode URIs. This parameter had no effect, neither did the useBodyEncodingForURI="false" parameter.
EDIT: actually, before I added the URIEncoding="UTF-8" parameter, the body XML echo of the $query parameter when searching using unicode chars would display as two non-unicode characters per unicode character, after I added it the query was echoed in proper UTF-8. So it does have an effect - it is just not the cause for this particular problem of zero results when querying with unicode chars.
Updated by georg kuehnberger over 2 years ago
I experienced exactly the same issues as described by Claus.
AFAI understood (during debugging) the problem is not caused by a broken mapping-ISOLatin1Accent.txt File, but rather by it's improper usage.
The provided config does the following:
- DO use the ISO-charFilter at Index-Time
- DONT use it at Query-Time
It does so for fieldType text and textSpell.
This might work dandy for English, however it breaks when using German with Umlauts.
Solution-1: As described: remove the ISO-charFilter alltogether
Solution-2: Add the ISO-charFilter to the analyzer type "query" for text and textSpell
hth regards georg
Updated by Ingo Renner over 2 years ago
- Assignee set to Markus Goldbach
Markus, could you take a look at this please?
Updated by Olivier Dobberkau over 1 year ago
- Status changed from New to Needs Feedback
- Assignee changed from Markus Goldbach to Ingo Renner
- Target version set to 2.5-dkd
Havent we removed this in some release this year?
Updated by Ingo Renner over 1 year ago
- Status changed from Needs Feedback to Resolved
- Target version changed from 2.5-dkd to 2.0
- % Done changed from 0 to 100
Has been resolved "somewhen" by removing the mapping configuration.