textExtract for pdf returns utf8, leading to further problems
|Priority:||Should have||Due date:|
|Assignee:||Ingo Renner||% Done:||
|TYPO3 Version:||Has patch:|
When trying to use the textExtract-service for a pdf you'll usually stumble across cc_txtextexec which executes "pdftotext". However if just process() is called on the document the utf8-returned content is transformed into iso-8859-1. To avoid that configuration needs to be passed. Since the solr-interface uses utf8, we can hardcode "utf-8" there, and by that just avoid the unnecessary charset-conversion.
PS: dam does it this way as well in getFileMetaInfo().
If the charset isn't processed as utf-8, it later might lead to problems when htmlspecialchars() is called with utf-8 explicitly which can result in a completely empty content-field (and thus almost no content to send to the solr-server).
Updated by Olivier Dobberkau almost 2 years ago
- Status changed from New to Needs Feedback
Can you try tika instead?
Updated by Ingo Renner over 1 year ago
- Status changed from Accepted to Resolved
- Assignee set to Ingo Renner
- % Done changed from 0 to 100
Applied in dkd-EAP r 89747