Bug #28072
textExtract for pdf returns utf8, leading to further problems
| Status: | Closed | Start date: | 2011-07-10 | |
|---|---|---|---|---|
| Priority: | Should have | Due date: | ||
| Assignee: | Ingo Renner | % Done: | 100% |
|
| Category: | File Indexing | |||
| Target version: | 2.5-dkd | |||
| TYPO3 Version: | Has patch: | |||
| PHP Version: | Tags: | |||
| Votes: | 0 |
Description
When trying to use the textExtract-service for a pdf you'll usually stumble across cc_txtextexec which executes "pdftotext". However if just process() is called on the document the utf8-returned content is transformed into iso-8859-1. To avoid that configuration needs to be passed. Since the solr-interface uses utf8, we can hardcode "utf-8" there, and by that just avoid the unnecessary charset-conversion.
PS: dam does it this way as well in getFileMetaInfo().
If the charset isn't processed as utf-8, it later might lead to problems when htmlspecialchars() is called with utf-8 explicitly which can result in a completely empty content-field (and thus almost no content to send to the solr-server).
History
Updated by Olivier Dobberkau almost 2 years ago
- Status changed from New to Needs Feedback
Can you try tika instead?
Updated by Ingo Renner almost 2 years ago
- Target version changed from 2.0 to 2.5-dkd
Updated by Ingo Renner over 1 year ago
- Status changed from Needs Feedback to Accepted
Updated by Ingo Renner over 1 year ago
- Status changed from Accepted to Resolved
- Assignee set to Ingo Renner
- % Done changed from 0 to 100
Applied in dkd-EAP r 89747
Updated by Ingo Renner about 1 year ago
- Status changed from Resolved to Closed