CoreCommunity ExtensionsIncubatorDistributionsTYPO3 4.5 ProjectsTYPO3 4.6 ProjectsTYPO3 4.7 ProjectsTYPO3 6.0 ProjectsTYPO3 6.1 ProjectsTYPO3 6.2 Projects (+)

Bug #28072

textExtract for pdf returns utf8, leading to further problems

Added by Stefan Neufeind almost 2 years ago. Updated about 1 year ago.

Status:Closed Start date:2011-07-10
Priority:Should have Due date:
Assignee:Ingo Renner % Done:

100%

Category:File Indexing
Target version:2.5-dkd
TYPO3 Version: Has patch:
PHP Version: Tags:
Votes: 0

Description

When trying to use the textExtract-service for a pdf you'll usually stumble across cc_txtextexec which executes "pdftotext". However if just process() is called on the document the utf8-returned content is transformed into iso-8859-1. To avoid that configuration needs to be passed. Since the solr-interface uses utf8, we can hardcode "utf-8" there, and by that just avoid the unnecessary charset-conversion.

PS: dam does it this way as well in getFileMetaInfo().

If the charset isn't processed as utf-8, it later might lead to problems when htmlspecialchars() is called with utf-8 explicitly which can result in a completely empty content-field (and thus almost no content to send to the solr-server).

textextract-utf8.patch - small patch (647 Bytes) Stefan Neufeind, 2011-07-10 22:02

History

Updated by Olivier Dobberkau almost 2 years ago

  • Status changed from New to Needs Feedback

Can you try tika instead?

Updated by Ingo Renner almost 2 years ago

  • Target version changed from 2.0 to 2.5-dkd

Updated by Ingo Renner over 1 year ago

  • Status changed from Needs Feedback to Accepted

Updated by Ingo Renner over 1 year ago

  • Status changed from Accepted to Resolved
  • Assignee set to Ingo Renner
  • % Done changed from 0 to 100

Applied in dkd-EAP r 89747

Updated by Ingo Renner about 1 year ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF