Bug #99352
closedPDF Metadata double-encoded by index-search indexer with poppler-utils pdfinfo
100%
Description
pdfinfo version 21.08.0 Copyright 2005-2021 The Poppler Developers - http://poppler.freedesktop.org Copyright 1996-2011 Glyph & Cog, LLC
There are different versions of pdfinfo available in the wild.
Debian/Fedora use pdfinfo (>v20) from the poppler-utils package.
Also good hosters like Hetzner use this version.
This tool defaults to UTF-8 output for metadata:
pdfinfo umlauts-metadata.pdf | grep Title Title: Test æ ø å ü ö ä
On the other hand there are hosters like Mittwald and Domainfactory, which use the older v3 of pdfinfo which defaults to Latin1 output.
pdfinfo -v pdfinfo version 3.02 Copyright 1996-2007 Glyph & Cog, LLC
This tool produces Latin1 output by default:
pdfinfo umlauts-metadata.pdf | grep Title Title: Test � � � � � �
Both versions support an -enc UTF-8
option, which should be used by TYPO3 to circumvent the differences between these tools, instead of always implying that v3 is used and forcefully converting from ISO-8859-1 to UTF_8 – as added in See https://review.typo3.org/c/Packages/TYPO3.CMS/+/76861
– which leads to double-encoding with the poppler-utils pdfinfo variant.
Files