Project

General

Profile

Actions

Bug #99352

closed

PDF Metadata double-encoded by index-search indexer with poppler-utils pdfinfo

Added by Benjamin Franzke almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
Indexed Search
Target version:
Start date:
2022-12-13
Due date:
% Done:

100%

Estimated time:
TYPO3 Version:
12
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

pdfinfo version 21.08.0
Copyright 2005-2021 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC

There are different versions of pdfinfo available in the wild.
Debian/Fedora use pdfinfo (>v20) from the poppler-utils package.
Also good hosters like Hetzner use this version.
This tool defaults to UTF-8 output for metadata:

pdfinfo umlauts-metadata.pdf  | grep Title
Title:          Test æ ø å ü ö ä

On the other hand there are hosters like Mittwald and Domainfactory, which use the older v3 of pdfinfo which defaults to Latin1 output.

pdfinfo -v
pdfinfo version 3.02
Copyright 1996-2007 Glyph & Cog, LLC

This tool produces Latin1 output by default:

pdfinfo umlauts-metadata.pdf | grep Title
Title:          Test � � � � � �

Both versions support an -enc UTF-8 option, which should be used by TYPO3 to circumvent the differences between these tools, instead of always implying that v3 is used and forcefully converting from ISO-8859-1 to UTF_8 – as added in See https://review.typo3.org/c/Packages/TYPO3.CMS/+/76861
– which leads to double-encoding with the poppler-utils pdfinfo variant.


Files

umlauts-metadata.pdf (7.47 KB) umlauts-metadata.pdf Benjamin Franzke, 2022-12-13 08:06
umlauts-double-encoding.png (76 KB) umlauts-double-encoding.png Benjamin Franzke, 2022-12-13 08:11

Related issues 1 (0 open1 closed)

Related to TYPO3 Core - Bug #80085: Extraction of metadata in PDF-documents does not recognize unicode charactersClosed2017-03-01

Actions
Actions #1

Updated by Benjamin Franzke almost 2 years ago

  • Related to Bug #80085: Extraction of metadata in PDF-documents does not recognize unicode characters added
Actions #2

Updated by Gerrit Code Review almost 2 years ago

  • Status changed from New to Under Review

Patch set 1 for branch main of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77074

Actions #3

Updated by Gerrit Code Review almost 2 years ago

Patch set 2 for branch main of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77074

Actions #4

Updated by Gerrit Code Review almost 2 years ago

Patch set 1 for branch 11.5 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77081

Actions #5

Updated by Gerrit Code Review almost 2 years ago

Patch set 1 for branch 10.4 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77082

Actions #6

Updated by Benjamin Franzke almost 2 years ago

  • Status changed from Under Review to Resolved
  • % Done changed from 0 to 100
Actions #7

Updated by Gerrit Code Review almost 2 years ago

  • Status changed from Resolved to Under Review

Patch set 1 for branch 12.1 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77113

Actions #8

Updated by Gerrit Code Review almost 2 years ago

Patch set 2 for branch 12.1 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77113

Actions #9

Updated by Gerrit Code Review almost 2 years ago

Patch set 3 for branch 12.1 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/77113

Actions #10

Updated by Benjamin Franzke almost 2 years ago

  • Status changed from Under Review to Resolved
Actions #11

Updated by Benni Mack almost 2 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF