Bug #104277: Indexed search do not index PDF files - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Bug #104277

open

Indexed search do not index PDF files

Added by Alex Tuveri 4 months ago.

Status:

New

Priority:

Should have

Assignee:

Category:

Indexed Search

Target version:

Start date:

2024-07-01

Due date:

% Done:

Estimated time:

TYPO3 Version:

PHP Version:

7.4

Tags:

indexed-search typo3 pdf

Complexity:

Is Regression:

Sprint Focus:

Description

I'm hoping to get help with a problem I can't solve. The working environment is as follows:

SYSTEM
Debian 12 bookworm
PHP 7.4 (tried 8.2 and 8.3 with failure on crawler) + FPM/FastCGI
/usr/bin/pdftotext and /usr/bin/pdfinfo installed
openbase dir: add directive (to be sure) to allow /usr/bin

TYPO3 v.11.5
Indexed search engine 11.5.38
site crawler 11.0.7
bootstraP package 12.0.10
(all ext are installed correctly)

TYPO3 settings (setup)
page.config.index_enable = 1
page.config.index_externals = 1

Admin tools > settings > extensions configuration

INDEXED SEARCH
Path to PDF parsers /usr/bin
PDF parsing mode: 0
Full Text Data Length 0
Enable metaphone search (sounds like) 1
Ignore Extensions jpg,gif.jpeg,html
Debug mode: checked
Max External files: 99
Use "crawler" extension to index external files unchecked

SITE CRAWLER configuration
base site: https://domain.it/
php path: /usr/bin/php

ROOT PAGE OF THE SITE
I have tried numerous configurations to try to index pdfs, such as:

FilePath
with path configured: fileadmin
and depth: 4 levels below
Limit to extensions (commalist) pdf (To avoid other files like doc, rtf, etc.)

Database records
with table: file

THE PROBLEM
I see that the instructions on the TYPO3 website are not very updated in the screens and that in some ways they are insufficient. I understood that even if the "crawler" extension is not installed, at each update of the page, carried out NOT from the active backend (I assume with another browser) TYPO£, INDEXED SEARCH parses the page, extracts the words it deems useful for the indexing, then reads the links and indexes them too; about PDFs you should open them one by one, inspect them with pdftotext/pdfinfo and from the resulting text obtain other keywords to index the document.

Therefore I tried every way, even launching the crawler manually, first filling the queue, then defining the process, and launching it observing the progress percentage.

Indexing occurs ONLY for pages, via the crawler (launched by hand) or by visiting the site.

From Web > Indexing > (root page):

List: Pages - indexed statistics show only indexed contents (type, date, etc.) but no pdf files
List: External documents - list always empty
Detailed statistics -> overview - all pages listed but no related or listed pdfs

Test

indexed search configuration: Use "crawler" extension to index external files set to 1 or less things do not change
by chance I discovered, by typing the keyword "pdf" in the search box, I see 6 indexed files to which no new ones are added despite the repeated tests carried out above. Even worse, these files are duplicates, i.e. the output is: document1-1-12.pdf (referring to pages 1-12) document1-13-25.pdf (pages 13-25) etc.

I hope to find some help, thank you very much

No data to display

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Bug #104277

Indexed search do not index PDF files