Bug #104277
openIndexed search do not index PDF files
0%
Description
I'm hoping to get help with a problem I can't solve. The working environment is as follows:
SYSTEM
Debian 12 bookworm
PHP 7.4 (tried 8.2 and 8.3 with failure on crawler) + FPM/FastCGI
/usr/bin/pdftotext and /usr/bin/pdfinfo installed
openbase dir: add directive (to be sure) to allow /usr/bin
TYPO3 v.11.5
Indexed search engine 11.5.38
site crawler 11.0.7
bootstraP package 12.0.10
(all ext are installed correctly)
TYPO3 settings (setup)
page.config.index_enable = 1
page.config.index_externals = 1
Admin tools > settings > extensions configuration
INDEXED SEARCH
Path to PDF parsers /usr/bin
PDF parsing mode: 0
Full Text Data Length 0
Enable metaphone search (sounds like) 1
Ignore Extensions jpg,gif.jpeg,html
Debug mode: checked
Max External files: 99
Use "crawler" extension to index external files unchecked
SITE CRAWLER configuration
base site: https://domain.it/
php path: /usr/bin/php
ROOT PAGE OF THE SITE
I have tried numerous configurations to try to index pdfs, such as:
FilePath
with path configured: fileadmin
and depth: 4 levels below
Limit to extensions (commalist) pdf (To avoid other files like doc, rtf, etc.)
Database records
with table: file
THE PROBLEM
I see that the instructions on the TYPO3 website are not very updated in the screens and that in some ways they are insufficient. I understood that even if the "crawler" extension is not installed, at each update of the page, carried out NOT from the active backend (I assume with another browser) TYPO£, INDEXED SEARCH parses the page, extracts the words it deems useful for the indexing, then reads the links and indexes them too; about PDFs you should open them one by one, inspect them with pdftotext/pdfinfo and from the resulting text obtain other keywords to index the document.
Therefore I tried every way, even launching the crawler manually, first filling the queue, then defining the process, and launching it observing the progress percentage.
Indexing occurs ONLY for pages, via the crawler (launched by hand) or by visiting the site.
From Web > Indexing > (root page):- List: Pages - indexed statistics show only indexed contents (type, date, etc.) but no pdf files
- List: External documents - list always empty
- Detailed statistics -> overview - all pages listed but no related or listed pdfs
Test
- indexed search configuration: Use "crawler" extension to index external files set to 1 or less things do not change
- by chance I discovered, by typing the keyword "pdf" in the search box, I see 6 indexed files to which no new ones are added despite the repeated tests carried out above. Even worse, these files are duplicates, i.e. the output is: document1-1-12.pdf (referring to pages 1-12) document1-13-25.pdf (pages 13-25) etc.
I hope to find some help, thank you very much
No data to display