Bug #50095

Epic #65815: Improve Indexed search indexer

Indexing of external files and absRefPrefix

Added by Harald no-lastname-given over 7 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Must have
Assignee:
-
Category:
Indexed Search
Target version:
Start date:
2013-07-17
Due date:
% Done:

0%

TYPO3 Version:
6.1
PHP Version:
Tags:
Complexity:
hard
Is Regression:
No
Sprint Focus:

Description

Hello!

If I for example use the following configuration

config.index_enable = 1
config.index_externals = 1

together with

config.absRefPrefix = /

the indexed files are displayed in the backend, but with the search function in the frontend not found. Only the pages on which the files are to be displayed.

When I use

config.absRefPrefix =

and then run a new indexing, it works!

Thank you very much

Harald

50095-patch1.diff View (818 Bytes) Chris Müller, 2013-10-21 15:10


Related issues

Related to TYPO3 Core - Bug #64315: Typo in function name checkExistance Closed 2015-01-16
Related to TYPO3 Core - Bug #44381: indexed_search FE Plugin doesn't show external urls in TYPO3 4.7.7 Closed 2013-01-08

History

#1 Updated by Philipp Gampe about 7 years ago

  • Category set to Indexed Search
  • Status changed from New to Needs Feedback

As said in the other bug report, this is no question and answer site. This tracker is for actual bugs, feature requests and task which need to be done.

I am not into index search, you better ask in the forum/newsgroup.

Did you try this while setting config.absRefPrefix to the full domain?

#2 Updated by Markus Blaschke about 7 years ago

I can confirm this issue with TYPO3 6.1.3.

Also no output when using full domain in config.absRefPrefix.

All files are indexed with /fileadmin/... when using config.absRefPrefix =

If we are using the full domain eg. config.absRefPrefix = http://www.example.com/ the files are indexed as http://www.example.com/fileadmin/....

#3 Updated by Philipp Gampe about 7 years ago

Might be an issue inside index search then. Do know if it worked before, e.g. in some 4.x version?

#4 Updated by Harald no-lastname-given about 7 years ago

Hello Philip!

Thanks for the reply!
Unfortunately, I can not say whether the problem has existed for TYPO3 4.x.
At least a corresponding file is found in the search in TYPO3 4.x. Whether the contents of this file was indexed, I do not know anymore!

Many Thanks

Harald

#5 Updated by Philipp Gampe about 7 years ago

OK, leaving this open, but someone will need to dig into indexed_search and find the root cause of this behavior.

#6 Updated by Chris Müller about 7 years ago

Today I had the same problem in TYPO3 6.1.5. We are using config.absRefPrefix = / and encountered the problem, that pdf files are not shown in the result list. So I digged into the code: The pdf files are found but the method "checkExistence()" throw them out of the results.

Attached the patch 50095-patch1.diff which fixes this issue. I tested it with absRefPrefix = / and baseUrl = http://www.example.org/

#7 Updated by Thirot no-lastname-given over 6 years ago

Unfortunately, I can not say whether the problem has existed for TYPO3 4.x.
At least a corresponding file is found in the search in TYPO3 4.x. Whether the contents of this file was indexed, I do not know anymore!

I can see this issue in TYPO3 4.7.17.
The path is saved in the database with absRefPrefix.
The file is indexed but not usable in the search form.
And it is impossible to re-index the file in the Backend.
Indexed_search seems to use relative path without the prepend / slash but parse_url() return [path] => /path ?
So what's the correct path to use ?

Update 2014.01.28
- Absolute absRefPrefix (http://site.com) is not working. Any http:// URL is saved as an external URL.
- Indexed_search use URL and the localPath.
- The first calulated phash by the crwaler is based on localPath and not the URL, for this reason pash based on URL are invalid.
- The meta base of the html page is not used in extractHyperLinks() in order to extract absolute URL?
- But absolute URL can generate duplicate content for many domains
- I can't figure out how indexed_search is supposed to work

I made a patch for me. This patch save the PATH and not the URL in order to be pash compatible.
I didn't test external document.

 typo3/sysext/indexed_search/class.indexer.php | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/typo3/sysext/indexed_search/class.indexer.php b/typo3/sysext/indexed_search/class.indexer.php
index e3beed0..762754c 100755
--- a/typo3/sysext/indexed_search/class.indexer.php
+++ b/typo3/sysext/indexed_search/class.indexer.php
@@ -731,7 +731,7 @@ class tx_indexedsearch_indexer {
                         if (is_object($crawler))    {
                             $params = array(
                                 'document' => $linkSource,
-                                'alturl' => $linkInfo['href'],
+                                'alturl' => $linkInfo['localPath'],
                                 'conf' => $this->conf
                             );
                             unset($params['conf']['content']);

#8 Updated by Alexander Opitz over 6 years ago

  • Status changed from Needs Feedback to New
  • Is Regression set to No

#9 Updated by Ben Robinson about 6 years ago

I can confirm that this issue still exists in 6.2.4

#10 Updated by Markus Klein almost 6 years ago

  • Status changed from New to Accepted
  • Priority changed from Should have to Must have
  • Target version set to next-patchlevel
  • Complexity set to hard

Ran into this issue as well and debugged it.

Problem is that the DB has the href value in data_filename field.
When showing the output for the search result the SearchFormController::checkExistance() method is run, which checks with !is_file($row['data_filename']).
This is wrong!

The indexer uses Indexer::createLocalPath and 5 submethods to identify the correct local path for the given href value.
This functionality is needed for the check in SearchFormController::checkExistance() as well.

Solutions:
  • copy createLocalPath() code to SearchFormController
  • make it public in the Indexer
  • create new class to hold these path manipulation methods
  • create a new db field to hold the real local path

#11 Updated by Mathias Schreiber almost 6 years ago

  • Target version changed from next-patchlevel to 7.1 (Cleanup)

#12 Updated by Tymoteusz Motylewski over 5 years ago

  • Parent task set to #65815

#13 Updated by Benni Mack over 5 years ago

  • Target version changed from 7.1 (Cleanup) to 7.4 (Backend)

#14 Updated by Susanne Moog about 5 years ago

  • Target version changed from 7.4 (Backend) to 7.5

#15 Updated by Benni Mack about 5 years ago

  • Target version changed from 7.5 to 8 LTS

#16 Updated by Tymoteusz Motylewski almost 5 years ago

FYI, we're not checking for file existence on rendering (so in searchController) any more.

#17 Updated by Markus Klein almost 5 years ago

This is still an issue on soon to come 6.2.16

#18 Updated by Chris W over 4 years ago

TYPO3 6.2.21
As long as i am logged in with feuser i am able to find every PDF which is indexed in public pages. PDF files indexed in secured pages can't be found... Disable absRefPrefix fix this for me.

#19 Updated by Jan Kiesewetter about 4 years ago

As the whole function was removed with #44381 the problem just occurs in 6.2.
For 6.2 I created an small extension which xclasses the SearchFormController and just return true like TYPO3 7.6 or 8 which no longer consider.
https://bitbucket.org/t3easy_de/indexed_search_absrefprefix

This issue can be closed.

#20 Updated by Riccardo De Contardi about 4 years ago

  • Status changed from Accepted to Closed

Thank you for your answer and findings, I'll close it.

Regards.

If you think that this is the wrong decision please reopen in or open a new issue and add a reference to this one. Thank you.

Also available in: Atom PDF