Project

General

Profile

Actions

Bug #22229

closed

External URL only indexes first page

Added by Xavier Perseguers about 14 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Should have
Category:
-
Target version:
-
Start date:
2010-03-03
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

When indexing an external URL/website, the first page is indexed but no subpage of the external website.

Problem is related to relative links vs absolute (w/ scheme) in hyperlinks. Today's websites often use relative links:

<a href="some/relative/page.html">....

instead of

<a href="http://www.domain.tld/subsite/some/relative/page.html&quot;>

Problem is that EXT:indexed_search/class.crawler.php in method indexExtUrl() is not able to properly convert from relative link to absolute when dealing with external websites. It only supports converting relative link to absolute for the TYPO3 website. In such cases, the URL above will be converted to

http://typo3-website.tld/some/relative/page.html

This page 1) does not exist and 2) is not within the authorized target website and as such cannot and would not be indexed anyway, even if the document existed.

(issue imported from #M13732)


Files

13732.diff (2.08 KB) 13732.diff Administrator Admin, 2010-03-08 15:32
13732_v2.diff (2.34 KB) 13732_v2.diff Administrator Admin, 2010-03-08 16:43

Related issues 2 (0 open2 closed)

Related to TYPO3 Core - Bug #22296: IS cannot not index files if absRefPrefix is set and indexExternalURLs is notClosedDmitry Dulepov2010-03-18

Actions
Related to TYPO3 Core - Bug #20035: Crawler does not crawl though relative links in an external pageClosedJeff Segars2009-02-17

Actions
Actions #1

Updated by Xavier Perseguers about 14 years ago

After investigating a bit more, the base url in case of relative url is not computed with the TYPO3 website itself (it was just a thought due to my own setup when testing this bug). However it is not computed correctly anyway:

When telling to index url http://www.domain.tld/page1/subpage/foo.html (with or without foo.html in config)

any relative url will be prefixed by http://www.domain.tld/ instead of http://www.domain.tld/page1/subpage/. This is wrong and causes the symptom described.

Moreover, according to http://www.w3.org/TR/html401/struct/links.html#h-12.4, conversion from relative url to full url should first try to use a "base href" tag if present and then rely on implicit relative url with enclosing path.

Actions #2

Updated by Xavier Perseguers about 14 years ago

Sorry :-/ after testing with other websites, I found out that my patch did not handle "relative" links which are absolute:

<a href="/somepage.html">

That is, starting with a slash '/' and as such relative to the hostname. v2 takes care of this meaning all links are now handled properly:

Full link:
<a href="http://www.domain.tld/subsite/some-page.html&quot;>

Relative link (takes base href or computed base href into account):
<a href="some-other-page.html">

Absolute link:
<a href="/subsite/some-other-page.html">

Actions #3

Updated by Xavier Perseguers about 14 years ago

Committed to:

- trunk (rev. 7356)
- 4-3 (rev. 7357)

Actions #4

Updated by Benni Mack over 5 years ago

  • Status changed from Resolved to Closed
Actions

Also available in: Atom PDF