Bug #22229
closedExternal URL only indexes first page
0%
Description
When indexing an external URL/website, the first page is indexed but no subpage of the external website.
Problem is related to relative links vs absolute (w/ scheme) in hyperlinks. Today's websites often use relative links:
<a href="some/relative/page.html">....
instead of
<a href="http://www.domain.tld/subsite/some/relative/page.html">
Problem is that EXT:indexed_search/class.crawler.php in method indexExtUrl() is not able to properly convert from relative link to absolute when dealing with external websites. It only supports converting relative link to absolute for the TYPO3 website. In such cases, the URL above will be converted to
http://typo3-website.tld/some/relative/page.html
This page 1) does not exist and 2) is not within the authorized target website and as such cannot and would not be indexed anyway, even if the document existed.
(issue imported from #M13732)
Files
Updated by Xavier Perseguers over 14 years ago
After investigating a bit more, the base url in case of relative url is not computed with the TYPO3 website itself (it was just a thought due to my own setup when testing this bug). However it is not computed correctly anyway:
When telling to index url http://www.domain.tld/page1/subpage/foo.html (with or without foo.html in config)
any relative url will be prefixed by http://www.domain.tld/ instead of http://www.domain.tld/page1/subpage/. This is wrong and causes the symptom described.
Moreover, according to http://www.w3.org/TR/html401/struct/links.html#h-12.4, conversion from relative url to full url should first try to use a "base href" tag if present and then rely on implicit relative url with enclosing path.
Updated by Xavier Perseguers over 14 years ago
Sorry :-/ after testing with other websites, I found out that my patch did not handle "relative" links which are absolute:
<a href="/somepage.html">
That is, starting with a slash '/' and as such relative to the hostname. v2 takes care of this meaning all links are now handled properly:
Full link:
<a href="http://www.domain.tld/subsite/some-page.html">
Relative link (takes base href or computed base href into account):
<a href="some-other-page.html">
Absolute link:
<a href="/subsite/some-other-page.html">
Updated by Xavier Perseguers over 14 years ago
Committed to:
- trunk (rev. 7356)
- 4-3 (rev. 7357)