Bug #16534
Links on external pages don't get indexed
| Status: | Under Review | Start date: | 2006-09-06 | |
|---|---|---|---|---|
| Priority: | Should have | Due date: | ||
| Assignee: | Dmitry Dulepov | % Done: | 0% |
|
| Category: | indexed search | |||
| Target version: | - | |||
| TYPO3 Version: | Complexity: | |||
| PHP Version: | ||||
| Votes: | 0 |
Description
I noticed some strange behaviour when working with the indexed_search and the crawler extension: Some websites (like http://typo3.org/) are getting indexed including the subpages.
But on other domains, just the first page is indexed - but the links on that page are not followed (even if I configure it to dig 3 levels deep).
All the pages that aren't working are valid HTML or valid XHTML. I tried some different scenarios (like absolute/relativ paths as links) - no success.
TYPO3 4.0
Indexed search 2.9.0
Crawler 1.1.0
(issue imported from #M4167)
History
Updated by Mario Rimann over 6 years ago
Some additional information:
I also tested this on another server running Debian Linux, Apache 1.3.something and TYPO3 v4.0.1. This test ended with the same result.
Tried to crawl some domains on the same server - some went well, other showed the same issue as described in bug report (first page fetched, but not followed the links on it).
I also noticed that it doesn't depend on whether HTML or XHTML is used. It also seems to be charset independent.
Updated by Mario Rimann over 6 years ago
More information:
I tracked this issue down to the function checkUrl() in the crawler class of indexed search.
If you start with a URL like "http://www.domain.tld/" (the root page):
- Links to inside of this domain will work
- Links to outside of that domain don't work
If you start with a URL like "http://www.domain.tld/fileadmin/linklist.htm" (a file / subfolder):
- No links will work! Neither absolute nor relative. As they get compared, it will never work out (the check will fail and the URL won't get added to the queue).
I think this should be enhanced by a configuration option to
a) ignore those checks and index "blind"
or
b) have a "whitelist" of domains (next to the BaseURL) and allow indexing for URLs start with that whitelist in any case.
I'd also appreciate if single files as base URL would be supported.
Updated by Mario Rimann over 6 years ago
I've attached an initial patch. This solves the problem if the baseURL (the starting URL for crawling/indexing) is pointing to a file instead of a domain-root.
This doesn't add any additional configuration options. Question to the CoreDevs: Which configuration options should be implemented? Any feedback is welcome!
Updated by Mario Rimann over 3 years ago
The second patch was adapted to the current trunk (rev. 5837)
Updated by Ferdinand Kuhl over 3 years ago
I read over the patch, and i looks very clean to me. It just allows the crawler to start with a file. All links inside the same domain as the file will be followed.
I tested it at a smaller test environment and it works for me.
Updated by Dmitry Dulepov over 2 years ago
I agree, the patch is good. We should get that into indexed search.
Updated by Mr. Jenkins over 1 year ago
- Status changed from Accepted to Under Review
Patch set 1 of change I2727a9a447754b88d2c279c24b32b5c3a2df26c0 has been pushed to the review server.
It is available at http://review.typo3.org/6990
Updated by Mr. Jenkins over 1 year ago
Patch set 2 of change I2727a9a447754b88d2c279c24b32b5c3a2df26c0 has been pushed to the review server.
It is available at http://review.typo3.org/6990
Updated by Gerrit Code Review 11 months ago
Patch set 3 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/6990
Updated by Gerrit Code Review 11 months ago
Patch set 4 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/6990
Updated by Gerrit Code Review about 1 month ago
Patch set 5 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990