Add possibility to start indexing an external site at a specific page
Current behaviour is that the starting URL is used for two purposes:
- determine where crawling starts
- check if the indexed pages are "inside" this URL
If you need to start the crawler at a specific page which is not a directory name there needs to be an extra setting.
I noticed some strange behaviour when working with the indexed_search and the crawler extension: Some websites (like http://typo3.org/) are getting indexed including the subpages.
But on other domains, just the first page is indexed - but the links on that page are not followed (even if I configure it to dig 3 levels deep).
All the pages that aren't working are valid HTML or valid XHTML. I tried some different scenarios (like absolute/relativ paths as links) - no success.
Indexed search 2.9.0
(issue imported from #M4167)
[BUGFIX] Links on external pages don't get indexed
This is just about the comparison against the base URL and
enables the Crawler to start crawling at e.g. a file that contains
a manually generated list of links to follow. Before that change,
even links to targets on the same domain were rejected by
the checkUrl() method in case the base Url was pointing to some
file instead of "/". This was because the base URL was then not
part of the target URL.
After stripping off any path from the base URL for this comparison
this can now also be used to start crawling from a file.
Releases: 6.2, 6.1, 6.0, 4.7, 4.5
Reviewed-by: Michael Stucki
Tested-by: Michael Stucki
Reviewed-by: Georg Ringer
Tested-by: Georg Ringer
#1 Updated by Mario Rimann over 12 years ago
Some additional information:
I also tested this on another server running Debian Linux, Apache 1.3.something and TYPO3 v4.0.1. This test ended with the same result.
Tried to crawl some domains on the same server - some went well, other showed the same issue as described in bug report (first page fetched, but not followed the links on it).
I also noticed that it doesn't depend on whether HTML or XHTML is used. It also seems to be charset independent.
#2 Updated by Mario Rimann over 12 years ago
I tracked this issue down to the function checkUrl() in the crawler class of indexed search.
If you start with a URL like "http://www.domain.tld/" (the root page):
- Links to inside of this domain will work
- Links to outside of that domain don't work
If you start with a URL like "http://www.domain.tld/fileadmin/linklist.htm" (a file / subfolder):
- No links will work! Neither absolute nor relative. As they get compared, it will never work out (the check will fail and the URL won't get added to the queue).
I think this should be enhanced by a configuration option to
a) ignore those checks and index "blind"
b) have a "whitelist" of domains (next to the BaseURL) and allow indexing for URLs start with that whitelist in any case.
I'd also appreciate if single files as base URL would be supported.
#3 Updated by Mario Rimann over 12 years ago
I've attached an initial patch. This solves the problem if the baseURL (the starting URL for crawling/indexing) is pointing to a file instead of a domain-root.
This doesn't add any additional configuration options. Question to the CoreDevs: Which configuration options should be implemented? Any feedback is welcome!
#17 Updated by Jigal van Hemert almost 6 years ago
- Tracker changed from Bug to Feature
- Subject changed from Links on external pages don't get indexed to Add possibility to start indexing an external site at a specific page
- Status changed from Resolved to New
- Assignee deleted (
- Target version deleted (
- Complexity set to easy
- TYPO3 Version set to 4.0
#19 Updated by Mario Rimann almost 6 years ago
Jigal, Stefan Neufeind and I have discussed this issue during the last week and came to the conclusion, that my proposed patch would (in rare cases) break the existing functionality. We then discussed several ways of going forward:
a) revert my change + just leave as it was so far (= wait until someone really requires a))
b) revert my change + come up with a new proposal (then marked as feature as it would need to extend the crawler/indexed_search extensions in a way that won't fit as a bugfix)
c) revert my change + modify the patch so it would go through as bugfix (would lead to a "known unstable"-solution, which would probably fix 99.5% of all cases)
We threw away c) as it would not be "clean" at all. After some discussion, we decided to go for a) and just revoke + wait. And so did Jigal revert the change + update this issue.
I'd even go for closing this issue as "won't fix" for the moment - so that the bug-tracker get's cleaned up right away. If one really needs this change, he/she shall open a new report, referring to this one and we can get working on a proper solution).