Feature #16534: Add possibility to start indexing an external site at a specific page - TYPO3 Core - TYPO3 Forge

Custom queries

Accessibility
Easy tasks
Forge Triage
JavaScript issues
JavaScript Issues (w/o Epics/Stories)
My open issues
No feedback within 90 days
Open Bugs
Open issues of 10
Recently modified
Regressions
Stabilization Issues
Usability or UX

Watchers (3)

Dmitry Dulepov
Mario Rimann
Michael Stucki

Actions

Copy link

Feature #16534

closed

Add possibility to start indexing an external site at a specific page

Added by Mario Rimann about 18 years ago. Updated about 11 years ago.

Status:

Closed

Priority:

Should have

Assignee:

Category:

Indexed Search

Target version:

Start date:

2006-09-06

Due date:

% Done:

100%

Estimated time:

PHP Version:

Tags:

Complexity:

easy

Sprint Focus:

Description

Current behaviour is that the starting URL is used for two purposes:
- determine where crawling starts
- check if the indexed pages are "inside" this URL

If you need to start the crawler at a specific page which is not a directory name there needs to be an extra setting.

Old description:
I noticed some strange behaviour when working with the indexed_search and the crawler extension: Some websites (like http://typo3.org/) are getting indexed including the subpages.

But on other domains, just the first page is indexed - but the links on that page are not followed (even if I configure it to dig 3 levels deep).

All the pages that aren't working are valid HTML or valid XHTML. I tried some different scenarios (like absolute/relativ paths as links) - no success.

TYPO3 4.0
Indexed search 2.9.0
Crawler 1.1.0
(issue imported from #M4167)

Files

Download all files

indexed_search_4167_v1.diff (1.5 KB) indexed_search_4167_v1.diff		Administrator Admin, 2006-09-11 20:27
indexed_search_4167_v2.diff (1.4 KB) indexed_search_4167_v2.diff		Administrator Admin, 2009-08-28 10:57

History
Notes
Property changes
Associated revisions

Actions

Copy link

Updated by Mario Rimann about 18 years ago

Some additional information:
I also tested this on another server running Debian Linux, Apache 1.3.something and TYPO3 v4.0.1. This test ended with the same result.

Tried to crawl some domains on the same server - some went well, other showed the same issue as described in bug report (first page fetched, but not followed the links on it).

I also noticed that it doesn't depend on whether HTML or XHTML is used. It also seems to be charset independent.

Actions

Copy link

Updated by Mario Rimann about 18 years ago

More information:

I tracked this issue down to the function checkUrl() in the crawler class of indexed search.

If you start with a URL like "http://www.domain.tld/" (the root page):
- Links to inside of this domain will work
- Links to outside of that domain don't work

If you start with a URL like "http://www.domain.tld/fileadmin/linklist.htm" (a file / subfolder):
- No links will work! Neither absolute nor relative. As they get compared, it will never work out (the check will fail and the URL won't get added to the queue).

I think this should be enhanced by a configuration option to
a) ignore those checks and index "blind"
or
b) have a "whitelist" of domains (next to the BaseURL) and allow indexing for URLs start with that whitelist in any case.

I'd also appreciate if single files as base URL would be supported.

Actions

Copy link

Updated by Mario Rimann about 18 years ago

I've attached an initial patch. This solves the problem if the baseURL (the starting URL for crawling/indexing) is pointing to a file instead of a domain-root.

This doesn't add any additional configuration options. Question to the CoreDevs: Which configuration options should be implemented? Any feedback is welcome!

Actions

Copy link

Updated by Mario Rimann about 15 years ago

The second patch was adapted to the current trunk (rev. 5837)

Actions

Copy link

Updated by Ferdinand Kuhl about 15 years ago

I read over the patch, and i looks very clean to me. It just allows the crawler to start with a file. All links inside the same domain as the file will be followed.

I tested it at a smaller test environment and it works for me.

Actions

Copy link

Updated by Dmitry Dulepov about 14 years ago

I agree, the patch is good. We should get that into indexed search.

Actions

Copy link

Updated by Mr. Jenkins almost 13 years ago

Status changed from Accepted to Under Review

Patch set 1 of change I2727a9a447754b88d2c279c24b32b5c3a2df26c0 has been pushed to the review server.
It is available at http://review.typo3.org/6990

Actions

Copy link

Updated by Mr. Jenkins almost 13 years ago

Patch set 2 of change I2727a9a447754b88d2c279c24b32b5c3a2df26c0 has been pushed to the review server.
It is available at http://review.typo3.org/6990

Actions

Copy link

Updated by Gerrit Code Review over 12 years ago

Patch set 3 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/6990

Actions

Copy link

#10

Updated by Gerrit Code Review over 12 years ago

Patch set 4 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/6990

Actions

Copy link

#11

Updated by Gerrit Code Review over 11 years ago

Patch set 5 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990

Actions

Copy link

#12

Updated by Gerrit Code Review over 11 years ago

Patch set 6 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990

Actions

Copy link

#13

Updated by Gerrit Code Review over 11 years ago

Patch set 7 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990

Actions

Copy link

#14

Updated by Gerrit Code Review over 11 years ago

Patch set 8 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990

Actions

Copy link

#15

Updated by Gerrit Code Review over 11 years ago

Patch set 9 for branch master has been pushed to the review server.
It is available at https://review.typo3.org/6990

Actions

Copy link

#16

Updated by Anonymous over 11 years ago

Status changed from Under Review to Resolved
% Done changed from 0 to 100

Applied in changeset 819b5be0ac81004371fee2f0e6386cc32233a59b.

Actions

Copy link

#17

Updated by Jigal van Hemert over 11 years ago

Tracker changed from Bug to Feature
Subject changed from Links on external pages don't get indexed to Add possibility to start indexing an external site at a specific page
Status changed from Resolved to New
Assignee deleted (~~Dmitry Dulepov~~)
Target version deleted (0)
Complexity set to easy
TYPO3 Version set to 4.0

Actions

Copy link

#18

Updated by Michael Stucki over 11 years ago

Hey Jigal,

why is this changed to new again? Did you revert the patch? Please explain...

Greetings, Michael

Actions

Copy link

#19

Updated by Mario Rimann over 11 years ago

Hi Michael

Jigal, Stefan Neufeind and I have discussed this issue during the last week and came to the conclusion, that my proposed patch would (in rare cases) break the existing functionality. We then discussed several ways of going forward:
a) revert my change + just leave as it was so far (= wait until someone really requires a))
b) revert my change + come up with a new proposal (then marked as feature as it would need to extend the crawler/indexed_search extensions in a way that won't fit as a bugfix)
c) revert my change + modify the patch so it would go through as bugfix (would lead to a "known unstable"-solution, which would probably fix 99.5% of all cases)

We threw away c) as it would not be "clean" at all. After some discussion, we decided to go for a) and just revoke + wait. And so did Jigal revert the change + update this issue.

I'd even go for closing this issue as "won't fix" for the moment - so that the bug-tracker get's cleaned up right away. If one really needs this change, he/she shall open a new report, referring to this one and we can get working on a proper solution).

Cheers,
Mario

Actions

Copy link

#20

Updated by Michael Stucki over 11 years ago

Thanks Mario! Now I see this was reverted in 559eb0091a5cf093515ad43d1b6b7dc7575bf8aa

Thanks for summarizing what happened about the issue. It's amazing to see how much work can go into such a tiny change... :-)

Actions

Copy link

#21

Updated by Michael Stucki about 11 years ago

Status changed from New to Closed

Closed at request of Mario.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (3)

Feature #16534

Add possibility to start indexing an external site at a specific page

Updated by Mario Rimann about 18 years ago

Updated by Mario Rimann about 18 years ago

Updated by Mario Rimann about 18 years ago

Updated by Mario Rimann about 15 years ago

Updated by Ferdinand Kuhl about 15 years ago

Updated by Dmitry Dulepov about 14 years ago

Updated by Mr. Jenkins almost 13 years ago

Updated by Mr. Jenkins almost 13 years ago

Updated by Gerrit Code Review over 12 years ago

Updated by Gerrit Code Review over 12 years ago

Updated by Gerrit Code Review over 11 years ago

Updated by Gerrit Code Review over 11 years ago

Updated by Gerrit Code Review over 11 years ago

Updated by Gerrit Code Review over 11 years ago

Updated by Gerrit Code Review over 11 years ago

Updated by Anonymous over 11 years ago

Updated by Jigal van Hemert over 11 years ago

Updated by Michael Stucki over 11 years ago

Updated by Mario Rimann over 11 years ago

Updated by Michael Stucki over 11 years ago

Updated by Michael Stucki about 11 years ago