Project

General

Profile

Actions

Task #89287

open

Make linkvalidator crawling polite

Added by Sybille Peters almost 5 years ago. Updated 15 days ago.

Status:
New
Priority:
Should have
Assignee:
-
Category:
Linkvalidator
Target version:
-
Start date:
2019-09-26
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
10
PHP Version:
Tags:
throttle, outgoing HTTP requests, resources, large-site
Complexity:
Sprint Focus:

Description

Currently, linkvalidator does not apply common practice for being "nice" / "polite" when crawling other websites:

  • It should be possible to see what is crawling your site. It is usually standard to add a URL and contact information to the User-Agent or referrer, e.g.
"Mozilla/5.0 (compatible; MetaJobBot; http://www.metajob.de/crawler)" 

"https://www.google.de/" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1" 
  • the crawler should respect the robots.txt
  • the crawler should wait between consecutive requests (Crawl-Delay). A good minimum value is e.g. 1-5 second for requests done on same domain

While linkvalidator is not a full web crawler which downloads the entire pages on the site and it currently uses HEAD by default (and not GET) which does not download the entire page - so this is not as dramatic.

But preferably, I think we should follow these recommendations as well.

Possible solutions

Not just for reducing load on other sites but also on the TYPO3 site, some changes might be made to the crawling process:

  • when crawling external URLs, do not check right away but defer to a separate task which will handle the crawling of external URLs (with reasonable delays in between)
  • do not keep crawling over and over but regularly only recrawl content which was recently modified
  • optional: delegate the checking of external URLs, e.g. use a URLchecking service

This may make it necessary to make more things asynchronous and confine the link checking only to the scheduler.

Resources

About politeness of web crawlers:

URL checking site:


Related issues 5 (3 open2 closed)

Related to TYPO3 Core - Epic #93547: Collection of problems with large sitesAccepted2021-02-19

Actions
Related to TYPO3 Core - Epic #85006: Reduce falsely reported broken linksNew2018-02-11

Actions
Related to TYPO3 Core - Bug #101670: Linkvalidator reports some external URLs as "false positives"New2023-08-13

Actions
Related to TYPO3 Core - Task #101671: Disable external linktypes by default in linkvalidatorClosedSybille Peters2023-08-13

Actions
Related to TYPO3 Core - Bug #86918: Linkvalidator stops working on specific links (external URLs)Closed2018-11-13

Actions
Actions #1

Updated by Sybille Peters almost 3 years ago

  • Tracker changed from Bug to Task
Actions #2

Updated by Sybille Peters over 2 years ago

  • Tags set to throttle, outgoing HTTP requests, resources, large-site
Actions #3

Updated by Sybille Peters over 2 years ago

  • Related to Epic #93547: Collection of problems with large sites added
Actions #4

Updated by Sybille Peters over 2 years ago

Not having a throttling of outgoing URLs HTTP requests and only caching the outgoing requests once per check cycle is one of the main reasons I do not want to use Linkvalidator in my site and I find that problematic if a TYPO3 extension bombards external sites with requests.

The way I currently changed the external link checking behaviour in "brofix" (which is a fork of linkvalidator with some changes), is that I added a minimum delay per domain: If the last request to an URL of this domain is less than X seconds, we wait. This delays the checking, but since I also use a link target cache and don't care if initial check takes several hours, I do see this as acceptable and an improvement in any case.

The wordpress plugin "Link checker" also has a throttle, see

Actions #5

Updated by Rémy DANIEL 12 months ago

  • Related to Epic #85006: Reduce falsely reported broken links added
Actions #6

Updated by Sybille Peters 12 months ago

  • Related to Bug #101670: Linkvalidator reports some external URLs as "false positives" added
Actions #7

Updated by Sybille Peters 12 months ago

  • Related to Task #101671: Disable external linktypes by default in linkvalidator added
Actions #8

Updated by Garvin Hicking 15 days ago

  • Related to Bug #86918: Linkvalidator stops working on specific links (external URLs) added
Actions #9

Updated by Garvin Hicking 15 days ago

(Some parts of this were addressed with https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801)

Actions

Also available in: Atom PDF