Task #89287: Make linkvalidator crawling polite - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Task #89287

open

Make linkvalidator crawling polite

Added by Sybille Peters almost 5 years ago. Updated over 2 years ago.

Status:

New

Priority:

Should have

Assignee:

Category:

Linkvalidator

Target version:

Start date:

2019-09-26

Due date:

% Done:

Estimated time:

TYPO3 Version:

PHP Version:

Tags:

throttle, outgoing HTTP requests, resources, large-site

Complexity:

Sprint Focus:

Description

Currently, linkvalidator does not apply common practice for being "nice" / "polite" when crawling other websites:

It should be possible to see what is crawling your site. It is usually standard to add a URL and contact information to the User-Agent or referrer, e.g.

"Mozilla/5.0 (compatible; MetaJobBot; http://www.metajob.de/crawler)"

"https://www.google.de/" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_4_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.2 Mobile/15E148 Safari/604.1"

the crawler should respect the robots.txt
the crawler should wait between consecutive requests (Crawl-Delay). A good minimum value is e.g. 1-5 second for requests done on same domain

While linkvalidator is not a full web crawler which downloads the entire pages on the site and it currently uses HEAD by default (and not GET) which does not download the entire page - so this is not as dramatic.

But preferably, I think we should follow these recommendations as well.

Possible solutions¶

Not just for reducing load on other sites but also on the TYPO3 site, some changes might be made to the crawling process:

when crawling external URLs, do not check right away but defer to a separate task which will handle the crawling of external URLs (with reasonable delays in between)
do not keep crawling over and over but regularly only recrawl content which was recently modified
optional: delegate the checking of external URLs, e.g. use a URLchecking service

This may make it necessary to make more things asynchronous and confine the link checking only to the scheduler.

Resources¶

About politeness of web crawlers:

https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy

URL checking site:

https://httpstatus.io/

Related issues 4 (3 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (2)

Task #89287

Make linkvalidator crawling polite

Possible solutions¶

Resources¶

Updated by Sybille Peters over 2 years ago

Updated by Sybille Peters over 2 years ago

Updated by Sybille Peters over 2 years ago

Updated by Sybille Peters over 2 years ago

Updated by Rémy DANIEL 11 months ago

Updated by Sybille Peters 11 months ago

Updated by Sybille Peters 11 months ago