Bug #86918
closedEpic #85006: Reduce falsely reported broken links
Linkvalidator stops working on specific links (external URLs)
100%
Description
- Some sites require specific HTTP headers, which are normally set in browsers.
E.g. an external link validation for the URL "https://www.dpdhl.com/en.html" never ends and finally breaks the scheduler task.
The result of some debuggins is the following header default set in \TYPO3\CMS\Linkvalidator\Linktype\ExternalLinktype::checkLink():$options = [ 'cookies' => GeneralUtility::makeInstance(CookieJar::class), 'allow_redirects' => ['strict' => true], 'headers' => [ 'User-Agent' => 'TYPO3 linkvalidator', 'Accept' => '*/*', 'Accept-Language' => '*', 'Accept-Encoding' => '*', 'Connection' => 'keep-alive', ], ];
- Also, some sites don't allow HEAD requests and in that cases the defined fallback GET Request in the mentioned method above won't ever used.
So, it would be great if you could decide, for example by a configuration, if you just want to use a simple GET request.
- Another point is using the HTTP setting "Range: bytes = 0 – 4048" leads to strange responses in some link cases. A better way would be a possibility to set up this header setting so that it will not always be used.
Updated by Sybille Peters about 6 years ago
- Description updated (diff)
Thank you for your report. I reformatted your post a little. Hope that's ok.
Number 2 is a little odd, because from the code, it should fallback to HEAD if GET fails, see 8.7 code
Does an exception get thrown or do you have a sample URL? Do you have a sample URL for the Range problem as well? Thanks.
In general:
My wish would be, that linkvalidator works well out-of-the box, so people who are no experts don't have to dive deep into the configuration. Configuration options can be possible for the expert, though. I am afraid, you won't find a configuration which will "just work" for all sites.
If you do have suggestions, what would work, that would be very helpful.
Another idea would be to be able to (e.g. by regex) make linkvalidator ignore links. Getting the link check to work is the preferred solution but then you could at least prevent links from cluttering up the reports if new issues with external link checking crop up. (this would be a feature however, I believe and due to TYPO3 roadmap can't make it into TYPO3 9 or 8, see https://forge.typo3.org/issues/85127)
Next steps:
If you have a good idea, how this issue can be solved, you may want to consider submitting a patch for your solution, see https://docs.typo3.org/typo3cms/ContributionWorkflowGuide/
I would definitely like to look into it as well, but will take a couple of days to find some time.
Updated by Stefan Berger about 6 years ago
we solve our problems by extending TYPO3\CMS\Linkvalidator\Linktype\ExternalLinktype
/** * Checks a given URL for validity * * @param string $origUrl The URL to check * @param array $softRefEntry The soft reference entry which builds the context of that URL * @param \TYPO3\CMS\Linkvalidator\LinkAnalyzer $reference Parent instance * @return bool TRUE on success or FALSE on error * @throws \InvalidArgumentException */ public function checkLink($origUrl, $softRefEntry, $reference) { // use URL from cache, if available if (isset($this->urlReports[$origUrl])) { $this->setErrorParams($this->urlErrorParams[$origUrl]); return $this->urlReports[$origUrl]; } $options = [ 'cookies' => GeneralUtility::makeInstance(CookieJar::class), 'allow_redirects' => ['strict' => true], // -> Allways set this default headers, because some sites requires it 'headers' => [ 'User-Agent' => 'TYPO3 linkvalidator', 'Accept' => '*/*', 'Accept-Language' => '*', 'Accept-Encoding' => '*', 'Connection' => 'keep-alive', ], ]; $url = $this->preprocessUrl($origUrl); if (!empty($url)) { // -> never use HEAD, because some site hangs up when no GET is used. Besides, HEAD is not used in a normal FE case. // $isValidUrl = $this->requestUrl($url, 'HEAD', $options); // if (!$isValidUrl) { // HEAD was not allowed or threw an error, now trying GET // -> don't get response by Range set: many responses lead to strange returns, e.g. 401 responses //$options['headers']['Range'] = 'bytes = 0 - 4048'; $isValidUrl = $this->requestUrl($url, 'GET', $options); // } } $this->urlReports[$origUrl] = $isValidUrl; $this->urlErrorParams[$origUrl] = $this->errorParams; return $isValidUrl; }
Updated by Sybille Peters about 5 years ago
- Related to Epic #85006: Reduce falsely reported broken links added
Updated by Sybille Peters about 5 years ago
- Subject changed from Linkvalidator stops working on specific links to Linkvalidator stops working on specific links (external URLs)
Updated by Gerrit Code Review about 5 years ago
- Status changed from New to Under Review
Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Sybille Peters about 5 years ago
@Stefan Berger
I created a patch: https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
It would be helpful, if you could review it.
If you want to recreate your setttings, you can change page TSconfig:
mod.linkvalidator { linktypes_config { external { # some sites require these headers headers { Connection = keep-alive Accept = */* Accept-Language = * Accept-Encoding = * }, # use HEAD or GET # - HEAD will always fallback to GET if HEAD fails # - some sites cause problems with HEAD, in that case use GET method = GET # only for GET method: this limits the size of the document returned but may fail on some sites range = } } }
We should probably also add this to the documentation.
Also, it would be helpful if you could supply URLs that do not work. https://www.dpdhl.com/en.html currently causes no problem.
Updated by Sybille Peters about 5 years ago
@Stefan Berger
Also, thanks for reporting this and supplying working ExternalLinkType class.
Updated by Gerrit Code Review about 5 years ago
Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Sybille Peters about 5 years ago
- Related to Feature #85127: linkvalidator: Add possibility to exclude specific external URLs / domains or patterns added
Updated by Sybille Peters about 5 years ago
Please be aware that using GET instead of HEAD or removing the Range for GET will result in more (network) traffic and puts more load not only on your server but also on the remote server that is being crawled.
Right now, I don't see an easy way to get reliable link checking without increasing load / network traffic. I would really like to have reliable link checking but adding extra load for all when only a few URLs are affected seems wrong.
There is also an alternative proposal to make it possible to exclude specific URLs from link checking: https://forge.typo3.org/issues/85127 .
Updated by Gerrit Code Review about 5 years ago
Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811
Updated by Gerrit Code Review about 5 years ago
Patch set 1 for branch 9.5 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61829
Updated by Sybille Peters about 5 years ago
- Status changed from Under Review to Resolved
- % Done changed from 0 to 100
Applied in changeset 70045e1b3c3640198e935fa3fe173b385302d8b9.
Updated by Gerrit Code Review about 5 years ago
- Status changed from Resolved to Under Review
Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Gerrit Code Review about 5 years ago
Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801
Updated by Sybille Peters about 5 years ago
- Status changed from Under Review to Resolved
Applied in changeset 2507a32f4d23af2a54bcdb06012673908f30fe70.
Updated by Benni Mack almost 5 years ago
- Status changed from Resolved to Closed
Updated by Garvin Hicking 4 months ago
- Related to Task #89287: Make linkvalidator crawling polite added