Project

General

Profile

Actions

Bug #86918

closed

Epic #85006: Reduce falsely reported broken links

Linkvalidator stops working on specific links (external URLs)

Added by Stefan Berger about 6 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
Linkvalidator
Target version:
-
Start date:
2018-11-13
Due date:
% Done:

100%

Estimated time:
TYPO3 Version:
8
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

  • Some sites require specific HTTP headers, which are normally set in browsers.
    E.g. an external link validation for the URL "https://www.dpdhl.com/en.html" never ends and finally breaks the scheduler task.
    The result of some debuggins is the following header default set in \TYPO3\CMS\Linkvalidator\Linktype\ExternalLinktype::checkLink():
    $options = [
        'cookies' => GeneralUtility::makeInstance(CookieJar::class),
        'allow_redirects' => ['strict' => true],
        'headers' => [
            'User-Agent'        => 'TYPO3 linkvalidator',
            'Accept'            => '*/*',
            'Accept-Language'   => '*',
            'Accept-Encoding'   => '*',
            'Connection'        => 'keep-alive',
        ],
    ];
    
  • Also, some sites don't allow HEAD requests and in that cases the defined fallback GET Request in the mentioned method above won't ever used.
    So, it would be great if you could decide, for example by a configuration, if you just want to use a simple GET request.
  • Another point is using the HTTP setting "Range: bytes = 0 – 4048" leads to strange responses in some link cases. A better way would be a possibility to set up this header setting so that it will not always be used.

Related issues 2 (2 open0 closed)

Related to TYPO3 Core - Feature #85127: linkvalidator: Add possibility to exclude specific external URLs / domains or patternsNew2018-05-31

Actions
Related to TYPO3 Core - Task #89287: Make linkvalidator crawling politeNew2019-09-26

Actions
Actions #1

Updated by Sybille Peters about 6 years ago

  • Description updated (diff)

Thank you for your report. I reformatted your post a little. Hope that's ok.

Number 2 is a little odd, because from the code, it should fallback to HEAD if GET fails, see 8.7 code

Does an exception get thrown or do you have a sample URL? Do you have a sample URL for the Range problem as well? Thanks.

In general:

My wish would be, that linkvalidator works well out-of-the box, so people who are no experts don't have to dive deep into the configuration. Configuration options can be possible for the expert, though. I am afraid, you won't find a configuration which will "just work" for all sites.

If you do have suggestions, what would work, that would be very helpful.

Another idea would be to be able to (e.g. by regex) make linkvalidator ignore links. Getting the link check to work is the preferred solution but then you could at least prevent links from cluttering up the reports if new issues with external link checking crop up. (this would be a feature however, I believe and due to TYPO3 roadmap can't make it into TYPO3 9 or 8, see https://forge.typo3.org/issues/85127)

Next steps:

If you have a good idea, how this issue can be solved, you may want to consider submitting a patch for your solution, see https://docs.typo3.org/typo3cms/ContributionWorkflowGuide/

I would definitely like to look into it as well, but will take a couple of days to find some time.

Actions #2

Updated by Stefan Berger about 6 years ago

we solve our problems by extending TYPO3\CMS\Linkvalidator\Linktype\ExternalLinktype

    /**
     * Checks a given URL for validity
     *
     * @param string $origUrl The URL to check
     * @param array $softRefEntry The soft reference entry which builds the context of that URL
     * @param \TYPO3\CMS\Linkvalidator\LinkAnalyzer $reference Parent instance
     * @return bool TRUE on success or FALSE on error
     * @throws \InvalidArgumentException
     */
    public function checkLink($origUrl, $softRefEntry, $reference)
    {
        // use URL from cache, if available
        if (isset($this->urlReports[$origUrl])) {
            $this->setErrorParams($this->urlErrorParams[$origUrl]);
            return $this->urlReports[$origUrl];
        }

        $options = [
            'cookies' => GeneralUtility::makeInstance(CookieJar::class),
            'allow_redirects' => ['strict' => true],
            // -> Allways set this default headers, because some sites requires it
            'headers' => [
                'User-Agent'        => 'TYPO3 linkvalidator',
                'Accept'            => '*/*',
                'Accept-Language'   => '*',
                'Accept-Encoding'   => '*',
                'Connection'        => 'keep-alive',
            ],
        ];
        $url = $this->preprocessUrl($origUrl);

        if (!empty($url)) {
            // -> never use HEAD, because some site hangs up when no GET is used. Besides, HEAD is not used in a normal FE case.
            // $isValidUrl = $this->requestUrl($url, 'HEAD', $options);
            // if (!$isValidUrl) {
                // HEAD was not allowed or threw an error, now trying GET
                // -> don't get response by Range set: many responses lead to strange returns, e.g. 401 responses
                //$options['headers']['Range'] = 'bytes = 0 - 4048';
                $isValidUrl = $this->requestUrl($url, 'GET', $options);
            // }
        }
        $this->urlReports[$origUrl] = $isValidUrl;
        $this->urlErrorParams[$origUrl] = $this->errorParams;

        return $isValidUrl;
    }
Actions #3

Updated by Sybille Peters about 5 years ago

  • Related to Epic #85006: Reduce falsely reported broken links added
Actions #4

Updated by Sybille Peters about 5 years ago

  • Subject changed from Linkvalidator stops working on specific links to Linkvalidator stops working on specific links (external URLs)
Actions #5

Updated by Gerrit Code Review about 5 years ago

  • Status changed from New to Under Review

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #6

Updated by Sybille Peters about 5 years ago

@Stefan Berger

I created a patch: https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

It would be helpful, if you could review it.

If you want to recreate your setttings, you can change page TSconfig:

mod.linkvalidator {
    linktypes_config {
       external {
          # some sites require these headers
          headers {
             Connection = keep-alive
             Accept = */*
             Accept-Language = *
             Accept-Encoding = *
          },
          # use HEAD or GET
          # - HEAD will always fallback to GET if HEAD fails
          # - some sites cause problems with HEAD, in that case use GET
          method = GET
          # only for GET method: this limits the size of the document returned but may fail on some sites
          range = 
       }
    }

}

We should probably also add this to the documentation.

Also, it would be helpful if you could supply URLs that do not work. https://www.dpdhl.com/en.html currently causes no problem.

Actions #7

Updated by Sybille Peters about 5 years ago

@Stefan Berger

Also, thanks for reporting this and supplying working ExternalLinkType class.

Actions #8

Updated by Gerrit Code Review about 5 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #9

Updated by Gerrit Code Review about 5 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #10

Updated by Gerrit Code Review about 5 years ago

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #11

Updated by Gerrit Code Review about 5 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #12

Updated by Gerrit Code Review about 5 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #13

Updated by Gerrit Code Review about 5 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #14

Updated by Sybille Peters about 5 years ago

  • Related to Feature #85127: linkvalidator: Add possibility to exclude specific external URLs / domains or patterns added
Actions #15

Updated by Sybille Peters about 5 years ago

Please be aware that using GET instead of HEAD or removing the Range for GET will result in more (network) traffic and puts more load not only on your server but also on the remote server that is being crawled.

Right now, I don't see an easy way to get reliable link checking without increasing load / network traffic. I would really like to have reliable link checking but adding extra load for all when only a few URLs are affected seems wrong.

There is also an alternative proposal to make it possible to exclude specific URLs from link checking: https://forge.typo3.org/issues/85127 .

Actions #16

Updated by Gerrit Code Review about 5 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #17

Updated by Gerrit Code Review about 5 years ago

Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #18

Updated by Gerrit Code Review about 5 years ago

Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #19

Updated by Gerrit Code Review about 5 years ago

Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #20

Updated by Gerrit Code Review about 5 years ago

Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #21

Updated by Gerrit Code Review about 5 years ago

Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61811

Actions #22

Updated by Gerrit Code Review about 5 years ago

Patch set 1 for branch 9.5 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61829

Actions #23

Updated by Sybille Peters about 5 years ago

  • Status changed from Under Review to Resolved
  • % Done changed from 0 to 100
Actions #24

Updated by Gerrit Code Review about 5 years ago

  • Status changed from Resolved to Under Review

Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #25

Updated by Gerrit Code Review about 5 years ago

Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #26

Updated by Gerrit Code Review about 5 years ago

Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #27

Updated by Sybille Peters about 5 years ago

  • Parent task set to #85006
Actions #28

Updated by Gerrit Code Review about 5 years ago

Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #29

Updated by Gerrit Code Review about 5 years ago

Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/61801

Actions #30

Updated by Sybille Peters about 5 years ago

  • Status changed from Under Review to Resolved
Actions #31

Updated by Benni Mack almost 5 years ago

  • Status changed from Resolved to Closed
Actions #32

Updated by Garvin Hicking 4 months ago

  • Related to Task #89287: Make linkvalidator crawling polite added
Actions

Also available in: Atom PDF