Project

General

Profile

Actions

Bug #105753

closed

Linkvalidator does not parse domains with Umlauts correctly

Added by Roman Schilter about 1 month ago. Updated 11 days ago.

Status:
Resolved
Priority:
Should have
Category:
Linkvalidator
Target version:
-
Start date:
2024-12-03
Due date:
% Done:

100%

Estimated time:
TYPO3 Version:
12
PHP Version:
8.1
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

Steps to reproduce:

  • Create "Link to External Url" page
  • Insert Url with Umlaut ("https://gebäudehülle.swiss")
  • Run Linkvalidator Task for the created page

It will find the link "https://geb" and report it as broken.

The reason is the Regexp in UrlSoftReferenceParser does not match Umlauts.


Files

linkvalidator_result.png (24.4 KB) linkvalidator_result.png Sybille Peters, 2024-12-30 10:32
Actions #1

Updated by Sybille Peters 14 days ago

I can reproduce this in v14 (main) and v12 (latest 12.4 branch using page type "link to external page" and the link in pages.url (as described).

I cannot reproduce it with content element CType="textmedia" and links in tt_content.bodytext (RTE). There, the links are parsed correctly.

Actions #2

Updated by Sybille Peters 14 days ago

The following works correctly in RTE:

select bodytext from tt_content where pid=913;
<p>mit Umlauten <a href="https://xn--gebudehlle-s5a60a.swiss/">link (umlauts)</a></p>
<p><a href="https://xn--gebudehlle-s5a60a.swiss/sdfsddfs">broken link (umlauts)</a></p>
<p>Encoded: <a href="https://gebäudehülle.swiss">link (encoded)</a></p>
<p><a href="https://gebäudehülle.swiss/sdfsddfs">broken link (encoded)</a></p>
Actions #3

Updated by Sybille Peters 14 days ago · Edited

For pages.url, if the already encoded value is used, it is ok:

Not ok, if this is used:

As a workaround you can use the encoded URL (the URL is already converted for me, when I copy it from the browser address bar, for example).

As mentioned above, the problem is in UrlSoftReferenceParser. Possible solutions:

  • value is already converted when saved in pages.url (and other fields)
  • linkvalidator does not use softref parser to "parse" fields where the content can be used as is (e.g. for pages.url which is of type "input" with "softref" = "url")
  • linkvalidator converts the URL before passing it to the sofref parser (not possible for fields where parsing should be performed FIRST)
  • the softref parser UrlSoftReferenceParser converts the URL first. We already have code for this in ExternalLinktype::preprocessUrl :
protected function preprocessUrl(string $url): string
    {
        $url = html_entity_decode($url);
        $parts = parse_url($url);
        if ($parts['host'] ?? false) {
            try {
                $newDomain = (string)idn_to_ascii($parts['host']);
                if (strcmp($parts['host'], $newDomain) !== 0) {
                    $parts['host'] = $newDomain;
                    $url = HttpUtility::buildUrl($parts);
                }
            } catch (\Exception | \Throwable $e) {
                // ignore error and proceed with link checking
            }
        }
        return $url;
    }

Actions #4

Updated by Sybille Peters 14 days ago

Also, there is no problem with parsing sys_file_reference.link because it uses the TypolinkSoftReferenceParser.

I am wondering what is the point of having UrlSoftReferenceParser because TypolinkSoftReferenceParser can also parse urls.

Also, pages.url could use type=link and with allowedTypes='url' (though that would not solve the problem with possibly other fields)

pages.url:
- type=input
- softref=url

sys_file_reference.link:
- type=link
- softref=typolink

Actions #5

Updated by Gerrit Code Review 12 days ago

  • Status changed from Accepted to Under Review

Patch set 1 for branch main of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/87609

Actions #6

Updated by Sybille Peters 12 days ago

  • Assignee set to Sybille Peters
Actions #7

Updated by Gerrit Code Review 11 days ago

Patch set 2 for branch main of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/87609

Actions #8

Updated by Gerrit Code Review 11 days ago

Patch set 3 for branch main of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/87609

Actions #9

Updated by Gerrit Code Review 11 days ago

Patch set 1 for branch 13.4 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/87618

Actions #10

Updated by Gerrit Code Review 11 days ago

Patch set 1 for branch 12.4 of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/87619

Actions #11

Updated by Sybille Peters 11 days ago

  • Status changed from Under Review to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF