TYPO3 Core

Category:

Linkvalidator

Target version:

4.6.0

Start date:

2011-03-10

Due date:

% Done:

100%

Estimated time:

2.00 h

TYPO3 Version:

PHP Version:

Tags:

Complexity:

Is Regression:

Sprint Focus:

Description

External links are fetched completely using t3lib_div:getURL(). For big files, this results in a crash of TYPO3 in line 3066 of class t3lib_div (TYPO3 4.5.2), which is this line:
$content .= stream_get_contents($fp);
Here, TYPO3 wants to append the whole content of the file to a variable - until the memory limit. Crash.

The only solution would be to fetch only the header (which would also speed up the link checking). But in class.tx_linkvalidator_linktype_external.php I find the comment
"try fetching the content of the URL (just fetching the headers does not work correctly)"

What's the problem about only fetching the headers? Is there any solution to this problem at all???

Related issues 5 (0 open — 5 closed)

Related to TYPO3 Core - Bug #13432: 404 for external links with no redirect

Closed

2011-02-28

Related to TYPO3 Core - Bug #13802: Add handling for HTTP 405 Method not allowed

Closed

2011-03-12

Related to TYPO3 Core - Feature #13680: Refactor external link, do not follow senseless loop, add user agent

Closed

2011-01-14

Related to TYPO3 Core - Task #28915: Refactor to t3lib_http_Request for external URLs

Closed

2011-08-10

Has duplicate TYPO3 Core - Bug #25321: getUrl crashes when trying to retrieve a big file

Rejected

2011-03-14

Issue # Delay: days Cancel

History
Notes
Property changes
Associated revisions

Actions

Updated by Philipp Gampe over 13 years ago

The problem is that getUrl() fails if the users forgets to append a "/" at the end of the domain name.
See http://bugs.typo3.org/view.php?id=16925

If you could test this behavior and see if it works again for all kind of links (with or without redirect), please report your result. (You need to turn curl off).

Some links to check:

http://google.com --> ok
http://www.google.com/ --> ok
http://loop.philippgampe.info/ --> loop
http://error.philippgampe.info/ --> 404
http://google.com/test --> 404
http://www.google.com/test --> 404

Actions

Updated by Daniel Minder over 13 years ago

I added a note to the getURL() bug. IMO getURL() must not send a wrong HTTP request.

Some more technical background:
If I sent the request "GET HTTP/1.1" to a webserver it interprets this as a HTTP 0.9 request with "HTTP/1.1" as the path. Of course, this results in a 404 error since this path exists on hardly any server.
A request "HEAD HTTP/1.1" is also interpreted as HTTP 0.9 request, but the HEAD command does not exist in 0.9 and only GET is allowed. Therefore, the server responds correctly with a 400 bad request error.

I checked the links above and some more manually by sending HTTP GET and HEAD requests. These are the results:

URL                             | GET | HEAD | Location for 3*
--------------------------------+-----+------+-------------------------------
http://google.com               | 404 | 400  |
http://google.com/              | 301 | 301  | http://www.google.com/
http://www.google.com/          | 302 | 302  | http://www.google.de/
http://www.google.de/           | 200 | 200  |
http://loop.philippgampe.info/  | 302 | 302  | http://loop.philippgampe.info/
http://error.philippgampe.info/ | 404 | 404  |
http://google.com/test          | 301 | 301  | http://www.google.com/test
http://www.google.com/test      | 404 | 404  |

So, as we see GET and HEAD result in the same codes as required by the RFC if and only if the correct request is sent (which is not the case for the first test only). Following all the redirections, we get the desired results as indicated by Philipp.

So, currently with the wrong implementation of getURL we can choose if we want to present the user a 404 or a 400 error. Both is correct from the HTTP protocols's point of view but wrong from the user's experience when typing the same URL in a browser. Let's hope that the getURL() bug get's fixed soon. If not, we have to deal with it in linkvalidator.

Nevertheless, if there is no other reason for using GET instead of HEAD I would strongly recommend to use HEAD since it at least solves the problem of large files and reduces network load.

Actions

Updated by Philipp Gampe over 13 years ago

We have more problems here. Some server do not respond to a HEAD:

http://www.amazon.com/TYPO3-Enterprise-Management-Official-endorsed/dp/1904811418/ref=sr_1_1?ie=UTF8&tag=typentconmans-20

This returns 405:

HTTP/1.1 405 MethodNotAllowed Date: Sat, 12 Mar 2011 13:02:03 GMT Server: Server x-amz-id-1: 1AWSWM8KAAS2ZFKP2950 allow: POST, GET x-amz-id-2: zFaB5UjNZ+aZKn9EzfVjnIEQ3/sLuxOvCUDFoSzbgoeMD4Yyg83l6yhzLKUpeAmb Vary: Accept-Encoding,User-Agent nnCoection: close Content-Type: text/html; charset=ISO-8859-1 Connection: close

But anyway thank you so much for your deep analysis.

Actions

Updated by Philipp Gampe over 13 years ago

Status changed from New to Accepted
Assignee set to Philipp Gampe

Actions

Updated by Philipp Gampe over 13 years ago

% Done changed from 0 to 50

Included in patch #13680

I moved the Amazon 405 issue to #13802

Actions

Updated by Daniel Minder over 13 years ago

I also found a server which reacts different to a HEAD request. Actually, the RFC specifies only that a server SHOULD react in the same way, not MUST... We cannot be sure how a non-compliant server reacts. Some respond with 405, some with 404.

I see three different solutions:

1. We could retry with a GET if a HEAD does not result in a 200 OK, but this is not a failure proof solution.

2. We could add a configurable list that defines which servers should be queried with GET instead of HEAD. But this does not solve the problem with big files linked on those servers...

3. We could copy the code from t3lib_div::getUrl (at least the parts that are relevant for us), hardcode a GET request but stop reading the input after the header. This is at least possible when we do not use curl. I have not tried if this is possible with curl (does curl return the header although it exits with code 65 when the --max-filesize is exceeded?).

As far as I can see option 3 is bad from a software engineering point of view, but IMO the only possible way...

Actions

Updated by Philipp Gampe over 13 years ago

I would go for 1, because this is the least overhead.

I do not like the idea of 3. This sounds like a nightmare.

We need more tests with curl. Or we use http://php.net/manual/en/book.http.php. But the would mean more requirements on server site - bad either.

2 could be an option... should be about the same work as 1. Still I prefer silent fallback (so 1).

Actions

Updated by Daniel Minder over 13 years ago

Solution 1 has the same problem as solution 2: if a servers reacts wrong to a HEAD request and we retry with a GET and if a big file is linked, Typo3 will crash.

Additional requirements are also bad, that's true. But if you are thinking about reimplementing using the HTTP class, then you could also go for solution 3 :)

Actions

Updated by Michael Miousse over 13 years ago

Daniel Minder wrote:

Solution 1 has the same problem as solution 2: if a servers reacts wrong to a HEAD request and we retry with a GET and if a big file is linked, Typo3 will crash.

Additional requirements are also bad, that's true. But if you are thinking about reimplementing using the HTTP class, then you could also go for solution 3 :)

i think it would be best if we provide a patch to getUrl in order to prenvent Typo3 to crash.

I mean you are rigth big file will crash Typo3 but is'nt it also true for any call made to getUrl?

if so i think that theire sould be a limit to content fetched by getUrl in the core directly.

it would not only solve are problem but should also prenvent the core from crashing Typo3.

what do you think?

Actions

#10

Updated by Philipp Gampe over 13 years ago

@Michael

Either that, or we introduce a mode 3 for getUrl, which does the same as 2 using get. This is rather easy and might even be added to 4.5

We do only need the head of the HTTP message.

Actions

#11

Updated by Daniel Minder over 13 years ago

I have never used cURL, but looking at http://de3.php.net/manual/en/function.curl-setopt.php it seems that we cannot tell cURL to stop downloading after the header... Or do I miss something?

Actions

#12

Updated by Philipp Gampe over 13 years ago

Daniel Minder wrote:

I have never used cURL, but looking at http://de3.php.net/manual/en/function.curl-setopt.php it seems that we cannot tell cURL to stop downloading after the header... Or do I miss something?

Yes ... there is an option curl_setopt($ch, CURLOPT_NOBODY, $includeHeader == 2 ? 1 : 0);

I already have a working patch for getUrl($url,3) and changes for the cleanup patch.

Actions

#13

Updated by Daniel Minder over 13 years ago

Yes ... there is an option curl_setopt($ch, CURLOPT_NOBODY, $includeHeader == 2 ? 1 : 0);

No, that's wrong! The man page says: "Request method is then set to HEAD." But that's exactly NOT what we want to have!

Actions

#14

Updated by Philipp Gampe over 13 years ago

I played around with the settings and there is no way to combine CURLOPT_NOBODY with CURLOPT_HTTPGET. The later on is even used wrong in core.

CURLOPT_HTTPGET = 1 retrieves the body even if CURLOPT_NOBODY is set.

The whole curlUse seems not to work correctly as is additionalHeaders are not set either (so no user agent --> bug with mediawiki).

I thing we should solve this with normal getUrl and then think on how to handle curlUse.
You can test with http://bugs.typo3.org/view.php?id=17943 and v2 from #13680

Actions

#15

Updated by Philipp Gampe over 13 years ago

BTW, thanks for you continuous efforts.

Actions