Bug #13756
closedExternal links to big files crash TYPO3
Added by Daniel Minder over 13 years ago. Updated almost 11 years ago.
100%
Description
External links are fetched completely using t3lib_div:getURL(). For big files, this results in a crash of TYPO3 in line 3066 of class t3lib_div (TYPO3 4.5.2), which is this line:
$content .= stream_get_contents($fp);
Here, TYPO3 wants to append the whole content of the file to a variable - until the memory limit. Crash.
The only solution would be to fetch only the header (which would also speed up the link checking). But in class.tx_linkvalidator_linktype_external.php I find the comment
"try fetching the content of the URL (just fetching the headers does not work correctly)"
What's the problem about only fetching the headers? Is there any solution to this problem at all???
Updated by Philipp Gampe over 13 years ago
The problem is that getUrl()
fails if the users forgets to append a "/" at the end of the domain name.
See http://bugs.typo3.org/view.php?id=16925
If you could test this behavior and see if it works again for all kind of links (with or without redirect), please report your result. (You need to turn curl off).
Some links to check:
http://google.com --> ok http://www.google.com/ --> ok http://loop.philippgampe.info/ --> loop http://error.philippgampe.info/ --> 404 http://google.com/test --> 404 http://www.google.com/test --> 404
Updated by Daniel Minder over 13 years ago
I added a note to the getURL() bug. IMO getURL() must not send a wrong HTTP request.
Some more technical background:
If I sent the request "GET HTTP/1.1" to a webserver it interprets this as a HTTP 0.9 request with "HTTP/1.1" as the path. Of course, this results in a 404 error since this path exists on hardly any server.
A request "HEAD HTTP/1.1" is also interpreted as HTTP 0.9 request, but the HEAD command does not exist in 0.9 and only GET is allowed. Therefore, the server responds correctly with a 400 bad request error.
I checked the links above and some more manually by sending HTTP GET and HEAD requests. These are the results:
URL | GET | HEAD | Location for 3* --------------------------------+-----+------+------------------------------- http://google.com | 404 | 400 | http://google.com/ | 301 | 301 | http://www.google.com/ http://www.google.com/ | 302 | 302 | http://www.google.de/ http://www.google.de/ | 200 | 200 | http://loop.philippgampe.info/ | 302 | 302 | http://loop.philippgampe.info/ http://error.philippgampe.info/ | 404 | 404 | http://google.com/test | 301 | 301 | http://www.google.com/test http://www.google.com/test | 404 | 404 |
So, as we see GET and HEAD result in the same codes as required by the RFC if and only if the correct request is sent (which is not the case for the first test only). Following all the redirections, we get the desired results as indicated by Philipp.
So, currently with the wrong implementation of getURL we can choose if we want to present the user a 404 or a 400 error. Both is correct from the HTTP protocols's point of view but wrong from the user's experience when typing the same URL in a browser. Let's hope that the getURL() bug get's fixed soon. If not, we have to deal with it in linkvalidator.
Nevertheless, if there is no other reason for using GET instead of HEAD I would strongly recommend to use HEAD since it at least solves the problem of large files and reduces network load.
Updated by Philipp Gampe over 13 years ago
We have more problems here. Some server do not respond to a HEAD:
http://www.amazon.com/TYPO3-Enterprise-Management-Official-endorsed/dp/1904811418/ref=sr_1_1?ie=UTF8&tag=typentconmans-20
This returns 405:
HTTP/1.1 405 MethodNotAllowed Date: Sat, 12 Mar 2011 13:02:03 GMT Server: Server x-amz-id-1: 1AWSWM8KAAS2ZFKP2950 allow: POST, GET x-amz-id-2: zFaB5UjNZ+aZKn9EzfVjnIEQ3/sLuxOvCUDFoSzbgoeMD4Yyg83l6yhzLKUpeAmb Vary: Accept-Encoding,User-Agent nnCoection: close Content-Type: text/html; charset=ISO-8859-1 Connection: close
But anyway thank you so much for your deep analysis.
Updated by Philipp Gampe over 13 years ago
- Status changed from New to Accepted
- Assignee set to Philipp Gampe
Updated by Philipp Gampe over 13 years ago
- % Done changed from 0 to 50
Updated by Daniel Minder over 13 years ago
I also found a server which reacts different to a HEAD request. Actually, the RFC specifies only that a server SHOULD react in the same way, not MUST... We cannot be sure how a non-compliant server reacts. Some respond with 405, some with 404.
I see three different solutions:
1. We could retry with a GET if a HEAD does not result in a 200 OK, but this is not a failure proof solution.
2. We could add a configurable list that defines which servers should be queried with GET instead of HEAD. But this does not solve the problem with big files linked on those servers...
3. We could copy the code from t3lib_div::getUrl (at least the parts that are relevant for us), hardcode a GET request but stop reading the input after the header. This is at least possible when we do not use curl. I have not tried if this is possible with curl (does curl return the header although it exits with code 65 when the --max-filesize is exceeded?).
As far as I can see option 3 is bad from a software engineering point of view, but IMO the only possible way...
Updated by Philipp Gampe over 13 years ago
I would go for 1, because this is the least overhead.
I do not like the idea of 3. This sounds like a nightmare.
We need more tests with curl. Or we use http://php.net/manual/en/book.http.php. But the would mean more requirements on server site - bad either.
2 could be an option... should be about the same work as 1. Still I prefer silent fallback (so 1).
Updated by Daniel Minder over 13 years ago
Solution 1 has the same problem as solution 2: if a servers reacts wrong to a HEAD request and we retry with a GET and if a big file is linked, Typo3 will crash.
Additional requirements are also bad, that's true. But if you are thinking about reimplementing using the HTTP class, then you could also go for solution 3 :)
Updated by Michael Miousse over 13 years ago
Daniel Minder wrote:
Solution 1 has the same problem as solution 2: if a servers reacts wrong to a HEAD request and we retry with a GET and if a big file is linked, Typo3 will crash.
Additional requirements are also bad, that's true. But if you are thinking about reimplementing using the HTTP class, then you could also go for solution 3 :)
i think it would be best if we provide a patch to getUrl in order to prenvent Typo3 to crash.
I mean you are rigth big file will crash Typo3 but is'nt it also true for any call made to getUrl?
if so i think that theire sould be a limit to content fetched by getUrl in the core directly.
it would not only solve are problem but should also prenvent the core from crashing Typo3.
what do you think?
Updated by Philipp Gampe over 13 years ago
@Michael
Either that, or we introduce a mode 3 for getUrl, which does the same as 2 using get. This is rather easy and might even be added to 4.5
We do only need the head of the HTTP message.
Updated by Daniel Minder over 13 years ago
I have never used cURL, but looking at http://de3.php.net/manual/en/function.curl-setopt.php it seems that we cannot tell cURL to stop downloading after the header... Or do I miss something?
Updated by Philipp Gampe over 13 years ago
Daniel Minder wrote:
I have never used cURL, but looking at http://de3.php.net/manual/en/function.curl-setopt.php it seems that we cannot tell cURL to stop downloading after the header... Or do I miss something?
Yes ... there is an option curl_setopt($ch, CURLOPT_NOBODY, $includeHeader == 2 ? 1 : 0);
I already have a working patch for getUrl($url,3)
and changes for the cleanup patch.
Updated by Daniel Minder over 13 years ago
Yes ... there is an option curl_setopt($ch, CURLOPT_NOBODY, $includeHeader == 2 ? 1 : 0);
No, that's wrong! The man page says: "Request method is then set to HEAD." But that's exactly NOT what we want to have!
Updated by Philipp Gampe over 13 years ago
I played around with the settings and there is no way to combine CURLOPT_NOBODY
with CURLOPT_HTTPGET
. The later on is even used wrong in core.
CURLOPT_HTTPGET = 1
retrieves the body even if CURLOPT_NOBODY
is set.
The whole curlUse seems not to work correctly as is additionalHeaders are not set either (so no user agent --> bug with mediawiki).
I thing we should solve this with normal getUrl and then think on how to handle curlUse.
You can test with http://bugs.typo3.org/view.php?id=17943 and v2 from #13680
Updated by Philipp Gampe over 13 years ago
BTW, thanks for you continuous efforts.
Updated by Philipp Gampe over 13 years ago
- Target version set to 4.6.0
- Estimated time set to 2.00 h
- use t3lib_http_Request
- use method HEAD
- use automatic following with follow strict
- use cookie jar to save cookies during requests
- if HEAD fails (400), retry with GET
- but set content range header to 0-4K
- skip body
- if it still failed, mark link as bad
- if an exception occures, mark link as bad and use exeption message
- use return code and message from t3lib_http_Request
- handle all HTTP return codes (there are not that many)
Updated by Mr. Hudson over 13 years ago
Patch set 1 of change Ib5d70f4c18fce2617dd9b4d0c675468c5c9558ed has been pushed to the review server.
It is available at http://review.typo3.org/4261
Updated by Mr. Hudson over 13 years ago
Patch set 2 of change Ib5d70f4c18fce2617dd9b4d0c675468c5c9558ed has been pushed to the review server.
It is available at http://review.typo3.org/4261
Updated by Philipp Gampe over 13 years ago
- Status changed from Accepted to Resolved
- % Done changed from 50 to 100
Applied in changeset commit:5096f769049686c74e5c22a42a8a303cbbdc2ff2.
Updated by Chris topher over 12 years ago
- Status changed from Resolved to Closed
Updated by Michael Stucki almost 11 years ago
- Project changed from 1510 to TYPO3 Core
- Category changed from Linkvalidator to Linkvalidator