Bug #80804
closedApache precesses overload because of typo3temp/locks/flock_
0%
Description
Hi there,
We have a really strange problem with locks : sometimes, Apache is completely saturated, too much processes, looks like, because of TYPO3 locking system.
The bug appear really sometimes, but now more and more regularly, 1 time / day. Really annoying bug, it puts all sites on the server down.
Actual solutions : or to remove all lock files from typo3temp, or to restart Apache.
Here are some strace logs :
- strace -p 5886
Process 5886 attached
restart_syscall(<... resuming interrupted call ...>) = 0
open("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", O_WRONLY|O_CREAT, 0666) = 5
fstat(5, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(5, 0, SEEK_CUR) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", 0664) = -1 EPERM (Operation not permitted)
flock(5, LOCK_EX) = 0
open("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", O_WRONLY|O_CREAT, 0666) = 134
fstat(134, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(134, 0, SEEK_CUR) = 0
close(135) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", 0664) = 0
flock(134, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
flock(5, LOCK_UN) = 0
close(5) = 0
nanosleep({0, 100000000}, NULL) = 0
open("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", O_WRONLY|O_CREAT, 0666) = 5
fstat(5, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(5, 0, SEEK_CUR) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", 0664) = -1 EPERM (Operation not permitted)
flock(5, LOCK_EX) = 0
open("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", O_WRONLY|O_CREAT, 0666) = 135
fstat(135, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(135, 0, SEEK_CUR) = 0
close(134) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", 0664) = 0
flock(135, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
flock(5, LOCK_UN) = 0
close(5) = 0
nanosleep({0, 100000000}, NULL) = 0
...
Best regards,
Fedir
Updated by Markus Klein over 7 years ago
Hi Fedir!
Thanks for your report. A few things I can read from your posting:
/home/www/mysite.com/www/typo3temp/locks/flock_*
seems not to have correct permissions as thechmod
fails with EPERM- The core tries to get an exclusive and non-blocking lock on
flock_7bfc94f72787757bc18f8d24791673b2
, which is intended. The lock is locked and hence EAGAIN is returned
I conclude that you have an ongoing page generation running that takes a serious long time or maybe never ends.
This causes all apache requests to pile up in the queue as they all wait for this page to finish generation, so they can deliver it.
What is your server setup?
Are you using php-fpm or mod_php?
Which mpm are you using?
We successfully run our servers with php-fpm and mpm_worker/mpm_event.
Updated by Fedir RYKHTIK over 7 years ago
Hey Markus !
Thank You very much for Your answer.
Here is information about my setup :
Are you using php-fpm or mod_php?
Apache2 mod_php5
What is your server setup?
Debian Jessie 8.7
PHP 5.6.30-0+deb8u1 (cli) (built: Feb 8 2017 08:50:21)
Server version: Apache/2.4.10 (Debian)
Which mpm are you using?
apache2-mpm-itk 2.4.10
One important stuff : web user uses permissions different from CLI user (for security reasons).
I conclude that you have an ongoing page generation running that takes a serious long time or maybe never ends.
How could I verify which pages are being generated ?
Best regards,
Fedir
Updated by Markus Klein over 7 years ago
- Category changed from System/Bootstrap/Configuration to Content Rendering
- Status changed from New to Needs Feedback
apache2-mpm-itk 2.4.10
I have never used this and it is based on mpm-prefork, which is known to be the worst choice nowadays.
I can only recommend to get rid of any third party code to be executed in the context of apache, that's why there is fcgi.
One important stuff : web user uses permissions different from CLI user (for security reasons).
Fair enough, still you need to make sure that both can properly read/write the locks directory if functionality is executed that requires locks.
How could I verify which pages are being generated ?
There are actually 2 locks involved in the page generation. One for the so called "pagesection" and one for the "page".
The key for the pagesection lock is based on the requested page UID and optionally a mountpoint-info.
The key for the page lock is created by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getLockHash
and includes the page uid as well.
All locks in FE rendering process are managed by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::acquireLock
Keep in mind that actually there should never be a large queue of waiting clients, since there is one additional thing involved:
Whenever one process is generating a page, it first fills the well-known "page is being generated... wait 30 seconds..." content into the cache and releases the pagesection lock. This way all waiting processes will display this content to the client and will therefore go away from the queue.
I have only one idea, which I would classify as a possible reason for what you see: Your apache process crashes in the middle of the page generation, hence your "pages" lock file remains on the disk in a locked state.
Updated by Fedir RYKHTIK over 7 years ago
Hey Markus !
apache2-mpm-itk 2.4.10
I have never used this and it is based on mpm-prefork, which is known to be the worst choice nowadays.
I can only recommend to get rid of any third party code to be executed in the context of apache, that's why there is fcgi.
It's quite stable module, and quite handy in our case. I think we will continue to use it on the server where the bug appears.
One important stuff : web user uses permissions different from CLI user (for security reasons).
Fair enough, still you need to make sure that both can properly read/write the locks directory if functionality is executed that requires locks.
I checked, permissions on place were good. But problem still existed.
How could I verify which pages are being generated ?
There are actually 2 locks involved in the page generation. One for the so called "pagesection" and one for the "page".
The key for the pagesection lock is based on the requested page UID and optionally a mountpoint-info.
The key for the page lock is created by\TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getLockHash
and includes the page uid as well.All locks in FE rendering process are managed by
\TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::acquireLock
Not so easy to debug the hash on production with thousand of requests per minute.
Keep in mind that actually there should never be a large queue of waiting clients, since there is one additional thing involved:
Whenever one process is generating a page, it first fills the well-known "page is being generated... wait 30 seconds..." content into the cache and releases the pagesection lock. This way all waiting processes will display this content to the client and will therefore go away from the queue.
OK, thanks for the hint.
I have only one idea, which I would classify as a possible reason for what you see: Your apache process crashes in the middle of the page generation, hence your "pages" lock file remains on the disk in a locked state.
Probably Apache, probably PHP, probably MySQL, could be different reasons.
The problem, what dead locks are not "burried". I think should be in place some locks TTL : if the TTL of the lock is too big, it should be released, to avoid infinitive looping.
Anyway, I made following workaround at the moment, we could find all old locks :
find /var/www/html/typo3temp/locks/flock_* not -newermt '-600 seconds' -printf "%p %TY%Tm-%Td %TH:%TM:%TS %Tz\n"
And even we could remove old locks by an external script, each 10 minutes :
/10 * * * * find /var/www/html/typo3temp/locks/flock_ -not -newermt '-600 seconds' -delete
Further proposition could be : to integrate flocks locks TTL into the Core, with default value 3600 seconds.
If lock exists and it's really really old, it should be removed, as it's not possible / normal, what locks exists after long time.
What do You think about such adjustment of the locks logic ?
Best regards,
Fedir
Updated by Markus Klein over 7 years ago
I have to think about this.. a lot.
The creation time of a lock file says usually nothing about its lock state. The resource is created once and used over and over again. To give you an idea of a website of ours, which is also quite busy:
-rw-r--r-- 1 user www-data 0 Nov 22 21:04 flock_1f2d180d4f0e10e95f64f6dd50c1d4cf -rw-r--r-- 1 user www-data 0 Dec 26 09:32 flock_3bce63d33f7c0032aa0aae07e78d7acc -rw-r--r-- 1 user www-data 0 Dec 7 14:31 flock_9c9007f651dd683b88f9913f91ccc017 -rw-r--r-- 1 user www-data 0 Nov 22 21:07 flock_b3b32a2d422265cd25c3323ed0157f81 -rw-r--r-- 1 user www-data 0 Nov 22 21:08 flock_c0a28f04f029116df131f61bb0c194b0 -rw-r--r-- 1 user www-data 0 Feb 7 16:01 flock_d0f5321c3c5b903d5ddf6d588d429d6b -rw-r--r-- 1 user www-data 0 Mar 13 10:19 flock_f5411b68e96cb15c2dc9772543729e6d
As you can see the lock files are all rather old.
$ stat flock_c0a28f04f029116df131f61bb0c194b0 File: ‘flock_c0a28f04f029116df131f61bb0c194b0’ Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 820h/2080d Inode: 3850487 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 989/user) Gid: ( 33/www-data) Access: 2016-11-22 21:08:02.004170617 +0100 Modify: 2016-11-22 21:08:02.004170617 +0100 Change: 2017-04-24 15:25:02.045152537 +0200 Birth: -
So you see that the file is never modified but only "changed".
I'm not sure about the TTL feature therefore.
Updated by Fedir RYKHTIK over 7 years ago
Markus Klein wrote:
As you can see the lock files are all rather old.
So you see that the file is never modified but only "changed".
What I couldn't understand : how lock is working ?
If lock flock_* file exists = Lock. And after the end of the process execution, the file(flock_*) is removed. Or it runs in different way ?
As I could see in typo3/sysext/core/Classes/Locking/FileLockStrategy.php :
/**
* Release the lock
*
* @return bool Returns TRUE on success or FALSE on failure
*/
public function release()
{
if (!$this->isAcquired) {
return true;
}
$success = true;
if (is_resource($this->filePointer)) {
if (flock($this->filePointer, LOCK_UN) === false) {
$success = false;
}
fclose($this->filePointer);
}
$this->isAcquired = false;
return $success;
}
There is no remove of flock_* file in the end. Is it normal ? Why not to remove the file after lock release ? Potentially it creates conflicts, and there are no visibility of current locks usage.
Updated by Markus Klein over 7 years ago
Nope. This works differently. The lock is set on a resource (the file in this case), which is shared across all processes.
So the lock is a property of the file. The file itself is only the "carrier" of the property.
It is important to NOT remove the file on unlock, since there might be other processes hanging in a LOCK command (process is blocked by OS) on this resource/file. If you unlock the next process will continue and will get the lock.
If you remove the resource all processes will DIE as the resource they are waiting on is wiped away.
You may only remove the resource for a lock, if you KNOW that you (the current process) are the only one on that very resource at the moment, which is not decidable for PHP processes in general!
Usually, in an IPC program, the master process (where children are forked off from), would take care about resource management, hence creating the lock resource and free it once all child processes have terminated. We do not have such a master process in the PHP world (each request is handled by starting a new process), so they do not know about each other.
Therefore we must never delete a lock resource until the webserver is stopped.
Updated by Alexander Opitz over 7 years ago
On which filesystem does the lock directory run?
Updated by Fedir RYKHTIK over 7 years ago
Hi guys,
@Markus
Nope. This works differently. The lock is set on a resource (the file in this case), which is shared across all processes.
So the lock is a property of the file. The file itself is only the "carrier" of the property.
Hmm. If it's a property of the file, please could You tell me, how I could list all statuses of all files with an external PHP or Bash script ? locked / unlocked ?
@Alexander
On which filesystem does the lock directory run?
ext4
Best regards,
Fedir
Updated by Alexander Opitz over 7 years ago
Ok, ext4 shouldn't be a problem.
For Monitoring flock, you may get answer here: https://unix.stackexchange.com/questions/90506/monitoring-file-locks-locked-using-flock
The question would be, does a PHP process die before it unlocks?
Updated by Jelle Groenendal over 7 years ago
I'm running in to the exact same problem. We get it on 2 Apache servers with a load balancer.
Did you find a solution?
Updated by Sigfried Arnold about 7 years ago
- TYPO3 Version changed from 7 to 8
- PHP Version changed from 5.6 to 7.0
- Is Regression set to No
We are running into the exact same Problem. Apache dies and does not handle requests anymore.
CentOS Linux release 7.3.1611 (Core) running with PHP 7
Only solution for the moment is to restart Apache.
Is there a solution yet?
Updated by Markus Klein about 7 years ago
@Sigfried: Please add more information.
mod_php or php-fpm?
what does the php log say? Does php process die for some reason?
Usually this problem only occurs with stale locks caused by dying php processes.
Updated by Sigfried Arnold about 7 years ago
we use mod_php7 and according to dmesg there are no php proccesses that died recently
as for the apache access and error logs: there are just gaps, but nothing useful so far
Updated by Markus Klein about 7 years ago
I suppose your are running on a normal file system and not anything shared via NFS or the like?
Updated by Sigfried Arnold about 7 years ago
out of the box CentOS running on ext4
Updated by Markus Klein about 7 years ago
Apache dies and does not handle requests anymore.
What does "dies" mean exactly? Is the process killed? By whom?
Before this happens, is the CPU load 100% by apache?
Updated by Sigfried Arnold about 7 years ago
It looks like this:
top - 08:49:57 up 48 days, 18:00, 2 users, load average: 0,00, 0,01, 0,05 Tasks: 435 total, 1 running, 434 sleeping, 0 stopped, 0 zombie %Cpu(s): 0,0 us, 0,0 sy, 0,0 ni, 99,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st KiB Mem : 13174940+total, 816496 free, 10368488 used, 12056442+buff/cache KiB Swap: 4194300 total, 3614520 free, 579780 used. 11938673+avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 589 apache 20 0 586080 24392 11792 S 0,0 0,0 0:00.01 /usr/sbin/httpd 871 apache 20 0 817976 102568 61780 S 0,0 0,1 0:25.16 /usr/sbin/httpd 875 apache 20 0 793272 60592 43192 S 0,0 0,0 0:02.14 /usr/sbin/httpd 934 apache 20 0 796704 67000 44100 S 0,0 0,1 0:03.21 /usr/sbin/httpd 1423 apache 20 0 700300 30336 17388 S 0,0 0,0 0:00.27 /usr/sbin/httpd 1424 apache 20 0 797884 58056 34132 S 0,0 0,0 0:02.37 /usr/sbin/httpd 1562 apache 20 0 586276 29192 16464 S 0,0 0,0 0:00.05 /usr/sbin/httpd ...
Apache has lots of open processes (this list goes on) but non of them use any signficant cpu time or memory - also the system itself almost at zero CPU time
Updated by Markus Klein about 7 years ago
So this means all those processes are waiting on a lock.
Do you have a bit of knowledge on linux commandline?
If so you can try to follow https://unix.stackexchange.com/questions/90506/monitoring-file-locks-locked-using-flock and see if you can find out which apache process exactly has the lock. (typo3temp/var/locks/*)
You can try to kill that very process and see if the list of processes reduces vastly this way.
Updated by Susanne Moog about 7 years ago
- Category changed from Content Rendering to Locking / Session Handling
Updated by Sigfried Arnold about 7 years ago
- Category changed from Locking / Session Handling to Content Rendering
I found the issue - but not the source of the problem
There are pages, which got loads of entries in sys_file_references for a specific field:
SELECT pid, fieldname, COUNT(*) as count FROM grandhotel.sys_file_reference GROUP BY pid,fieldname order by count DESC;
In my example i get something like this
50 image 214572 232 assets 352 9 assets 98 9 image 96 103 image 75 138 image 72 138 assets 70 94 assets 42 109 assets 41 235 assets 40
The page with id 50 does not have 200k images - it is just a hand full and one images is copied over and over
When TYPO3 renders this page, it opens/accesses the image (can be seen in strace) over and over again an takes ages to finish on those pages (found this extreme behaviour on 3 independet instances so far.
After quite some time the process dies due to max_execution_time and sometimes does not unlock the files properly.
This accumulates for days - or if unlucky within a few hours, if the right pages were accessed - and viola, the lock files need to be removed or apache has to be restarted.
Updated by Markus Klein about 7 years ago
- Category changed from Content Rendering to Locking / Session Handling
Updated by Markus Klein about 7 years ago
@Arnold: Great you found an issue. Please open a dedicated report for this!
Updated by Alexander Opitz over 6 years ago
Is there something we can do at the moment here?
Updated by Sigfried Arnold over 6 years ago
For the past few months sometimes this occured on differend instances, servers and enviroments - RHEL, Ubuntu, CentOS - does not matter.
But they all have the behaviour explained in #82929 in common - if there are too many TYPO3 instances on a server with all having the same issue, the apache overloads in long running processes with endless image-reading.
So i guess this issue can be closed, since it is only a side effect of another issue.
Updated by Riccardo De Contardi over 6 years ago
- Related to Bug #82929: sys_file_references filled with duplicate/multiple entries added
Updated by Fedir RYKHTIK over 5 years ago
Here is an another version of cleaning script (releases all flocks older than 1 hour):
/usr/bin/find -O3 "/path/var/lock/" -ignore_readdir_race -depth -mindepth 1 -name 'flock_*' -type f -cmin +3600 -delete
Updated by Markus Klein over 5 years ago
- Status changed from Needs Feedback to Closed
Updated by Sybille Peters over 2 years ago
Current information about possible problem with TYPO3 locking based on flock and NFS:
- there is a known problem with NFS and flock (unrelated to TYPO3): https://docs.typo3.org/m/typo3/reference-coreapi/main/en-us/ApiOverview/LockingApi/Index.html#filelockstrategy-nfs
- TYPO3 uses flock by default for locking (FileLockStrategy)
- other alternatives can be used, for example provided by the extension https://github.com/b13/distributed-locks which uses Redis to store the locks
- TYPO3 cache can also be stored in Redis: https://docs.typo3.org/m/typo3/reference-coreapi/main/en-us/ApiOverview/CachingFramework/FrontendsBackends/Index.html#redis-backend
(added information because other people who have the problem or similar problem refer to this issue and removing lock files probably not the best solution)