Project

General

Profile

Actions

Bug #80804

closed

Apache precesses overload because of typo3temp/locks/flock_

Added by Fedir RYKHTIK about 7 years ago. Updated over 1 year ago.

Status:
Closed
Priority:
Should have
Assignee:
-
Category:
Locking / Session Handling
Target version:
-
Start date:
2017-04-11
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
8
PHP Version:
7.0
Tags:
Complexity:
hard
Is Regression:
No
Sprint Focus:

Description

Hi there,

We have a really strange problem with locks : sometimes, Apache is completely saturated, too much processes, looks like, because of TYPO3 locking system.

The bug appear really sometimes, but now more and more regularly, 1 time / day. Really annoying bug, it puts all sites on the server down.

Actual solutions : or to remove all lock files from typo3temp, or to restart Apache.

Here are some strace logs :

  1. strace -p 5886

Process 5886 attached

restart_syscall(<... resuming interrupted call ...>) = 0

open("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", O_WRONLY|O_CREAT, 0666) = 5
fstat(5, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(5, 0, SEEK_CUR) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", 0664) = -1 EPERM (Operation not permitted)
flock(5, LOCK_EX) = 0

open("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", O_WRONLY|O_CREAT, 0666) = 134
fstat(134, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(134, 0, SEEK_CUR) = 0
close(135) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", 0664) = 0
flock(134, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
flock(5, LOCK_UN) = 0
close(5) = 0
nanosleep({0, 100000000}, NULL) = 0

open("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", O_WRONLY|O_CREAT, 0666) = 5
fstat(5, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(5, 0, SEEK_CUR) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_b3b32a2d422265cd25c3323ed0157f81", 0664) = -1 EPERM (Operation not permitted)
flock(5, LOCK_EX) = 0

open("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", O_WRONLY|O_CREAT, 0666) = 135
fstat(135, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
lseek(135, 0, SEEK_CUR) = 0
close(134) = 0
stat("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
chmod("/home/www/mysite.com/www/typo3temp/locks/flock_7bfc94f72787757bc18f8d24791673b2", 0664) = 0
flock(135, LOCK_EX|LOCK_NB) = -1 EAGAIN (Resource temporarily unavailable)
flock(5, LOCK_UN) = 0
close(5) = 0
nanosleep({0, 100000000}, NULL) = 0

...

Best regards,
Fedir


Related issues 1 (0 open1 closed)

Related to TYPO3 Core - Bug #82929: sys_file_references filled with duplicate/multiple entriesClosed2017-11-06

Actions
Actions #1

Updated by Fedir RYKHTIK about 7 years ago

  • Description updated (diff)
Actions #2

Updated by Markus Klein about 7 years ago

Hi Fedir!

Thanks for your report. A few things I can read from your posting:

  • /home/www/mysite.com/www/typo3temp/locks/flock_* seems not to have correct permissions as the chmod fails with EPERM
  • The core tries to get an exclusive and non-blocking lock on flock_7bfc94f72787757bc18f8d24791673b2, which is intended. The lock is locked and hence EAGAIN is returned

I conclude that you have an ongoing page generation running that takes a serious long time or maybe never ends.
This causes all apache requests to pile up in the queue as they all wait for this page to finish generation, so they can deliver it.

What is your server setup?
Are you using php-fpm or mod_php?
Which mpm are you using?

We successfully run our servers with php-fpm and mpm_worker/mpm_event.

Actions #3

Updated by Fedir RYKHTIK about 7 years ago

Hey Markus !

Thank You very much for Your answer.

Here is information about my setup :

Are you using php-fpm or mod_php?

Apache2 mod_php5

What is your server setup?

Debian Jessie 8.7
PHP 5.6.30-0+deb8u1 (cli) (built: Feb 8 2017 08:50:21)
Server version: Apache/2.4.10 (Debian)

Which mpm are you using?

apache2-mpm-itk 2.4.10

One important stuff : web user uses permissions different from CLI user (for security reasons).

I conclude that you have an ongoing page generation running that takes a serious long time or maybe never ends.

How could I verify which pages are being generated ?

Best regards,
Fedir

Actions #4

Updated by Markus Klein about 7 years ago

  • Category changed from System/Bootstrap/Configuration to Content Rendering
  • Status changed from New to Needs Feedback

apache2-mpm-itk 2.4.10

I have never used this and it is based on mpm-prefork, which is known to be the worst choice nowadays.
I can only recommend to get rid of any third party code to be executed in the context of apache, that's why there is fcgi.

One important stuff : web user uses permissions different from CLI user (for security reasons).

Fair enough, still you need to make sure that both can properly read/write the locks directory if functionality is executed that requires locks.

How could I verify which pages are being generated ?

There are actually 2 locks involved in the page generation. One for the so called "pagesection" and one for the "page".
The key for the pagesection lock is based on the requested page UID and optionally a mountpoint-info.
The key for the page lock is created by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getLockHash and includes the page uid as well.

All locks in FE rendering process are managed by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::acquireLock

Keep in mind that actually there should never be a large queue of waiting clients, since there is one additional thing involved:
Whenever one process is generating a page, it first fills the well-known "page is being generated... wait 30 seconds..." content into the cache and releases the pagesection lock. This way all waiting processes will display this content to the client and will therefore go away from the queue.

I have only one idea, which I would classify as a possible reason for what you see: Your apache process crashes in the middle of the page generation, hence your "pages" lock file remains on the disk in a locked state.

Actions #5

Updated by Fedir RYKHTIK about 7 years ago

Hey Markus !

apache2-mpm-itk 2.4.10

I have never used this and it is based on mpm-prefork, which is known to be the worst choice nowadays.
I can only recommend to get rid of any third party code to be executed in the context of apache, that's why there is fcgi.

It's quite stable module, and quite handy in our case. I think we will continue to use it on the server where the bug appears.

One important stuff : web user uses permissions different from CLI user (for security reasons).

Fair enough, still you need to make sure that both can properly read/write the locks directory if functionality is executed that requires locks.

I checked, permissions on place were good. But problem still existed.

How could I verify which pages are being generated ?

There are actually 2 locks involved in the page generation. One for the so called "pagesection" and one for the "page".
The key for the pagesection lock is based on the requested page UID and optionally a mountpoint-info.
The key for the page lock is created by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::getLockHash and includes the page uid as well.

All locks in FE rendering process are managed by \TYPO3\CMS\Frontend\Controller\TypoScriptFrontendController::acquireLock

Not so easy to debug the hash on production with thousand of requests per minute.

Keep in mind that actually there should never be a large queue of waiting clients, since there is one additional thing involved:
Whenever one process is generating a page, it first fills the well-known "page is being generated... wait 30 seconds..." content into the cache and releases the pagesection lock. This way all waiting processes will display this content to the client and will therefore go away from the queue.

OK, thanks for the hint.

I have only one idea, which I would classify as a possible reason for what you see: Your apache process crashes in the middle of the page generation, hence your "pages" lock file remains on the disk in a locked state.

Probably Apache, probably PHP, probably MySQL, could be different reasons.

The problem, what dead locks are not "burried". I think should be in place some locks TTL : if the TTL of the lock is too big, it should be released, to avoid infinitive looping.

Anyway, I made following workaround at the moment, we could find all old locks :

find /var/www/html/typo3temp/locks/flock_* not -newermt '-600 seconds' -printf "%p %TY%Tm-%Td %TH:%TM:%TS %Tz\n"

And even we could remove old locks by an external script, each 10 minutes :

/10 * * * * find /var/www/html/typo3temp/locks/flock_ -not -newermt '-600 seconds' -delete

Further proposition could be : to integrate flocks locks TTL into the Core, with default value 3600 seconds.

If lock exists and it's really really old, it should be removed, as it's not possible / normal, what locks exists after long time.

What do You think about such adjustment of the locks logic ?

Best regards,
Fedir

Actions #6

Updated by Markus Klein about 7 years ago

I have to think about this.. a lot.

The creation time of a lock file says usually nothing about its lock state. The resource is created once and used over and over again. To give you an idea of a website of ours, which is also quite busy:

-rw-r--r--  1 user www-data    0 Nov 22 21:04 flock_1f2d180d4f0e10e95f64f6dd50c1d4cf
-rw-r--r--  1 user www-data    0 Dec 26 09:32 flock_3bce63d33f7c0032aa0aae07e78d7acc
-rw-r--r--  1 user www-data    0 Dec  7 14:31 flock_9c9007f651dd683b88f9913f91ccc017
-rw-r--r--  1 user www-data    0 Nov 22 21:07 flock_b3b32a2d422265cd25c3323ed0157f81
-rw-r--r--  1 user www-data    0 Nov 22 21:08 flock_c0a28f04f029116df131f61bb0c194b0
-rw-r--r--  1 user www-data    0 Feb  7 16:01 flock_d0f5321c3c5b903d5ddf6d588d429d6b
-rw-r--r--  1 user www-data    0 Mar 13 10:19 flock_f5411b68e96cb15c2dc9772543729e6d

As you can see the lock files are all rather old.

$ stat flock_c0a28f04f029116df131f61bb0c194b0
  File: ‘flock_c0a28f04f029116df131f61bb0c194b0’
  Size: 0               Blocks: 0          IO Block: 4096   regular empty file
Device: 820h/2080d      Inode: 3850487     Links: 1
Access: (0644/-rw-r--r--)  Uid: (  989/user)   Gid: (   33/www-data)
Access: 2016-11-22 21:08:02.004170617 +0100
Modify: 2016-11-22 21:08:02.004170617 +0100
Change: 2017-04-24 15:25:02.045152537 +0200
 Birth: -

So you see that the file is never modified but only "changed".

I'm not sure about the TTL feature therefore.

Actions #7

Updated by Fedir RYKHTIK about 7 years ago

Markus Klein wrote:

As you can see the lock files are all rather old.
So you see that the file is never modified but only "changed".

What I couldn't understand : how lock is working ?

If lock flock_* file exists = Lock. And after the end of the process execution, the file(flock_*) is removed. Or it runs in different way ?

As I could see in typo3/sysext/core/Classes/Locking/FileLockStrategy.php :

/**
 * Release the lock
 *
 * @return bool Returns TRUE on success or FALSE on failure
*/
public function release() {
if (!$this->isAcquired) {
return true;
}
$success = true;
if (is_resource($this->filePointer)) {
if (flock($this->filePointer, LOCK_UN) === false) {
$success = false;
}
fclose($this->filePointer);
}
$this->isAcquired = false;
return $success;
}

There is no remove of flock_* file in the end. Is it normal ? Why not to remove the file after lock release ? Potentially it creates conflicts, and there are no visibility of current locks usage.

Actions #8

Updated by Markus Klein about 7 years ago

Nope. This works differently. The lock is set on a resource (the file in this case), which is shared across all processes.
So the lock is a property of the file. The file itself is only the "carrier" of the property.

It is important to NOT remove the file on unlock, since there might be other processes hanging in a LOCK command (process is blocked by OS) on this resource/file. If you unlock the next process will continue and will get the lock.
If you remove the resource all processes will DIE as the resource they are waiting on is wiped away.

You may only remove the resource for a lock, if you KNOW that you (the current process) are the only one on that very resource at the moment, which is not decidable for PHP processes in general!

Usually, in an IPC program, the master process (where children are forked off from), would take care about resource management, hence creating the lock resource and free it once all child processes have terminated. We do not have such a master process in the PHP world (each request is handled by starting a new process), so they do not know about each other.
Therefore we must never delete a lock resource until the webserver is stopped.

Actions #9

Updated by Alexander Opitz about 7 years ago

On which filesystem does the lock directory run?

Actions #10

Updated by Fedir RYKHTIK almost 7 years ago

Hi guys,

@Markus

Nope. This works differently. The lock is set on a resource (the file in this case), which is shared across all processes.
So the lock is a property of the file. The file itself is only the "carrier" of the property.

Hmm. If it's a property of the file, please could You tell me, how I could list all statuses of all files with an external PHP or Bash script ? locked / unlocked ?

@Alexander

On which filesystem does the lock directory run?

ext4

Best regards,
Fedir

Actions #11

Updated by Alexander Opitz almost 7 years ago

@Fedir RYKHTIK

Ok, ext4 shouldn't be a problem.
For Monitoring flock, you may get answer here: https://unix.stackexchange.com/questions/90506/monitoring-file-locks-locked-using-flock

The question would be, does a PHP process die before it unlocks?

Actions #12

Updated by Jelle Groenendal almost 7 years ago

I'm running in to the exact same problem. We get it on 2 Apache servers with a load balancer.
Did you find a solution?

Actions #13

Updated by Sigfried Arnold over 6 years ago

  • TYPO3 Version changed from 7 to 8
  • PHP Version changed from 5.6 to 7.0
  • Is Regression set to No

We are running into the exact same Problem. Apache dies and does not handle requests anymore.

CentOS Linux release 7.3.1611 (Core) running with PHP 7

Only solution for the moment is to restart Apache.

Is there a solution yet?

Actions #14

Updated by Markus Klein over 6 years ago

@Sigfried: Please add more information.

mod_php or php-fpm?
what does the php log say? Does php process die for some reason?

Usually this problem only occurs with stale locks caused by dying php processes.

Actions #15

Updated by Sigfried Arnold over 6 years ago

we use mod_php7 and according to dmesg there are no php proccesses that died recently

as for the apache access and error logs: there are just gaps, but nothing useful so far

Actions #16

Updated by Markus Klein over 6 years ago

I suppose your are running on a normal file system and not anything shared via NFS or the like?

Actions #17

Updated by Sigfried Arnold over 6 years ago

out of the box CentOS running on ext4

Actions #18

Updated by Markus Klein over 6 years ago

Apache dies and does not handle requests anymore.

What does "dies" mean exactly? Is the process killed? By whom?
Before this happens, is the CPU load 100% by apache?

Actions #19

Updated by Sigfried Arnold over 6 years ago

It looks like this:

top - 08:49:57 up 48 days, 18:00,  2 users,  load average: 0,00, 0,01, 0,05
Tasks: 435 total,   1 running, 434 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,0 us,  0,0 sy,  0,0 ni, 99,9 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem : 13174940+total,   816496 free, 10368488 used, 12056442+buff/cache
KiB Swap:  4194300 total,  3614520 free,   579780 used. 11938673+avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  589 apache    20   0  586080  24392  11792 S   0,0  0,0   0:00.01 /usr/sbin/httpd
  871 apache    20   0  817976 102568  61780 S   0,0  0,1   0:25.16 /usr/sbin/httpd
  875 apache    20   0  793272  60592  43192 S   0,0  0,0   0:02.14 /usr/sbin/httpd
  934 apache    20   0  796704  67000  44100 S   0,0  0,1   0:03.21 /usr/sbin/httpd
 1423 apache    20   0  700300  30336  17388 S   0,0  0,0   0:00.27 /usr/sbin/httpd
 1424 apache    20   0  797884  58056  34132 S   0,0  0,0   0:02.37 /usr/sbin/httpd
 1562 apache    20   0  586276  29192  16464 S   0,0  0,0   0:00.05 /usr/sbin/httpd
...

Apache has lots of open processes (this list goes on) but non of them use any signficant cpu time or memory - also the system itself almost at zero CPU time

Actions #20

Updated by Markus Klein over 6 years ago

So this means all those processes are waiting on a lock.

Do you have a bit of knowledge on linux commandline?
If so you can try to follow https://unix.stackexchange.com/questions/90506/monitoring-file-locks-locked-using-flock and see if you can find out which apache process exactly has the lock. (typo3temp/var/locks/*)
You can try to kill that very process and see if the list of processes reduces vastly this way.

Actions #21

Updated by Susanne Moog over 6 years ago

  • Category changed from Content Rendering to Locking / Session Handling
Actions #22

Updated by Sigfried Arnold over 6 years ago

  • Category changed from Locking / Session Handling to Content Rendering

I found the issue - but not the source of the problem

There are pages, which got loads of entries in sys_file_references for a specific field:

SELECT pid, fieldname, COUNT(*) as count FROM grandhotel.sys_file_reference GROUP BY pid,fieldname order by count DESC;

In my example i get something like this

50    image    214572
232    assets    352
9    assets    98
9    image    96
103    image    75
138    image    72
138    assets    70
94    assets    42
109    assets    41
235    assets    40

The page with id 50 does not have 200k images - it is just a hand full and one images is copied over and over

When TYPO3 renders this page, it opens/accesses the image (can be seen in strace) over and over again an takes ages to finish on those pages (found this extreme behaviour on 3 independet instances so far.

After quite some time the process dies due to max_execution_time and sometimes does not unlock the files properly.

This accumulates for days - or if unlucky within a few hours, if the right pages were accessed - and viola, the lock files need to be removed or apache has to be restarted.

Actions #23

Updated by Markus Klein over 6 years ago

  • Category changed from Content Rendering to Locking / Session Handling
Actions #24

Updated by Markus Klein over 6 years ago

@Arnold: Great you found an issue. Please open a dedicated report for this!

Actions #25

Updated by Sigfried Arnold over 6 years ago

Opened a new ticket #82929

Actions #26

Updated by Alexander Opitz about 6 years ago

Is there something we can do at the moment here?

Actions #27

Updated by Sigfried Arnold almost 6 years ago

For the past few months sometimes this occured on differend instances, servers and enviroments - RHEL, Ubuntu, CentOS - does not matter.

But they all have the behaviour explained in #82929 in common - if there are too many TYPO3 instances on a server with all having the same issue, the apache overloads in long running processes with endless image-reading.

So i guess this issue can be closed, since it is only a side effect of another issue.

Actions #28

Updated by Riccardo De Contardi almost 6 years ago

  • Related to Bug #82929: sys_file_references filled with duplicate/multiple entries added
Actions #29

Updated by Fedir RYKHTIK about 5 years ago

Here is an another version of cleaning script (releases all flocks older than 1 hour):

/usr/bin/find -O3 "/path/var/lock/" -ignore_readdir_race -depth -mindepth 1 -name 'flock_*' -type f -cmin +3600 -delete
Actions #30

Updated by Markus Klein about 5 years ago

  • Status changed from Needs Feedback to Closed
Actions #31

Updated by Sybille Peters over 1 year ago

Current information about possible problem with TYPO3 locking based on flock and NFS:

(added information because other people who have the problem or similar problem refer to this issue and removing lock files probably not the best solution)

Actions

Also available in: Atom PDF