This is still an ongoing investigation, but I will summarize what I found:
Initial problem:¶
Error due to exception "Lock wait timeout" in Doctrine shown both in frontend (as "Oops" and backend) and visible in logs. All occurences were on rootline cache.
System¶
TYPO3: 9.5.13
DB-Server: MariaDB
pages: 38791 (not deleted, sys_language_uid=0)
typical number of entries (these vary, of course)
- cf_cache_rootline: 15347
- cf_cache_rootline_tags: 112814
Findings¶
- Some occurrences in slow query log. All are similar database statement that take several seconds up to 30 minutes (!!!) in one extreme case:
DELETE tags2, cache1 FROM cf_cache_rootline_tags AS tags1
JOIN cf_cache_rootline_tags AS tags2 ON tags1.identifier = tags2.identifier
JOIN cf_cache_rootline AS cache1 ON tags1.identifier = cache1.identifier
WHERE tags1.tag IN ('pageId_3');
The tag is not always the same. Here, it is the start page (page 3)
- the occurences in the slow query log more or less correlate with the errors
- there are almost 40000 pages in the system with a deep page depth of more than 8 in some cases
- some of the entries in cf_cache_rootline.content are quite large (up to 40KB). Not only is rootline information for all parent pages written into this, including Page TSconfig
- The tags in cf_rootline_cache_tags contain tags for page Ids, e.g. "pageId_3" for page id 3. All child pages contain the tag for their parents. So if the above DELETE + double-JOIN is executed on tag 'pageId_3' this means that practically all rootline cache and rootline cache tag entries will be affected in one big scoop.
- On trying to track down error: after saving the start page, some more errors occurred. I looked at show processlist on DB server and saw both a DELETE / JOIN statement on rootline cache and several INSERTs simultaneously
Explanation¶
This is a preliminary explanation as I am not sure of all facts yet.
The long running DELETE cf_cache_rootline, cf_cache_rootline_tags affects a lot of (or almost all rows) and would lock out simulaneous SQL statements (on rootline cache tables). On a lot of occasions a rootline cache entry is filled (or queried) so this is pretty likely (simultaneous edit and page load operations).
Open questions¶
So what is the reason the DELETE takes so long?
Is it the problem that such a DELETE statement gets created (affecting so many rows at once) or should the DB server be able to handle this without problems?
Is it a combination of a lot of pages, a lot of rootline and tags entries being deleted simultaneously? Not enough memory available on DB server?
Where should I look next:
- DB server?
- TYPO3 configuration?
- TYPO3 itself?
Generally, deleting practically all
Changes¶
The problem no longer occurs now. I made these changes. This has (hopefully) defused the situation:
- Assigned more RAM to Innodb buffer cache
- Update MariaDB server to latest version
- Enabled 'compression' for rootline cache
- Set lifetime on rootline cache to 24h (instead of 30 days which is the default).
- removed some clearCacheCmd in page TSconfig that would result in clearing the start page with a DELETE of practically all rootline entries (effectively triggering the situation described above)
- removed page TSconfig from start page (that was only intended temporarily and was moved to sitepackage) to prevent it from being written to cf_cache_rootline.content of all subpages.
I consider this currently a workaround until I am more sure what to do long term.