Bug #84541

Uncaught TYPO3 Exception in indexed_search: duplicate key (hash) error

Added by Sybille Peters about 3 years ago. Updated 8 months ago.

Status:
New
Priority:
Should have
Assignee:
-
Category:
Indexed Search
Target version:
-
Start date:
2018-03-27
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
8
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

I get an Exception with HTTP status code 503 on loading a page with the Exception message:

Oops, an error occurred!
An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' with params [55602717, "images\/botschaften", "66411504"]: Duplicate entry '55602717' for key 'PRIMARY'

System

  • TYPO3 version 8.7.17
  • current master

Severity of error

Even if Configuration preset "Live" is selected, an error is still displayed in the frontend on first load of page. The page is not rendered. Doesn't happen for all pages, only for pages where the hash collision occurs. Still, it's pretty ugly.

Reproduce

  1. Activate the extension indexed_search
  2. Add static includes for indexed_search
  3. Add the following words to a new or existing content element: graf gettogethers abfluss erworbener
  4. Load the page in the frontend (you may have to do that twice)

Expected result

Page is displayed

Actual result

Exception message:

Uncaught TYPO3 Exception
An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' with params [186135449, "gettogethers", "199699927"]: Duplicate entry '186135449' for key 'PRIMARY' 

Reproduce with test code

You can also reproduce the problem with this simple test script, which includes the original hash function from IndexedSearchUtility and some test strings

<?php

$str = ['graf', 'gettogethers', 'abfluss', 'erworbener'];

function md5inthash($stringToHash)
{
    return hexdec(substr(md5($stringToHash), 0, 7));
}

foreach ($str as $s) {
    print("string=$s hash=" . md5inthash($s) . "\n");
}

The first 2 strings and the third and forth string produce the same hash (which causes the collisions):
string=graf hash=186135449
string=gettogethers hash=186135449
string=abfluss hash=211412923
string=erworbener hash=211412923

Additional information

If you need to reproduce this several times, you can delete the index for the page by selecting the "Indexing" module, select the page with the content you inserted earlier, then select "Detailed statistics" and press the "Delete" garbage can.

You can also check the sys_log for similar errors:

 SELECT uid,FROM_UNIXTIME(tstamp),details FROM sys_log WHERE details LIKE '%index_word%' and error = 2 ORDER BY uid DESC LIMIT 10;

Cause of error

possibly: https://github.com/TYPO3/TYPO3.CMS/blob/8124407655ae73656bf6c21f6bc8841b8e1d2023/typo3/sysext/indexed_search/Classes/Indexer.php#L2134

(the link goes to a specific commit which may no longer reflect the current codebase)

hash collision?

Affected fields

These are the fields in the DB, that get filled using the current hash algorithm:

(all affected fields by looking at DB schema and content and guessing):

  • index_debug.phash
  • index_fulltext.phash
  • index_grlist.phash
  • index_grlist.phash_x
  • index_grlist.hash_gr_list
  • index_phash.phash
  • index_phash.phash_grouping
  • index_phash.contentHash
  • index_rel.phash
  • index_rel.wid
  • index_section.phash (see https://forge.typo3.org/issues/79802)
  • index_section.phash_t3
  • index_words.wid (see this issue)
  • index_words.metaphone

Currently however, only index_words.wid seems to be affected by the collisions.


Files

exception.png (86.5 KB) exception.png Sybille Peters, 2018-08-10 17:49
exception2.png (47.1 KB) exception2.png Sybille Peters, 2018-08-10 17:52
disable_fe.png (18.6 KB) disable_fe.png Sybille Peters, 2018-11-02 14:30

Related issues

Related to TYPO3 Core - Bug #79802: phash not uniqueNew2017-02-14

Actions
Related to TYPO3 Core - Bug #87138: indexed_search: Duplicate entry for key 'Primary' in index_relNew2018-12-12

Actions
Has duplicate TYPO3 Core - Bug #88557: Indexed Search: generates identical word ids for different wordsClosed2019-06-13

Actions
Is duplicate of TYPO3 Core - Bug #17619: INSERT-Error on table "index_words" on ORACLE and PostgreSQLClosed2007-09-19

Actions
#1

Updated by Sybille Peters about 3 years ago

The hash function IndexedSearchUtility::md5inthash() does indeed have collisions for e.g. the following words:

  • graf + gettogethers (md5inthash: 186135449)
  • erworbener + ablfluss (md5inthash: 211412923)

because only a substring of md5 is used. The md5 hash ist 32 chars long and only the first 7 chars are used (and then converted to int).

#2

Updated by Sybille Peters about 3 years ago

  • Description updated (diff)
#3

Updated by Gerrit Code Review about 3 years ago

  • Status changed from New to Under Review

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#4

Updated by Gerrit Code Review about 3 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#5

Updated by Gerrit Code Review about 3 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#6

Updated by Gerrit Code Review about 3 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#7

Updated by Gerrit Code Review about 3 years ago

Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#8

Updated by Gerrit Code Review about 3 years ago

Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#9

Updated by Gerrit Code Review about 3 years ago

Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#10

Updated by Gerrit Code Review about 3 years ago

Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#11

Updated by Sybille Peters about 3 years ago

  • Subject changed from Exception with duplicate key error in database for indexed_search to Exception with duplicate key (hash) error in database for indexed_search
#12

Updated by Sybille Peters about 3 years ago

#13

Updated by Sybille Peters almost 3 years ago

  • Assignee set to Sybille Peters
#14

Updated by Gerrit Code Review almost 3 years ago

Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#15

Updated by Gerrit Code Review almost 3 years ago

Patch set 10 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#16

Updated by Gerrit Code Review almost 3 years ago

Patch set 11 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#17

Updated by Sybille Peters almost 3 years ago

  • Subject changed from Exception with duplicate key (hash) error in database for indexed_search to Uncaught TYPO3 Exception in indexed_search: duplicate key (hash) error
#18

Updated by Sybille Peters almost 3 years ago

  • Tags set to exception_handling
#19

Updated by Sybille Peters over 2 years ago

#20

Updated by Sybille Peters over 2 years ago

Still reproducible with current 8.7.19-dev and master (9.4.0-dev).

#21

Updated by Sybille Peters over 2 years ago

  • Description updated (diff)
  • Tags deleted (exception_handling)
#22

Updated by Sybille Peters over 2 years ago

  • Description updated (diff)
#23

Updated by Sybille Peters over 2 years ago

Added more information to description:

  • Affected fields
  • PHP script to reproduce
#24

Updated by Sybille Peters over 2 years ago

  • Assignee deleted (Sybille Peters)
#25

Updated by Gerrit Code Review over 2 years ago

Patch set 12 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#26

Updated by Gerrit Code Review over 2 years ago

Patch set 13 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#27

Updated by Gerrit Code Review over 2 years ago

Patch set 14 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/56470

#28

Updated by Gerrit Code Review over 2 years ago

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58704

#29

Updated by Gerrit Code Review over 2 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58704

#30

Updated by Gerrit Code Review over 2 years ago

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#31

Updated by Gerrit Code Review over 2 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#32

Updated by Gerrit Code Review over 2 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58704

#33

Updated by Gerrit Code Review over 2 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#34

Updated by Gerrit Code Review over 2 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#35

Updated by Sybille Peters over 2 years ago

  • Sprint Focus set to On Location Sprint
#36

Updated by Gerrit Code Review over 2 years ago

Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#37

Updated by Gerrit Code Review over 2 years ago

Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#38

Updated by Gerrit Code Review over 2 years ago

Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#39

Updated by Gerrit Code Review over 2 years ago

Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#40

Updated by Gerrit Code Review over 2 years ago

Patch set 9 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#41

Updated by Sybille Peters over 2 years ago

Workaround:

If you are affected by this bug and waiting for a patch, there is a workaround: You can disable the Indexing in the Frontend:

To do this, change the extension configuration for indexed_search:"basic.disableFrontendIndexing".

Impact

Note: This does not disable the bug, but the effects are not as severe, because the Exceptions do not get thrown on FE rendering when the page is first loaded. If you do this however, you must activate some other method for indexing the pages, typically using the crawler via scheduler mechanism, see https://docs.typo3.org/typo3cms/extensions/indexed_search/IndexingConfigurations/CrawlerSetup/Index.html

#42

Updated by Gerrit Code Review over 2 years ago

Patch set 10 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#43

Updated by Gerrit Code Review over 2 years ago

Patch set 11 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#44

Updated by Gerrit Code Review over 2 years ago

Patch set 12 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#45

Updated by Gerrit Code Review over 2 years ago

Patch set 13 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#46

Updated by Sybille Peters over 2 years ago

  • Related to Bug #86491: Duplicate entry for PRIMARY key in cache_treelist added
#47

Updated by Gerrit Code Review over 2 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58704

#48

Updated by Gerrit Code Review over 2 years ago

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58853

#49

Updated by Gerrit Code Review over 2 years ago

Patch set 14 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#50

Updated by Gerrit Code Review over 2 years ago

Patch set 15 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#51

Updated by Gerrit Code Review over 2 years ago

Patch set 16 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/58717

#52

Updated by Alexander Schnitzler over 2 years ago

  • Related to deleted (Bug #86491: Duplicate entry for PRIMARY key in cache_treelist)
#53

Updated by Reinhard Hiebl over 2 years ago

  • Related to Bug #87138: indexed_search: Duplicate entry for key 'Primary' in index_rel added
#54

Updated by Reinhard Hiebl over 2 years ago

  • Related to Bug #87138: indexed_search: Duplicate entry for key 'Primary' in index_rel added
#55

Updated by Reinhard Hiebl over 2 years ago

  • Related to deleted (Bug #87138: indexed_search: Duplicate entry for key 'Primary' in index_rel)
#56

Updated by Gerrit Code Review about 2 years ago

Patch set 17 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/58717

#57

Updated by Gerrit Code Review about 2 years ago

Patch set 18 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at https://review.typo3.org/c/Packages/TYPO3.CMS/+/58717

#58

Updated by Jonas Eberle almost 2 years ago

  • Related to Bug #88557: Indexed Search: generates identical word ids for different words added
#59

Updated by Jonas Eberle almost 2 years ago

  • Related to deleted (Bug #88557: Indexed Search: generates identical word ids for different words)
#60

Updated by Jonas Eberle almost 2 years ago

  • Has duplicate Bug #88557: Indexed Search: generates identical word ids for different words added
#61

Updated by Sybille Peters over 1 year ago

  • Status changed from Under Review to New
#62

Updated by Sybille Peters over 1 year ago

Patch was abandoned.

#63

Updated by Wolfgang Wagner over 1 year ago

Problem still exists in 9.5.11 :-/

#64

Updated by Florian Rival over 1 year ago

I confirm, the problem is still present with Typo3 9.5.11

An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' 
with params [108983540, "actors", "116730212"]: Duplicate entry '108983540' for key 'PRIMARY'
#65

Updated by Arne Bracht Bracht over 1 year ago

Bug is also in 9.5.13 and the workarround with crawler ist not useable annymore. The crawler ist for Version TYPO3 8 a the moment

#66

Updated by Susanne Moog over 1 year ago

  • Sprint Focus deleted (On Location Sprint)
#67

Updated by Sven Burkert over 1 year ago

  • Is duplicate of Bug #17619: INSERT-Error on table "index_words" on ORACLE and PostgreSQL added
#68

Updated by Peter Linzenkirchner about 1 year ago

Bug is also in TYPO3 9.7.13. Flooding the log - only possibilty seems to deactivate indexed_search.

#69

Updated by Michael Sollmann about 1 year ago

Can confirm this for 9.5.13 too. In my opinion it needs to be fixed urgently cause it regularly crashes the frontend in production environments.

#70

Updated by Sybille Peters about 1 year ago

@Michael can you check the workaround in https://forge.typo3.org/issues/84541#note-41 and see if this helps for you (not as a solution, as a temporary workaround)

#71

Updated by Riccardo De Contardi about 1 year ago

I add here the description of #88557 to keep track of it

Indexed Search: generates identical word ids for different words

Description

Because of the method used for hashing it, there are cases when a word id (wid) is a duplicate of the hash for a different word.
And the following error occurs because wid is primary key in the index_words table:

Core: Exception handler (WEB): Uncaught TYPO3 Exception: An exception occurred while executing ‘INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)’ with params [36468906, “nussholz”, “76382384"]: Duplicate entry ‘36468906’ for key ‘PRIMARY’ | Doctrine\DBAL\Exception\UniqueConstraintViolationException thrown in file deploy/releases/177/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/AbstractMySQLDriver.php in line 66.

Test case:

\TYPO3\CMS\IndexedSearch\Utility\IndexedSearchUtility::md5inthash('nam')
and
\TYPO3\CMS\IndexedSearch\Utility\IndexedSearchUtility::md5inthash('nussholz')

will both return 36468906

#72

Updated by a d 10 months ago

Why not simply change the IndexedSearchUtility::md5inthash() function to something which doesn't produce collisions?

This works perfectly - always produces an int32 (signed long) on 32 and 64 bit platforms and works with the existing database int columns:

    public static function md5inthash($stringToHash)
    {
        return current(unpack('l', pack('l', crc32($stringToHash))));
    }
#73

Updated by Sybille Peters 10 months ago

If only it were that simple ...

I think this boils down to 2 questions:

1. Is there a collision-free hash function with fixed output size (32 bit) which produces no collisions for variable arbitrary input?
2. Is crc32() a good solution for a hash function for result in fixed length (32 bit) to produce hashes (to be used for key /value storage or checking duplicates, not for cryptography) with low probability of collision?

about 1: I think not. It does not even matter that much if the result is stored in a 32 bit field, 64bit or whatever. Using a better hash function or different handling of the result hash or a larger results datatype may make the collision less likely but will not completely eliminate the possibilities of collisions.

With collisions, we mean producing the same output hash for 2 different strings as input.

Correct me, if I am wrong.

See

About question 2: I really don't know, maybe, see e.g.

Still, your approach is good if it decreases the possibilities of collisions and what I've seen, the original function chops off part of the hash and yours does not so that should be an improvement.

Additionally, it might be a good solution to handle the possibility of a collision occurring gracefully and not throwing an exception.

It think this is possible, you (or anyone else for that matter) are welcome to try.

More information:

  • Create a patch (in Contribution Guide)
  • you can ask for input about the topic or get help with contribution in the #typo3-cms-coredev channel on Slack
  • I think Oliver Hader had some good ideas the last time I raised the question but I can't really remember what exactly he said. It has been a while.

I will not work on this any further, I abandoned my patch, the solution was flawed.

#74

Updated by Tomas Norre Mikkelsen 9 months ago

What about adding a timestamp to the hash-generation after the


\TYPO3\CMS\IndexedSearch\Utility\IndexedSearchUtility::md5inthash('nam' . time())
\TYPO3\CMS\IndexedSearch\Utility\IndexedSearchUtility::md5inthash('nussholz' . time())

Then the likelihood of a collision would be smaller, at the same time then improve the hash itself, then we have to optimizations that will decrease the possibility for a collision.

#75

Updated by Sybille Peters 9 months ago

As I understand it, the hash should be reproducible - it should give consistent results. If you add a timestamp, you get different hashes for the same input. This bloats up the number of entries and is - in this implementation - also a problem because the hash is used to reference between tables. (e.g. index_words.wid < => index_rel.wid).

The hash in indexed_search is used as identifier and is created to have a "short" version of one or more (usually) longer data entries.

Search for md5inthash in indexed_search and look at relation of database tables including wid, phash etc.

#76

Updated by Tomas Norre Mikkelsen 9 months ago

I see.. That I didn't think about.

#77

Updated by David Bruchmann 8 months ago

I had the same issue but with another table `sys_file_processedfiles`.
The content of that table I had to delete and afterwards I created an autoindex for the field `uid`.
Afterwards the problem was gone.

As the content of indexed_search can be written new by indexing it won't be a problem to delete the content here too.
Else the uid and relations of data-row 0 or 1 had to be changed manually, it's not worth it if the content can be created easily new like in these cases.
I never thought about which field to change for indexed_search, and how in detail though.

Also available in: Atom PDF