Task #72037: Truncate indexed_search tables before indexing with crawler - TYPO3 Core - TYPO3 Forge

Actions

Copy link

Task #72037

closed

Truncate indexed_search tables before indexing with crawler

Added by Sven Burkert almost 9 years ago. Updated 5 months ago.

Status:

Closed

Priority:

Should have

Assignee:

Category:

Indexed Search

Target version:

Candidate for Major Version

Start date:

2015-12-03

Due date:

% Done:

Estimated time:

TYPO3 Version:

PHP Version:

Tags:

Complexity:

Sprint Focus:

Description

Ext. indexed_search should provide a scheduler job for truncating all the indexed_search tables:

TRUNCATE TABLE index_debug;
TRUNCATE TABLE index_fulltext;
TRUNCATE TABLE index_grlist;
TRUNCATE TABLE index_phash;
TRUNCATE TABLE index_rel;
TRUNCATE TABLE index_section;
TRUNCATE TABLE index_stat_search;
TRUNCATE TABLE index_stat_word;
TRUNCATE TABLE index_words;
UPDATE index_config SET timer_next_indexing = 0;

As next step, the whole page is indexed again with ext. crawler. But before, the indexed contents has to be deleted because of deleted and hidden records, which should vanish out of the search results.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Benni Mack almost 9 years ago

Status changed from New to Needs Feedback

Can you elobarate the benefits here? A recurring task that clears the whole index? I'd rather go with a button in some indexed search module so an admin can clear them.

Actions

Copy link

Updated by Sven Burkert almost 9 years ago

Truncating the indexing tables also removes the deleted contents.
If you do not truncate the tables before running over the site with the crawler (ext. crawler), then you index all contents, but the deleted or hidden contents remain in the indexing tables and the links to these contents are shown in indexed_search search results.

That means, every time before indexing the site with ext. crawler, the indexing tables has to be truncated.

Actions

Copy link

Updated by Alexander Opitz over 8 years ago

Hi Sven, I can't confirm this behavior if using the crawler via the scheduler.

How do you run indexed search?

Actions

Copy link

Updated by Sven Burkert over 8 years ago

Ok, let's expect you use ext. "news" and want to index the news detail pages.
So I create a Crawler Configuration record on the news detail page.
Furthermore, I create two scheduler jobs:
1) Crawler queue
2) Crawler Run

These two jobs are indexing the news. Without the job which truncates the indexed contents, meanwhile deleted or hidden news records are still in the index.

Do you have a TYPO3 setup where older indexed contents are deleted or somehow invalidated?

Actions

Copy link

Updated by Alexander Opitz about 8 years ago

Does http://xavier.perseguers.ch/tutoriels/typo3/articles/indexed-search-crawler.html can help you?

Actions

Copy link

Updated by Sven Burkert about 8 years ago

This link doesn't help, but it also recommends to truncate the tables:

I suggest to empty indexing and crawler tables before each task, this prevents many side effects[...]

Actions

Copy link

Updated by Alexander Opitz about 8 years ago

Status changed from Needs Feedback to New
Target version set to Candidate for Major Version

Ok, found more blogs and forums which all provide this as solution.

In my installations I do not have this issue, but there don't change much. All others use solr.

So there would be 2 possible ways to solve this.

1) Add a way to truncate tables. Fast to implement but the index isn't fully available till next complete crawling (for each config).
- Fast to implement
- Most hassle if something do not work after truncating
- Not complete everytime
2.) Task which can remove entries for deleted pages/contents for hit (best not as task rather on editor action).
- Hard to implement
- Should be cleanest solution

Actions

Copy link

Updated by Alexander Opitz about 8 years ago

Any comments or other solutions?

Actions

Copy link

Updated by Benni Mack about 8 years ago

well, I would consider this a feature to have a scheduler task to truncate the index for a certain config, this way an admin can choose to do so.

Alternatively, the indexing config has an additional option to flush the index before crawling.

Both options are related to the quick fix, the clean way really seems hard (you also have deleted/hidden/starttime/endtime stuff to consider).

Actions

Copy link

#10

Updated by Tymoteusz Motylewski about 8 years ago

Possible solutions:
1. truncate table
- it leaves index empty for some time

2. trigger removal of the items from index on data change (e.g. record was updated) or something
- it's tricky to implement to cover all cases, but maybe we can do a simple hook to remove stuff from index on most common cases like hiding/removing/restricting... a page.

3. Have a "garbage collector" task which runs query like "delete all records in the index which were last indexed 2 days ago"
(the 2days time being configurable value). Then you can schedule this as a scheduler task running after you recrawl the whole page and removes data which were not seen any more.
- with this solution, there still be a period of time where wrong records are available

Gentlemen, which solution do you think will suite your needs?

Actions

Copy link

#11

Updated by Sven Burkert about 8 years ago

Perhaps this is also a solution:

4.) Check for every indexed page/record, if this record is still visible (that means, not deleted, not hidden, no starttime in future, no endtime reached). If not, delete the entries for this one only.

But I am unsure, what happens, if this record becomes visible again. Is the url for this record put in queue again and is it indexed on the next run?

Actions

Copy link

#12

Updated by Sybille Peters over 6 years ago

I don't think truncating the index is a good solution because it will leave the index empty / incomplete until everything is reindexed. The other solutions as proposed by Tymoteusz sound like a better idea.

Actions

Copy link

#13

Updated by David Henninger over 6 years ago

Yes, truncating is a bad hack. I am all for a TTL for indexes solution 3: If a page doesn't get indexed regularly, it gets thrown away.

For now I have to remove entries manually or start a full reindex after truncating...

Actions

Copy link

#14

Updated by Uwe Wiebach about 6 years ago

+1 for proposed solution 3.

Actions

Copy link

#15

Updated by Benjamin Robinson almost 4 years ago

Related to Bug #67249: Indexed search do not delete hidden records added

Actions

Copy link

#16

Updated by Benjamin Robinson almost 4 years ago

Also +1 for proposal 3

Actions

Copy link

#17

Updated by Sven Teuber over 3 years ago

We, too, struggle with indexed search displaying old content that has long been edited out or deleted (indexed_search 9.5.28), and Google/Stackoverflow seem to agree that this is (still) quite common.

The best solution in a frontend indexing scenario (which we used until now) would be if indexed search discarded any old stuff whenever a page gets re-indexed and just indexes the current version. Apparently, it doesn't, but instead adds new content without removing the old content, leading to discrepancies between the search results/result descriptions and the actual pages.

The second best solution may be proposal 3 from above, but it could get complicated correctly identifying the content that's older than x days and removing only that content.

A third solution that may fit quite some use cases may be to throw away any old stuff and rebuilding the index at a time when there's no traffic on the website anyway. Who cares if there are no search results between 04:20 and 04:25 a.m. on a smaller local page? And that's probably the main use case for indexed search anyway. A large multinational site will propably use SOLR or something similar instead.

Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful.

It's not the best solution, but a working, pragmatic solution. Why not make it available to those who would like to use it? It's not like we can't add more sophisticated, better solutions once we provided a quick fix, is it?

So, without further ado, for those in search for a quick, pragmatic solution, just add this to your sitepackage:

[sitepackage]/Configuration/Commands.php:

<?php 

return [
    'mysite:truncateindexedsearch' => [
        'class' => \Vendor\Sitepackage\Command\TruncateIndexedSearchTablesCommand::class
    ]
];

[sitepackage]/Classes/Command/TruncateIndexedSearchTablesCommand.php:

<?php
namespace Vendor\Sitepackage\Command;

use TYPO3\CMS\Core\Utility\GeneralUtility;
use TYPO3\CMS\Core\Database\ConnectionPool;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Style\SymfonyStyle;

class TruncateIndexedSearchTablesCommand extends Command
{
    /**
     * Configure the command by defining the name, options and arguments
     */
    protected function configure()
    {
        $this->setDescription('Truncate indexed search tables.');
    }

    /**
     * Truncate indexed search tables to force removal of hidden, deleted or changed content.
     * Don't forget to rebuild the index right after clearing it using the crawler extension!
     *
     * @param InputInterface $input
     * @param OutputInterface $output
     */
    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $io = new SymfonyStyle($input, $output);
        $io->title($this->getDescription());

        $tables = [
            'index_debug', 'index_fulltext', 'index_grlist', 'index_phash', 'index_rel', 'index_section', 'index_stat_search', 'index_stat_word', 'index_words'
        ];

        foreach ($tables as $tableName) {
            $io->writeln('<info>Truncating '.$tableName.'</info>');
            GeneralUtility::makeInstance(ConnectionPool::class)
                ->getConnectionForTable($tableName)
                ->truncate($tableName);
        }

        $io->writeln('<info>Setting timer for next indexing to 0</info>');
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable('index_config');
        $queryBuilder
            ->update('index_config')
            ->where(
                $queryBuilder->expr()->neq('timer_next_indexing', $queryBuilder->createNamedParameter(0))
            )
            ->set('timer_next_indexing', 0)
            ->execute();

        $io->writeln('<info>...done</info>');

        return 0;
    }
}

You can call the command either directly via cronjob or with the scheduler, right before re-indexing the site with the crawler extension. You're welcome. ;)

Actions

Copy link

#18

Updated by Florian Schöppe over 2 years ago

Thank you @Sven Teuber for the explanation of your workaround and the Command-code.

For my use case I removed the tables index_stat_search and index_stat_word from the table list to preserve the search statistics.

Actions

Copy link

#19

Updated by Sybille Peters over 2 years ago

About the solution: "Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful."

I assume this refers to a complete reindex.

I find that unfortunate because with truncate you have a gap where no search results are available until the table is filled again. I think it should be possible to solve this differently.

e.g. by adding a timestamp to the existing records and removing the records which were last updated before the beginning of the reindexing.

This is the approach I use in "brofix" (fork from linkvalidator):

In linkvalidator (tool for gathering broken links), before checking for new broken links all the broken link records for the pages to be rechecked were deleted at the beginning of the check. This has the disadvantage, that you have a time where the data is gone (until it is created again).
I changed this (in brofix): At the beginning of the check, a timestamp is saved. All existing broken link records are not removed, they are updated if the broken link still exists. At the end of the check, all records that were last updated before the check (based on the timestamp and comparing with e.g. the tstamp field) were deleted. This works quite well.

Actions

Copy link

#20

Updated by Georg Ringer 5 months ago

Status changed from New to Closed

closing issue as with #94577 there is at least the option to clear the index if content is edited in the backend.

feel free to create an extension with more rules to clear the tables, e.g. after a timeframe or similiar.

currently nothing will change in that regard in core itself

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

TYPO3 Core

Custom queries

Watchers (14)

Task #72037

Truncate indexed_search tables before indexing with crawler

Updated by Benni Mack almost 9 years ago

Updated by Sven Burkert almost 9 years ago

Updated by Alexander Opitz over 8 years ago

Updated by Sven Burkert over 8 years ago

Updated by Alexander Opitz about 8 years ago

Updated by Sven Burkert about 8 years ago

Updated by Alexander Opitz about 8 years ago

Updated by Alexander Opitz about 8 years ago

Updated by Benni Mack about 8 years ago

Updated by Tymoteusz Motylewski about 8 years ago

Updated by Sven Burkert about 8 years ago

Updated by Sybille Peters over 6 years ago

Updated by David Henninger over 6 years ago

Updated by Uwe Wiebach about 6 years ago

Updated by Benjamin Robinson almost 4 years ago

Updated by Benjamin Robinson almost 4 years ago

Updated by Sven Teuber over 3 years ago

Updated by Florian Schöppe over 2 years ago

Updated by Sybille Peters over 2 years ago

Updated by Georg Ringer 5 months ago