Project

General

Profile

Actions

Task #72037

open

Truncate indexed_search tables before indexing with crawler

Added by Sven Burkert over 8 years ago. Updated about 2 years ago.

Status:
New
Priority:
Should have
Assignee:
-
Category:
Indexed Search
Start date:
2015-12-03
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
7
PHP Version:
Tags:
Complexity:
Sprint Focus:

Description

Ext. indexed_search should provide a scheduler job for truncating all the indexed_search tables:

TRUNCATE TABLE index_debug;
TRUNCATE TABLE index_fulltext;
TRUNCATE TABLE index_grlist;
TRUNCATE TABLE index_phash;
TRUNCATE TABLE index_rel;
TRUNCATE TABLE index_section;
TRUNCATE TABLE index_stat_search;
TRUNCATE TABLE index_stat_word;
TRUNCATE TABLE index_words;
UPDATE index_config SET timer_next_indexing = 0;

As next step, the whole page is indexed again with ext. crawler. But before, the indexed contents has to be deleted because of deleted and hidden records, which should vanish out of the search results.


Related issues 1 (0 open1 closed)

Related to TYPO3 Core - Bug #67249: Indexed search do not delete hidden recordsRejected2015-06-02

Actions
Actions #1

Updated by Benni Mack over 8 years ago

  • Status changed from New to Needs Feedback

Can you elobarate the benefits here? A recurring task that clears the whole index? I'd rather go with a button in some indexed search module so an admin can clear them.

Actions #2

Updated by Sven Burkert over 8 years ago

Truncating the indexing tables also removes the deleted contents.
If you do not truncate the tables before running over the site with the crawler (ext. crawler), then you index all contents, but the deleted or hidden contents remain in the indexing tables and the links to these contents are shown in indexed_search search results.

That means, every time before indexing the site with ext. crawler, the indexing tables has to be truncated.

Actions #3

Updated by Alexander Opitz almost 8 years ago

Hi Sven, I can't confirm this behavior if using the crawler via the scheduler.

How do you run indexed search?

Actions #4

Updated by Sven Burkert almost 8 years ago

Ok, let's expect you use ext. "news" and want to index the news detail pages.
So I create a Crawler Configuration record on the news detail page.
Furthermore, I create two scheduler jobs:
1) Crawler queue
2) Crawler Run

These two jobs are indexing the news. Without the job which truncates the indexed contents, meanwhile deleted or hidden news records are still in the index.

Do you have a TYPO3 setup where older indexed contents are deleted or somehow invalidated?

Actions #6

Updated by Sven Burkert over 7 years ago

This link doesn't help, but it also recommends to truncate the tables:

I suggest to empty indexing and crawler tables before each task, this prevents many side effects[...]

Actions #7

Updated by Alexander Opitz over 7 years ago

  • Status changed from Needs Feedback to New
  • Target version set to Candidate for Major Version

Ok, found more blogs and forums which all provide this as solution.

In my installations I do not have this issue, but there don't change much. All others use solr.

So there would be 2 possible ways to solve this.

  • 1) Add a way to truncate tables. Fast to implement but the index isn't fully available till next complete crawling (for each config).
    • Fast to implement
    • Most hassle if something do not work after truncating
    • Not complete everytime
  • 2.) Task which can remove entries for deleted pages/contents for hit (best not as task rather on editor action).
    • Hard to implement
    • Should be cleanest solution
Actions #8

Updated by Alexander Opitz over 7 years ago

Any comments or other solutions?

Actions #9

Updated by Benni Mack over 7 years ago

well, I would consider this a feature to have a scheduler task to truncate the index for a certain config, this way an admin can choose to do so.

Alternatively, the indexing config has an additional option to flush the index before crawling.

Both options are related to the quick fix, the clean way really seems hard (you also have deleted/hidden/starttime/endtime stuff to consider).

Actions #10

Updated by Tymoteusz Motylewski over 7 years ago

Possible solutions:
1. truncate table
- it leaves index empty for some time

2. trigger removal of the items from index on data change (e.g. record was updated) or something
- it's tricky to implement to cover all cases, but maybe we can do a simple hook to remove stuff from index on most common cases like hiding/removing/restricting... a page.

3. Have a "garbage collector" task which runs query like "delete all records in the index which were last indexed 2 days ago"
(the 2days time being configurable value). Then you can schedule this as a scheduler task running after you recrawl the whole page and removes data which were not seen any more.
- with this solution, there still be a period of time where wrong records are available

Gentlemen, which solution do you think will suite your needs?

Actions #11

Updated by Sven Burkert over 7 years ago

Perhaps this is also a solution:

4.) Check for every indexed page/record, if this record is still visible (that means, not deleted, not hidden, no starttime in future, no endtime reached). If not, delete the entries for this one only.

But I am unsure, what happens, if this record becomes visible again. Is the url for this record put in queue again and is it indexed on the next run?

Actions #12

Updated by Sybille Peters about 6 years ago

I don't think truncating the index is a good solution because it will leave the index empty / incomplete until everything is reindexed. The other solutions as proposed by Tymoteusz sound like a better idea.

Actions #13

Updated by David Henninger almost 6 years ago

Yes, truncating is a bad hack. I am all for a TTL for indexes solution 3: If a page doesn't get indexed regularly, it gets thrown away.

For now I have to remove entries manually or start a full reindex after truncating...

Actions #14

Updated by Uwe Wiebach over 5 years ago

+1 for proposed solution 3.

Actions #15

Updated by Benjamin Robinson about 3 years ago

  • Related to Bug #67249: Indexed search do not delete hidden records added
Actions #16

Updated by Benjamin Robinson about 3 years ago

Also +1 for proposal 3

Actions #17

Updated by Sven Teuber almost 3 years ago

We, too, struggle with indexed search displaying old content that has long been edited out or deleted (indexed_search 9.5.28), and Google/Stackoverflow seem to agree that this is (still) quite common.

The best solution in a frontend indexing scenario (which we used until now) would be if indexed search discarded any old stuff whenever a page gets re-indexed and just indexes the current version. Apparently, it doesn't, but instead adds new content without removing the old content, leading to discrepancies between the search results/result descriptions and the actual pages.

The second best solution may be proposal 3 from above, but it could get complicated correctly identifying the content that's older than x days and removing only that content.

A third solution that may fit quite some use cases may be to throw away any old stuff and rebuilding the index at a time when there's no traffic on the website anyway. Who cares if there are no search results between 04:20 and 04:25 a.m. on a smaller local page? And that's probably the main use case for indexed search anyway. A large multinational site will propably use SOLR or something similar instead.

Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful.

It's not the best solution, but a working, pragmatic solution. Why not make it available to those who would like to use it? It's not like we can't add more sophisticated, better solutions once we provided a quick fix, is it?

So, without further ado, for those in search for a quick, pragmatic solution, just add this to your sitepackage:

[sitepackage]/Configuration/Commands.php:

<?php 

return [
    'mysite:truncateindexedsearch' => [
        'class' => \Vendor\Sitepackage\Command\TruncateIndexedSearchTablesCommand::class
    ]
];

[sitepackage]/Classes/Command/TruncateIndexedSearchTablesCommand.php:

<?php
namespace Vendor\Sitepackage\Command;

use TYPO3\CMS\Core\Utility\GeneralUtility;
use TYPO3\CMS\Core\Database\ConnectionPool;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Style\SymfonyStyle;

class TruncateIndexedSearchTablesCommand extends Command
{
    /**
     * Configure the command by defining the name, options and arguments
     */
    protected function configure()
    {
        $this->setDescription('Truncate indexed search tables.');
    }

    /**
     * Truncate indexed search tables to force removal of hidden, deleted or changed content.
     * Don't forget to rebuild the index right after clearing it using the crawler extension!
     *
     * @param InputInterface $input
     * @param OutputInterface $output
     */
    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $io = new SymfonyStyle($input, $output);
        $io->title($this->getDescription());

        $tables = [
            'index_debug', 'index_fulltext', 'index_grlist', 'index_phash', 'index_rel', 'index_section', 'index_stat_search', 'index_stat_word', 'index_words'
        ];

        foreach ($tables as $tableName) {
            $io->writeln('<info>Truncating '.$tableName.'</info>');
            GeneralUtility::makeInstance(ConnectionPool::class)
                ->getConnectionForTable($tableName)
                ->truncate($tableName);
        }

        $io->writeln('<info>Setting timer for next indexing to 0</info>');
        $queryBuilder = GeneralUtility::makeInstance(ConnectionPool::class)->getQueryBuilderForTable('index_config');
        $queryBuilder
            ->update('index_config')
            ->where(
                $queryBuilder->expr()->neq('timer_next_indexing', $queryBuilder->createNamedParameter(0))
            )
            ->set('timer_next_indexing', 0)
            ->execute();

        $io->writeln('<info>...done</info>');

        return 0;
    }
}

You can call the command either directly via cronjob or with the scheduler, right before re-indexing the site with the crawler extension. You're welcome. ;)

Actions #18

Updated by Florian Schöppe about 2 years ago

Thank you @Sven Teuber for the explanation of your workaround and the Command-code.

For my use case I removed the tables index_stat_search and index_stat_word from the table list to preserve the search statistics.

Actions #19

Updated by Sybille Peters about 2 years ago

About the solution: "Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful."

I assume this refers to a complete reindex.

I find that unfortunate because with truncate you have a gap where no search results are available until the table is filled again. I think it should be possible to solve this differently.

e.g. by adding a timestamp to the existing records and removing the records which were last updated before the beginning of the reindexing.

This is the approach I use in "brofix" (fork from linkvalidator):

  • In linkvalidator (tool for gathering broken links), before checking for new broken links all the broken link records for the pages to be rechecked were deleted at the beginning of the check. This has the disadvantage, that you have a time where the data is gone (until it is created again).
  • I changed this (in brofix): At the beginning of the check, a timestamp is saved. All existing broken link records are not removed, they are updated if the broken link still exists. At the end of the check, all records that were last updated before the check (based on the timestamp and comparing with e.g. the tstamp field) were deleted. This works quite well.
Actions

Also available in: Atom PDF