http://forge.typo3.org/http://forge.typo3.org/themes/typo3_forge/favicon/favicon.png?17058661692015-12-05T15:23:04ZTYPO3 ForgeTYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=2891142015-12-05T15:23:04ZBenni Mackbenni@typo3.org
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Needs Feedback</i></li></ul><p>Can you elobarate the benefits here? A recurring task that clears the whole index? I'd rather go with a button in some indexed search module so an admin can clear them.</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=2901982015-12-17T13:16:40ZSven Burkertbedienung@sbtheke.de
<ul></ul><p>Truncating the indexing tables also removes the deleted contents.<br />If you do not truncate the tables before running over the site with the crawler (ext. crawler), then you index all contents, but the deleted or hidden contents remain in the indexing tables and the links to these contents are shown in indexed_search search results.</p>
<p>That means, every time before indexing the site with ext. crawler, the indexing tables has to be truncated.</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3034202016-05-02T11:18:45ZAlexander Opitzopitz.alexander@googlemail.com
<ul></ul><p>Hi Sven, I can't confirm this behavior if using the crawler via the scheduler.</p>
<p>How do you run indexed search?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3036262016-05-04T09:41:01ZSven Burkertbedienung@sbtheke.de
<ul></ul><p>Ok, let's expect you use ext. "news" and want to index the news detail pages.<br />So I create a Crawler Configuration record on the news detail page.<br />Furthermore, I create two scheduler jobs:<br />1) Crawler queue<br />2) Crawler Run</p>
<p>These two jobs are indexing the news. Without the job which truncates the indexed contents, meanwhile deleted or hidden news records are still in the index.</p>
<p>Do you have a TYPO3 setup where older indexed contents are deleted or somehow invalidated?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3171612016-11-07T10:37:17ZAlexander Opitzopitz.alexander@googlemail.com
<ul></ul><p>Does <a class="external" href="http://xavier.perseguers.ch/tutoriels/typo3/articles/indexed-search-crawler.html">http://xavier.perseguers.ch/tutoriels/typo3/articles/indexed-search-crawler.html</a> can help you?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3172952016-11-08T08:39:04ZSven Burkertbedienung@sbtheke.de
<ul></ul><p>This link doesn't help, but it also recommends to truncate the tables:</p>
<blockquote>
<p>I suggest to empty indexing and crawler tables before each task, this prevents many side effects[...]</p>
</blockquote> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3172972016-11-08T09:12:53ZAlexander Opitzopitz.alexander@googlemail.com
<ul><li><strong>Status</strong> changed from <i>Needs Feedback</i> to <i>New</i></li><li><strong>Target version</strong> set to <i>Candidate for Major Version</i></li></ul><p>Ok, found more blogs and forums which all provide this as solution.</p>
<p>In my installations I do not have this issue, but there don't change much. All others use solr.</p>
<p>So there would be 2 possible ways to solve this.</p>
<ul>
<li>1) Add a way to truncate tables. Fast to implement but the index isn't fully available till next complete crawling (for each config).
<ul>
<li>Fast to implement</li>
<li>Most hassle if something do not work after truncating</li>
<li>Not complete everytime</li>
</ul>
</li>
<li>2.) Task which can remove entries for deleted pages/contents for hit (best not as task rather on editor action).
<ul>
<li>Hard to implement</li>
<li>Should be cleanest solution</li>
</ul></li>
</ul> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3172982016-11-08T09:13:33ZAlexander Opitzopitz.alexander@googlemail.com
<ul></ul><p>Any comments or other solutions?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3174552016-11-10T13:13:19ZBenni Mackbenni@typo3.org
<ul></ul><p>well, I would consider this a feature to have a scheduler task to truncate the index for a certain config, this way an admin can choose to do so.</p>
<p>Alternatively, the indexing config has an additional option to flush the index before crawling.</p>
<p>Both options are related to the quick fix, the clean way really seems hard (you also have deleted/hidden/starttime/endtime stuff to consider).</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3174712016-11-10T13:43:59ZTymoteusz Motylewskit.motylewski@gmail.com
<ul></ul><p>Possible solutions:<br />1. truncate table<br />- it leaves index empty for some time</p>
<p>2. trigger removal of the items from index on data change (e.g. record was updated) or something<br />- it's tricky to implement to cover all cases, but maybe we can do a simple hook to remove stuff from index on most common cases like hiding/removing/restricting... a page.</p>
<p>3. Have a "garbage collector" task which runs query like "delete all records in the index which were last indexed 2 days ago" <br />(the 2days time being configurable value). Then you can schedule this as a scheduler task running after you recrawl the whole page and removes data which were not seen any more.<br />- with this solution, there still be a period of time where wrong records are available</p>
<p>Gentlemen, which solution do you think will suite your needs?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3176832016-11-12T17:40:52ZSven Burkertbedienung@sbtheke.de
<ul></ul><p>Perhaps this is also a solution:</p>
<p>4.) Check for every indexed page/record, if this record is still visible (that means, not deleted, not hidden, no starttime in future, no endtime reached). If not, delete the entries for this one only.</p>
<p>But I am unsure, what happens, if this record becomes visible again. Is the url for this record put in queue again and is it indexed on the next run?</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3622732018-03-27T11:55:50ZSybille Peterssypets@gmx.de
<ul></ul><p>I don't think truncating the index is a good solution because it will leave the index empty / incomplete until everything is reindexed. The other solutions as proposed by Tymoteusz sound like a better idea.</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3686142018-07-10T12:34:29ZDavid Henningerdavid.henninger@ibc-solar.de
<ul></ul><p>Yes, truncating is a bad hack. I am all for a TTL for indexes solution 3: If a page doesn't get indexed regularly, it gets thrown away.</p>
<p>For now I have to remove entries manually or start a full reindex after truncating...</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=3740772018-09-21T13:00:50ZUwe Wiebachwiebach@kapelan.com
<ul></ul><p>+1 for proposed solution 3.</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=4399372021-02-18T10:59:22ZBenjamin Robinson
<ul><li><strong>Related to</strong> <i><a class="issue tracker-1 status-6 priority-3 priority-lowest closed" href="/issues/67249">Bug #67249</a>: Indexed search do not delete hidden records</i> added</li></ul> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=4399382021-02-18T11:03:09ZBenjamin Robinson
<ul></ul><p>Also +1 for proposal 3</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=4473852021-07-21T13:44:00ZSven Teuber
<ul></ul><p>We, too, struggle with indexed search displaying old content that has long been edited out or deleted (indexed_search 9.5.28), and Google/Stackoverflow seem to agree that this is (still) quite common.</p>
<p>The best solution in a frontend indexing scenario (which we used until now) would be if indexed search discarded any old stuff whenever a page gets re-indexed and just indexes the current version. Apparently, it doesn't, but instead adds new content without removing the old content, leading to discrepancies between the search results/result descriptions and the actual pages.</p>
<p>The second best solution may be proposal 3 from above, but it could get complicated correctly identifying the content that's older than x days and removing only that content.</p>
<p>A third solution that may fit quite some use cases may be to throw away any old stuff and rebuilding the index at a time when there's no traffic on the website anyway. Who cares if there are no search results between 04:20 and 04:25 a.m. on a smaller local page? And that's probably the main use case for indexed search anyway. A large multinational site will propably use SOLR or something similar instead.</p>
<p>Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful.</p>
<p>It's not the best solution, but a working, pragmatic solution. Why not make it available to those who would like to use it? It's not like we can't add more sophisticated, better solutions once we provided a quick fix, is it?</p>
<p>So, without further ado, for those in search for a quick, pragmatic solution, just add this to your sitepackage:</p>
<p>[sitepackage]/Configuration/Commands.php:<br /><pre><code class="php syntaxhl" data-language="php"><span class="cp"><?php</span>
<span class="k">return</span> <span class="p">[</span>
<span class="s1">'mysite:truncateindexedsearch'</span> <span class="o">=></span> <span class="p">[</span>
<span class="s1">'class'</span> <span class="o">=></span> <span class="nc">\Vendor\Sitepackage\Command\TruncateIndexedSearchTablesCommand</span><span class="o">::</span><span class="n">class</span>
<span class="p">]</span>
<span class="p">];</span>
</code></pre></p>
<p>[sitepackage]/Classes/Command/TruncateIndexedSearchTablesCommand.php:<br /><pre><code class="php syntaxhl" data-language="php"><span class="cp"><?php</span>
<span class="kn">namespace</span> <span class="nn">Vendor\Sitepackage\Command</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">TYPO3\CMS\Core\Utility\GeneralUtility</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">TYPO3\CMS\Core\Database\ConnectionPool</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">Symfony\Component\Console\Command\Command</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">Symfony\Component\Console\Input\InputInterface</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">Symfony\Component\Console\Output\OutputInterface</span><span class="p">;</span>
<span class="kn">use</span> <span class="nc">Symfony\Component\Console\Style\SymfonyStyle</span><span class="p">;</span>
<span class="kd">class</span> <span class="nc">TruncateIndexedSearchTablesCommand</span> <span class="kd">extends</span> <span class="nc">Command</span>
<span class="p">{</span>
<span class="cd">/**
* Configure the command by defining the name, options and arguments
*/</span>
<span class="k">protected</span> <span class="k">function</span> <span class="n">configure</span><span class="p">()</span>
<span class="p">{</span>
<span class="nv">$this</span><span class="o">-></span><span class="nf">setDescription</span><span class="p">(</span><span class="s1">'Truncate indexed search tables.'</span><span class="p">);</span>
<span class="p">}</span>
<span class="cd">/**
* Truncate indexed search tables to force removal of hidden, deleted or changed content.
* Don't forget to rebuild the index right after clearing it using the crawler extension!
*
* @param InputInterface $input
* @param OutputInterface $output
*/</span>
<span class="k">protected</span> <span class="k">function</span> <span class="n">execute</span><span class="p">(</span><span class="kt">InputInterface</span> <span class="nv">$input</span><span class="p">,</span> <span class="kt">OutputInterface</span> <span class="nv">$output</span><span class="p">)</span>
<span class="p">{</span>
<span class="nv">$io</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">SymfonyStyle</span><span class="p">(</span><span class="nv">$input</span><span class="p">,</span> <span class="nv">$output</span><span class="p">);</span>
<span class="nv">$io</span><span class="o">-></span><span class="nf">title</span><span class="p">(</span><span class="nv">$this</span><span class="o">-></span><span class="nf">getDescription</span><span class="p">());</span>
<span class="nv">$tables</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'index_debug'</span><span class="p">,</span> <span class="s1">'index_fulltext'</span><span class="p">,</span> <span class="s1">'index_grlist'</span><span class="p">,</span> <span class="s1">'index_phash'</span><span class="p">,</span> <span class="s1">'index_rel'</span><span class="p">,</span> <span class="s1">'index_section'</span><span class="p">,</span> <span class="s1">'index_stat_search'</span><span class="p">,</span> <span class="s1">'index_stat_word'</span><span class="p">,</span> <span class="s1">'index_words'</span>
<span class="p">];</span>
<span class="k">foreach</span> <span class="p">(</span><span class="nv">$tables</span> <span class="k">as</span> <span class="nv">$tableName</span><span class="p">)</span> <span class="p">{</span>
<span class="nv">$io</span><span class="o">-></span><span class="nf">writeln</span><span class="p">(</span><span class="s1">'<info>Truncating '</span><span class="mf">.</span><span class="nv">$tableName</span><span class="mf">.</span><span class="s1">'</info>'</span><span class="p">);</span>
<span class="nc">GeneralUtility</span><span class="o">::</span><span class="nf">makeInstance</span><span class="p">(</span><span class="nc">ConnectionPool</span><span class="o">::</span><span class="n">class</span><span class="p">)</span>
<span class="o">-></span><span class="nf">getConnectionForTable</span><span class="p">(</span><span class="nv">$tableName</span><span class="p">)</span>
<span class="o">-></span><span class="nf">truncate</span><span class="p">(</span><span class="nv">$tableName</span><span class="p">);</span>
<span class="p">}</span>
<span class="nv">$io</span><span class="o">-></span><span class="nf">writeln</span><span class="p">(</span><span class="s1">'<info>Setting timer for next indexing to 0</info>'</span><span class="p">);</span>
<span class="nv">$queryBuilder</span> <span class="o">=</span> <span class="nc">GeneralUtility</span><span class="o">::</span><span class="nf">makeInstance</span><span class="p">(</span><span class="nc">ConnectionPool</span><span class="o">::</span><span class="n">class</span><span class="p">)</span><span class="o">-></span><span class="nf">getQueryBuilderForTable</span><span class="p">(</span><span class="s1">'index_config'</span><span class="p">);</span>
<span class="nv">$queryBuilder</span>
<span class="o">-></span><span class="nf">update</span><span class="p">(</span><span class="s1">'index_config'</span><span class="p">)</span>
<span class="o">-></span><span class="nf">where</span><span class="p">(</span>
<span class="nv">$queryBuilder</span><span class="o">-></span><span class="nf">expr</span><span class="p">()</span><span class="o">-></span><span class="nf">neq</span><span class="p">(</span><span class="s1">'timer_next_indexing'</span><span class="p">,</span> <span class="nv">$queryBuilder</span><span class="o">-></span><span class="nf">createNamedParameter</span><span class="p">(</span><span class="mi">0</span><span class="p">))</span>
<span class="p">)</span>
<span class="o">-></span><span class="nf">set</span><span class="p">(</span><span class="s1">'timer_next_indexing'</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="o">-></span><span class="nf">execute</span><span class="p">();</span>
<span class="nv">$io</span><span class="o">-></span><span class="nf">writeln</span><span class="p">(</span><span class="s1">'<info>...done</info>'</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></p>
<p>You can call the command either directly via cronjob or with the scheduler, right before re-indexing the site with the crawler extension. You're welcome. ;)</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=4677632022-04-22T09:11:26ZFlorian Schöppeflorian@schoeppe.info
<ul></ul><p>Thank you @Sven Teuber for the explanation of your workaround and the Command-code.</p>
<p>For my use case I removed the tables index_stat_search and index_stat_word from the table list to preserve the search statistics.</p> TYPO3 Core - Task #72037: Truncate indexed_search tables before indexing with crawlerhttp://forge.typo3.org/issues/72037?journal_id=4677652022-04-22T09:38:49ZSybille Peterssypets@gmx.de
<ul></ul><p>About the solution: "Which leads us back to square 1: TRUNCATE the tables automatically just before indexing them anew with the crawler would be very useful."</p>
<p>I assume this refers to a complete reindex.</p>
<p>I find that unfortunate because with truncate you have a gap where no search results are available until the table is filled again. I think it should be possible to solve this differently.</p>
<p>e.g. by adding a timestamp to the existing records and removing the records which were last updated before the beginning of the reindexing.</p>
<p>This is the approach I use in "brofix" (fork from linkvalidator):</p>
<ul>
<li>In linkvalidator (tool for gathering broken links), before checking for new broken links all the broken link records for the pages to be rechecked were deleted at the beginning of the check. This has the disadvantage, that you have a time where the data is gone (until it is created again).</li>
<li>I changed this (in brofix): At the beginning of the check, a timestamp is saved. All existing broken link records are not removed, they are updated if the broken link still exists. At the end of the check, all records that were last updated <strong>before</strong> the check (based on the timestamp and comparing with e.g. the tstamp field) were deleted. This works quite well.</li>
</ul>