Project

General

Profile

Actions

Bug #91168

open

Design flaw in FAL with indexing of files after a move

Added by Xavier Perseguers about 4 years ago. Updated 10 months ago.

Status:
New
Priority:
Must have
Assignee:
-
Category:
File Abstraction Layer (FAL)
Target version:
Start date:
2020-04-22
Due date:
% Done:

0%

Estimated time:
TYPO3 Version:
6.2
PHP Version:
Tags:
Complexity:
Is Regression:
Sprint Focus:

Description

This problem was reported by myself in the context of the TYPO3 extension extractor: https://github.com/xperseguers/t3ext-extractor/issues/25

Context

When moving a file in fileadmin, FAL is invoking extractors to extract metadata again which is totally wrong, it should be done during upload (if configured so for the underlying storage) but never afterwards.

Problem is that if a user changes the metadata by editing them, they will get overridden whenever the file is moved around!

\TYPO3\CMS\Core\Resource\ResourceStorage::moveFile() line 2020 is doing that:

$this->getIndexer()->updateIndexEntry($file);

which in turn does that:

$metaData = $this->extractRequiredMetaData($fileObject);

if ($this->storage->autoExtractMetadataEnabled()) {
    $metaData = array_merge($metaData, $this->getExtractorService()->extractMetaData($fileObject));
}
$fileObject->getMetaData()->add($metaData)->save();

This has the effect of possibly overriding changes made by the user and leads to loss of information!

How to reproduce: Start

  • Take a vanilla v6.2, v7, v8, v9 or v10 TYPO3 installation
  • Download and install EXT:extractor (https://extensions.typo3.org/extension/extractor or packagist causal/extractor)
  • Configure the extension by just enabling native PHP extraction (default settings, nothing to change, no need to configure additional external extraction services)
  • Upload a file containing metadata. As an example I attach a sample PDF (sample.pdf) coming from EXT:extractor itself to this ticket
  • Click the file name to edit the metadata, you get something like that:

  • Now change some metadata manually, like the title or the description. Save
  • You can edit again to check, your changes have naturally been persisted.

How to reproduce: DESIGN FLAW!

  • Move the file to another directory
  • Edit the metadata... BANG! your metadata have been reset to the content of the 3rd-party extractor services

Solution

  • Check if reindexing the file after moving it is actually required by FAL
  • In any case, DO NOT INVOKE metadata extraction again!

Files

sample.pdf (23.5 KB) sample.pdf Xavier Perseguers, 2020-04-22 10:59
metadata.png (114 KB) metadata.png Xavier Perseguers, 2020-04-22 11:02
Actions #1

Updated by Xavier Perseguers about 4 years ago

This is just a (quick) suggestion:

diff --git a/typo3/sysext/core/Classes/Resource/Service/ExtractorService.php b/typo3/sysext/core/Classes/Resource/Service/ExtractorService.php
index 92229d3581..a4a05a75dc 100644
--- a/typo3/sysext/core/Classes/Resource/Service/ExtractorService.php
+++ b/typo3/sysext/core/Classes/Resource/Service/ExtractorService.php
@@ -80,6 +80,16 @@ class ExtractorService
      */
     private function isFileTypeSupportedByExtractor(File $file, ExtractorInterface $extractor): bool
     {
+        if ($file->getMetaData()->count() > 0) {
+            // There's a design flaw in FAL, moving a file should not reindex it
+            // because it will effectively lead to overriding any user-defined
+            // value by metadata extracted by any extractor again.
+            // We will do our small job to refuse to extract metadata if metadata
+            // are already present.
+            // See: https://forge.typo3.org/issues/91168
+            return false;
+        }
+
         $isSupported = true;
         $fileTypeRestrictions = $extractor->getFileTypeRestrictions();
         if (!empty($fileTypeRestrictions) && !in_array($file->getType(), $fileTypeRestrictions, true)) {

Something that could logically be moved even more near the actual call (currently trying to find a way to prevent this flaw in my ext until TYPO3 is fixed).

Actions #2

Updated by Sybille Peters 11 months ago

Hi Xavier, I am testing your extension and have noticed this issue (again).

Would you consider creating a patch for your suggested change?

I think metadata extraction on moving files does not make sense at all.

(I have not checked in later versions if this will still happen, maybe we should check that first).

Actions #3

Updated by Sybille Peters 11 months ago

In the extension, this is now implemented like this, comparing tstamp and crdate in sys_file_metadata:

public function canProcess(File $file)
    {
        $typo3Branch = class_exists(\TYPO3\CMS\Core\Information\Typo3Version::class)
            ? (new \TYPO3\CMS\Core\Information\Typo3Version())->getBranch()
            : TYPO3_branch;
        if (version_compare($typo3Branch, '10.0', '<')) {
            $metadata = $file->_getMetaData();
        } else {
            $metadata = $file->getMetaData();
        }

        if (!empty($metadata) && $metadata['crdate'] !== $metadata['tstamp']) {
            // There's a design flaw in FAL, moving a file should not reindex it
            // because it will effectively lead to overriding any user-defined
            // value by metadata extracted by any extractor again.
            // We will do our small job to refuse to extract metadata if metadata
            // modification timestamp is different from creation timestamp
            // See: https://forge.typo3.org/issues/91168
            return false;
        }

link to latest commit in master branch

Actions #4

Updated by Sybille Peters 10 months ago

I am not sure this is a design flaw. I think what happened is that probably at the beginning the metadata was in sys_file too and then this was separated and the extractors added and the file index (sys_file) and metadata (sys_file_metadata) has not been entirely and cleanly separated.

This is visible in the core and documentation.

I think to fix is good not to fix or workaround this one case but decide when sys_file should be updated and metadata extracted and separate these 2 from each other where it makes sense. Also, it is not so easy to fix and I suggest to write tests first to make sure not to break anything.

...

Clearly separated:

  • 2 separate scheduler tasks
  • automatic metadata extraction can be turned off in storage

Not clearly separated:

  • in core there is a function Indexer::updateIndexEntry, this updates the sys_file index and starts metdata extraction. In some cases it is good to do both (e.g. when a new file is uploaded or a file copied), in some cases it does not make sense to do metadata extraction (again), e.g. on file move, but you do need the sys_file index updating.
public function updateIndexEntry(File $fileObject): File
    {
        $updatedInformation = $this->gatherFileInformationArray($fileObject->getIdentifier());
        $fileObject->updateProperties($updatedInformation);

        $this->getFileIndexRepository()->update($fileObject);
        $metaData = $this->extractRequiredMetaData($fileObject);

        if ($this->storage->autoExtractMetadataEnabled()) {
            $metaData = array_merge($metaData, $this->getExtractorService()->extractMetaData($fileObject));
        }
        $fileObject->getMetaData()->add($metaData)->save();
        return $fileObject;
    }

documentation:

  • in the documentation, in some places the distinction between the file index (sys_file) and metadata extraction is not clear

e.g.

Indexing a file creates a database record for the file, containing meta-information both about the file (file-system properties) and from the file (e.g. EXIF information for images).

https://docs.typo3.org/m/typo3/reference-coreapi/12.4/en-us/ApiOverview/Fal/Architecture/Components.html#the-file-index

That is kind of correct if with "indexing" you mean updating sys_file and extracting metadata and with "database record" you mean sys_file and sys_file_metadata. In any case, I find it misleading.

Otherwise, it is not correct, because the exif metadata is written to sys_file_metdata

  • also, the metadata extraction is not documented at all (e.g. how to add your own extractors)
Actions

Also available in: Atom PDF