Feature #36743

Use text extraction services to get file content

Added by Ingo Renner over 7 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Must have
Assignee:
Category:
File Abstraction Layer (FAL)
Target version:
Start date:
2012-05-01
Due date:
% Done:

100%

PHP Version:
Tags:
Complexity:
Sprint Focus:

Description

Currently FAL simply uses file_get_contents() in its local driver to extract a file's content. This is fine for simple text files, but won't work for file types like Office and PDF files.

TYPO3 already offers the services infrastructure to allow having different text extractors. Use the textExtract service to read file contents.

Associated revisions

Revision ca87592e (diff)
Added by Ingo Renner over 4 years ago

[FEATURE] Get a file's text content if possible

Currently FAL only allows to extract meta data from files. This
patch allows to also extract text content from files. This can
be useful for search engines or providing snippets/teasers
in document archives.

Multiple text extractors can be registered to allow dealing with
different file types. A plain text extractor is provided by the core.

This is also a successor to the former textExtract service interface
implemented by several extensions: http://bit.ly/1D0x92M

Fixes: #36743
Releases: master
Change-Id: I1ce414c99fb26413eedd32422821e1a8802010de
Reviewed-on: http://review.typo3.org/36556
Reviewed-by: Frans Saris <>
Tested-by: Frans Saris <>
Reviewed-by: Frank Nägler <>
Tested-by: Frank Nägler <>
Reviewed-by: Stefan Froemken <>
Tested-by: Stefan Froemken <>

History

#1 Updated by Gerrit Code Review over 7 years ago

  • Status changed from New to Under Review

Patch set 1 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/10916

#2 Updated by Gerrit Code Review over 7 years ago

Patch set 2 for branch master has been pushed to the review server.
It is available at http://review.typo3.org/10916

#3 Updated by Alexander Opitz over 6 years ago

  • Assignee changed from Ingo Renner to Andreas Wolf
  • Target version deleted (6.0.0)

What is the state of text extraction services? You metioned in gerrit that there are other plans to implement this.

#4 Updated by Alexander Opitz almost 5 years ago

  • Target version set to 7.1 (Cleanup)
  • Sprint Focus set to On Location Sprint

#5 Updated by Alexander Opitz almost 5 years ago

  • Category set to File Abstraction Layer (FAL)

#6 Updated by Frans Saris almost 5 years ago

  • Status changed from Under Review to Needs Feedback

You can create your own extractor service to process a file to get the readable content of a file just like is possible for metadata.

In you extractor you call $file->getForLocalProcessing(); to get the path to the real file (or temp local copy of it) and do your magic to fetch the text.

#7 Updated by Fabien Udriot almost 5 years ago

A source of inspiration could be in EXT:metadata where we retrieve custom metadata for images and pdf.

Can we close the ticket?

#8 Updated by Mathias Schreiber almost 5 years ago

  • Status changed from Needs Feedback to Accepted

#9 Updated by Gerrit Code Review almost 5 years ago

  • Status changed from Accepted to Under Review

Patch set 1 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#10 Updated by Gerrit Code Review almost 5 years ago

Patch set 2 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#11 Updated by Gerrit Code Review almost 5 years ago

Patch set 3 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#12 Updated by Gerrit Code Review almost 5 years ago

Patch set 4 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#13 Updated by Gerrit Code Review almost 5 years ago

Patch set 5 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#14 Updated by Gerrit Code Review over 4 years ago

Patch set 6 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#15 Updated by Gerrit Code Review over 4 years ago

Patch set 7 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#16 Updated by Gerrit Code Review over 4 years ago

Patch set 8 for branch master of project Packages/TYPO3.CMS has been pushed to the review server.
It is available at http://review.typo3.org/36556

#17 Updated by Ingo Renner over 4 years ago

  • Status changed from Under Review to Resolved
  • % Done changed from 0 to 100

#18 Updated by Anja Leichsenring almost 4 years ago

  • Sprint Focus deleted (On Location Sprint)

#19 Updated by Riccardo De Contardi about 2 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF