CoreCommunity ExtensionsIncubatorDistributionsTYPO3 4.5 ProjectsTYPO3 4.6 ProjectsTYPO3 4.7 ProjectsTYPO3 6.0 ProjectsTYPO3 6.1 ProjectsTYPO3 6.2 Projects (+)

Overview

“Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.” ~source: http://tika.apache.org

All in all Tika knows about 1200 file formats and can read about half of them.

Apache Tika for TYPO3 offers several services to extract meta data and content from files. The extension also comes with a service to detect the language of a text (requires Tika 0.8+).

EXT:tika can use either a locally available Tika CLI app or a remote Apache Solr server.

The provided services can then be used by other extensions like EXT:dam or EXT:solr for example.

File types currently supported by the extension's services are:

Meta data extraction

  • au
  • bmp
  • doc
  • docx
  • epub
  • flv
  • gif
  • htm
  • html
  • image:exif
  • jpg
  • jpeg
  • mid
  • mp3
  • msg
  • odf
  • odt
  • pdf
  • png
  • ppt
  • pptx
  • rtf
  • svg
  • sxw
  • tiff
  • txt
  • wav
  • xls
  • xlsx
  • xml

Text extraction

  • doc
  • docx
  • epub
  • htm
  • html
  • msg
  • odf
  • odt
  • pdf
  • ppt
  • pptx
  • rtf
  • sxw
  • txt
  • xls
  • xlsx
  • xml

More information about Apache Tika: http://tika.apache.org

Issue tracking

View all issues

Members

Leader

Ingo Renner (flyguide)

Member

georg kuehnberger (gkuehnberger)
Ingo Schmitt (ischmittis)
Markus Goldbach (goldi_42)