Overview
“Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.” ~source: http://tika.apache.org
All in all Tika knows about 1200 file formats and can read about half of them.
Apache Tika for TYPO3 offers several services to extract meta data and content from files. The extension also comes with a service to detect the language of a text (requires Tika 0.8+).
EXT:tika can use either a locally available Tika CLI app or a remote Apache Solr server.
The provided services can then be used by other extensions like EXT:dam or EXT:solr for example.
File types currently supported by the extension's services are:
Meta data extraction
- au
- bmp
- doc
- docx
- epub
- flv
- gif
- htm
- html
- image:exif
- jpg
- jpeg
- mid
- mp3
- msg
- odf
- odt
- png
- ppt
- pptx
- rtf
- svg
- sxw
- tiff
- txt
- wav
- xls
- xlsx
- xml
Text extraction
- doc
- docx
- epub
- htm
- html
- msg
- odf
- odt
- ppt
- pptx
- rtf
- sxw
- txt
- xls
- xlsx
- xml
More information about Apache Tika: http://tika.apache.org
Members
Leader
Ingo Renner (flyguide)
Member
georg kuehnberger (gkuehnberger)
Ingo Schmitt (ischmittis)
Markus Goldbach (goldi_42)
Olivier Dobberkau (oli4)