What does it do¶
“Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.” ~source: http://tika.apache.org
All in all Tika knows about 1200 file formats and can read about half of them. These formats include the most common ones:
HTML, XML including RSS and ATOM feeds, Microsoft Office (binary formats and OOXML), OpenDocumentFormat / ODF (OpenOffice.org), Apple iWork, PDF, ePUB, RTF, compressed formats like ZIP, audio formats including MP3, flash flv video, image formats including JPEG and TIFF, mail box mbox format, and many more.
Apache Tika for TYPO3 provides three services to retrieve information from files:
- Text extraction
- Language detection of file contents (requires Apache Tika version 0.8 or higher)
- Met data extraction
All three services can be used with DAM.
Have a look at the installed services report in the reports module to check what formats the extension itself currently supports. The list of supported file formats can be extended easily by adding file extensions to the comma-separated list in the service registration call. Currently the list is limited to those formats we have written unit tests for.
It is recommended to use Tika version 0.7 or higher, to use the language detection service you'll need version 0.8 or higher.