616 shaares
133 private links
133 private links
2 results
tagged
documentos
A container which takes in a pdf or docx file, and outputs a txt file containing extracted metadata and content.
The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more. You can find the latest release on the download page. Please see the Getting Started page for more information on how to start using Tika.