Lessons Learned from rtika, a Digital Babel Fish

Author: Sasha Goodman

The Apache Tika parser is like the Babel fish in Douglas Adam’s book, “The Hitchhikers’ Guide to the Galaxy”. The Babel fish translates any natural language to any other. Although Tika does not yet translate natural language, it starts to tame the tower of babel of digital document formats. As the Babel fish allowed a person to understand Vogon poetry, Tika allows an analyst to extract text and objects from Microsoft Word.

Parsing files is a common concern for many communities, including journalism, government, business, and academia. The complexity of parsing can vary a lot. Apache Tika is a common library to lower that complexity. The Tika auto-detect parser finds the content type of a file and processes it with an appropriate parser. It currently handles text or metadata extraction from over one thousand digital formats.

Read the full post here: https://ropensci.org/blog/2018/04/25/rtika-introduction/