Skip to Content

You are currently visiting an old archive website with limited functionality. If you are looking für the current Berlin Buzzwords Website, please visit https://berlinbuzzwords.de

Behemoth - a Hadoop based platform for large scale document processing

Tue, 2010-06-08 14:10 - 14:40
Speaker: 
Julien Nioche

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. It allows to deploy GATE or UIMA applications and uses a simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community. Behemoth already does or will interact with
quite a few open source projects such as Nutch, Tika, Mahout or HBase.

One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide converters from common data formats (Warc, Nutch, etc...) and a sandbox for users to share applications using the annotations using Hadoop Map Reduce.

This talk will be an overview of Behemoth and will give concrete example of its use. I will also describe future developments and short term plans