Behemoth - a Hadoop based platform for large scale document processing

Fri, 2010-05-07 15:23 — isabel

search

Tue, 2010-06-08 14:10 - 14:40

Speaker:

Julien Nioche

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. It allows to deploy GATE or UIMA applications and uses a simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community. Behemoth already does or will interact with
quite a few open source projects such as Nutch, Tika, Mahout or HBase.

One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide converters from common data formats (Warc, Nutch, etc...) and a sandbox for users to share applications using the annotations using Hadoop Map Reduce.

This talk will be an overview of Behemoth and will give concrete example of its use. I will also describe future developments and short term plans

Add new comment

Berlin Buzzwords 2010 is a conference for developers and users of open source software projects, focussing on the issues of scalable search, data-analysis in the cloud and NoSQL-databases. Berlin Buzzwords presents more than 30 talks and presentations of international speakers specific to the three tags "search", "store" and "scale".

Gold Partners