Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. It allows to deploy GATE or UIMA applications and uses a simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community. Behemoth already does or will interact with
quite a few open source projects such as Nutch, Tika, Mahout or HBase.
One of the main aspects of Behemoth is to simplify the deployment of document analysers on a large scale but also to provide converters from common data formats (Warc, Nutch, etc...) and a sandbox for users to share applications using the annotations using Hadoop Map Reduce.
This talk will be an overview of Behemoth and will give concrete example of its use. I will also describe future developments and short term plans
Recent comments
4 years 15 weeks ago
4 years 15 weeks ago
4 years 20 weeks ago
4 years 22 weeks ago
4 years 22 weeks ago
4 years 24 weeks ago