Skip to Content


Talks in the search track are mostly concerned with searching information, fulltext search, Lucene etc.

Nutch as a web mining platform - the present and the future

Mon, 2010-06-07 16:30 - 17:15
Andrzej Bialecki

The Nutch platform for building large-scale search engines continues to serve as a flag example of Hadoop-based applications. This talk will start with an overview of Nutch architecture, present some less typical uses of Nutch as a web mining platform (based on real use cases), and outline a new range of applications expected as a result of the currently ongoing redesign of the platform.

MetaCarta GeoSearch Toolkit for Solr

Mon, 2010-06-07 13:55 - 14:25
James Goodwin

This presentation is an introduction to the capabilities, design and architecture of the MetaCarta GeoSearch Toolkit for Solr. The toolkit enables GeoTagging of fields using MetaCarta's GeoTagger as they are indexed by Solr and the indexing of the resulting geographic information. When querying it allows the combination of a geographic query with any other full text query and applies a GeoRelevance ranking to boost results where the query terms are more geographically relevant. It also enables filtering on other geographic meta-data.

Real Time Search with Lucene

Mon, 2010-06-07 14:35 - 15:05
Michael Busch

Lucene has for a while already a nice feature that we call "Near-realtime search" (NRT). The approach works well for a lot of applications, but we're currently working on an even better real-time solution in Lucene: directly searching IndexWriter's RAM buffer while documents are being added! This will dramatically improve indexing performance compared to NRT, and the search latency (the time it takes for a document to become searchable) will shrink to a minimum - hence we will scratch the N in NRT!

Introduction to Collaborative Filtering using Mahout

Tue, 2010-06-08 15:10 - 15:40
Frank Scholten

Collaborative filtering is a popular research area that is widely applied by many websites today. Amazon, Netflix and Google all use techniques from this field to recommend movies, books, musics and so on, to their visitors. This talk introduces you to collaborative filtering and shows how to create a recommendation engine using Apache Mahout / Taste. We start by covering basic concepts and discuss algorithms, Taste's architecture and extension points. We explain Taste along with a web application built with Apache Wicket and Taste that demonstrates several features.

Simple co-occurrence-based recommendation on Hadoop

Tue, 2010-06-08 15:50 - 16:35
Sean Owen

Recommender engines thrive on data -- lots of data. As such, scale inevitably becomes a challenge for recommenders. Distributed computing frameworks like Hadoop offer the infrastructure for applying many machines to such problems, and Apache Mahout has recently provided some first truly distributed recommender algorithms based on Hadoop. This talk explores the first such implementation, a simple algorithm based on item co-occurrence. We focus on how the algorithm is fit into a map-reduce paradigm, and how issues of scale inform the implementation.

Behemoth - a Hadoop based platform for large scale document processing

Tue, 2010-06-08 14:10 - 14:40
Julien Nioche

Behemoth is an open source platform for large scale document analysis based on Apache Hadoop. It allows to deploy GATE or UIMA applications and uses a simple representation format which can be used as a common ground between UIMA and GATE-generated annotations, hence achieving compatibility between both systems. Since it is Hadoop-based it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community. Behemoth already does or will interact with
quite a few open source projects such as Nutch, Tika, Mahout or HBase.

ElasticSearch - You Know, for Search

Mon, 2010-06-07 15:30 - 16:15
Shay Banon

This presentation will cover the basics of ElasticSearch (, an open source, distributed, RESTful, search engine. We will go over what it means to develop a distributed search engine with "data grid level" features, as well as cover basic elasticsearch functionally. We will also cover how search and nosql solution integrate and cooperate.

Text and metadata extraction with Apache Tika

Mon, 2010-06-07 13:15 - 13:45
Jukka Zitting

Text and metadata extraction with Apache Tika Abstract of Talk: Apache Tika is a toolkit for extracting text and metadata from digital documents. It's the perfect companion to search engines and any other applications where it's useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.

Finite-State Queries in Lucene

Mon, 2010-06-07 11:30 - 12:15
Robert Muir

The talk would focus upon how in an upcoming version of lucene, you will be able to do scalable 'inexact' queries such as pattern-matching, fuzzy, etc.

In current versions of lucene these queries are not very scalable.

Lucene Forecast - Version, Unicode, Flex and Modules

Mon, 2010-06-07 10:35 - 11:20
Simon Willnauer
Uwe Schindler

Since Apache Lucene moved to Java 5 in November 2009 several new features and concepts were introduced. From maintaining Version-by-Version backwards compatibility to fully enabled Unicode 4.0 support and the recently merged "Flexible-Indexing" branch, future Versions are ushering a new ear of Open Source fulltext search. During spring 2010 Lucene and Solr developments have merged leading to an even closer development and more flexible modularization.

Syndicate content