Real-time Indexing with Terrier

Introduction

In addition to traditional on-disk indices, Terrier provides both memory-only and hybrid memory+disk index structures that can be updated dynamically with new documents over time. Since Terrier 4.0, the top level Index class became abstract such that different types of indices can be supported. The pre-Terrier 4.0 index functionality is contained within the IndexOnDisk class, while new index types were added to enable search systems that can be updated in real-time without a lengthy batch indexing process.

Index Interfaces

To support real-time indexing, two new interfaces have been defined, namely UpdatableIndex and WritableIndex. An index class that implements WritableIndex supports the dynamic addition of new documents via a indexDocument() method. When indexDocument() is called, that document will be added to the index immediately and will be searchable once the indexDocument() returns. The WritableIndex interface represents an index that can be written to disk. In particular, a class that implements WritableIndex will implement a write() method that will convert each of the index structures into equivalent on-disk structures and will be written out to a specified path and with a named prefix. An index written in this way can then be later loaded as an IndexOnDisk index.

Real-time Index Types

There are two real-time index structures supported in Terrier 4.0:

Usage

Below we give some examples for using the real-time Terrier index structures.


    // define an example document and query
    String docContent = "Real-time indexing and retrieval is easy to use in Terrier";
    String query = "Indexing";

    // create a new index
    MemoryIndex memIndex = new MemoryIndex();

    // get the default tokeniser to break the document down into words
    Tokeniser tokeniser = Tokeniser.Tokeniser.getTokeniser();

    // create a Terrier document from the content string
    Reader contentReader = new StringReader(docContent);
    Map documentProperties = new HashMap();
    FileDocument document = new FileDocument(contentReader, documentProperties, tokeniser);

    // index the document
    memIndex.indexDocument(document);

    // the document is now available for searching

    // create a search manager (runs the search process over an index)
    Manager queryingManager = new Manager(memIndex);

    // a search request represents the search to be carried out
    SearchRequest srq = queryingManager.newSearchRequest("query", sb.toString());
    srq.setOriginalQuery(sb.toString());

    // define a matching model, in this case use the classical BM25 retrieval model
    srq.addMatchingModel("Matching","BM25");

    // run a Terrier search
    queryingManager.runSearchRequest(srq);

    ResultSet results = srq.getResultSet();

Webpage: http://terrier.org
Contact: School of Computing Science
Copyright (C) 2004-2016 University of Glasgow. All Rights Reserved.