org.terrier.indexing (Terrier 4.0 API)

Interface Summary
Interface	Description
Collection	This interface encapsulates the most fundamental concept to indexing with Terrier - a Collection.
Document	This interface encapsulates the concept of a document during indexing.
DocumentExtractor	Deprecated
Tokenizer	The specification of the interface implemented by tokeniser classes.

Class Summary
Class	Description
CollectionFactory	Implements a factory for Collection objects.
FileDocument	Models a document which corresponds to one file.
MSExcelDocument	Deprecated
MSPowerPointDocument	Deprecated
MSWordDocument	Deprecated
PDFDocument	Implements a Document object for reading PDF documents, using Apache PDFBox.
POIDocument	Represents Microsoft Office documents, which are parsed by the Apache POI library
SimpleFileCollection	Implements a collection that can read arbitrary files on disk.
SimpleMedlineXMLCollection	Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format.
SimpleXMLCollection	Initial implementation of a class that generates a Collection with Documents from a series of XML files.
TaggedDocument	Models a tagged document (e.g., an HTML or TREC document).
TRECCollection	Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor.
TRECFullTokenizer	This class is the tokenizer used for indexing TREC topic files.
TRECUTFCollection	Deprecated
TRECWebCollection	Version of TRECCollection which can parse standard form DOCHDR tags in TREC Web corpoa.
TwitterJSONCollection	This class represents a collection of tweets stored in JSON format.
TwitterJSONDocument	This is a Terrier Document implementation of a Tweet stored in JSON format.
WARC018Collection	This object is used to parse WARC format web crawls, 0.18.
WARC09Collection	This object is used to parse WARC format web crawls, version 0.9.
WARC10Collection	This object is used to parse WARC format web crawls, version 0.10.

Package org.terrier.indexing Description

Provides classes and interfaces related to the indexing of documents. There are three main abstract concepts that are related to the code of this package.

The first is the concept of a Collection of documents. This can be a standard TREC test collection, or a connection to a database from where the documents are extracted.

The second abstraction is the concept of a Document. An implementation of a collection should iterate through the documents in the collection and return one at a time. The document encapsulates the parser required to extract the information to index. Implementations of documents are provided for TREC documents, PDF documents and standard Microsoft Office formats, such as MS Word, MS Powerpoint and MS Excel.

The third abstraction is related to the Indexer, the process that iterates through the documents of a collection and creates the necessary data structures. There are several implemented indexers:

BasicIndexer - indexes a Collection without recording position information. A DirectIndex is also built.
BlockIndexer - as BasicIndexer, but also records position information.
BasicSinglePassIndexer - creates an index without building a DirectIndex. This is approach is inherently more scalable than BasicIndexer.
BlockSinglePassIndexer - as BasicSinglePassIndexer, but also records position information.
Hadoop_BasicSinglePassIndexer - a distributed singlepass indexer that makes use of a Hadoop MapReduce cluster.
Hadoop_BasicSinglePassIndexer - as Hadoop_BasicSinglePassIndexer, but also records position information.