uk.ac.gla.terrier.indexing (Terrier Information Retrieval Platform version 1.1.1 API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
1.1.1

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package uk.ac.gla.terrier.indexing

Provides classes and interfaces related to the indexing of documents.

See:
Description

Interface Summary
Collection	This interface encapsulates the most fundamental concept to indexing with Terrier - a Collection.
Document	This interface encapsulates the concept of a document during indexing.
DocumentExtractor	The interface for the collection objects that give access to the text (string) of the documents in the collection
Tokenizer	The specification of the interface implemented by tokeniser classes.

Class Summary
BasicIndexer	BasicIndexer is the default indexer for Terrier.
BlockIndexer	An indexer that saves block information for the indexed terms.
CollectionFactory	Implements a factory for Collection objects.
CreateDocumentInitialWeightIndex	This class creates the initial weight index of all documents in the collection.
CreateTermEstimateIndex	This class creates the term estimate index of all terms in vocabulary.
FileDocument	Models a document which corresponds to one file.
HTMLDocument	Models an HTML document.
Indexer	Properties: `indexing.max.tokens` - The maximum number of tokens the indexer will attempt to index in a document.
MSExcelDocument	Implements a Document object for a Microsoft Excel spreadsheet.
MSPowerpointDocument	Implements a Document object for reading Microsoft Powerpoint files.
MSWordDocument	This class is used for indexing MS Word document files (ie files ending .doc).
PDFDocument	Implements a Document object for reading PDF documents.
SimpleFileCollection	Implements a collection that can read arbitrary files on disk.
SimpleXMLCollection	Initial implementation of a class that generates a Collection with Documents from a series of XML files.
TRECCollection	Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor.
TRECDocument	Models a document in a TREC collection.
TRECFullTokenizer	This class is the tokenizer used for indexing TREC collections.
TRECUTFCollection	Extends TRECCollection to provide support for indexing TREC collection in non-ASCII character sets.

Package uk.ac.gla.terrier.indexing Description

Provides classes and interfaces related to the indexing of documents. There are three main abstract concepts that are related to the code of this package.

The first is the concept of a collection of documents. This can be a standard TREC test collection, or a connection to a database from where the documents are extracted.

The second abstraction is the concept of a document. An implementation of a collection should iterate through the documents in the collection and return one at a time. The document encapsulates the parser required to extract the information to index. Implementations of documents are provided for TREC documents, PDF documents and standard Microsoft Office formats, such as MS Word, MS Powerpoint and MS Excel.

The third abstraction is related to the indexer, the process that iterates through the documents of a collection and creates the necessary data structures. There are two implemented indexers. The first one saves field information, if specified, while the second one saves position information as well.