|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
Collection | This interface encapsulates the most fundamental concept to indexing with Terrier - a Collection. |
Document | This interface encapsulates the concept of a document during indexing. |
DocumentExtractor | Deprecated. |
SinglePassIndexerFlushDelegate | Used by ExtensibleSinglePassIndexer for
delegating the flushing of memory. |
Tokenizer | The specification of the interface implemented by tokeniser classes. |
Class Summary | |
---|---|
BasicIndexer | BasicIndexer is the default indexer for Terrier. |
BasicSinglePassIndexer | This class indexes a document collection (skipping the direct file construction). |
BlockIndexer | An indexer that saves block information for the indexed terms. |
BlockSinglePassIndexer | Indexes a document collection saving block information for the indexed terms. |
CollectionFactory | Implements a factory for Collection objects. |
ExtensibleSinglePassIndexer | Directly based on BasicSinglePassIndexer, with just a few modifications to enable some extra hooks. |
FileDocument | Models a document which corresponds to one file. |
HTMLDocument | Deprecated. |
Indexer | Properties: termpipelines - the sequence of TermPipeline stages (e.g. |
MSExcelDocument | Implements a Document object for a Microsoft Excel spreadsheet. |
MSPowerpointDocument | Implements a Document object for reading Microsoft Powerpoint files. |
MSWordDocument | This class is used for indexing MS Word document files (ie files ending .doc). |
PDFDocument | Implements a Document object for reading PDF documents. |
SimpleFileCollection | Implements a collection that can read arbitrary files on disk. |
SimpleMedlineXMLCollection | Initial implementation of a class that generates a Collection with Documents from a series of XML files in the Medline format. |
SimpleXMLCollection | Initial implementation of a class that generates a Collection with Documents from a series of XML files. |
TaggedDocument | Models a tagged document (e.g., an HTML or TREC document). |
TRECCollection | Models a TREC test collection by implementing the interfaces Collection and DocumentExtractor. |
TRECDocument | Deprecated. |
TRECFullTokenizer | This class is the tokenizer used for indexing TREC topic files. |
TRECFullUTFTokenizer | Deprecated. From 3.5, TRECFullTokenizer should be used instead, with
trec.encoding set to utf8. |
TRECUTFCollection | Deprecated. |
TRECWebCollection | Version of TRECCollection which can parse standard form DOCHDR tags in TREC Web corpoa. |
WARC018Collection | This object is used to parse WARC format web crawls, 0.18. |
WARC09Collection | This object is used to parse WARC format web crawls, version 0.9. |
Provides classes and interfaces related to the indexing of documents. There are three main abstract concepts that are related to the code of this package.
The first is the concept of a Collection of documents. This can be a standard TREC test collection, or a connection to a database from where the documents are extracted.
The second abstraction is the concept of a Document. An implementation of a collection should iterate through the documents in the collection and return one at a time. The document encapsulates the parser required to extract the information to index. Implementations of documents are provided for TREC documents, PDF documents and standard Microsoft Office formats, such as MS Word, MS Powerpoint and MS Excel.
The third abstraction is related to the Indexer, the process that iterates through the documents of a collection and creates the necessary data structures. There are several implemented indexers:
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |