Package | Description |
---|---|
org.terrier.indexing |
Provides classes and interfaces related to the indexing of documents.
|
org.terrier.realtime |
Provides index structures that support updating and real-time retrieval.
|
org.terrier.realtime.incremental |
Provides incremental indexing functionality.
|
org.terrier.realtime.memory |
Provides MemoryIndex structures.
|
org.terrier.realtime.memory.fields |
Provides MemoryIndex structures that support field search.
|
org.terrier.structures.indexing.singlepass |
Provides implementation of the structures needed for performing a single
pass indexing
|
org.terrier.structures.indexing.singlepass.hadoop |
Provides classes implemeting the Hadoop MapReduce indexing in Terrier.
|
Modifier and Type | Class and Description |
---|---|
class |
FileDocument
Models a document which corresponds to one file.
|
class |
MSExcelDocument
Deprecated.
|
class |
MSPowerPointDocument
Deprecated.
|
class |
MSWordDocument
Deprecated.
|
class |
PDFDocument
Implements a Document object for reading PDF documents, using Apache PDFBox.
|
class |
POIDocument
Represents Microsoft Office documents, which are parsed by the Apache POI library
|
class |
TaggedDocument
Models a tagged document (e.g., an HTML or TREC document).
|
class |
TwitterJSONDocument
This is a Terrier Document implementation of a Tweet stored in JSON format.
|
Modifier and Type | Field and Description |
---|---|
protected Document |
TwitterJSONCollection.currentDocument
The current document
|
Modifier and Type | Field and Description |
---|---|
protected Class<? extends Document> |
MultiDocumentFileCollection.documentClass
Class to use for all documents parsed by this class
|
protected Map<String,Class<? extends Document>> |
SimpleFileCollection.extension_DocumentClass
Maps filename extensions to Document classes.
|
Modifier and Type | Method and Description |
---|---|
static Document |
TaggedDocument.generateDocumentFromFile(String filename)
instantiates a TREC document from a file
|
Document |
TwitterJSONCollection.getDocument() |
Document |
Collection.getDocument()
Get the document object representing the current document.
|
Document |
SimpleXMLCollection.getDocument()
Get the document object representing the current document.
|
Document |
SimpleFileCollection.getDocument()
Return the current document in the collection.
|
abstract Document |
MultiDocumentFileCollection.getDocument() |
Document |
WARC09Collection.getDocument()
Get the document object representing the current document.
|
Document |
WARC018Collection.getDocument()
Get the document object representing the current document.
|
Document |
TRECCollection.getDocument()
Returns the current document to process.
|
protected Document |
SimpleFileCollection.makeDocument(String Filename,
InputStream in)
Given the opened document in, of Filename and File f, work out which
parser to try, and instantiate it.
|
Document |
SimpleXMLCollection.next()
get the next document
|
Document |
SimpleFileCollection.next()
Move onto the next document in the collection to be processed.
|
Document |
MultiDocumentFileCollection.next()
Return the next document
|
Document |
TRECCollection.next()
Return next document
|
Modifier and Type | Method and Description |
---|---|
static void |
TaggedDocument.dumpDocument(Document d)
Dumps a document to stdout
|
Modifier and Type | Method and Description |
---|---|
boolean |
UpdatableIndex.addToDocument(int docid,
Document doc)
Adds specified content contents to the named document id.
|
void |
UpdatableIndex.indexDocument(Document doc)
Add a new document to the index.
|
Modifier and Type | Method and Description |
---|---|
boolean |
IncrementalIndex.addToDocument(int docid,
Document doc) |
void |
IncrementalIndex.indexDocument(Document doc)
Update the index with a new document.
|
Modifier and Type | Method and Description |
---|---|
boolean |
MemoryIndex.addToDocument(int docid,
Document doc)
Adds specified content contents to the named document id.
|
void |
MemoryIndex.indexDocument(Document doc)
Index a new document.
|
Modifier and Type | Method and Description |
---|---|
void |
MemoryFieldsIndex.indexDocument(Document doc)
Index a new document.
|
Modifier and Type | Method and Description |
---|---|
protected abstract void |
ExtensibleSinglePassIndexer.preProcess(Document doc,
String term)
Perform an operation before the term pipeline is initiated.
|
Modifier and Type | Method and Description |
---|---|
SplitAwareWrapper<Document> |
CollectionRecordReader.createValue()
Create a new Text value,
each value is a document
|
org.apache.hadoop.mapred.RecordReader<org.apache.hadoop.io.Text,SplitAwareWrapper<Document>> |
MultiFileCollectionInputFormat.getRecordReader(org.apache.hadoop.mapred.InputSplit genericSplit,
org.apache.hadoop.mapred.JobConf job,
org.apache.hadoop.mapred.Reporter reporter) |
Modifier and Type | Method and Description |
---|---|
void |
Hadoop_BasicSinglePassIndexer.map(org.apache.hadoop.io.Text key,
SplitAwareWrapper<Document> value,
org.apache.hadoop.mapred.OutputCollector<SplitEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
org.apache.hadoop.mapred.Reporter reporter)
Map processes a single document.
|
boolean |
CollectionRecordReader.next(org.apache.hadoop.io.Text DocID,
SplitAwareWrapper<Document> document)
Moves to the next Document in the Collections accessing this InputSplit
if one exists, setting DocID to the property
"DOCID" and Document to the text within the
document.
|
Terrier Information Retrieval Platform4.1. Copyright © 2004-2015, University of Glasgow