Package | Description |
---|---|
org.terrier.indexing |
Provides classes and interfaces related to the indexing of documents.
|
org.terrier.indexing.tokenisation |
Provides classes related to the tokenisation of documents.
|
Modifier and Type | Field and Description |
---|---|
protected Tokeniser |
WARC09Collection.tokeniser
Tokeniser to use for all documents parsed by this class
|
protected Tokeniser |
WARC018Collection.tokeniser
Tokeniser to use for all documents parsed by this class
|
protected Tokeniser |
TRECCollection.tokeniser |
protected Tokeniser |
TaggedDocument.tokeniser |
protected Tokeniser |
SimpleFileCollection.tokeniser |
Constructor and Description |
---|
FileDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Constructs an instance of the FileDocument from the
given input stream.
|
FileDocument(Reader docReader,
Map<String,String> docProperties,
Tokeniser tok)
create a document for a file
|
FileDocument(String _filename,
InputStream docStream,
Tokeniser tok)
create a document for a file
|
FileDocument(String _filename,
Reader docReader,
Tokeniser tok)
create a document for a file
|
MSExcelDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Deprecated.
|
MSExcelDocument(String filename,
InputStream docStream,
Tokeniser tokeniser)
Deprecated.
|
MSPowerPointDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Deprecated.
|
MSPowerPointDocument(String filename,
InputStream docStream,
Tokeniser tokeniser)
Deprecated.
|
MSWordDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Deprecated.
|
MSWordDocument(String filename,
InputStream docStream,
Tokeniser tokeniser)
Deprecated.
|
PDFDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Constructs a new PDFDocument
|
PDFDocument(Reader docReader,
Map<String,String> docProperties,
Tokeniser tok)
Constructs a new PDFDocument
|
PDFDocument(String filename,
InputStream docStream,
Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream
which represents the file to a Document object from which an Indexer
can retrieve a stream of terms.
|
PDFDocument(String filename,
Reader docReader,
Tokeniser tok)
Constructs a new PDFDocument
|
POIDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by
docStream.
|
POIDocument(String filename,
InputStream docStream,
Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by
docStream.
|
TaggedDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream.
|
TaggedDocument(InputStream docStream,
Map<String,String> docProperties,
Tokeniser _tokeniser,
String doctags,
String exactdoctags,
String fieldtags)
Constructs an instance of the class from the given input stream.
|
TaggedDocument(Reader docReader,
Map<String,String> docProperties,
Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object.
|
Modifier and Type | Class and Description |
---|---|
class |
EnglishTokeniser
Tokenises text obtained from a text stream assuming English language.
|
class |
IdentityTokeniser
A Tokeniser implementation that returns the input as is.
|
class |
UTFTokeniser
Tokenises text obtained from a text stream.
|
class |
UTFTwitterTokeniser
A tokeniser designed for use on tweets.
|
Modifier and Type | Method and Description |
---|---|
static Tokeniser |
Tokeniser.getTokeniser()
Instantiates Tokeniser class named in the tokeniser property.
|
Terrier 4.0. Copyright © 2004-2014 University of Glasgow