| Package | Description | 
|---|---|
| org.terrier.indexing | 
 Provides classes and interfaces related to the indexing of documents. 
 | 
| org.terrier.indexing.tokenisation | 
 Provides classes related to the tokenisation of documents. 
 | 
| Modifier and Type | Field and Description | 
|---|---|
protected Tokeniser | 
TaggedDocument.tokeniser  | 
protected Tokeniser | 
SimpleFileCollection.tokeniser  | 
protected Tokeniser | 
MultiDocumentFileCollection.tokeniser
Tokeniser to use for all documents parsed by this class 
 | 
protected Tokeniser | 
FlatJSONDocument.tokenizer  | 
| Modifier and Type | Method and Description | 
|---|---|
static Document | 
IndexTestUtils.makeDocumentFromText(String contents,
                    Map<String,String> docProperties,
                    Tokeniser t)  | 
| Constructor and Description | 
|---|
FileDocument(InputStream docStream,
            Map<String,String> docProperties,
            Tokeniser tok)
Constructs an instance of the FileDocument from the 
 given input stream. 
 | 
FileDocument(Reader docReader,
            Map<String,String> docProperties,
            Tokeniser tok)
create a document for a file 
 | 
FileDocument(String _filename,
            InputStream docStream,
            Tokeniser tok)
create a document for a file 
 | 
FileDocument(String _filename,
            Reader docReader,
            Tokeniser tok)
create a document for a file 
 | 
MSExcelDocument(InputStream docStream,
               Map<String,String> docProperties,
               Tokeniser tok)
Deprecated.  
  | 
MSExcelDocument(String filename,
               InputStream docStream,
               Tokeniser tokeniser)
Deprecated.  
  | 
MSPowerPointDocument(InputStream docStream,
                    Map<String,String> docProperties,
                    Tokeniser tok)
Deprecated.  
  | 
MSPowerPointDocument(String filename,
                    InputStream docStream,
                    Tokeniser tokeniser)
Deprecated.  
  | 
MSWordDocument(InputStream docStream,
              Map<String,String> docProperties,
              Tokeniser tok)
Deprecated.  
  | 
MSWordDocument(String filename,
              InputStream docStream,
              Tokeniser tokeniser)
Deprecated.  
  | 
PDFDocument(InputStream docStream,
           Map<String,String> docProperties,
           Tokeniser tok)
Constructs a new PDFDocument 
 | 
PDFDocument(Reader docReader,
           Map<String,String> docProperties,
           Tokeniser tok)
Constructs a new PDFDocument 
 | 
PDFDocument(String filename,
           InputStream docStream,
           Tokeniser tokeniser)
Constructs a new PDFDocument, which will convert the docStream
 which represents the file to a Document object from which an Indexer
 can retrieve a stream of terms. 
 | 
PDFDocument(String filename,
           Reader docReader,
           Tokeniser tok)
Constructs a new PDFDocument 
 | 
POIDocument(InputStream docStream,
           Map<String,String> docProperties,
           Tokeniser tok)
Constructs a new MSWordDocument object for the file represented by
        docStream. 
 | 
POIDocument(String filename,
           InputStream docStream,
           Tokeniser tokeniser)
Constructs a new MSWordDocument object for the file represented by
        docStream. 
 | 
TaggedDocument(InputStream docStream,
              Map<String,String> docProperties,
              Tokeniser _tokeniser)
Constructs an instance of the class from the given input stream. 
 | 
TaggedDocument(InputStream docStream,
              Map<String,String> docProperties,
              Tokeniser _tokeniser,
              String doctags,
              String exactdoctags,
              String fieldtags)
Constructs an instance of the class from the given input stream. 
 | 
TaggedDocument(Reader docReader,
              Map<String,String> docProperties,
              Tokeniser _tokeniser)
Constructs an instance of the class from the given reader object. 
 | 
| Modifier and Type | Class and Description | 
|---|---|
class  | 
EnglishTokeniser
Tokenises text obtained from a text stream assuming English language. 
 | 
class  | 
IdentityTokeniser
A Tokeniser implementation that returns the input as is. 
 | 
class  | 
UTFTokeniser
Tokenises text obtained from a text stream. 
 | 
class  | 
UTFTwitterTokeniser
A tokeniser designed for use on tweets. 
 | 
| Modifier and Type | Method and Description | 
|---|---|
static Tokeniser | 
Tokeniser.getTokeniser()
Instantiates Tokeniser class named in the tokeniser property. 
 | 
Terrier Information Retrieval Platform 5.1. Copyright © 2004-2019, University of Glasgow