TRECFullTokenizer (Terrier Information Retrieval Platform 4.1 API)

java.lang.Object
- org.terrier.indexing.TRECFullTokenizer

All Implemented Interfaces:

Tokenizer
```
public class TRECFullTokenizer
extends Object
implements Tokenizer
```
This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.
NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.

Author:

Gianni Amati, Vassilis Plachouras

See Also:
TagSet

Field Summary

Fields
Modifier and Type	Field and Description
`BufferedReader`	`br` The input reader.
`long`	`counter` The number of bytes read from the input.
`boolean`	`EOD` The end of document.
`boolean`	`EOF` The end of file from the buffered reader.
`boolean`	`error` A flag which is set when errors are encountered.
`protected TagSet`	`exactTagSet` The set of exact tags.
`protected boolean`	`ignoreMissingClosingTags` An option to ignore missing closing tags.
`boolean`	`inDocnoTag` Is in docno tag?
`boolean`	`inTagToProcess` Is in tag to process?
`boolean`	`inTagToSkip` Is in tag to skip?
`static int`	`lastChar` last character read
`protected static org.slf4j.Logger`	`logger`
`protected static boolean`	`lowercase` Transform to lowercase or not?.
`int`	`number_of_terms` A counter for the number of terms.
`protected static Stack<String>`	`stk` The stack where the tags are pushed and popped accordingly.
`protected StringBuilder`	`sw`
`protected StringBuilder`	`tagNameSB`
`protected TagSet`	`tagSet` The tag set to use.
`protected static int`	`tokenMaximumLength` The maximum length of a token in the check method.

Constructor Summary

Constructors
Constructor and Description
`TRECFullTokenizer()` TConstructs an instance of the TRECFullTokenizer.
`TRECFullTokenizer(BufferedReader _br)` Constructs an instance of the TRECFullTokenizer, given the buffered reader.
`TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)` Constructs an instance of the TRECFullTokenizer with non-default tags.
`TRECFullTokenizer(TagSet _ts, TagSet _exactSet, BufferedReader _br)` Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.

Method Summary

Methods
Modifier and Type	Method and Description
`protected String`	`check(String s)` A restricted check function for discarding uncommon, or 'strange' terms.
`void`	`close()` Closes the buffered reader associated with the tokenizer.
`void`	`closeBufferedReader()` Closes the buffered reader associated with the tokenizer.
`String`	`currentTag()` Returns the name of the tag the tokenizer is currently in.
`long`	`getByteOffset()` Returns the number of bytes read from the current file.
`boolean`	`inDocnoTag()` Indicates whether the tokenizer is in the special document number tag.
`boolean`	`inTagToProcess()` Returns true if the given tag is to be processed.
`boolean`	`inTagToSkip()` Returns true if the given tag is to be skipped.
`boolean`	`isEndOfDocument()` Returns true if the end of document is encountered.
`boolean`	`isEndOfFile()` Returns true if the end of file is encountered.
`void`	`nextDocument()` Proceed to the next document.
`String`	`nextToken()` Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
`protected void`	`processEndOfTag(String tag)` The encounterd tag, which must be a final tag is matched with the tag on the stack.
`void`	`setIgnoreMissingClosingTags(boolean toIgnore)` Sets the value of the ignoreMissingClosingTags.
`void`	`setInput(BufferedReader _br)` Sets the input of the tokenizer.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Field Detail
  - logger
```
protected static final org.slf4j.Logger logger
```
  - ignoreMissingClosingTags
```
protected boolean ignoreMissingClosingTags
```
    An option to ignore missing closing tags. Used for the query files.
  - lastChar
```
public static int lastChar
```
    last character read
  - number_of_terms
```
public int number_of_terms
```
    A counter for the number of terms.
  - EOF
```
public boolean EOF
```
    The end of file from the buffered reader.
  - EOD
```
public boolean EOD
```
    The end of document.
  - error
```
public boolean error
```
    A flag which is set when errors are encountered.
  - br
```
public BufferedReader br
```
    The input reader.
  - counter
```
public long counter
```
    The number of bytes read from the input.
  - stk
```
protected static Stack<String> stk
```
    The stack where the tags are pushed and popped accordingly.
  - tagSet
```
protected TagSet tagSet
```
    The tag set to use.
  - exactTagSet
```
protected TagSet exactTagSet
```
    The set of exact tags.
  - tokenMaximumLength
```
protected static final int tokenMaximumLength
```
    The maximum length of a token in the check method.
  - lowercase
```
protected static final boolean lowercase
```
    Transform to lowercase or not?.
  - inTagToProcess
```
public boolean inTagToProcess
```
    Is in tag to process?
  - inTagToSkip
```
public boolean inTagToSkip
```
    Is in tag to skip?
  - inDocnoTag
```
public boolean inDocnoTag
```
    Is in docno tag?
  - sw
```
protected final StringBuilder sw
```
  - tagNameSB
```
protected final StringBuilder tagNameSB
```
- Constructor Detail
  - TRECFullTokenizer
```
public TRECFullTokenizer()
```
    TConstructs an instance of the TRECFullTokenizer. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
  - TRECFullTokenizer
```
public TRECFullTokenizer(BufferedReader _br)
```
    Constructs an instance of the TRECFullTokenizer, given the buffered reader. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS
    
    Parameters:
    _br - java.io.BufferedReader the input stream to tokenize
  - TRECFullTokenizer
```
public TRECFullTokenizer(TagSet _tagSet,
                 TagSet _exactSet)
```
    Constructs an instance of the TRECFullTokenizer with non-default tags.
    
    Parameters:
    _tagSet - TagSet the document tags to process.
    _exactSet - TagSet the document tags to process exactly, without applying strict checks.
  - TRECFullTokenizer
```
public TRECFullTokenizer(TagSet _ts,
                 TagSet _exactSet,
                 BufferedReader _br)
```
    Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.
    
    Parameters:
    _ts - TagSet the document tags to process.
    _exactSet - TagSet the document tags to process exactly, without applying strict checks.
    _br - java.io.BufferedReader the input to tokenize.
- Method Detail
  - check
```
protected String check(String s)
```
    A restricted check function for discarding uncommon, or 'strange' terms.
    
    Parameters:
    s - The term to check.
    
    Returns:
    the term if it passed the check, otherwise null.
  - close
```
public void close()
```
    Closes the buffered reader associated with the tokenizer.
  - closeBufferedReader
```
public void closeBufferedReader()
```
    Closes the buffered reader associated with the tokenizer.
  - currentTag
```
public String currentTag()
```
    Returns the name of the tag the tokenizer is currently in.
    
    Specified by:
    
    currentTag in interface Tokenizer
    
    Returns:
    the name of the tag the tokenizer is currently in
  - inDocnoTag
```
public boolean inDocnoTag()
```
    Indicates whether the tokenizer is in the special document number tag.
    
    Specified by:
    
    inDocnoTag in interface Tokenizer
    
    Returns:
    true if the tokenizer is in the document number tag.
  - inTagToProcess
```
public boolean inTagToProcess()
```
    Returns true if the given tag is to be processed.
    
    Specified by:
    
    inTagToProcess in interface Tokenizer
    
    Returns:
    true if the tag is to be processed, otherwise false.
  - inTagToSkip
```
public boolean inTagToSkip()
```
    Returns true if the given tag is to be skipped.
    
    Specified by:
    
    inTagToSkip in interface Tokenizer
    
    Returns:
    true if the tag is to be skipped, otherwise false.
  - isEndOfDocument
```
public boolean isEndOfDocument()
```
    Returns true if the end of document is encountered.
    
    Specified by:
    
    isEndOfDocument in interface Tokenizer
    
    Returns:
    true if the end of document is encountered.
  - isEndOfFile
```
public boolean isEndOfFile()
```
    Returns true if the end of file is encountered.
    
    Specified by:
    
    isEndOfFile in interface Tokenizer
    
    Returns:
    true if the end of file is encountered.
  - nextDocument
```
public void nextDocument()
```
    Proceed to the next document.
    
    Specified by:
    
    nextDocument in interface Tokenizer
  - nextToken
```
public String nextToken()
```
    Returns the next token from the current chunk of text, extracted from the document into a TokenStream.
    
    Specified by:
    
    nextToken in interface Tokenizer
    
    Returns:
    String the next token of the document, or null if the token was discarded during tokenisation.
  - processEndOfTag
```
protected void processEndOfTag(String tag)
```
    The encounterd tag, which must be a final tag is matched with the tag on the stack. If they are not the same, then the consistency is restored by popping the tags in the stack, the observed tag included. If the stack becomes empty after that, then the end of document EOD is set to true.
    
    Parameters:
    tag - The closing tag to be tested against the content of the stack.
  - setIgnoreMissingClosingTags
```
public void setIgnoreMissingClosingTags(boolean toIgnore)
```
    Sets the value of the ignoreMissingClosingTags.
    
    Parameters:
    toIgnore - boolean to ignore or not the missing closing tags
  - getByteOffset
```
public long getByteOffset()
```
    Returns the number of bytes read from the current file.
    
    Specified by:
    
    getByteOffset in interface Tokenizer
    
    Returns:
    long the byte offset
  - setInput
```
public void setInput(BufferedReader _br)
```
    Sets the input of the tokenizer.
    
    Specified by:
    
    setInput in interface Tokenizer
    
    Parameters:
    _br - BufferedReader the input stream

Class TRECFullTokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

logger

ignoreMissingClosingTags

lastChar

number_of_terms

EOF

EOD

error

br

counter

stk

tagSet

exactTagSet

tokenMaximumLength

lowercase

inTagToProcess

inTagToSkip

inDocnoTag

sw

tagNameSB

Constructor Detail

TRECFullTokenizer

TRECFullTokenizer

TRECFullTokenizer

TRECFullTokenizer

Method Detail

check

close

closeBufferedReader

currentTag

inDocnoTag

inTagToProcess

inTagToSkip

isEndOfDocument

isEndOfFile

nextDocument

nextToken

processEndOfTag

setIgnoreMissingClosingTags

getByteOffset

setInput