TRECFullTokenizer (Terrier Information Retrieval Platform version 2.2.1 API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
2.2.1

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

uk.ac.gla.terrier.indexing
Class TRECFullTokenizer

java.lang.Object
  uk.ac.gla.terrier.indexing.TRECFullTokenizer

All Implemented Interfaces:: Tokenizer

Direct Known Subclasses:: TRECFullUTFTokenizer

public class TRECFullTokenizer
extends java.lang.Object
implements Tokenizer
extends java.lang.Object
implements Tokenizer

This class is the tokenizer used for indexing TREC topic files. It can be used for tokenizing other topic file formats, provided that the tags to skip and to process are specified accordingly.

NB: This class only accepts A-Z a-z and 0-9 as valid character for query terms. If this restriction is too tight, please use TRECFullUTFTokenizer instead.

Version:: $Revision: 1.33 $
Author:: Gianni Amati, Vassilis Plachouras
See Also:: TagSet

Field Summary
`java.io.BufferedReader`	`br` The input reader.
`long`	`counter` The number of bytes read from the input.
`boolean`	`EOD` The end of document.
`boolean`	`EOF` The end of file from the buffered reader.
`boolean`	`error` A flag which is set when errors are encountered.
`boolean`	`inDocnoTag` Is in docno tag?
`boolean`	`inTagToProcess` Is in tag to process?
`boolean`	`inTagToSkip` Is in tag to skip?
`static int`	`lastChar` last character read
`int`	`number_of_terms` A counter for the number of terms.

Constructor Summary
`TRECFullTokenizer()` TConstructs an instance of the TRECFullTokenizer.
`TRECFullTokenizer(java.io.BufferedReader br)` Constructs an instance of the TRECFullTokenizer, given the buffered reader.
`TRECFullTokenizer(TagSet _tagSet, TagSet _exactSet)` Constructs an instance of the TRECFullTokenizer with non-default tags.
`TRECFullTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader br)` Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.

Method Summary
`void`	`close()` Closes the buffered reader associated with the tokenizer.
`void`	`closeBufferedReader()` Closes the buffered reader associated with the tokenizer.
`java.lang.String`	`currentTag()` Returns the name of the tag the tokenizer is currently in.
`long`	`getByteOffset()` Returns the number of bytes read from the current file.
`boolean`	`inDocnoTag()` Indicates whether the tokenizer is in the special document number tag.
`boolean`	`inTagToProcess()` Returns true if the given tag is to be processed.
`boolean`	`inTagToSkip()` Returns true if the given tag is to be skipped.
`boolean`	`isEndOfDocument()` Returns true if the end of document is encountered.
`boolean`	`isEndOfFile()` Returns true if the end of file is encountered.
`void`	`nextDocument()` Proceed to the next document.
`java.lang.String`	`nextToken()` nextTermWithNumbers gives the first next string which is not a tag.
`void`	`setIgnoreMissingClosingTags(boolean toIgnore)` Sets the value of the ignoreMissingClosingTags.
`void`	`setInput(java.io.BufferedReader _br)` Sets the input of the tokenizer.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

lastChar

public static int lastChar

last character read

number_of_terms

public int number_of_terms

A counter for the number of terms.

EOF

public boolean EOF

The end of file from the buffered reader.

EOD

public boolean EOD

The end of document.

error

public boolean error

A flag which is set when errors are encountered.

br

public java.io.BufferedReader br

The input reader.

counter

public long counter

The number of bytes read from the input.

inTagToProcess

public boolean inTagToProcess

Is in tag to process?

inTagToSkip

public boolean inTagToSkip

Is in tag to skip?

inDocnoTag

public boolean inDocnoTag

Is in docno tag?

Constructor Detail

TRECFullTokenizer

public TRECFullTokenizer()

TConstructs an instance of the TRECFullTokenizer. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS

TRECFullTokenizer

public TRECFullTokenizer(java.io.BufferedReader br)

Constructs an instance of the TRECFullTokenizer, given the buffered reader. The used tags are TagSet.TREC_DOC_TAGS and TagSet.TREC_EXACT_DOC_TAGS

Parameters:: br - java.io.BufferedReader the input stream to tokenize

TRECFullTokenizer

public TRECFullTokenizer(TagSet _tagSet,
                         TagSet _exactSet)

Constructs an instance of the TRECFullTokenizer with non-default tags.

Parameters:: _tagSet - TagSet the document tags to process.; _exactSet - TagSet the document tags to process exactly, without applying strict checks.

TRECFullTokenizer

public TRECFullTokenizer(TagSet _ts,
                         TagSet _exactSet,
                         java.io.BufferedReader br)

Constructs an instance of the TRECFullTokenizer with non-default tags and a given buffered reader.

Parameters:: _ts - TagSet the document tags to process.; _exactSet - TagSet the document tags to process exactly, without applying strict checks.; br - java.io.BufferedReader the input to tokenize.

Method Detail

close

public void close()

Closes the buffered reader associated with the tokenizer.

closeBufferedReader

public void closeBufferedReader()

Closes the buffered reader associated with the tokenizer.

currentTag

public java.lang.String currentTag()

Returns the name of the tag the tokenizer is currently in.

Specified by:: currentTag in interface Tokenizer

Returns:: the name of the tag the tokenizer is currently in

inDocnoTag

public boolean inDocnoTag()

Indicates whether the tokenizer is in the special document number tag.

Specified by:: inDocnoTag in interface Tokenizer

Returns:: true if the tokenizer is in the document number tag.

inTagToProcess

public boolean inTagToProcess()

Returns true if the given tag is to be processed.

Specified by:: inTagToProcess in interface Tokenizer

Returns:: true if the tag is to be processed, otherwise false.

inTagToSkip

public boolean inTagToSkip()

Returns true if the given tag is to be skipped.

Specified by:: inTagToSkip in interface Tokenizer

Returns:: true if the tag is to be skipped, otherwise false.

isEndOfDocument

public boolean isEndOfDocument()

Returns true if the end of document is encountered.

Specified by:: isEndOfDocument in interface Tokenizer

Returns:: true if the end of document is encountered.

isEndOfFile

public boolean isEndOfFile()

Returns true if the end of file is encountered.

Specified by:: isEndOfFile in interface Tokenizer

Returns:: true if the end of file is encountered.

nextDocument

public void nextDocument()

Proceed to the next document.

Specified by:: nextDocument in interface Tokenizer

nextToken

public java.lang.String nextToken()

nextTermWithNumbers gives the first next string which is not a tag. All encounterd tags are pushed or popped according they are initial or final

Specified by:: nextToken in interface Tokenizer

Returns:: the next token, or null if the end of file is encountered.

setIgnoreMissingClosingTags

public void setIgnoreMissingClosingTags(boolean toIgnore)

Sets the value of the ignoreMissingClosingTags.

Parameters:: toIgnore - boolean to ignore or not the missing closing tags

getByteOffset

public long getByteOffset()

Returns the number of bytes read from the current file.

Specified by:: getByteOffset in interface Tokenizer

Returns:: long the byte offset

setInput

public void setInput(java.io.BufferedReader _br)

Sets the input of the tokenizer.

Specified by:: setInput in interface Tokenizer

Parameters:: _br - BufferedReader the input stream

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
2.2.1

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

uk.ac.gla.terrier.indexing Class TRECFullTokenizer

lastChar

number_of_terms

EOF

EOD

error

br

counter

inTagToProcess

inTagToSkip

inDocnoTag

TRECFullTokenizer

TRECFullTokenizer

TRECFullTokenizer

TRECFullTokenizer

close

closeBufferedReader

currentTag

inDocnoTag

inTagToProcess

inTagToSkip

isEndOfDocument

isEndOfFile

nextDocument

nextToken

setIgnoreMissingClosingTags

getByteOffset

setInput

uk.ac.gla.terrier.indexing
Class TRECFullTokenizer