org.terrier.indexing
Class TRECFullUTFTokenizer

java.lang.Object
  extended by org.terrier.indexing.TRECFullTokenizer
      extended by org.terrier.indexing.TRECFullUTFTokenizer
All Implemented Interfaces:
Tokenizer

Deprecated. From 3.5, TRECFullTokenizer should be used instead, with trec.encoding set to utf8.

public class TRECFullUTFTokenizer
extends TRECFullTokenizer

This is a subclass of TRECFullTokenizer, which is less restrictive than it's parent. In this class any character passing Character.isLetterOrDigit() is accepted as a valid query term.

Since:
2.1
Author:
Craig Macdonald

Field Summary
 
Fields inherited from class org.terrier.indexing.TRECFullTokenizer
br, counter, EOD, EOF, error, exactTagSet, ignoreMissingClosingTags, inDocnoTag, inTagToProcess, inTagToSkip, lastChar, logger, lowercase, number_of_terms, stk, sw, tagNameSB, tagSet, tokenMaximumLength
 
Constructor Summary
TRECFullUTFTokenizer()
          Deprecated. Constructs an instance of the TRECFullUTFTokenizer.
TRECFullUTFTokenizer(java.io.BufferedReader br)
          Deprecated. Constructs an instance of the TRECFullUTFTokenizer, given a BufferReader.
TRECFullUTFTokenizer(TagSet _tagSet, TagSet _exactSet)
          Deprecated. Constructs an instance of the TRECFullUTFTokenizer.
TRECFullUTFTokenizer(TagSet _ts, TagSet _exactSet, java.io.BufferedReader br)
          Deprecated. Constructs an instance of the TRECFullUTFTokenizer.
 
Method Summary
protected  java.lang.String check(java.lang.String s)
          Deprecated. A restricted check function for discarding uncommon, or 'strange' terms.
 java.lang.String nextToken()
          Deprecated. nextTermWithNumbers gives the first next string which is not a tag.
 
Methods inherited from class org.terrier.indexing.TRECFullTokenizer
close, closeBufferedReader, currentTag, getByteOffset, inDocnoTag, inTagToProcess, inTagToSkip, isEndOfDocument, isEndOfFile, nextDocument, processEndOfTag, setIgnoreMissingClosingTags, setInput
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TRECFullUTFTokenizer

public TRECFullUTFTokenizer()
Deprecated. 
Constructs an instance of the TRECFullUTFTokenizer.


TRECFullUTFTokenizer

public TRECFullUTFTokenizer(java.io.BufferedReader br)
Deprecated. 
Constructs an instance of the TRECFullUTFTokenizer, given a BufferReader.

Parameters:
br -

TRECFullUTFTokenizer

public TRECFullUTFTokenizer(TagSet _tagSet,
                            TagSet _exactSet)
Deprecated. 
Constructs an instance of the TRECFullUTFTokenizer.

Parameters:
_tagSet -
_exactSet -

TRECFullUTFTokenizer

public TRECFullUTFTokenizer(TagSet _ts,
                            TagSet _exactSet,
                            java.io.BufferedReader br)
Deprecated. 
Constructs an instance of the TRECFullUTFTokenizer.

Parameters:
_ts -
_exactSet -
br -
Method Detail

check

protected java.lang.String check(java.lang.String s)
Deprecated. 
A restricted check function for discarding uncommon, or 'strange' terms.

Overrides:
check in class TRECFullTokenizer
Parameters:
s - The term to check.
Returns:
the term if it passed the check, otherwise null.

nextToken

public java.lang.String nextToken()
Deprecated. 
nextTermWithNumbers gives the first next string which is not a tag. All encounterd tags are pushed or popped according they are initial or final

Specified by:
nextToken in interface Tokenizer
Overrides:
nextToken in class TRECFullTokenizer
Returns:
String the next token of the document, or null if the token was discarded during tokenisation.


Terrier 3.5. Copyright © 2004-2011 University of Glasgow