Terrier IR Platform
1.1.1

uk.ac.gla.terrier.structures
Class UTFLexicon

java.lang.Object
  extended by uk.ac.gla.terrier.structures.Lexicon
      extended by uk.ac.gla.terrier.structures.UTFLexicon
All Implemented Interfaces:
java.lang.Iterable<java.lang.String>, Closeable

public class UTFLexicon
extends Lexicon

The class that implements the lexicon structure. Apart from the lexicon file, which contains the actual data about the terms, and takes its name from ApplicationSetup.LEXICON_FILENAME, another file is created and used, containing a mapping from the term's code to the offset of the term in the lexicon. The name of this file is given by ApplicationSetup.LEXICON_INDEX_FILENAME.

Version:
$Revision: 1.11 $
Author:
Gianni Amati, Vassilis Plachouras, Craig Macdonald
See Also:
ApplicationSetup.LEXICON_FILENAME, ApplicationSetup.LEXICON_INDEX_FILENAME

Field Summary
static int lexiconEntryLength
          The size in bytes of an entry in the lexicon file.
 
Constructor Summary
UTFLexicon()
          A default constructor.
UTFLexicon(java.lang.String lexiconName)
          Constructs an instace of Lexicon and opens the corresponding file.
UTFLexicon(java.lang.String path, java.lang.String prefix)
           
 
Method Summary
 boolean findTerm(int _termId)
          Finds the term given its term code.
 boolean findTerm(java.lang.String _term)
          Performs a binary search in the lexicon in order to locate the given term.
 LexiconEntry getLexiconEntry(int termid)
          Returns a LexiconEntry describing all the information in the lexicon about the term denoted by termid
 LexiconEntry getLexiconEntry(java.lang.String _term)
          Returns a LexiconEntry describing all the information in the lexicon about the term denoted by _term
static int numberOfEntries(java.io.File f)
           
static int numberOfEntries(java.lang.String filename)
           
 void print()
          Deprecated. Please use the Lexicon Input Streams for displaying lexicons
 boolean seekEntry(int i)
          Seeks the i-th entry of the lexicon.
 boolean updateEntry(int i, int frequency, long endOffset, byte endBitOffset)
          Deprecated. The Lexicon class is only used for reading the lexicon file, and not for writing any information.
 
Methods inherited from class uk.ac.gla.terrier.structures.Lexicon
close, getEndBitOffset, getEndOffset, getNt, getNumberOfLexiconEntries, getStartBitOffset, getStartOffset, getTerm, getTermId, getTF, iterator
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lexiconEntryLength

public static final int lexiconEntryLength
The size in bytes of an entry in the lexicon file. An entry corresponds to a string, an int (termCode), an int (docf), an int (tf), a long (the offset of the end of the term's entry in bytes in the inverted file) and a byte (the offset in bits of the last byte of the term's entry in the inverted file.

Constructor Detail

UTFLexicon

public UTFLexicon()
A default constructor.


UTFLexicon

public UTFLexicon(java.lang.String path,
                  java.lang.String prefix)

UTFLexicon

public UTFLexicon(java.lang.String lexiconName)
Constructs an instace of Lexicon and opens the corresponding file.

Parameters:
lexiconName - the name of the lexicon file.
Method Detail

print

public void print()
Deprecated. Please use the Lexicon Input Streams for displaying lexicons

Prints out the contents of the lexicon file. Streams are used to read the lexicon file.

Overrides:
print in class Lexicon

findTerm

public boolean findTerm(int _termId)
Finds the term given its term code.

Overrides:
findTerm in class Lexicon
Parameters:
_termId - the term's identifier
Returns:
true if the term is found, else return false

findTerm

public boolean findTerm(java.lang.String _term)
Performs a binary search in the lexicon in order to locate the given term. If the term is located, the properties termCharacters, documentFrequency, termFrequency, startOffset, startBitOffset, endOffset and endBitOffset contain the values related to the term.

Overrides:
findTerm in class Lexicon
Parameters:
_term - The term to search for.
Returns:
true if the term is found, and false otherwise.

seekEntry

public boolean seekEntry(int i)
Seeks the i-th entry of the lexicon. TODO read a byte array from the file and decode it, instead of reading the different pieces of information separately.

Overrides:
seekEntry in class Lexicon
Parameters:
i - The index of the entry we are looking for.
Returns:
true if the entry was found, false otherwise.

getLexiconEntry

public LexiconEntry getLexiconEntry(int termid)
Returns a LexiconEntry describing all the information in the lexicon about the term denoted by termid

Overrides:
getLexiconEntry in class Lexicon
Parameters:
termid - the termid of the term of interest
Returns:
LexiconEntry all information about the term's entry in the lexicon. null if termid not found

getLexiconEntry

public LexiconEntry getLexiconEntry(java.lang.String _term)
Returns a LexiconEntry describing all the information in the lexicon about the term denoted by _term

Overrides:
getLexiconEntry in class Lexicon
Parameters:
_term - the String term that is of interest
Returns:
LexiconEntry all information about the term's entry in the lexicon. null if termid not found

updateEntry

public boolean updateEntry(int i,
                           int frequency,
                           long endOffset,
                           byte endBitOffset)
Deprecated. The Lexicon class is only used for reading the lexicon file, and not for writing any information.

In an already stored entry in the lexicon file, the information about the term frequency, the endOffset in bytes, and the endBitOffset in the last byte, is updated. The term is specified by the index of the entry.

Overrides:
updateEntry in class Lexicon
Parameters:
i - the i-th entry
frequency - the term's Frequency
endOffset - the offset of the ending byte in the inverted file
endBitOffset - the offset in bits in the ending byte in the term's entry in inverted file
Returns:
true if the information is updated properly, otherwise return false

numberOfEntries

public static int numberOfEntries(java.io.File f)

numberOfEntries

public static int numberOfEntries(java.lang.String filename)

Terrier IR Platform
1.1.1

Terrier Information Retrieval Platform 1.1.1. Copyright 2004-2007 University of Glasgow