Terrier IR Platform
1.1.1

uk.ac.gla.terrier.structures
Class Lexicon

java.lang.Object
  extended by uk.ac.gla.terrier.structures.Lexicon
All Implemented Interfaces:
java.lang.Iterable<java.lang.String>, Closeable
Direct Known Subclasses:
BlockLexicon, UTFLexicon

public class Lexicon
extends java.lang.Object
implements java.lang.Iterable<java.lang.String>, Closeable

The class that implements the lexicon structure. Apart from the lexicon file, which contains the actual data about the terms, and takes its name from ApplicationSetup.LEXICON_FILENAME, another file is created and used, containing a mapping from the term's code to the offset of the term in the lexicon. The name of this file is given by ApplicationSetup.LEXICON_INDEX_FILENAME.

Version:
$Revision: 1.41 $
Author:
Gianni Amati, Vassilis Plachouras
See Also:
ApplicationSetup.LEXICON_FILENAME, ApplicationSetup.LEXICON_INDEX_FILENAME

Field Summary
static int lexiconEntryLength
          The size in bytes of an entry in the lexicon file.
 
Constructor Summary
Lexicon()
          A default constructor.
Lexicon(java.lang.String lexiconName)
          Constructs an instace of Lexicon and opens the corresponding file.
Lexicon(java.lang.String path, java.lang.String prefix)
           
 
Method Summary
 void close()
          Closes the lexicon and lexicon index files.
 boolean findTerm(int _termId)
          Finds the term given its term code.
 boolean findTerm(java.lang.String _term)
          Performs a binary search in the lexicon in order to locate the given term.
 byte getEndBitOffset()
          Returns the bit offset in the last byte of the term's entry in the inverted file.
 long getEndOffset()
          Returns the ending offset of the term's entry in the inverted file.
 LexiconEntry getLexiconEntry(int termid)
          Returns a LexiconEntry describing all the information in the lexicon about the term denoted by termid
 LexiconEntry getLexiconEntry(java.lang.String _term)
          Returns a LexiconEntry describing all the information in the lexicon about the term denoted by _term
 int getNt()
          Return the document frequency for the given term.
 long getNumberOfLexiconEntries()
          Returns the number of entries in the lexicon.
 byte getStartBitOffset()
          The bit offset in the starting byte of the entry in the inverted file.
 long getStartOffset()
          Returns the beginning of the term's entry in the inverted file.
 java.lang.String getTerm()
          Insert the method's description here.
 int getTermId()
          Returns the term's id.
 int getTF()
          Returns the term frequency for the already seeked term.
 java.util.Iterator<java.lang.String> iterator()
          Returns an interator that gives every item in the lexicon, in lexical order.
static int numberOfEntries(java.io.File f)
          Returns the number of entries in the lexicon file specified by f.
static int numberOfEntries(java.lang.String filename)
          Returns the number of entries in the lexicon file specified by filename.
 void print()
          Prints out the contents of the lexicon file.
 boolean seekEntry(int i)
          Seeks the i-th entry of the lexicon.
 boolean updateEntry(int i, int frequency, long endOffset, byte endBitOffset)
          Deprecated. The Lexicon class is only used for reading the lexicon file, and not for writing any information.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

lexiconEntryLength

public static final int lexiconEntryLength
The size in bytes of an entry in the lexicon file. An entry corresponds to a string, an int (termCode), an int (docf), an int (tf), a long (the offset of the end of the term's entry in bytes in the inverted file) and a byte (the offset in bits of the last byte of the term's entry in the inverted file.

Constructor Detail

Lexicon

public Lexicon()
A default constructor.


Lexicon

public Lexicon(java.lang.String path,
               java.lang.String prefix)

Lexicon

public Lexicon(java.lang.String lexiconName)
Constructs an instace of Lexicon and opens the corresponding file.

Parameters:
lexiconName - the name of the lexicon file.
Method Detail

close

public void close()
Closes the lexicon and lexicon index files.

Specified by:
close in interface Closeable

print

public void print()
Prints out the contents of the lexicon file. Streams are used to read the lexicon file.


findTerm

public boolean findTerm(int _termId)
Finds the term given its term code.

Parameters:
_termId - the term's identifier
Returns:
true if the term is found, else return false

findTerm

public boolean findTerm(java.lang.String _term)
Performs a binary search in the lexicon in order to locate the given term. If the term is located, the properties termCharacters, documentFrequency, termFrequency, startOffset, startBitOffset, endOffset and endBitOffset contain the values related to the term.

Parameters:
_term - The term to search for.
Returns:
true if the term is found, and false otherwise.

getEndBitOffset

public byte getEndBitOffset()
Returns the bit offset in the last byte of the term's entry in the inverted file.

Returns:
byte the bit offset in the last byte of the term's entry in the inverted file

getEndOffset

public long getEndOffset()
Returns the ending offset of the term's entry in the inverted file.

Returns:
long The ending byte of the term's entry in the inverted file.

getNt

public int getNt()
Return the document frequency for the given term.

Returns:
int The document frequency for the given term

getNumberOfLexiconEntries

public long getNumberOfLexiconEntries()
Returns the number of entries in the lexicon.

Returns:
the number of entries in the lexicon.

getStartBitOffset

public byte getStartBitOffset()
The bit offset in the starting byte of the entry in the inverted file.

Returns:
byte The number of bits in the first byte of the entry in the inverted file

getStartOffset

public long getStartOffset()
Returns the beginning of the term's entry in the inverted file.

Returns:
long the start offset (in bytes) in the inverted file

getTerm

public java.lang.String getTerm()
Insert the method's description here.

Returns:
java.lang.String The string representation of the seeked term.

getTermId

public int getTermId()
Returns the term's id.

Returns:
int the term's id.

getTF

public int getTF()
Returns the term frequency for the already seeked term.

Returns:
int The term frequency in the collection.

seekEntry

public boolean seekEntry(int i)
Seeks the i-th entry of the lexicon. TODO read a byte array from the file and decode it, instead of reading the different pieces of information separately.

Parameters:
i - The index of the entry we are looking for.
Returns:
true if the entry was found, false otherwise.

updateEntry

public boolean updateEntry(int i,
                           int frequency,
                           long endOffset,
                           byte endBitOffset)
Deprecated. The Lexicon class is only used for reading the lexicon file, and not for writing any information.

In an already stored entry in the lexicon file, the information about the term frequency, the endOffset in bytes, and the endBitOffset in the last byte, is updated. The term is specified by the index of the entry.

Parameters:
i - the i-th entry
frequency - the term's Frequency
endOffset - the offset of the ending byte in the inverted file
endBitOffset - the offset in bits in the ending byte in the term's entry in inverted file
Returns:
true if the information is updated properly, otherwise return false

numberOfEntries

public static int numberOfEntries(java.io.File f)
Returns the number of entries in the lexicon file specified by f.

Parameters:
f - The file to find the number of entries in

numberOfEntries

public static int numberOfEntries(java.lang.String filename)
Returns the number of entries in the lexicon file specified by filename.

Parameters:
filename -

getLexiconEntry

public LexiconEntry getLexiconEntry(int termid)
Returns a LexiconEntry describing all the information in the lexicon about the term denoted by termid

Parameters:
termid - the termid of the term of interest
Returns:
LexiconEntry all information about the term's entry in the lexicon. null if termid not found

getLexiconEntry

public LexiconEntry getLexiconEntry(java.lang.String _term)
Returns a LexiconEntry describing all the information in the lexicon about the term denoted by _term

Parameters:
_term - the String term that is of interest
Returns:
LexiconEntry all information about the term's entry in the lexicon. null if termid not found

iterator

public java.util.Iterator<java.lang.String> iterator()
Returns an interator that gives every item in the lexicon, in lexical order. Underlying implementation is using a lexicon input stream

Specified by:
iterator in interface java.lang.Iterable<java.lang.String>

Terrier IR Platform
1.1.1

Terrier Information Retrieval Platform 1.1.1. Copyright 2004-2007 University of Glasgow