org.terrier.structures.indexing.singlepass
Class RunsMerger

java.lang.Object
  extended by org.terrier.structures.indexing.singlepass.RunsMerger
Direct Known Subclasses:
HadoopRunsMerger

public class RunsMerger
extends java.lang.Object

Merges a set of N runs using a priority queue. Each element of the queue is a RunIterator each one pointing at a different run in disk. Each run is sorted, so we only need to compare the heads of the element in the queue in each merging step. As the runs are being merged, they are written (to disk) using a BitOut.

Since:
2.0
Author:
Roi Blanco and Craig Macdonald

Nested Class Summary
static class RunsMerger.PostingComparator
          Implements a comparator for RunIterators (so it can be used by the queue).
 
Field Summary
protected  BitOut bos
          BitOut used to write the merged postings to disk
protected  int currentTerm
          Number of terms written
protected  int lastDocFreq
          Last document's frequency
protected  int lastDocument
          Last document written in the stream
protected  int lastFreq
          Frequency in the run of the last term merged
protected  java.lang.String lastTermWritten
          Last term written to disk (useful for terms appearing in mutiple runs
protected  RunIterator myRun
          RunReader reference for merging
protected  int numberOfPointers
          Number of pointers written
protected  java.util.Queue<RunIterator> queue
          Heap for the postings coming from different runs.
protected  RunIteratorFactory runsSource
           
protected  BitFilePosition startOffset
           
protected  LexiconEntry termStatistics
           
 
Constructor Summary
RunsMerger(RunIteratorFactory _runsSource)
          constructor
 
Method Summary
 void beginMerge(int size, java.lang.String fileName)
          Begins the multiway merging phase.
 void endMerge(LexiconOutputStream<java.lang.String> lexStream)
          Ends the merging phase, writes the last entry and closes the streams.
 byte getBitOffset()
           
 BitOut getBos()
          getBos
 long getByteOffset()
           
 int getLastDocFreq()
           
 int getLastFreq()
           
 java.lang.String getLastTermWritten()
           
 int getNumberOfPointers()
           
 int getNumberOfTerms()
           
protected  void init(int size, BitOut invertedFile)
           
protected  void init(int size, java.lang.String fileName)
          Begins the merge, initilialising the structures.
 boolean isDone()
          Indicates whether the merging is done or not
 void mergeOne(LexiconOutputStream<java.lang.String> lexStream)
          Mergers one term in the runs.
 void setBos(BitOut _bos)
          setBos
 void setLastTermWritten(java.lang.String _lastTermWritten)
          Setter for the last term written.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

queue

protected java.util.Queue<RunIterator> queue
Heap for the postings coming from different runs. It uses an alphabetical order using the terms


bos

protected BitOut bos
BitOut used to write the merged postings to disk


lastTermWritten

protected java.lang.String lastTermWritten
Last term written to disk (useful for terms appearing in mutiple runs


termStatistics

protected LexiconEntry termStatistics

lastFreq

protected int lastFreq
Frequency in the run of the last term merged


lastDocument

protected int lastDocument
Last document written in the stream


lastDocFreq

protected int lastDocFreq
Last document's frequency


myRun

protected RunIterator myRun
RunReader reference for merging


currentTerm

protected int currentTerm
Number of terms written


numberOfPointers

protected int numberOfPointers
Number of pointers written


startOffset

protected BitFilePosition startOffset

runsSource

protected RunIteratorFactory runsSource
Constructor Detail

RunsMerger

public RunsMerger(RunIteratorFactory _runsSource)
constructor

Parameters:
_runsSource -
Method Detail

getLastFreq

public int getLastFreq()
Returns:
the last frequency written.

getLastDocFreq

public int getLastDocFreq()
Returns:
the last document frequency written.

getNumberOfTerms

public int getNumberOfTerms()
Returns:
the number of terms written.

getNumberOfPointers

public int getNumberOfPointers()
Returns:
the number of pointers written.

isDone

public boolean isDone()
Indicates whether the merging is done or not

Returns:
true if there are no more elements to merge

getByteOffset

public long getByteOffset()
Returns:
the byte offset in the BitOut (used for lexicon writting)

getBitOffset

public byte getBitOffset()
Returns:
the bit offset in the BitOut (used for lexicon writting)

getLastTermWritten

public java.lang.String getLastTermWritten()
Returns:
the String with the last term written to disk.

setLastTermWritten

public void setLastTermWritten(java.lang.String _lastTermWritten)
Setter for the last term written.

Parameters:
_lastTermWritten - String with the last term written.

init

protected void init(int size,
                    java.lang.String fileName)
             throws java.lang.Exception
Begins the merge, initilialising the structures. Notice that the file names must be in order of run-id

Parameters:
size - number of runs in disk.
fileName - String with the file name of the final inverted file.
Throws:
java.io.IOException - if an I/O error occurs.
java.lang.Exception

init

protected void init(int size,
                    BitOut invertedFile)
             throws java.lang.Exception
Throws:
java.lang.Exception

beginMerge

public void beginMerge(int size,
                       java.lang.String fileName)
                throws java.lang.Exception
Begins the multiway merging phase.

Parameters:
size - number of runs to be merged.
fileName - output filename.
Throws:
java.lang.Exception - if an I/O error occurs.

mergeOne

public void mergeOne(LexiconOutputStream<java.lang.String> lexStream)
              throws java.lang.Exception
Mergers one term in the runs. If a run is exhausted, it is closed and removed from the queue.

Parameters:
lexStream - LexiconOutputStream used to write the lexicon.
Throws:
java.lang.Exception - if an I/O error occurs.

endMerge

public void endMerge(LexiconOutputStream<java.lang.String> lexStream)
              throws java.io.IOException
Ends the merging phase, writes the last entry and closes the streams.

Parameters:
lexStream - LexiconOutputStream used to write the lexicon.
Throws:
java.io.IOException - if an I/O error occurs.

getBos

public BitOut getBos()
getBos

Returns:
BitOut

setBos

public void setBos(BitOut _bos)
setBos

Parameters:
_bos -


Terrier 3.5. Copyright © 2004-2011 University of Glasgow