Hadoop_BasicSinglePassIndexer (Terrier Information Retrieval Platform version 2.2.1 API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
2.2.1

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

uk.ac.gla.terrier.indexing.hadoop
Class Hadoop_BasicSinglePassIndexer

java.lang.Object
  uk.ac.gla.terrier.indexing.Indexer
      uk.ac.gla.terrier.indexing.BasicIndexer
          uk.ac.gla.terrier.indexing.BasicSinglePassIndexer
              uk.ac.gla.terrier.indexing.hadoop.Hadoop_BasicSinglePassIndexer

All Implemented Interfaces:: java.io.Closeable, org.apache.hadoop.mapred.JobConfigurable, org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Direct Known Subclasses:: Hadoop_BlockSinglePassIndexer

public class Hadoop_BasicSinglePassIndexer
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>
extends BasicSinglePassIndexer
implements org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>, org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Single Pass Map-Reduce indexer.

Map phase processing

Indexes as a Map task, taking in a series of documents, emitting posting lists for terms as memory becomes exhausted. Two side-files are created for each map task: the first (run files) takes note of how many documents were indexed for each flush and for each map; the second contains the statistics for each document in a minature document index

Reduce phase processing

All posting lists for each term are read in, one term at a time. Using the run files, the posting lists are output into the final inverted file, with all document ids corrected. Lastly, when all terms have been processed, the document indexes are merged into the final document index, and the lexicon hash and lexid created.

Partitioned Reduce processing

Normally, the map reduce indexer is used with a single reducer. However, if the partitioner is used, multiple reduces can run concurrently, building several final indices. In doing so, a large collection can be indexed into several output indices, which may be useful for distributed retrieval.

Since:: 2.2
Version:: $Revision: 1.4 $
Author:: Richard McCreadie and Craig Macdonald

Constructor Summary
`Hadoop_BasicSinglePassIndexer()` Empty constructor.

Method Summary
`void`	`close()` Called when the Map or Reduce task ends, to finish up the indexer.
`void`	`configure(org.apache.hadoop.mapred.JobConf jc)` Configure this indexer.
`void`	`map(org.apache.hadoop.io.Text key, Wrapper<Document> value, org.apache.hadoop.mapred.OutputCollector<MapEmittedTerm,MapEmittedPostingList> _outputPostingListCollector, org.apache.hadoop.mapred.Reporter reporter)` Map processes a single document.
`void`	`reduce(MapEmittedTerm Term, java.util.Iterator<MapEmittedPostingList> postingIterator, org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output, org.apache.hadoop.mapred.Reporter reporter)` Main reduce algorithm step.
`void`	`startReduce(java.util.LinkedList<MapData> mapData)` Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Methods inherited from class uk.ac.gla.terrier.indexing.BasicSinglePassIndexer
`createDirectIndex, createInvertedIndex, createInvertedIndex, performMultiWayMerge`

Methods inherited from class uk.ac.gla.terrier.indexing.Indexer
`index, isUTFIndexing, main, merge, merge, useFieldInformation`

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

Hadoop_BasicSinglePassIndexer

public Hadoop_BasicSinglePassIndexer()

Empty constructor.

Method Detail

configure

public void configure(org.apache.hadoop.mapred.JobConf jc)

Configure this indexer. Firstly, loads ApplicationSetup appropriately. Actual configuration of indexer is then handled by configureMap() or configureReduce() depending on whether a Map or Reduce task is being configured.

Specified by:: configure in interface org.apache.hadoop.mapred.JobConfigurable

Parameters:: jc - The configuration for the job

close

public void close()
           throws java.io.IOException

Called when the Map or Reduce task ends, to finish up the indexer. Actual cleanup is handled by closeMap() or closeReduce() depending on whether this is a Map or Reduce task.

Specified by:: close in interface java.io.Closeable

Throws:: java.io.IOException

map

public void map(org.apache.hadoop.io.Text key,
                Wrapper<Document> value,
                org.apache.hadoop.mapred.OutputCollector<MapEmittedTerm,MapEmittedPostingList> _outputPostingListCollector,
                org.apache.hadoop.mapred.Reporter reporter)
         throws java.io.IOException

Map processes a single document. Stores the terms in the document along with the posting list until memory is full or all documents in this map have been processed then writes then to the output collector.

Specified by:: map in interface org.apache.hadoop.mapred.Mapper<org.apache.hadoop.io.Text,Wrapper<Document>,MapEmittedTerm,MapEmittedPostingList>

Parameters:: key - - Wrapper for Document Number; value - - Wrapper for Document Object; _outputPostingListCollector - Collector for emitting terms and postings lists
Throws:: java.io.IOException

startReduce

public void startReduce(java.util.LinkedList<MapData> mapData)

Merge the postings for the current term, converts the document ID's in the postings to be relative to one another using the run number, number of documents covered in each run, the flush number for that run and the number of documents flushed.

Parameters:: mapData - - info about the runs(maps) and the flushes

reduce

public void reduce(MapEmittedTerm Term,
                   java.util.Iterator<MapEmittedPostingList> postingIterator,
                   org.apache.hadoop.mapred.OutputCollector<java.lang.Object,java.lang.Object> output,
                   org.apache.hadoop.mapred.Reporter reporter)
            throws java.io.IOException

Main reduce algorithm step. Called for every term in the merged index, together with accessors to the posting list information that has been written. This reduce has no output.

Specified by:: reduce in interface org.apache.hadoop.mapred.Reducer<MapEmittedTerm,MapEmittedPostingList,java.lang.Object,java.lang.Object>

Parameters:: Term - indexing term which we are reducing the posting lists into; postingIterator - Iterator over the temporary posting lists we have for this term; output - Unused output collector; reporter - Used to report progress
Throws:: java.io.IOException