org.terrier.indexing
Class WARC09Collection

java.lang.Object
  extended by org.terrier.indexing.WARC09Collection
All Implemented Interfaces:
java.io.Closeable, Collection

public class WARC09Collection
extends java.lang.Object
implements Collection

This object is used to parse WARC format web crawls, version 0.9. The precise Document class to be used can be specified with the trec.document.class property. The following links denote the pages that were used to construct the format of this object: http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt http://archive-access.sourceforge.net/warc/warc_file_format.html http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.html

Properties

Author:
Craig Macdonald

Field Summary
protected  java.lang.String currentDocno
          the document number of the current document
protected  long currentDocumentBlobLength
          the length of the blob containing the document data
protected  java.util.Map<java.lang.String,java.lang.String> DocProperties
          properties for the current document
protected  java.lang.Class<? extends Document> documentClass
          Class to use for all documents parsed by this class
protected  int documentsInThisFile
          Counts the number of documents that have been found in this file.
protected  boolean eoc
          are we at the end of the collection?
protected  boolean eof
          has the end of the current input file been reached?
protected  int FileNumber
          The index in the FilesToProcess of the currently processed file.
protected  java.util.ArrayList<java.lang.String> FilesToProcess
          The list of files to process.
protected  java.io.InputStream is
          the input stream of the current input file
protected static org.apache.log4j.Logger logger
          logger for this class
protected  Tokeniser tokeniser
          Tokeniser to use for all documents parsed by this class
 
Constructor Summary
WARC09Collection()
          default constructor for this collection object.
WARC09Collection(java.io.InputStream input)
          A constructor that reads only the specificed InputStream.
WARC09Collection(java.lang.String CollectionSpecFilename)
          construct a collection from the denoted collection.spec file
 
Method Summary
 void close()
          Closes the collection, any files that may be open.
 boolean endOfCollection()
          Returns true if the end of the collection has been reached
 java.lang.String getDocid()
          Get the String document identifier of the current document.
 Document getDocument()
          Get the document object representing the current document.
 boolean hasNext()
          Check whether it is the last document in the collection
protected  void loadDocumentClass()
          Loads the class that will supply all documents for this Collection.
 Document next()
          Return the next document
 boolean nextDocument()
          Move the collection to the start of the next document.
protected  boolean openNextFile()
          Opens the next document from the collection specification.
protected  void readCollectionSpec(java.lang.String CollectionSpecFilename)
          read in the collection.spec
protected  java.lang.String readLine()
          read a line from the currently open InputStream is
 void reset()
          Resets the Collection iterator to the start of the collection.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
logger for this class


documentsInThisFile

protected int documentsInThisFile
Counts the number of documents that have been found in this file.


eoc

protected boolean eoc
are we at the end of the collection?


eof

protected boolean eof
has the end of the current input file been reached?


currentDocno

protected java.lang.String currentDocno
the document number of the current document


is

protected java.io.InputStream is
the input stream of the current input file


currentDocumentBlobLength

protected long currentDocumentBlobLength
the length of the blob containing the document data


DocProperties

protected java.util.Map<java.lang.String,java.lang.String> DocProperties
properties for the current document


FilesToProcess

protected java.util.ArrayList<java.lang.String> FilesToProcess
The list of files to process.


FileNumber

protected int FileNumber
The index in the FilesToProcess of the currently processed file.


documentClass

protected java.lang.Class<? extends Document> documentClass
Class to use for all documents parsed by this class


tokeniser

protected Tokeniser tokeniser
Tokeniser to use for all documents parsed by this class

Constructor Detail

WARC09Collection

public WARC09Collection()
default constructor for this collection object. Reads files from the system default collection.spec file


WARC09Collection

public WARC09Collection(java.io.InputStream input)
A constructor that reads only the specificed InputStream.


WARC09Collection

public WARC09Collection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec file

Method Detail

loadDocumentClass

protected void loadDocumentClass()
Loads the class that will supply all documents for this Collection. Set by property trec.document.class


hasNext

public boolean hasNext()
Check whether it is the last document in the collection

Returns:
boolean

next

public Document next()
Return the next document

Returns:
next document

close

public void close()
Closes the collection, any files that may be open.

Specified by:
close in interface java.io.Closeable

endOfCollection

public boolean endOfCollection()
Returns true if the end of the collection has been reached

Specified by:
endOfCollection in interface Collection
Returns:
boolean true if the end of collection has been reached, otherwise it returns false.

getDocid

public java.lang.String getDocid()
Get the String document identifier of the current document.


getDocument

public Document getDocument()
Get the document object representing the current document.

Specified by:
getDocument in interface Collection
Returns:
Document the current document;

nextDocument

public boolean nextDocument()
Move the collection to the start of the next document.

Specified by:
nextDocument in interface Collection
Returns:
boolean true if there exists another document in the collection, otherwise it returns false.

readLine

protected java.lang.String readLine()
                             throws java.io.IOException
read a line from the currently open InputStream is

Throws:
java.io.IOException

openNextFile

protected boolean openNextFile()
                        throws java.io.IOException
Opens the next document from the collection specification.

Returns:
boolean true if the file was opened successufully. If there are no more files to open, it returns false.
Throws:
java.io.IOException - if there is an exception while opening the collection files.

readCollectionSpec

protected void readCollectionSpec(java.lang.String CollectionSpecFilename)
read in the collection.spec


reset

public void reset()
Resets the Collection iterator to the start of the collection.

Specified by:
reset in interface Collection


Terrier 3.5. Copyright © 2004-2011 University of Glasgow