org.terrier.indexing
Class WARC018Collection

java.lang.Object
  extended by org.terrier.indexing.WARC018Collection
All Implemented Interfaces:
java.io.Closeable, Collection

public class WARC018Collection
extends java.lang.Object
implements Collection

This object is used to parse WARC format web crawls, 0.18. The precise Document class to be used can be specified with the trec.document.class property.

Properties

Author:
Craig Macdonald

Field Summary
protected  long currentDocumentBlobLength
          the length of the blob containing the document data
protected  java.lang.String desiredEncoding
          Encoding to be used to open all files.
protected  java.util.Map<java.lang.String,java.lang.String> DocProperties
          properties for the current document
protected  java.lang.Class<? extends Document> documentClass
          Class to use for all documents parsed by this class
protected  int documentsInThisFile
          Counts the number of documents that have been found in this file.
protected  boolean eoc
          are we at the end of the collection?
protected  boolean eof
          has the end of the current input file been reached?
protected  int FileNumber
          The index in the FilesToProcess of the currently processed file.
protected  java.util.ArrayList<java.lang.String> FilesToProcess
          The list of files to process.
protected  boolean forceUTF8
          should UTF8 encoding be assumed?
protected  java.io.InputStream is
          the input stream of the current input file
protected static org.apache.log4j.Logger logger
          logger for this class
protected  Tokeniser tokeniser
          Tokeniser to use for all documents parsed by this class
protected  java.lang.String warc_crawldate_header
          what header for the crawldate document metadata
protected  java.lang.String warc_docno_header
          what header for the docno document metadata
protected  java.lang.String warc_url_header
          what header for the url document metadata
 
Constructor Summary
WARC018Collection()
          default constructor for this collection object.
WARC018Collection(java.io.InputStream input)
          A constructor that reads only the specificed InputStream.
WARC018Collection(java.lang.String CollectionSpecFilename)
          construct a collection from the denoted collection.spec file
 
Method Summary
 void close()
          Closes the collection, any files that may be open.
 boolean endOfCollection()
          Returns true if the end of the collection has been reached
 java.lang.String getDocid()
          Get the String document identifier of the current document.
 Document getDocument()
          Get the document object representing the current document.
 boolean hasNext()
          Check whether it is the last document in the collection
protected  void loadDocumentClass()
          Loads the class that will supply all documents for this Collection.
 Document next()
          Return the next document
 boolean nextDocument()
          Move the collection to the start of the next document.
protected  boolean openNextFile()
          Opens the next document from the collection specification.
protected  int parseHeaders(boolean requireContentLength)
           
protected  void readCollectionSpec(java.lang.String CollectionSpecFilename)
          read in the collection.spec
protected  java.lang.String readLine()
          read a line from the currently open InputStream is
 void reset()
          Resets the Collection iterator to the start of the collection.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static final org.apache.log4j.Logger logger
logger for this class


documentsInThisFile

protected int documentsInThisFile
Counts the number of documents that have been found in this file.


eoc

protected boolean eoc
are we at the end of the collection?


eof

protected boolean eof
has the end of the current input file been reached?


is

protected java.io.InputStream is
the input stream of the current input file


currentDocumentBlobLength

protected long currentDocumentBlobLength
the length of the blob containing the document data


DocProperties

protected java.util.Map<java.lang.String,java.lang.String> DocProperties
properties for the current document


FilesToProcess

protected java.util.ArrayList<java.lang.String> FilesToProcess
The list of files to process.


FileNumber

protected int FileNumber
The index in the FilesToProcess of the currently processed file.


forceUTF8

protected final boolean forceUTF8
should UTF8 encoding be assumed?


warc_docno_header

protected final java.lang.String warc_docno_header
what header for the docno document metadata


warc_url_header

protected final java.lang.String warc_url_header
what header for the url document metadata


warc_crawldate_header

protected final java.lang.String warc_crawldate_header
what header for the crawldate document metadata


desiredEncoding

protected java.lang.String desiredEncoding
Encoding to be used to open all files.


documentClass

protected java.lang.Class<? extends Document> documentClass
Class to use for all documents parsed by this class


tokeniser

protected Tokeniser tokeniser
Tokeniser to use for all documents parsed by this class

Constructor Detail

WARC018Collection

public WARC018Collection()
default constructor for this collection object. Reads files from the system default collection.spec file


WARC018Collection

public WARC018Collection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec file


WARC018Collection

public WARC018Collection(java.io.InputStream input)
A constructor that reads only the specificed InputStream.

Method Detail

hasNext

public boolean hasNext()
Check whether it is the last document in the collection

Returns:
boolean

next

public Document next()
Return the next document

Returns:
next document

close

public void close()
Closes the collection, any files that may be open.

Specified by:
close in interface java.io.Closeable

endOfCollection

public boolean endOfCollection()
Returns true if the end of the collection has been reached

Specified by:
endOfCollection in interface Collection
Returns:
boolean true if the end of collection has been reached, otherwise it returns false.

getDocid

public java.lang.String getDocid()
Get the String document identifier of the current document.


loadDocumentClass

protected void loadDocumentClass()
Loads the class that will supply all documents for this Collection. Set by property trec.document.class


getDocument

public Document getDocument()
Get the document object representing the current document.

Specified by:
getDocument in interface Collection
Returns:
Document the current document;

parseHeaders

protected int parseHeaders(boolean requireContentLength)
                    throws java.io.IOException
Throws:
java.io.IOException

nextDocument

public boolean nextDocument()
Move the collection to the start of the next document.

Specified by:
nextDocument in interface Collection
Returns:
boolean true if there exists another document in the collection, otherwise it returns false.

readLine

protected java.lang.String readLine()
                             throws java.io.IOException
read a line from the currently open InputStream is

Throws:
java.io.IOException

openNextFile

protected boolean openNextFile()
                        throws java.io.IOException
Opens the next document from the collection specification.

Returns:
boolean true if the file was opened successufully. If there are no more files to open, it returns false.
Throws:
java.io.IOException - if there is an exception while opening the collection files.

readCollectionSpec

protected void readCollectionSpec(java.lang.String CollectionSpecFilename)
read in the collection.spec


reset

public void reset()
Resets the Collection iterator to the start of the collection.

Specified by:
reset in interface Collection


Terrier 3.5. Copyright © 2004-2011 University of Glasgow