Package org.terrier.indexing
Class WARC09Collection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.WARC09Collection
-
- All Implemented Interfaces:
java.io.Closeable,java.lang.AutoCloseable,Collection
public class WARC09Collection extends MultiDocumentFileCollection
This object is used to parse WARC format web crawls, version 0.9. The preciseDocumentclass to be used can be specified with the trec.document.class property. The following links denote the pages that were used to construct the format of this object: http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt http://archive-access.sourceforge.net/warc/warc_file_format.html http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.htmlProperties
- trec.document.class the
Documentclass to parse individual documents (defaults toTaggedDocument).
- Author:
- Craig Macdonald
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.StringcurrentDocnoproperties for the current documentprotected longcurrentDocumentBlobLength-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description WARC09Collection()WARC09Collection(java.io.InputStream input)WARC09Collection(java.lang.String CollectionSpecFilename)WARC09Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)WARC09Collection(java.util.List<java.lang.String> files)WARC09Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description DocumentgetDocument()Get the document object representing the current document.booleannextDocument()Move the collection to the start of the next document.protected java.lang.StringreadLine()read a line from the currently open InputStream isvoidreset()Resets the Collection iterator to the start of the collection.-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile
-
-
-
-
Constructor Detail
-
WARC09Collection
public WARC09Collection()
-
WARC09Collection
public WARC09Collection(java.io.InputStream input)
-
WARC09Collection
public WARC09Collection(java.lang.String CollectionSpecFilename)
-
WARC09Collection
public WARC09Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
WARC09Collection
public WARC09Collection(java.util.List<java.lang.String> files)
-
WARC09Collection
public WARC09Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
-
Method Detail
-
getDocument
public Document getDocument()
Get the document object representing the current document.- Specified by:
getDocumentin interfaceCollection- Specified by:
getDocumentin classMultiDocumentFileCollection- Returns:
- Document the current document;
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocumentin interfaceCollection- Specified by:
nextDocumentin classMultiDocumentFileCollection- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
readLine
protected java.lang.String readLine() throws java.io.IOExceptionread a line from the currently open InputStream is- Throws:
java.io.IOException
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.- Specified by:
resetin interfaceCollection- Overrides:
resetin classMultiDocumentFileCollection
-
-