Package org.terrier.indexing
Class WARC09Collection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.WARC09Collection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
public class WARC09Collection extends MultiDocumentFileCollection
This object is used to parse WARC format web crawls, version 0.9. The preciseDocument
class to be used can be specified with the trec.document.class property. The following links denote the pages that were used to construct the format of this object: http://www.yr-bcn.es/webspam/datasets/uk2006-pages/excerpt.txt http://archive-access.sourceforge.net/warc/warc_file_format.html http://crawler.archive.org/apidocs/index.html?org/archive/io/arc/ARCWriter.html http://crawler.archive.org/apidocs/org/archive/io/GzippedInputStream.htmlProperties
- trec.document.class the
Document
class to parse individual documents (defaults toTaggedDocument
).
- Author:
- Craig Macdonald
-
-
Field Summary
Fields Modifier and Type Field Description protected java.lang.String
currentDocno
properties for the current documentprotected long
currentDocumentBlobLength
-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description WARC09Collection()
WARC09Collection(java.io.InputStream input)
WARC09Collection(java.lang.String CollectionSpecFilename)
WARC09Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
WARC09Collection(java.util.List<java.lang.String> files)
WARC09Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Document
getDocument()
Get the document object representing the current document.boolean
nextDocument()
Move the collection to the start of the next document.protected java.lang.String
readLine()
read a line from the currently open InputStream isvoid
reset()
Resets the Collection iterator to the start of the collection.-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile
-
-
-
-
Constructor Detail
-
WARC09Collection
public WARC09Collection()
-
WARC09Collection
public WARC09Collection(java.io.InputStream input)
-
WARC09Collection
public WARC09Collection(java.lang.String CollectionSpecFilename)
-
WARC09Collection
public WARC09Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
WARC09Collection
public WARC09Collection(java.util.List<java.lang.String> files)
-
WARC09Collection
public WARC09Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
-
Method Detail
-
getDocument
public Document getDocument()
Get the document object representing the current document.- Specified by:
getDocument
in interfaceCollection
- Specified by:
getDocument
in classMultiDocumentFileCollection
- Returns:
- Document the current document;
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocument
in interfaceCollection
- Specified by:
nextDocument
in classMultiDocumentFileCollection
- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
readLine
protected java.lang.String readLine() throws java.io.IOException
read a line from the currently open InputStream is- Throws:
java.io.IOException
-
reset
public void reset()
Resets the Collection iterator to the start of the collection.- Specified by:
reset
in interfaceCollection
- Overrides:
reset
in classMultiDocumentFileCollection
-
-