Package org.terrier.indexing
Class WARC10Collection
- java.lang.Object
-
- org.terrier.indexing.MultiDocumentFileCollection
-
- org.terrier.indexing.WARC018Collection
-
- org.terrier.indexing.WARC10Collection
-
- All Implemented Interfaces:
java.io.Closeable
,java.lang.AutoCloseable
,Collection
public class WARC10Collection extends WARC018Collection
This object is used to parse WARC format web crawls, version 0.10. Uses properties from WARC018Collection.- Author:
- Craig Macdonald
-
-
Field Summary
-
Fields inherited from class org.terrier.indexing.WARC018Collection
currentDocumentBlobLength, readLineByteCount, warc_crawldate_header, warc_docno_header, warc_url_header
-
Fields inherited from class org.terrier.indexing.MultiDocumentFileCollection
currentFilename, desiredEncoding, DocProperties, documentClass, documentsInThisFile, eoc, eof, FileNumber, FilesToProcess, forceUTF8, is, logger, SkipFile, tokeniser
-
-
Constructor Summary
Constructors Constructor Description WARC10Collection()
WARC10Collection(java.io.InputStream input)
WARC10Collection(java.lang.String CollectionSpecFilename)
WARC10Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
WARC10Collection(java.util.List<java.lang.String> files)
WARC10Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
nextDocument()
Move the collection to the start of the next document.protected void
processRedirect(java.lang.String source, java.lang.String target)
-
Methods inherited from class org.terrier.indexing.WARC018Collection
getDocid, getDocument, parseDate, parseHeaders, readLine
-
Methods inherited from class org.terrier.indexing.MultiDocumentFileCollection
checkEncoding, close, endOfCollection, extractCharset, hasNext, loadDocumentClass, next, openNewFile, openNextFile, reset
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.terrier.indexing.Collection
endOfCollection, reset
-
-
-
-
Constructor Detail
-
WARC10Collection
public WARC10Collection()
-
WARC10Collection
public WARC10Collection(java.io.InputStream input)
-
WARC10Collection
public WARC10Collection(java.lang.String CollectionSpecFilename)
-
WARC10Collection
public WARC10Collection(java.util.List<java.lang.String> files, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
WARC10Collection
public WARC10Collection(java.util.List<java.lang.String> files)
-
WARC10Collection
public WARC10Collection(java.lang.String CollectionSpecFilename, java.lang.String TagSet, java.lang.String BlacklistSpecFilename, java.lang.String ignored)
-
-
Method Detail
-
processRedirect
protected void processRedirect(java.lang.String source, java.lang.String target)
-
nextDocument
public boolean nextDocument()
Move the collection to the start of the next document.- Specified by:
nextDocument
in interfaceCollection
- Overrides:
nextDocument
in classWARC018Collection
- Returns:
- boolean true if there exists another document in the collection, otherwise it returns false.
-
-