|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.indexing.WARC018Collection
public class WARC018Collection
This object is used to parse WARC format web crawls, 0.18.
The precise Document
class to be used can be specified with the
trec.document.class property.
Properties
Document
class to parse individual documents (defaults to TaggedDocument
).
Field Summary | |
---|---|
protected long |
currentDocumentBlobLength
the length of the blob containing the document data |
protected java.lang.String |
desiredEncoding
Encoding to be used to open all files. |
protected java.util.Map<java.lang.String,java.lang.String> |
DocProperties
properties for the current document |
protected java.lang.Class<? extends Document> |
documentClass
Class to use for all documents parsed by this class |
protected int |
documentsInThisFile
Counts the number of documents that have been found in this file. |
protected boolean |
eoc
are we at the end of the collection? |
protected boolean |
eof
has the end of the current input file been reached? |
protected int |
FileNumber
The index in the FilesToProcess of the currently processed file. |
protected java.util.ArrayList<java.lang.String> |
FilesToProcess
The list of files to process. |
protected boolean |
forceUTF8
should UTF8 encoding be assumed? |
protected java.io.InputStream |
is
the input stream of the current input file |
protected static org.apache.log4j.Logger |
logger
logger for this class |
protected Tokeniser |
tokeniser
Tokeniser to use for all documents parsed by this class |
protected java.lang.String |
warc_crawldate_header
what header for the crawldate document metadata |
protected java.lang.String |
warc_docno_header
what header for the docno document metadata |
protected java.lang.String |
warc_url_header
what header for the url document metadata |
Constructor Summary | |
---|---|
WARC018Collection()
default constructor for this collection object. |
|
WARC018Collection(java.io.InputStream input)
A constructor that reads only the specificed InputStream. |
|
WARC018Collection(java.lang.String CollectionSpecFilename)
construct a collection from the denoted collection.spec file |
Method Summary | |
---|---|
void |
close()
Closes the collection, any files that may be open. |
boolean |
endOfCollection()
Returns true if the end of the collection has been reached |
java.lang.String |
getDocid()
Get the String document identifier of the current document. |
Document |
getDocument()
Get the document object representing the current document. |
boolean |
hasNext()
Check whether it is the last document in the collection |
protected void |
loadDocumentClass()
Loads the class that will supply all documents for this Collection. |
Document |
next()
Return the next document |
boolean |
nextDocument()
Move the collection to the start of the next document. |
protected boolean |
openNextFile()
Opens the next document from the collection specification. |
protected int |
parseHeaders(boolean requireContentLength)
|
protected void |
readCollectionSpec(java.lang.String CollectionSpecFilename)
read in the collection.spec |
protected java.lang.String |
readLine()
read a line from the currently open InputStream is |
void |
reset()
Resets the Collection iterator to the start of the collection. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
protected static final org.apache.log4j.Logger logger
protected int documentsInThisFile
protected boolean eoc
protected boolean eof
protected java.io.InputStream is
protected long currentDocumentBlobLength
protected java.util.Map<java.lang.String,java.lang.String> DocProperties
protected java.util.ArrayList<java.lang.String> FilesToProcess
protected int FileNumber
protected final boolean forceUTF8
protected final java.lang.String warc_docno_header
protected final java.lang.String warc_url_header
protected final java.lang.String warc_crawldate_header
protected java.lang.String desiredEncoding
protected java.lang.Class<? extends Document> documentClass
protected Tokeniser tokeniser
Constructor Detail |
---|
public WARC018Collection()
public WARC018Collection(java.lang.String CollectionSpecFilename)
public WARC018Collection(java.io.InputStream input)
Method Detail |
---|
public boolean hasNext()
public Document next()
public void close()
close
in interface java.io.Closeable
public boolean endOfCollection()
endOfCollection
in interface Collection
public java.lang.String getDocid()
protected void loadDocumentClass()
public Document getDocument()
getDocument
in interface Collection
protected int parseHeaders(boolean requireContentLength) throws java.io.IOException
java.io.IOException
public boolean nextDocument()
nextDocument
in interface Collection
protected java.lang.String readLine() throws java.io.IOException
java.io.IOException
protected boolean openNextFile() throws java.io.IOException
java.io.IOException
- if there is an exception while opening the
collection files.protected void readCollectionSpec(java.lang.String CollectionSpecFilename)
public void reset()
reset
in interface Collection
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |