I am indexing TREC disk 4 & 5. I need to store the content/text of each document in the disks. To do so, I followed instructions here: [terrier.org
and added the following parameters to the terrier.properties file:
I have the following issues:
- Is what I am doing correct?
- Will the stored content be parsed (e.g., tags removed) or exactly as it is in the XML file being indexed?
- I keep getting this warning "WARN o.t.structures.indexing.Indexer - Adding empty document docID" although when I indexed the disks without storing the document content, I wasn't getting that warning. Is it possible that all these documents with warning are actually empty? why wasn't I getting this warning earlier?
- The indexing is extremely slow, it has been running for 2 days now on a dedicated server with more than 10 GB memory assigned to the indexing process
- The data_1.meta.zdata file is huge (it is 10 times larger than the actual file to index as it reached 20 GB in size). I thought that the parameter "keylens" decided the maximum size of the content to store per document, to me, it seems content of each document is being dedicated that size or else this metadata file wouldn't have reached 10 times the size of the file being indexed. Can someone explain what's going on here?
I really appreciate your help. This is my first time using Terrier and I really hope to get it working correctly.
Thanks in advance!
Edited 1 time(s). Last edit at 09/10/2017 08:09AM by Maram.