Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Storing content of docs in TREC Disks 4 & 5
Posted by: Maram ()
Date: August 29, 2017 10:34AM

Hi,

I am indexing TREC disk 4 & 5. I need to store the content/text of each document in the disks. To do so, I followed instructions here: [terrier.org]
and added the following parameters to the terrier.properties file:
indexer.meta.forward.keys=DOCNO,TEXT
indexer.meta.forward.keylens=500,214748
indexer.meta.reverse.keys=

I have the following issues:
- Is what I am doing correct?
- Will the stored content be parsed (e.g., tags removed) or exactly as it is in the XML file being indexed?
- I keep getting this warning "WARN o.t.structures.indexing.Indexer - Adding empty document docID" although when I indexed the disks without storing the document content, I wasn't getting that warning. Is it possible that all these documents with warning are actually empty? why wasn't I getting this warning earlier?
- The indexing is extremely slow, it has been running for 2 days now on a dedicated server with more than 10 GB memory assigned to the indexing process
- The data_1.meta.zdata file is huge (it is 10 times larger than the actual file to index as it reached 20 GB in size). I thought that the parameter "keylens" decided the maximum size of the content to store per document, to me, it seems content of each document is being dedicated that size or else this metadata file wouldn't have reached 10 times the size of the file being indexed. Can someone explain what's going on here?

I really appreciate your help. This is my first time using Terrier and I really hope to get it working correctly.

Thanks in advance!
Maram



Edited 1 time(s). Last edit at 09/10/2017 08:09AM by Maram.

Options: ReplyQuote
Re: Storing content of docs in TREC Disks 4 & 5
Posted by: Maram ()
Date: September 06, 2017 10:32PM

Hi,

Can anyone please help? not matter what I tried, docno is the only field that gets added to the meta index. How to add the document content?

Maram

Options: ReplyQuote
Re: Storing content of docs in TREC Disks 4 & 5
Posted by: Maram ()
Date: September 06, 2017 11:13PM

I tried to index the date only. Here's the configuration I used:

indexer.meta.forward.keys=docno,DATE
indexer.meta.forward.keylens=200,2147
indexer.meta.reverse.keys=docno

I tried to look-up the date in my java app. Using this code snippet, I can see the "DATE" is added as a key to meta index but the value is empty.

MetaIndex meta = index.getMetaIndex();
for(String k : meta.getKeys())
System.out.println(k);

So this code snippet shows a blank line:
String date= meta.getItem("DATE", docid);

Anyone can help please??

Maram

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.