[TR-137] TRECCollection cannot add properties from the document tags to the meta index at indexing time Created: 09/Jul/10  Updated: 05/Apr/11  Resolved: 03/Mar/11

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.0
Fix Version/s: 3.5

Type: New Feature Priority: Minor
Reporter: Richard McCreadie Assignee: Richard McCreadie
Resolution: Fixed  
Labels: None

Attachments: File TREC-178 v2.patch     File TREC-178v3.patch     File TRECMetaProperties.patch    

 Description   
Modified the TRECCollection class such that additional document tags like the URL can be saved automatically into the meta index. Initial patch provided.

Example: I want to index the title field and add the url to the meta index for the collection below:
<doc>
<docno>NYTMay2006-0</docno>
<url>http://events.nytimes.com/2006/05/03/dining/reviews/03rest.html&lt;/url>
<title>Forum: Dine Out</title>
</doc>

Set trec.properties:
# Which fields do we want to pass to the indexer as properties (in addition to the docno)?
TrecDocTags.propertytags=url
# At the indexer which keys to we add to the Meta index?
indexer.meta.forward.keys=docno,url
# What are the lengths of those keys (e.g. the url will be much longer)?
indexer.meta.forward.keylens=20,100


 Comments   
Comment by Craig Macdonald [ 17/Feb/11 ]

Tagged for 3.1

Comment by Craig Macdonald [ 17/Feb/11 ]

I reviewed this. My comments are as follows:

1. Much of the property tag calculations occurs for each document parsed. This is very expensive. Instead, some arrays could be populated in setTags() :

protected int[] propertyTagLengths;
protected char[][] startPropertyTags;
protected char[][] endPropertyTags;	

2. Additionally, could a method be used that did the same for the DOCNO and each of the property tags.

3. Can you give a look at ensuring all of the tag-related are adequately described within the javadoc for TRECCollection.

Ta.

Comment by Richard McCreadie [ 22/Feb/11 ]

New patch. Moved tag computation into setTags(). Refactored getDocument to move the tag scan code into a separate getTag() method. Passes the testSingleDocumentSingleTermProperyTags() test case.

Comment by Craig Macdonald [ 23/Feb/11 ]

Ok. Looks very good. Three minor comments:
1. Can you add documentation to TRECCollection, and probably configure_indexing.html
2. TrecDocTags should not be hard-coded
3. Can you include the updated test case in the patch?

Comment by Richard McCreadie [ 03/Mar/11 ]

Added JavsDoc to TRECCollection, fixed hard codeing of TRECDocTags and updated a new test case

Comment by Craig Macdonald [ 03/Mar/11 ]

Committed. Thanks for your patience Richard!

Generated at Wed Dec 13 10:52:50 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.