[TR-140] Indexing support for query-biased summarisation Created: 16/Sep/10  Updated: 05/Apr/11  Resolved: 04/Apr/11

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.0
Fix Version/s: 3.5

Type: Bug Priority: Major
Reporter: Craig Macdonald Assignee: Richard McCreadie
Resolution: Fixed  
Labels: None

Attachments: File TREC-200.patch     File TREC-200.v2.patch     File TREC-200.v3.patch     File TREC-200.v4.patch    
Issue Links:
Block
is blocked by TR-146 Tokenisation should be done separatel... Resolved
Related
relates to TR-150 TRECCollection parse DOCHDR tags, inc... Resolved

 Description   
Currently, while the Decorate object has support for query-biased summarisation, there is no tokenisation support for creating the summaries of documents necessary to support query-biased summarisation.

 Comments   
Comment by Craig Macdonald [ 03/Mar/11 ]

The key ingredient here is to alter (or subclass???) each Document implementation to define a StringBuilder for the abstract, which is saved as a String to the documentPropertes object at the end of the document.

As character which are tokenised are lowercase and punctuation removed, you need to ensure that this doesnt happen to the abstracts. However, some cleaning is necessary, to remove things like double spaces, etc.

Comment by Rodrygo L. T. Santos [ 17/Mar/11 ]

Once tokenisation is dealt with separately from Document implementations, this should become much easier.

Comment by Craig Macdonald [ 24/Mar/11 ]

This issue is ready to be looked at now.

Essentially, what we want is for the block of 100 characters in each document to be saved in the docProperties object as "title" metadata, and (an overlapping) block of 2000 characters to be saved as "abstract" or similar.

As we have altered Document implementations to use a Tokeniser, you should append a copy of anything that is handed to the tokeniser. However, we will want to remove duplicate punctuation etc, so that it is useful for tokenisation.

Comment by Richard McCreadie [ 25/Mar/11 ]

Modified TaggedDocument and FileDocument to support the saving of abstracts in the document properties.

TaggedDocument Properties:
/** The names of the abstracts to be saved (comma separated list) **/
TaggedDocument.abstracts
/** The fields that the named abstracts come from (comma separated list) **/
TaggedDocument.abstracts.fields
/** The maximum length of each named abstract (comma separated list) **/
TaggedDocument.abstracts.lengths

FileDocument Properties:
/** The names of the abstracts to be saved (comma separated list) **/
FileDocument.abstract
/** The maximum length of each named abstract (comma separated list) **/
FileDocument.abstract.length

Currently, text is saved as is without cleaning.

TestTRECDocument and TestFileDocument have been updated with new test cases.

Comment by Craig Macdonald [ 25/Mar/11 ]

Hi Richard,

Thanks for the patch and your work on this issue.

As I read the test cases, they only check if the first character is saved. IMHO, it should save the a given number of sentences, including punctuation and capitalisation until the property limit.

Craig

Comment by Richard McCreadie [ 25/Mar/11 ]

More comprehensive test cases

Comment by Richard McCreadie [ 25/Mar/11 ]

FileDocument: Changed to append stringbuilder for abstract
TaggedDocument: Added 'Else' special case
TestTRECDocument: Added test case for 'Else' special case

Comment by Richard McCreadie [ 25/Mar/11 ]

Minor fix to account for space added when joining fields in the 'Else' case

Comment by Richard McCreadie [ 01/Apr/11 ]

Minor change to fix an issue where the property was not added correctly

Comment by Richard McCreadie [ 01/Apr/11 ]

I have now tested this with the Terrier Web interface and it works as expected. I believe that we can close this issue.

Comment by Craig Macdonald [ 02/Apr/11 ]

Some changes to this patch:

  • Updated to current trunk
  • Changes fields to tags
  • Amended test cases to easier examples

Some more changes are required:

  • The tests currently aren't passing. I'm not sure if this is my changes, or other changes in TaggedDocument.
  • I think the abstract strings should only be built when the end of document is reached. Currently, we have many string appends occurring (for each call to saveToAbstract()). saveToAbstract() will be called for each block of text.
Comment by Richard McCreadie [ 04/Apr/11 ]

fixed the test case fails and added a string buffer for the else case

Comment by Craig Macdonald [ 04/Apr/11 ]

Final patch with agreed changes (use StringBuilders throughout TaggedDocument).

Comment by Craig Macdonald [ 04/Apr/11 ]

Committed v4. Thanks Richard!

Comment by Craig Macdonald [ 05/Apr/11 ]

Clarified title.

Generated at Wed Dec 13 11:14:40 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.