Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-140

Indexing support for query-biased summarisation

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .indexing
    • Labels:
      None

      Description

      Currently, while the Decorate object has support for query-biased summarisation, there is no tokenisation support for creating the summaries of documents necessary to support query-biased summarisation.

        Attachments

        1. TREC-200.patch
          17 kB
        2. TREC-200.v2.patch
          19 kB
        3. TREC-200.v3.patch
          22 kB
        4. TREC-200.v4.patch
          21 kB

          Issue Links

            Activity

            Hide
            craigm Craig Macdonald added a comment -

            The key ingredient here is to alter (or subclass???) each Document implementation to define a StringBuilder for the abstract, which is saved as a String to the documentPropertes object at the end of the document.

            As character which are tokenised are lowercase and punctuation removed, you need to ensure that this doesnt happen to the abstracts. However, some cleaning is necessary, to remove things like double spaces, etc.

            Show
            craigm Craig Macdonald added a comment - The key ingredient here is to alter (or subclass???) each Document implementation to define a StringBuilder for the abstract, which is saved as a String to the documentPropertes object at the end of the document. As character which are tokenised are lowercase and punctuation removed, you need to ensure that this doesnt happen to the abstracts. However, some cleaning is necessary, to remove things like double spaces, etc.
            Hide
            rodrygo Rodrygo L. T. Santos added a comment -

            Once tokenisation is dealt with separately from Document implementations, this should become much easier.

            Show
            rodrygo Rodrygo L. T. Santos added a comment - Once tokenisation is dealt with separately from Document implementations, this should become much easier.
            Hide
            craigm Craig Macdonald added a comment -

            This issue is ready to be looked at now.

            Essentially, what we want is for the block of 100 characters in each document to be saved in the docProperties object as "title" metadata, and (an overlapping) block of 2000 characters to be saved as "abstract" or similar.

            As we have altered Document implementations to use a Tokeniser, you should append a copy of anything that is handed to the tokeniser. However, we will want to remove duplicate punctuation etc, so that it is useful for tokenisation.

            Show
            craigm Craig Macdonald added a comment - This issue is ready to be looked at now. Essentially, what we want is for the block of 100 characters in each document to be saved in the docProperties object as "title" metadata, and (an overlapping) block of 2000 characters to be saved as "abstract" or similar. As we have altered Document implementations to use a Tokeniser, you should append a copy of anything that is handed to the tokeniser. However, we will want to remove duplicate punctuation etc, so that it is useful for tokenisation.
            Hide
            richardm Richard McCreadie added a comment -

            Modified TaggedDocument and FileDocument to support the saving of abstracts in the document properties.

            TaggedDocument Properties:
            /** The names of the abstracts to be saved (comma separated list) **/
            TaggedDocument.abstracts
            /** The fields that the named abstracts come from (comma separated list) **/
            TaggedDocument.abstracts.fields
            /** The maximum length of each named abstract (comma separated list) **/
            TaggedDocument.abstracts.lengths

            FileDocument Properties:
            /** The names of the abstracts to be saved (comma separated list) **/
            FileDocument.abstract
            /** The maximum length of each named abstract (comma separated list) **/
            FileDocument.abstract.length

            Currently, text is saved as is without cleaning.

            TestTRECDocument and TestFileDocument have been updated with new test cases.

            Show
            richardm Richard McCreadie added a comment - Modified TaggedDocument and FileDocument to support the saving of abstracts in the document properties. TaggedDocument Properties: /** The names of the abstracts to be saved (comma separated list) **/ TaggedDocument.abstracts /** The fields that the named abstracts come from (comma separated list) **/ TaggedDocument.abstracts.fields /** The maximum length of each named abstract (comma separated list) **/ TaggedDocument.abstracts.lengths FileDocument Properties: /** The names of the abstracts to be saved (comma separated list) **/ FileDocument.abstract /** The maximum length of each named abstract (comma separated list) **/ FileDocument.abstract.length Currently, text is saved as is without cleaning. TestTRECDocument and TestFileDocument have been updated with new test cases.
            Hide
            craigm Craig Macdonald added a comment -

            Hi Richard,

            Thanks for the patch and your work on this issue.

            As I read the test cases, they only check if the first character is saved. IMHO, it should save the a given number of sentences, including punctuation and capitalisation until the property limit.

            Craig

            Show
            craigm Craig Macdonald added a comment - Hi Richard, Thanks for the patch and your work on this issue. As I read the test cases, they only check if the first character is saved. IMHO, it should save the a given number of sentences, including punctuation and capitalisation until the property limit. Craig
            Hide
            richardm Richard McCreadie added a comment -

            More comprehensive test cases

            Show
            richardm Richard McCreadie added a comment - More comprehensive test cases
            Hide
            richardm Richard McCreadie added a comment -

            FileDocument: Changed to append stringbuilder for abstract
            TaggedDocument: Added 'Else' special case
            TestTRECDocument: Added test case for 'Else' special case

            Show
            richardm Richard McCreadie added a comment - FileDocument: Changed to append stringbuilder for abstract TaggedDocument: Added 'Else' special case TestTRECDocument: Added test case for 'Else' special case
            Hide
            richardm Richard McCreadie added a comment -

            Minor fix to account for space added when joining fields in the 'Else' case

            Show
            richardm Richard McCreadie added a comment - Minor fix to account for space added when joining fields in the 'Else' case
            Hide
            richardm Richard McCreadie added a comment -

            Minor change to fix an issue where the property was not added correctly

            Show
            richardm Richard McCreadie added a comment - Minor change to fix an issue where the property was not added correctly
            Hide
            richardm Richard McCreadie added a comment -

            I have now tested this with the Terrier Web interface and it works as expected. I believe that we can close this issue.

            Show
            richardm Richard McCreadie added a comment - I have now tested this with the Terrier Web interface and it works as expected. I believe that we can close this issue.
            Hide
            craigm Craig Macdonald added a comment -

            Some changes to this patch:

            • Updated to current trunk
            • Changes fields to tags
            • Amended test cases to easier examples

            Some more changes are required:

            • The tests currently aren't passing. I'm not sure if this is my changes, or other changes in TaggedDocument.
            • I think the abstract strings should only be built when the end of document is reached. Currently, we have many string appends occurring (for each call to saveToAbstract()). saveToAbstract() will be called for each block of text.
            Show
            craigm Craig Macdonald added a comment - Some changes to this patch: Updated to current trunk Changes fields to tags Amended test cases to easier examples Some more changes are required: The tests currently aren't passing. I'm not sure if this is my changes, or other changes in TaggedDocument. I think the abstract strings should only be built when the end of document is reached. Currently, we have many string appends occurring (for each call to saveToAbstract()). saveToAbstract() will be called for each block of text.
            Hide
            richardm Richard McCreadie added a comment -

            fixed the test case fails and added a string buffer for the else case

            Show
            richardm Richard McCreadie added a comment - fixed the test case fails and added a string buffer for the else case
            Hide
            craigm Craig Macdonald added a comment -

            Final patch with agreed changes (use StringBuilders throughout TaggedDocument).

            Show
            craigm Craig Macdonald added a comment - Final patch with agreed changes (use StringBuilders throughout TaggedDocument).
            Hide
            craigm Craig Macdonald added a comment -

            Committed v4. Thanks Richard!

            Show
            craigm Craig Macdonald added a comment - Committed v4. Thanks Richard!
            Hide
            craigm Craig Macdonald added a comment -

            Clarified title.

            Show
            craigm Craig Macdonald added a comment - Clarified title.

              People

              • Assignee:
                richardm Richard McCreadie
                Reporter:
                craigm Craig Macdonald
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: