Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-140

Indexing support for query-biased summarisation

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0
    • Fix Version/s: 3.5
    • Component/s: .indexing
    • Labels:
      None

      Description

      Currently, while the Decorate object has support for query-biased summarisation, there is no tokenisation support for creating the summaries of documents necessary to support query-biased summarisation.

        Attachments

        1. TREC-200.patch
          17 kB
        2. TREC-200.v2.patch
          19 kB
        3. TREC-200.v3.patch
          22 kB
        4. TREC-200.v4.patch
          21 kB

          Issue Links

            Activity

            craigm Craig Macdonald created issue -
            richardm Richard McCreadie made changes -
            Field Original Value New Value
            Assignee Craig Macdonald [ craigm ] Richard McCreadie [ richardm ]
            Hide
            craigm Craig Macdonald added a comment -

            The key ingredient here is to alter (or subclass???) each Document implementation to define a StringBuilder for the abstract, which is saved as a String to the documentPropertes object at the end of the document.

            As character which are tokenised are lowercase and punctuation removed, you need to ensure that this doesnt happen to the abstracts. However, some cleaning is necessary, to remove things like double spaces, etc.

            Show
            craigm Craig Macdonald added a comment - The key ingredient here is to alter (or subclass???) each Document implementation to define a StringBuilder for the abstract, which is saved as a String to the documentPropertes object at the end of the document. As character which are tokenised are lowercase and punctuation removed, you need to ensure that this doesnt happen to the abstracts. However, some cleaning is necessary, to remove things like double spaces, etc.
            Hide
            rodrygo Rodrygo L. T. Santos added a comment -

            Once tokenisation is dealt with separately from Document implementations, this should become much easier.

            Show
            rodrygo Rodrygo L. T. Santos added a comment - Once tokenisation is dealt with separately from Document implementations, this should become much easier.
            rodrygo Rodrygo L. T. Santos made changes -
            Link This issue is blocked by TREC-225 [ TREC-225 ]
            Hide
            craigm Craig Macdonald added a comment -

            This issue is ready to be looked at now.

            Essentially, what we want is for the block of 100 characters in each document to be saved in the docProperties object as "title" metadata, and (an overlapping) block of 2000 characters to be saved as "abstract" or similar.

            As we have altered Document implementations to use a Tokeniser, you should append a copy of anything that is handed to the tokeniser. However, we will want to remove duplicate punctuation etc, so that it is useful for tokenisation.

            Show
            craigm Craig Macdonald added a comment - This issue is ready to be looked at now. Essentially, what we want is for the block of 100 characters in each document to be saved in the docProperties object as "title" metadata, and (an overlapping) block of 2000 characters to be saved as "abstract" or similar. As we have altered Document implementations to use a Tokeniser, you should append a copy of anything that is handed to the tokeniser. However, we will want to remove duplicate punctuation etc, so that it is useful for tokenisation.
            Hide
            richardm Richard McCreadie added a comment -

            Modified TaggedDocument and FileDocument to support the saving of abstracts in the document properties.

            TaggedDocument Properties:
            /** The names of the abstracts to be saved (comma separated list) **/
            TaggedDocument.abstracts
            /** The fields that the named abstracts come from (comma separated list) **/
            TaggedDocument.abstracts.fields
            /** The maximum length of each named abstract (comma separated list) **/
            TaggedDocument.abstracts.lengths

            FileDocument Properties:
            /** The names of the abstracts to be saved (comma separated list) **/
            FileDocument.abstract
            /** The maximum length of each named abstract (comma separated list) **/
            FileDocument.abstract.length

            Currently, text is saved as is without cleaning.

            TestTRECDocument and TestFileDocument have been updated with new test cases.

            Show
            richardm Richard McCreadie added a comment - Modified TaggedDocument and FileDocument to support the saving of abstracts in the document properties. TaggedDocument Properties: /** The names of the abstracts to be saved (comma separated list) **/ TaggedDocument.abstracts /** The fields that the named abstracts come from (comma separated list) **/ TaggedDocument.abstracts.fields /** The maximum length of each named abstract (comma separated list) **/ TaggedDocument.abstracts.lengths FileDocument Properties: /** The names of the abstracts to be saved (comma separated list) **/ FileDocument.abstract /** The maximum length of each named abstract (comma separated list) **/ FileDocument.abstract.length Currently, text is saved as is without cleaning. TestTRECDocument and TestFileDocument have been updated with new test cases.
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10283 ]
            Hide
            craigm Craig Macdonald added a comment -

            Hi Richard,

            Thanks for the patch and your work on this issue.

            As I read the test cases, they only check if the first character is saved. IMHO, it should save the a given number of sentences, including punctuation and capitalisation until the property limit.

            Craig

            Show
            craigm Craig Macdonald added a comment - Hi Richard, Thanks for the patch and your work on this issue. As I read the test cases, they only check if the first character is saved. IMHO, it should save the a given number of sentences, including punctuation and capitalisation until the property limit. Craig
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10283 ]
            Hide
            richardm Richard McCreadie added a comment -

            More comprehensive test cases

            Show
            richardm Richard McCreadie added a comment - More comprehensive test cases
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10284 ]
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10284 ]
            Hide
            richardm Richard McCreadie added a comment -

            FileDocument: Changed to append stringbuilder for abstract
            TaggedDocument: Added 'Else' special case
            TestTRECDocument: Added test case for 'Else' special case

            Show
            richardm Richard McCreadie added a comment - FileDocument: Changed to append stringbuilder for abstract TaggedDocument: Added 'Else' special case TestTRECDocument: Added test case for 'Else' special case
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10285 ]
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10285 ]
            Hide
            richardm Richard McCreadie added a comment -

            Minor fix to account for space added when joining fields in the 'Else' case

            Show
            richardm Richard McCreadie added a comment - Minor fix to account for space added when joining fields in the 'Else' case
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10286 ]
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10286 ]
            Hide
            richardm Richard McCreadie added a comment -

            Minor change to fix an issue where the property was not added correctly

            Show
            richardm Richard McCreadie added a comment - Minor change to fix an issue where the property was not added correctly
            richardm Richard McCreadie made changes -
            Attachment TREC-200.patch [ 10292 ]
            Hide
            richardm Richard McCreadie added a comment -

            I have now tested this with the Terrier Web interface and it works as expected. I believe that we can close this issue.

            Show
            richardm Richard McCreadie added a comment - I have now tested this with the Terrier Web interface and it works as expected. I believe that we can close this issue.
            craigm Craig Macdonald made changes -
            Link This issue relates to TREC-240 [ TREC-240 ]
            Hide
            craigm Craig Macdonald added a comment -

            Some changes to this patch:

            • Updated to current trunk
            • Changes fields to tags
            • Amended test cases to easier examples

            Some more changes are required:

            • The tests currently aren't passing. I'm not sure if this is my changes, or other changes in TaggedDocument.
            • I think the abstract strings should only be built when the end of document is reached. Currently, we have many string appends occurring (for each call to saveToAbstract()). saveToAbstract() will be called for each block of text.
            Show
            craigm Craig Macdonald added a comment - Some changes to this patch: Updated to current trunk Changes fields to tags Amended test cases to easier examples Some more changes are required: The tests currently aren't passing. I'm not sure if this is my changes, or other changes in TaggedDocument. I think the abstract strings should only be built when the end of document is reached. Currently, we have many string appends occurring (for each call to saveToAbstract()). saveToAbstract() will be called for each block of text.
            craigm Craig Macdonald made changes -
            Attachment TREC-200.v2.patch [ 10293 ]
            Hide
            richardm Richard McCreadie added a comment -

            fixed the test case fails and added a string buffer for the else case

            Show
            richardm Richard McCreadie added a comment - fixed the test case fails and added a string buffer for the else case
            richardm Richard McCreadie made changes -
            Attachment TREC-200.v3.patch [ 10294 ]
            Hide
            craigm Craig Macdonald added a comment -

            Final patch with agreed changes (use StringBuilders throughout TaggedDocument).

            Show
            craigm Craig Macdonald added a comment - Final patch with agreed changes (use StringBuilders throughout TaggedDocument).
            craigm Craig Macdonald made changes -
            Attachment TREC-200.v4.patch [ 10295 ]
            Hide
            craigm Craig Macdonald added a comment -

            Committed v4. Thanks Richard!

            Show
            craigm Craig Macdonald added a comment - Committed v4. Thanks Richard!
            craigm Craig Macdonald made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Resolution Fixed [ 1 ]
            craigm Craig Macdonald made changes -
            Project TREC [ 10010 ] Terrier Core [ 10000 ]
            Key TREC-200 TR-140
            Issue Type Improvement [ 4 ] Bug [ 1 ]
            Workflow jira [ 10456 ] Terrier Open Source [ 10533 ]
            Affects Version/s 3.0 [ 10030 ]
            Affects Version/s 3.0 [ 10020 ]
            Component/s .indexing [ 10002 ]
            Component/s Core [ 10020 ]
            Fix Version/s 3.1 [ 10040 ]
            Fix Version/s 3.1 [ 10021 ]
            Hide
            craigm Craig Macdonald added a comment -

            Clarified title.

            Show
            craigm Craig Macdonald added a comment - Clarified title.
            craigm Craig Macdonald made changes -
            Summary Query-biased summarisation out of the box Indexing support for query-biased summarisation

              People

              • Assignee:
                richardm Richard McCreadie
                Reporter:
                craigm Craig Macdonald
              • Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: