Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-146

Tokenisation should be done separately from Document parsing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.5
    • Component/s: .indexing, tests
    • Labels:
      None

      Description

      Various Document objects perform hard-coded tokenisations for different languages. For instance, TRECUTFCollection$UTFDocument.

      It would be desirable to separate tokenisation from the rest of the document parsing, which deals with other tasks such as HTML parsing, metadata collection, etc.

        Attachments

          Issue Links

            Activity

            rodrygo Rodrygo L. T. Santos created issue -
            rodrygo Rodrygo L. T. Santos made changes -
            Field Original Value New Value
            Link This issue is related to TREC-224 [ TREC-224 ]
            rodrygo Rodrygo L. T. Santos made changes -
            Link This issue blocks TREC-220 [ TREC-220 ]
            rodrygo Rodrygo L. T. Santos made changes -
            Link This issue blocks TREC-200 [ TREC-200 ]
            craigm Craig Macdonald made changes -
            Link This issue blocks TREC-228 [ TREC-228 ]
            craigm Craig Macdonald made changes -
            Link This issue blocks TREC-227 [ TREC-227 ]
            craigm Craig Macdonald made changes -
            Status Open [ 1 ] Resolved [ 5 ]
            Assignee Rodrygo L. T. Santos [ rodrygo ] Craig Macdonald [ craigm ]
            Resolution Fixed [ 1 ]
            craigm Craig Macdonald made changes -
            Project TREC [ 10010 ] Terrier Core [ 10000 ]
            Key TREC-225 TR-146
            Issue Type New Feature [ 2 ] Bug [ 1 ]
            Workflow jira [ 10502 ] Terrier Open Source [ 10539 ]
            Component/s .indexing [ 10002 ]
            Component/s tests [ 10006 ]
            Component/s Core [ 10020 ]
            Fix Version/s 3.1 [ 10040 ]
            Fix Version/s 3.1 [ 10021 ]

              People

              • Assignee:
                craigm Craig Macdonald
                Reporter:
                rodrygo Rodrygo L. T. Santos
              • Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: