Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.6
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      Terrier should be extended to better be able to process MathML documents for participants to the NTCIR Math track.

      Attached tar with the following classes
       - MathMLCollection: Extends TRECCollection, i.e. follows the TaggedDocument parsing strategy but does not look for property tags or a docno tag.
       - FuzzyTagSet: Extends TagSet. Instead of performing exact match checking on the current tag against those to process, it instead only checks to see if the current tag or any of the enclosing tags match the tags listed to process
       - FuzzyTaggedDocument: A TaggedDocument that uses FuzzyTagSet rather than TagSet.

        Attachments

          Activity

          Hide
          richardm Richard McCreadie added a comment -

          Indexing Configuration:

          trec.collection.class=MathMLCollection
          TrecDocTags.doctag=body
          TrecDocTags.process=p,conv:math
          TrecDocTags.skip=
          TrecDocTags.casesensitive=false

          termpipelines=Stopwords,PorterStemmer
          tokeniser=TRECFullUTFTokenizer

          block.indexing=true
          blocks.size=1

          indexer.meta.forward.keys=docno,abstract
          indexer.meta.forward.keylens=10,1000

          TaggedDocument.abstracts=abstract
          TaggedDocument.abstracts.tags=ELSE
          TaggedDocument.abstracts.tags.casesensitive=false
          TaggedDocument.abstracts.lengths=1000

          metaindex.compressed.crop.long=true

          trec.document.class=org.terrier.indexing.FuzzyTaggedDocument

          Show
          richardm Richard McCreadie added a comment - Indexing Configuration: trec.collection.class=MathMLCollection TrecDocTags.doctag=body TrecDocTags.process=p,conv:math TrecDocTags.skip= TrecDocTags.casesensitive=false termpipelines=Stopwords,PorterStemmer tokeniser=TRECFullUTFTokenizer block.indexing=true blocks.size=1 indexer.meta.forward.keys=docno,abstract indexer.meta.forward.keylens=10,1000 TaggedDocument.abstracts=abstract TaggedDocument.abstracts.tags=ELSE TaggedDocument.abstracts.tags.casesensitive=false TaggedDocument.abstracts.lengths=1000 metaindex.compressed.crop.long=true trec.document.class=org.terrier.indexing.FuzzyTaggedDocument
          Hide
          richardm Richard McCreadie added a comment -

          MathML indexing requires Terrier 3.6 or later.

          Show
          richardm Richard McCreadie added a comment - MathML indexing requires Terrier 3.6 or later.

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              richardm Richard McCreadie
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: