Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-235

LexiconBuilder fails on empty term

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      INFO - Collection #0 took 55 seconds to build the runs for 1666 documents

      INFO - Key docno values are sorted in meta index, consider binary searching zdat
      a file
      INFO - Merging 1 runs...
      INFO - Collection #0 took 0 seconds to merge

      INFO - Collection #0 total time 55
      INFO - Optimising structure lexicon
      INFO - Optimsing lexicon with 68611 entries
      A problem occurred: java.nio.BufferUnderflowException
      java.nio.BufferUnderflowException
              at java.nio.Buffer.nextGetIndex(Unknown Source)
              at java.nio.HeapByteBuffer.get(Unknown Source)
              at org.apache.hadoop.io.Text.bytesToCodePoint(Text.java:536)
              at org.apache.hadoop.io.Text.charAt(Text.java:121)
              at org.terrier.structures.FSOMapFileLexicon.optimise(FSOMapFileLexicon.j
      ava:528)
              at org.terrier.structures.FSOMapFileLexicon.optimise(FSOMapFileLexicon.j
      ava:473)
              at org.terrier.structures.indexing.LexiconBuilder.optimise(LexiconBuilde
      r.java:830)
              at org.terrier.indexing.BasicIndexer.finishedInvertedIndexBuild(BasicInd
      exer.java:449)
              at org.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(Basic
      SinglePassIndexer.java:302)
              at org.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSi
      nglePassIndexer.java:155)
              at org.terrier.indexing.Indexer.index(Indexer.java:346)
              at org.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.j
      ava:220)
              at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:382)
              at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:56
      4)
              at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:235)

      C:\terrier-3.5\bin>

        Attachments

          Activity

          Hide
          mhidy Abdelkader EL MAHDAOUY added a comment -

          The same problem occured in the version 4.0
          The problem is that the LexiconBuilder fails to merge lexicons when stemmed terms of lenght 2 occured.
          To deal with this problem i just make little change into the function processTerm:

          public void processTerm(String t)

          { String s= new String(""+t+""); if (t == null || stem(s).length()<=2 ) return; next.processTerm(stem(t)); }

          Thank You

          Show
          mhidy Abdelkader EL MAHDAOUY added a comment - The same problem occured in the version 4.0 The problem is that the LexiconBuilder fails to merge lexicons when stemmed terms of lenght 2 occured. To deal with this problem i just make little change into the function processTerm: public void processTerm(String t) { String s= new String(""+t+""); if (t == null || stem(s).length()<=2 ) return; next.processTerm(stem(t)); } Thank You
          Hide
          craigm Craig Macdonald added a comment -

          Hi Abdelkader,

          I'm trying to understand your use case. Is it valid that:
          (a) your stemmer receives terms of length 2 to stem?
          (b) your stemmer outputs terms of length 0?

          For us, there isn't a convincing use case to have empty terms in the lexicon.
          There is are use case that require terms of length 1 or 2 though.

          I think the problem is with your stemmer, and not the LexiconBuilder?

          On the other hand, we need to check if/why there is any regression in 4.0.

          Craig

          Show
          craigm Craig Macdonald added a comment - Hi Abdelkader, I'm trying to understand your use case. Is it valid that: (a) your stemmer receives terms of length 2 to stem? (b) your stemmer outputs terms of length 0? For us, there isn't a convincing use case to have empty terms in the lexicon. There is are use case that require terms of length 1 or 2 though. I think the problem is with your stemmer, and not the LexiconBuilder? On the other hand, we need to check if/why there is any regression in 4.0. Craig
          Hide
          mhidy Abdelkader EL MAHDAOUY added a comment -

          Thanks Craig,

          Both (a) and (b) are correct. After removing diacritics we'll have some words with lenght 2 and sometimes the stemmer fails at processing some terms due to the complex morphology.

          Thanks again

          Show
          mhidy Abdelkader EL MAHDAOUY added a comment - Thanks Craig, Both (a) and (b) are correct. After removing diacritics we'll have some words with lenght 2 and sometimes the stemmer fails at processing some terms due to the complex morphology. Thanks again
          Hide
          craigm Craig Macdonald added a comment -

          Hi Abdelkader,

          Both (a) and (b) are correct

          So are we agreed the problem is with the stemmer and not with the LexiconBuilder?

          Show
          craigm Craig Macdonald added a comment - Hi Abdelkader, Both (a) and (b) are correct So are we agreed the problem is with the stemmer and not with the LexiconBuilder?
          Hide
          mhidy Abdelkader EL MAHDAOUY added a comment -

          Hi Craig,

          Yes, the problem in my stemmer and not in the LexiconBuilder.

          Thank You

          Show
          mhidy Abdelkader EL MAHDAOUY added a comment - Hi Craig, Yes, the problem in my stemmer and not in the LexiconBuilder. Thank You

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              mhidy Abdelkader EL MAHDAOUY
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: