[TR-235] LexiconBuilder fails on empty term Created: 08/Sep/13  Updated: 22/Jun/14  Resolved: 01/Apr/14

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Blocker
Reporter: Abdelkader EL MAHDAOUY Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

INFO - Collection #0 took 55 seconds to build the runs for 1666 documents

INFO - Key docno values are sorted in meta index, consider binary searching zdat
a file
INFO - Merging 1 runs...
INFO - Collection #0 took 0 seconds to merge

INFO - Collection #0 total time 55
INFO - Optimising structure lexicon
INFO - Optimsing lexicon with 68611 entries
A problem occurred: java.nio.BufferUnderflowException
        at java.nio.Buffer.nextGetIndex(Unknown Source)
        at java.nio.HeapByteBuffer.get(Unknown Source)
        at org.apache.hadoop.io.Text.bytesToCodePoint(Text.java:536)
        at org.apache.hadoop.io.Text.charAt(Text.java:121)
        at org.terrier.structures.FSOMapFileLexicon.optimise(FSOMapFileLexicon.j
        at org.terrier.structures.FSOMapFileLexicon.optimise(FSOMapFileLexicon.j
        at org.terrier.structures.indexing.LexiconBuilder.optimise(LexiconBuilde
        at org.terrier.indexing.BasicIndexer.finishedInvertedIndexBuild(BasicInd
        at org.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(Basic
        at org.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSi
        at org.terrier.indexing.Indexer.index(Indexer.java:346)
        at org.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.j
        at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:382)
        at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:56
        at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:235)


Comment by Abdelkader EL MAHDAOUY [ 08/Sep/13 ]

I'm working on Arabic information retrieval, so the tokenizer is extended to support the arabic characters and the Stemmer too.
When i index the collection without Stemming the index is created successfully, but when i use the stemmer the exception occured.

Comment by Craig Macdonald [ 09/Sep/13 ]

I think the problem is that you have an empty token in your lexicon.


Comment by Richard McCreadie [ 01/Apr/14 ]

I can confirm that this issue occurs when an empty string is considered to be a term in the lexicon.

I don't think that this is a case that Terrier should try to catch, since an empty term is invalid.

Either we:

  • Leave it as is
  • Log a warning if the term length is 0
Comment by Richard McCreadie [ 01/Apr/14 ]

Added exception if an empty term is added to the index and a unit test to check that the exception is thrown.

Committed to build 3766.

Comment by Richard McCreadie [ 01/Apr/14 ]

All unit tests pass. Resolving issue.

Comment by Abdelkader EL MAHDAOUY [ 20/Jun/14 ]

The same problem occured in the version 4.0
The problem is that the LexiconBuilder fails to merge lexicons when stemmed terms of lenght 2 occured.
To deal with this problem i just make little change into the function processTerm:

public void processTerm(String t)

{ String s= new String(""+t+""); if (t == null || stem(s).length()<=2 ) return; next.processTerm(stem(t)); }

Thank You

Comment by Craig Macdonald [ 20/Jun/14 ]

Hi Abdelkader,

I'm trying to understand your use case. Is it valid that:
(a) your stemmer receives terms of length 2 to stem?
(b) your stemmer outputs terms of length 0?

For us, there isn't a convincing use case to have empty terms in the lexicon.
There is are use case that require terms of length 1 or 2 though.

I think the problem is with your stemmer, and not the LexiconBuilder?

On the other hand, we need to check if/why there is any regression in 4.0.


Comment by Abdelkader EL MAHDAOUY [ 21/Jun/14 ]

Thanks Craig,

Both (a) and (b) are correct. After removing diacritics we'll have some words with lenght 2 and sometimes the stemmer fails at processing some terms due to the complex morphology.

Thanks again

Comment by Craig Macdonald [ 22/Jun/14 ]

Hi Abdelkader,

Both (a) and (b) are correct

So are we agreed the problem is with the stemmer and not with the LexiconBuilder?

Comment by Abdelkader EL MAHDAOUY [ 22/Jun/14 ]

Hi Craig,

Yes, the problem in my stemmer and not in the LexiconBuilder.

Thank You

Generated at Sat Aug 08 12:24:27 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.