Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: .indexing
    • Labels:
      None

      Description

      Indexing in batch mode of xml documents causes this error:
       
      INFO - Collection #0 took 11 seconds to index (8644 documents)
      INFO - 3 lexicons to merge
      INFO - Optimising structure lexicon
      INFO - Optimising lexicon with 86507 entries
      INFO - Started building the inverted index...
      INFO - Started building the inverted index...
      INFO - Iteration 1 of 1 iterations
      A problem occurred: java.lang.NullPointerException
      java.lang.NullPointerException
              at org.terrier.structures.bit.BitPostingIndexInputStream.loadPostingIterator(BitPostingIndexInputStream.java:242)
              at org.terrier.structures.bit.BitPostingIndexInputStream.getNextPostings(BitPostingIndexInputStream.java:183)
              at org.terrier.structures.indexing.classical.InvertedIndexBuilder.traverseDirectFile(InvertedIndexBuilder.java:524)
              at org.terrier.structures.indexing.classical.InvertedIndexBuilder.createInvertedIndex(InvertedIndexBuilder.java:315)
              at org.terrier.structures.indexing.classical.BasicIndexer.createInvertedIndex(BasicIndexer.java:427)
              at org.terrier.structures.indexing.Indexer.index(Indexer.java:348)
              at org.terrier.applications.TRECIndexing.index(TRECIndexing.java:122)
              at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:407)
              at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:588)
              at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:245)


      The xml documents follow this format:
      <DOC>
      <DOCNO> doc1 </DOCNO>
      Content of the document does here
      </DOC>
      <DOC>

      This feature worked fine on 3.5, but it fails on 4.0.

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Can you paste your terrier.properties configuration?

          Craig

          Show
          craigm Craig Macdonald added a comment - Can you paste your terrier.properties configuration? Craig
          Hide
          ivan.kitanovski Ivan Kitanovski added a comment -

          It's the default configuration:

          #default controls for query expansion
          querying.postprocesses.order=QueryExpansion
          querying.postprocesses.controls=qe:QueryExpansion
          #default controls for the web-based interface. SimpleDecorate
          #is the simplest metadata decorator. For more control, see Decorate.
          querying.postfilters.order=SimpleDecorate,SiteFilter,Scope
          querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope

          #default and allowed controls
          querying.default.controls=
          querying.allowed.controls=scope,qe,qemodel,start,end,site,scope

          #document tags specification
          #for processing the contents of
          #the documents, ignoring DOCHDR
          TrecDocTags.doctag=DOC
          TrecDocTags.idtag=DOCNO
          TrecDocTags.skip=DOCHDR
          #set to true if the tags can be of various case
          TrecDocTags.casesensitive=false

          #query tags specification
          TrecQueryTags.doctag=TOP
          TrecQueryTags.idtag=NUM
          TrecQueryTags.process=TOP,NUM,TITLE
          TrecQueryTags.skip=DESC,NARR

          #stop-words file
          stopwords.filename=stopword-list.txt

          #the processing stages a term goes through
          termpipelines=Stopwords,PorterStemmer

          Thanks

          Show
          ivan.kitanovski Ivan Kitanovski added a comment - It's the default configuration: #default controls for query expansion querying.postprocesses.order=QueryExpansion querying.postprocesses.controls=qe:QueryExpansion #default controls for the web-based interface. SimpleDecorate #is the simplest metadata decorator. For more control, see Decorate. querying.postfilters.order=SimpleDecorate,SiteFilter,Scope querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope #default and allowed controls querying.default.controls= querying.allowed.controls=scope,qe,qemodel,start,end,site,scope #document tags specification #for processing the contents of #the documents, ignoring DOCHDR TrecDocTags.doctag=DOC TrecDocTags.idtag=DOCNO TrecDocTags.skip=DOCHDR #set to true if the tags can be of various case TrecDocTags.casesensitive=false #query tags specification TrecQueryTags.doctag=TOP TrecQueryTags.idtag=NUM TrecQueryTags.process=TOP,NUM,TITLE TrecQueryTags.skip=DESC,NARR #stop-words file stopwords.filename=stopword-list.txt #the processing stages a term goes through termpipelines=Stopwords,PorterStemmer Thanks
          Hide
          craigm Craig Macdonald added a comment -

          Its not really XML, your sample document just looks like a TREC document, and you index it with TRECCollection.

          Would you be able to re-run with Java assertions enabled?

          i.e. add -ea to the Java command line in anyclass.sh?

          Craig

          Show
          craigm Craig Macdonald added a comment - Its not really XML, your sample document just looks like a TREC document, and you index it with TRECCollection. Would you be able to re-run with Java assertions enabled? i.e. add -ea to the Java command line in anyclass.sh? Craig
          Hide
          rekabsaz Navid Rekabsaz added a comment -

          I also ran into the same problem. So I moved to 3.5 where it seems that it does not exists. One point is that it does not happen when I index a portion of the collection up to 10K documents, but it happens when I index whole the collection with ~2M documents (CLEFIP 2013). Here are the properties:

          #default controls for query expansion
          querying.postprocesses.order=QueryExpansion
          querying.postprocesses.controls=qe:QueryExpansion

          querying.postfilters.order=SimpleDecorate,SiteFilter,Scope
          querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope

          #default and allowed controls
          querying.default.controls=
          querying.allowed.controls=scope,qe,qemodel,start,end,site,scope

          TrecDocTags.doctag=DOC
          TrecDocTags.idtag=ID
          TrecDocTags.skip=UCID,XPATH
          TrecDocTags.casesensitive=false

          indexer.meta.forward.keylens=100
          invertedfile.processpointers=2000000

          #query tags specification
          TrecQueryTags.doctag=TOP
          TrecQueryTags.idtag=NUM
          TrecQueryTags.process=TOP,NUM,TITLE
          TrecQueryTags.skip=DESC,NARR

          stopwords.filename=stopword-list.txt

          termpipelines=Stopwords,PorterStemmer

          Thanks
          Navid

          Show
          rekabsaz Navid Rekabsaz added a comment - I also ran into the same problem. So I moved to 3.5 where it seems that it does not exists. One point is that it does not happen when I index a portion of the collection up to 10K documents, but it happens when I index whole the collection with ~2M documents (CLEFIP 2013). Here are the properties: #default controls for query expansion querying.postprocesses.order=QueryExpansion querying.postprocesses.controls=qe:QueryExpansion querying.postfilters.order=SimpleDecorate,SiteFilter,Scope querying.postfilters.controls=decorate:SimpleDecorate,site:SiteFilter,scope:Scope #default and allowed controls querying.default.controls= querying.allowed.controls=scope,qe,qemodel,start,end,site,scope TrecDocTags.doctag=DOC TrecDocTags.idtag=ID TrecDocTags.skip=UCID,XPATH TrecDocTags.casesensitive=false indexer.meta.forward.keylens=100 invertedfile.processpointers=2000000 #query tags specification TrecQueryTags.doctag=TOP TrecQueryTags.idtag=NUM TrecQueryTags.process=TOP,NUM,TITLE TrecQueryTags.skip=DESC,NARR stopwords.filename=stopword-list.txt termpipelines=Stopwords,PorterStemmer Thanks Navid

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              ivan.kitanovski Ivan Kitanovski
            • Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: