Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-174

Indexing a directory breaks on special pdf- or excel files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      I've installed terrier 3.5 on windows xp and started desktop_terrier.
      After that, I choose a directory to index and started indexing.
      After about 50 documents terrier throws an execption, because it was not able to index a special pdf-dcument (some other pdfs worked).
      Is there any chance to tell terrier to skip such exceptions and to go on with indexing ?

      here is the execption/log:

      Set TERRIER_HOME to be D:\Java\terrier
      WARNING: The file terrier.properties was not found at location D:\Java\terrier\etc\terrier.properties
      Assuming the value of terrier.home from the corresponding system property.
      INFO - Deleting: D:\Java\terrier\var\index\data_1.direct.bf: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.document.fsarrayfile: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.idx: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.zdata: true
      INFO - creating the data structures data_1
      INFO - BlockIndexer creating direct index
      INFO - NEXT: D:\Virtual Machines\host\Privat\_dokumente
      .....
      java.lang.NullPointerException
      at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:254)
      at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:773)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:211)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:185)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:161)
      at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:111)
      at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
      at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
      at java.lang.reflect.Constructor.newInstance(Unknown Source)
      at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
      at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:357)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
      ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
      java.lang.NullPointerException
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
      at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)

        Attachments

          Activity

          Hide
          tutysara tutysara added a comment -

          I had applied the patch given.
          I could get the folder indexed.
          I am getting exception when I try to search using a keyword.

          Here are the logs

          INFO - Collection #0 took 26seconds to index (1335 documents)

          INFO - 1 lexicons to merge
          INFO - Optimising structure lexicon
          INFO - Optimsing lexicon with 9988 entries
          INFO - Started building the block inverted index...
          INFO - creating block inverted index
          INFO - Iteration 1 of 1 iterations
          INFO - Scanning lexicon for 2000000 pointers
          INFO - time to process part of lexicon: 0.094
          INFO - time to traverse direct file: 0.422
          INFO - time to write inverted file: 0.078
          INFO - time to perform one iteration: 0.594
          INFO - number of pointers processed: 124495
          INFO - Finished generating inverted file, rewriting lexicon
          INFO - Optimising structure lexicon
          INFO - Optimsing lexicon with 9988 entries
          INFO - Finished building the block inverted index...
          INFO - Time elapsed for inverted file: 0
          INFO - Structure meta reading lookup file into memory
          INFO - Structure meta reading reverse map for key docno directly from disk
          INFO - Structure meta loading data file into memory
          ERROR - IOException reading FSOrderedMapFile
          java.io.EOFException
          at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591)
          at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
          at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
          at org.apache.hadoop.io.Text.readFields(Text.java:263)
          at org.terrier.structures.seralization.FixedSizeTextFactory$FixedSizeText.readFields(FixedSizeTextFactory.java:65)
          at org.terrier.structures.collections.FSOrderedMapFile.getEntry(FSOrderedMapFile.java:729)
          at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:772)
          at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:1)
          at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:92)
          at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:1)
          at org.terrier.matching.PostingListManager.addSingleTerm(PostingListManager.java:195)
          at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:169)
          at org.terrier.matching.taat.Full.match(Full.java:73)
          at org.terrier.querying.Manager.runMatching(Manager.java:676)
          at org.terrier.applications.desktop.DesktopTerrier.runQuery(DesktopTerrier.java:1002)
          at org.terrier.applications.desktop.DesktopTerrier.access$15(DesktopTerrier.java:973)
          at org.terrier.applications.desktop.DesktopTerrier$11.run(DesktopTerrier.java:962)
          at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209)
          at java.awt.EventQueue.dispatchEvent(EventQueue.java:597)
          at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273)
          at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183)
          at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173)
          at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168)
          at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160)
          at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)

          Show
          tutysara tutysara added a comment - I had applied the patch given. I could get the folder indexed. I am getting exception when I try to search using a keyword. Here are the logs INFO - Collection #0 took 26seconds to index (1335 documents) INFO - 1 lexicons to merge INFO - Optimising structure lexicon INFO - Optimsing lexicon with 9988 entries INFO - Started building the block inverted index... INFO - creating block inverted index INFO - Iteration 1 of 1 iterations INFO - Scanning lexicon for 2000000 pointers INFO - time to process part of lexicon: 0.094 INFO - time to traverse direct file: 0.422 INFO - time to write inverted file: 0.078 INFO - time to perform one iteration: 0.594 INFO - number of pointers processed: 124495 INFO - Finished generating inverted file, rewriting lexicon INFO - Optimising structure lexicon INFO - Optimsing lexicon with 9988 entries INFO - Finished building the block inverted index... INFO - Time elapsed for inverted file: 0 INFO - Structure meta reading lookup file into memory INFO - Structure meta reading reverse map for key docno directly from disk INFO - Structure meta loading data file into memory ERROR - IOException reading FSOrderedMapFile java.io.EOFException at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readFields(Text.java:263) at org.terrier.structures.seralization.FixedSizeTextFactory$FixedSizeText.readFields(FixedSizeTextFactory.java:65) at org.terrier.structures.collections.FSOrderedMapFile.getEntry(FSOrderedMapFile.java:729) at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:772) at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:1) at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:92) at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:1) at org.terrier.matching.PostingListManager.addSingleTerm(PostingListManager.java:195) at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:169) at org.terrier.matching.taat.Full.match(Full.java:73) at org.terrier.querying.Manager.runMatching(Manager.java:676) at org.terrier.applications.desktop.DesktopTerrier.runQuery(DesktopTerrier.java:1002) at org.terrier.applications.desktop.DesktopTerrier.access$15(DesktopTerrier.java:973) at org.terrier.applications.desktop.DesktopTerrier$11.run(DesktopTerrier.java:962) at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209) at java.awt.EventQueue.dispatchEvent(EventQueue.java:597) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160) at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)
          Hide
          pi Bartholomew Cubbins added a comment -

          Greetings, I have found the same issue.
          Fixed it (it seems) by adding:

          @Override
          public String next()
          {
          try{
          //&&&& NPE:
          if (this.br == null) {
          eos = true;
          return null;
          }

          Show
          pi Bartholomew Cubbins added a comment - Greetings, I have found the same issue. Fixed it (it seems) by adding: @Override public String next() { try{ //&&&& NPE: if (this.br == null) { eos = true; return null; }
          Hide
          craigm Craig Macdonald added a comment -

          Thanks Bartholomew. Perhaps other users experiencing this problem (Ulrich, tutysara) can test the patch?

          Show
          craigm Craig Macdonald added a comment - Thanks Bartholomew. Perhaps other users experiencing this problem (Ulrich, tutysara) can test the patch?
          Hide
          rendfield Ulrich Kaemmerer added a comment -

          Sorry, I will not do that in the near future.
          The product was not usable for me (indexing breaks after a few files) so I switched to another product.

          Show
          rendfield Ulrich Kaemmerer added a comment - Sorry, I will not do that in the near future. The product was not usable for me (indexing breaks after a few files) so I switched to another product.
          Hide
          craigm Craig Macdonald added a comment -

          Committed for 3.6. I chose to check for null in the constructor of the Tokenisers, rather than for each term.

          Show
          craigm Craig Macdonald added a comment - Committed for 3.6. I chose to check for null in the constructor of the Tokenisers, rather than for each term.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              rendfield Ulrich Kaemmerer
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: