Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-174

Indexing a directory breaks on special pdf- or excel files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      I've installed terrier 3.5 on windows xp and started desktop_terrier.
      After that, I choose a directory to index and started indexing.
      After about 50 documents terrier throws an execption, because it was not able to index a special pdf-dcument (some other pdfs worked).
      Is there any chance to tell terrier to skip such exceptions and to go on with indexing ?

      here is the execption/log:

      Set TERRIER_HOME to be D:\Java\terrier
      WARNING: The file terrier.properties was not found at location D:\Java\terrier\etc\terrier.properties
      Assuming the value of terrier.home from the corresponding system property.
      INFO - Deleting: D:\Java\terrier\var\index\data_1.direct.bf: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.document.fsarrayfile: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.idx: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.zdata: true
      INFO - creating the data structures data_1
      INFO - BlockIndexer creating direct index
      INFO - NEXT: D:\Virtual Machines\host\Privat\_dokumente
      .....
      java.lang.NullPointerException
      at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:254)
      at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:773)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:211)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:185)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:161)
      at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:111)
      at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
      at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
      at java.lang.reflect.Constructor.newInstance(Unknown Source)
      at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
      at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:357)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
      ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
      java.lang.NullPointerException
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
      at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)

        Attachments

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Hi Ulrich,

          Yes, I actually found this myself yesterday. Please can you see if the attached patch addresses your problem?

          Craig

          Show
          craigm Craig Macdonald added a comment - Hi Ulrich, Yes, I actually found this myself yesterday. Please can you see if the attached patch addresses your problem? Craig
          Hide
          craigm Craig Macdonald added a comment -

          This issue should have a unit test before committing.

          Show
          craigm Craig Macdonald added a comment - This issue should have a unit test before committing.
          Hide
          tutysara tutysara added a comment -

          I have the issue with Excel files.
          I got these stack trace.

          ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
          java.lang.NullPointerException
          at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
          at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
          at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
          at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
          at org.terrier.indexing.Indexer.index(Indexer.java:346)
          at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
          at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)

          The actual problem might be, the file is not readable

          WARN - WARNING: Problem converting excel documentjava.io.IOException: Invalid header signature; read 723401728380766730, expected -2226271756974174256

          I will try your patch and report the result.

          Show
          tutysara tutysara added a comment - I have the issue with Excel files. I got these stack trace. ERROR - An unexpected exception occured while indexing. Indexing has been aborted. java.lang.NullPointerException at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97) at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76) at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221) at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371) at org.terrier.indexing.Indexer.index(Indexer.java:346) at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129) at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114) The actual problem might be, the file is not readable WARN - WARNING: Problem converting excel documentjava.io.IOException: Invalid header signature; read 723401728380766730, expected -2226271756974174256 I will try your patch and report the result.
          Hide
          rendfield Ulrich Kaemmerer added a comment -

          I've added that patch, recomplied everything and re-run terrier against the same directory with the same result as before.
          Indexing crashed and aborted.

          The problem ist not that the file could not be indexed but that the whole process stops after that error.

          Show
          rendfield Ulrich Kaemmerer added a comment - I've added that patch, recomplied everything and re-run terrier against the same directory with the same result as before. Indexing crashed and aborted. The problem ist not that the file could not be indexed but that the whole process stops after that error.
          Hide
          tutysara tutysara added a comment -

          I had applied the patch given.
          I could get the folder indexed.
          I am getting exception when I try to search using a keyword.

          Here are the logs

          INFO - Collection #0 took 26seconds to index (1335 documents)

          INFO - 1 lexicons to merge
          INFO - Optimising structure lexicon
          INFO - Optimsing lexicon with 9988 entries
          INFO - Started building the block inverted index...
          INFO - creating block inverted index
          INFO - Iteration 1 of 1 iterations
          INFO - Scanning lexicon for 2000000 pointers
          INFO - time to process part of lexicon: 0.094
          INFO - time to traverse direct file: 0.422
          INFO - time to write inverted file: 0.078
          INFO - time to perform one iteration: 0.594
          INFO - number of pointers processed: 124495
          INFO - Finished generating inverted file, rewriting lexicon
          INFO - Optimising structure lexicon
          INFO - Optimsing lexicon with 9988 entries
          INFO - Finished building the block inverted index...
          INFO - Time elapsed for inverted file: 0
          INFO - Structure meta reading lookup file into memory
          INFO - Structure meta reading reverse map for key docno directly from disk
          INFO - Structure meta loading data file into memory
          ERROR - IOException reading FSOrderedMapFile
          java.io.EOFException
          at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591)
          at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
          at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
          at org.apache.hadoop.io.Text.readFields(Text.java:263)
          at org.terrier.structures.seralization.FixedSizeTextFactory$FixedSizeText.readFields(FixedSizeTextFactory.java:65)
          at org.terrier.structures.collections.FSOrderedMapFile.getEntry(FSOrderedMapFile.java:729)
          at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:772)
          at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:1)
          at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:92)
          at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:1)
          at org.terrier.matching.PostingListManager.addSingleTerm(PostingListManager.java:195)
          at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:169)
          at org.terrier.matching.taat.Full.match(Full.java:73)
          at org.terrier.querying.Manager.runMatching(Manager.java:676)
          at org.terrier.applications.desktop.DesktopTerrier.runQuery(DesktopTerrier.java:1002)
          at org.terrier.applications.desktop.DesktopTerrier.access$15(DesktopTerrier.java:973)
          at org.terrier.applications.desktop.DesktopTerrier$11.run(DesktopTerrier.java:962)
          at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209)
          at java.awt.EventQueue.dispatchEvent(EventQueue.java:597)
          at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273)
          at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183)
          at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173)
          at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168)
          at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160)
          at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)

          Show
          tutysara tutysara added a comment - I had applied the patch given. I could get the folder indexed. I am getting exception when I try to search using a keyword. Here are the logs INFO - Collection #0 took 26seconds to index (1335 documents) INFO - 1 lexicons to merge INFO - Optimising structure lexicon INFO - Optimsing lexicon with 9988 entries INFO - Started building the block inverted index... INFO - creating block inverted index INFO - Iteration 1 of 1 iterations INFO - Scanning lexicon for 2000000 pointers INFO - time to process part of lexicon: 0.094 INFO - time to traverse direct file: 0.422 INFO - time to write inverted file: 0.078 INFO - time to perform one iteration: 0.594 INFO - number of pointers processed: 124495 INFO - Finished generating inverted file, rewriting lexicon INFO - Optimising structure lexicon INFO - Optimsing lexicon with 9988 entries INFO - Finished building the block inverted index... INFO - Time elapsed for inverted file: 0 INFO - Structure meta reading lookup file into memory INFO - Structure meta reading reverse map for key docno directly from disk INFO - Structure meta loading data file into memory ERROR - IOException reading FSOrderedMapFile java.io.EOFException at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readFields(Text.java:263) at org.terrier.structures.seralization.FixedSizeTextFactory$FixedSizeText.readFields(FixedSizeTextFactory.java:65) at org.terrier.structures.collections.FSOrderedMapFile.getEntry(FSOrderedMapFile.java:729) at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:772) at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:1) at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:92) at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:1) at org.terrier.matching.PostingListManager.addSingleTerm(PostingListManager.java:195) at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:169) at org.terrier.matching.taat.Full.match(Full.java:73) at org.terrier.querying.Manager.runMatching(Manager.java:676) at org.terrier.applications.desktop.DesktopTerrier.runQuery(DesktopTerrier.java:1002) at org.terrier.applications.desktop.DesktopTerrier.access$15(DesktopTerrier.java:973) at org.terrier.applications.desktop.DesktopTerrier$11.run(DesktopTerrier.java:962) at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209) at java.awt.EventQueue.dispatchEvent(EventQueue.java:597) at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273) at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183) at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168) at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160) at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)
          Hide
          pi Bartholomew Cubbins added a comment -

          Greetings, I have found the same issue.
          Fixed it (it seems) by adding:

          @Override
          public String next()
          {
          try{
          //&&&& NPE:
          if (this.br == null) {
          eos = true;
          return null;
          }

          Show
          pi Bartholomew Cubbins added a comment - Greetings, I have found the same issue. Fixed it (it seems) by adding: @Override public String next() { try{ //&&&& NPE: if (this.br == null) { eos = true; return null; }
          Hide
          craigm Craig Macdonald added a comment -

          Thanks Bartholomew. Perhaps other users experiencing this problem (Ulrich, tutysara) can test the patch?

          Show
          craigm Craig Macdonald added a comment - Thanks Bartholomew. Perhaps other users experiencing this problem (Ulrich, tutysara) can test the patch?
          Hide
          rendfield Ulrich Kaemmerer added a comment -

          Sorry, I will not do that in the near future.
          The product was not usable for me (indexing breaks after a few files) so I switched to another product.

          Show
          rendfield Ulrich Kaemmerer added a comment - Sorry, I will not do that in the near future. The product was not usable for me (indexing breaks after a few files) so I switched to another product.
          Hide
          craigm Craig Macdonald added a comment -

          Committed for 3.6. I chose to check for null in the constructor of the Tokenisers, rather than for each term.

          Show
          craigm Craig Macdonald added a comment - Committed for 3.6. I chose to check for null in the constructor of the Tokenisers, rather than for each term.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              rendfield Ulrich Kaemmerer
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: