Uploaded image for project: 'Terrier Core'
  1. Terrier Core
  2. TR-174

Indexing a directory breaks on special pdf- or excel files

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      I've installed terrier 3.5 on windows xp and started desktop_terrier.
      After that, I choose a directory to index and started indexing.
      After about 50 documents terrier throws an execption, because it was not able to index a special pdf-dcument (some other pdfs worked).
      Is there any chance to tell terrier to skip such exceptions and to go on with indexing ?

      here is the execption/log:

      Set TERRIER_HOME to be D:\Java\terrier
      WARNING: The file terrier.properties was not found at location D:\Java\terrier\etc\terrier.properties
      Assuming the value of terrier.home from the corresponding system property.
      INFO - Deleting: D:\Java\terrier\var\index\data_1.direct.bf: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.document.fsarrayfile: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.idx: true
      INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.zdata: true
      INFO - creating the data structures data_1
      INFO - BlockIndexer creating direct index
      INFO - NEXT: D:\Virtual Machines\host\Privat\_dokumente
      .....
      java.lang.NullPointerException
      at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:254)
      at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:773)
      at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
      at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:211)
      at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:185)
      at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:161)
      at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:111)
      at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
      at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
      at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
      at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
      at java.lang.reflect.Constructor.newInstance(Unknown Source)
      at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
      at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:357)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
      ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
      java.lang.NullPointerException
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
      at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
      at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
      at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
      at org.terrier.indexing.Indexer.index(Indexer.java:346)
      at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
      at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
      at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)

        Attachments

          Activity

          rendfield Ulrich Kaemmerer created issue -
          craigm Craig Macdonald made changes -
          Field Original Value New Value
          Status Open [ 1 ] Patch Available [ 10000 ]
          craigm Craig Macdonald made changes -
          Status Patch Available [ 10000 ] Open [ 1 ]
          craigm Craig Macdonald made changes -
          Attachment TR-174.v1.patch [ 10322 ]
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          Anonymous made changes -
          Status Patch Available [ 10000 ] Open [ 1 ]
          pi Bartholomew Cubbins made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          pi Bartholomew Cubbins made changes -
          Status Patch Available [ 10000 ] Open [ 1 ]
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.6 [ 10060 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              rendfield Ulrich Kaemmerer
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: