[TR-174] Indexing a directory breaks on special pdf- or excel files Created: 04/Aug/11  Updated: 13/Apr/12  Resolved: 13/Apr/12

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.5
Fix Version/s: 3.6

Type: Bug Priority: Major
Reporter: Ulrich Kaemmerer Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File TR-174.v1.patch    

 Description   
I've installed terrier 3.5 on windows xp and started desktop_terrier.
After that, I choose a directory to index and started indexing.
After about 50 documents terrier throws an execption, because it was not able to index a special pdf-dcument (some other pdfs worked).
Is there any chance to tell terrier to skip such exceptions and to go on with indexing ?

here is the execption/log:

Set TERRIER_HOME to be D:\Java\terrier
WARNING: The file terrier.properties was not found at location D:\Java\terrier\etc\terrier.properties
Assuming the value of terrier.home from the corresponding system property.
INFO - Deleting: D:\Java\terrier\var\index\data_1.direct.bf: true
INFO - Deleting: D:\Java\terrier\var\index\data_1.document.fsarrayfile: true
INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.idx: true
INFO - Deleting: D:\Java\terrier\var\index\data_1.meta.zdata: true
INFO - creating the data structures data_1
INFO - BlockIndexer creating direct index
INFO - NEXT: D:\Virtual Machines\host\Privat\_dokumente
.....
java.lang.NullPointerException
at org.pdfbox.util.PDFStreamEngine.showString(PDFStreamEngine.java:254)
at org.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:773)
at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:139)
at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:211)
at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:185)
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:161)
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:111)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:357)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
java.lang.NullPointerException
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)



 Comments   
Comment by Craig Macdonald [ 05/Aug/11 ]

Hi Ulrich,

Yes, I actually found this myself yesterday. Please can you see if the attached patch addresses your problem?

Craig

Comment by Craig Macdonald [ 05/Aug/11 ]

This issue should have a unit test before committing.

Comment by tutysara [ 10/Aug/11 ]

I have the issue with Excel files.
I got these stack trace.

ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
java.lang.NullPointerException
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java:371)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)

The actual problem might be, the file is not readable

WARN - WARNING: Problem converting excel documentjava.io.IOException: Invalid header signature; read 723401728380766730, expected -2226271756974174256

I will try your patch and report the result.

Comment by Ulrich Kaemmerer [ 10/Aug/11 ]

I've added that patch, recomplied everything and re-run terrier against the same directory with the same result as before.
Indexing crashed and aborted.

The problem ist not that the file could not be indexed but that the whole process stops after that error.

Comment by tutysara [ 11/Aug/11 ]

I had applied the patch given.
I could get the folder indexed.
I am getting exception when I try to search using a keyword.

Here are the logs

INFO - Collection #0 took 26seconds to index (1335 documents)

INFO - 1 lexicons to merge
INFO - Optimising structure lexicon
INFO - Optimsing lexicon with 9988 entries
INFO - Started building the block inverted index...
INFO - creating block inverted index
INFO - Iteration 1 of 1 iterations
INFO - Scanning lexicon for 2000000 pointers
INFO - time to process part of lexicon: 0.094
INFO - time to traverse direct file: 0.422
INFO - time to write inverted file: 0.078
INFO - time to perform one iteration: 0.594
INFO - number of pointers processed: 124495
INFO - Finished generating inverted file, rewriting lexicon
INFO - Optimising structure lexicon
INFO - Optimsing lexicon with 9988 entries
INFO - Finished building the block inverted index...
INFO - Time elapsed for inverted file: 0
INFO - Structure meta reading lookup file into memory
INFO - Structure meta reading reverse map for key docno directly from disk
INFO - Structure meta loading data file into memory
ERROR - IOException reading FSOrderedMapFile
java.io.EOFException
at java.io.RandomAccessFile.readByte(RandomAccessFile.java:591)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at org.apache.hadoop.io.Text.readFields(Text.java:263)
at org.terrier.structures.seralization.FixedSizeTextFactory$FixedSizeText.readFields(FixedSizeTextFactory.java:65)
at org.terrier.structures.collections.FSOrderedMapFile.getEntry(FSOrderedMapFile.java:729)
at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:772)
at org.terrier.structures.collections.FSOrderedMapFile.get(FSOrderedMapFile.java:1)
at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:92)
at org.terrier.structures.MapLexicon.getLexiconEntry(MapLexicon.java:1)
at org.terrier.matching.PostingListManager.addSingleTerm(PostingListManager.java:195)
at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:169)
at org.terrier.matching.taat.Full.match(Full.java:73)
at org.terrier.querying.Manager.runMatching(Manager.java:676)
at org.terrier.applications.desktop.DesktopTerrier.runQuery(DesktopTerrier.java:1002)
at org.terrier.applications.desktop.DesktopTerrier.access$15(DesktopTerrier.java:973)
at org.terrier.applications.desktop.DesktopTerrier$11.run(DesktopTerrier.java:962)
at java.awt.event.InvocationEvent.dispatch(InvocationEvent.java:209)
at java.awt.EventQueue.dispatchEvent(EventQueue.java:597)
at java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:273)
at java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:183)
at java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:173)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:168)
at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:160)
at java.awt.EventDispatchThread.run(EventDispatchThread.java:121)

Comment by Bartholomew Cubbins [ 15/Nov/11 ]

Greetings, I have found the same issue.
Fixed it (it seems) by adding:

@Override
public String next()
{
try{
//&&&& NPE:
if (this.br == null) {
eos = true;
return null;
}

Comment by Craig Macdonald [ 16/Nov/11 ]

Thanks Bartholomew. Perhaps other users experiencing this problem (Ulrich, tutysara) can test the patch?

Comment by Ulrich Kaemmerer [ 17/Nov/11 ]

Sorry, I will not do that in the near future.
The product was not usable for me (indexing breaks after a few files) so I switched to another product.

Comment by Craig Macdonald [ 13/Apr/12 ]

Committed for 3.6. I chose to check for null in the constructor of the Tokenisers, rather than for each term.

Generated at Wed Dec 13 08:53:33 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.