Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Unable to index .pdf files in desktop search
Posted by: arien ()
Date: June 25, 2011 07:58AM

Hi,
wen i try to index folder containing .pfd files desktop console throws this error--->

PointerException
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:119)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
at org.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:201)
at org.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:155)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
WARN - WARNING: Problem converting PDF:
java.lang.NullPointerException
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:124)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileCollection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileCollection.java:303)
at org.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:201)
at org.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:155)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)
ERROR - An unexpected exception occured while indexing. Indexing has been aborted.
java.lang.NullPointerException
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:97)
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream.next(EnglishTokeniser.java:76)
at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
at org.terrier.indexing.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:213)
at org.terrier.indexing.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:155)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerrier.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTerrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrier.java:498)



---->In terrier.prprties file ive included the following options which were not der by default---->

desktop.indexing.singlepass=true
desktopsearch.filetype.colors=Textsad smiley221 221 221),TeXsad smiley221 221 221),Bibsad smiley221 221 221),PDFsad smiley236 67 69),HTMLsad smiley177 228 250),Wordsad smiley100 100 255),Powerpointsad smiley250 110 49),Excelsad smiley38 183 78),XHTMLsad smiley177 228 250),XMLsad smiley177 228 250)
desktopsearch.filetype.types=txt:Text,text:Text,tex:TeX,bib:Bib,pdftongue sticking out smileyDF,html:HTML,htm:HTML,xhtmlangry smileyHTML,xmlangry smileyML,doc:Word,ppttongue sticking out smileyowerpoint,xls:Excel


Please help me with this

Regards,
Arien

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: arien ()
Date: June 25, 2011 07:59AM

Smileys are the prob of character encoding may be..but neways u will understand..

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: arien ()
Date: June 26, 2011 03:30PM

Hi,
this problem occurs for indexing few .pdf files...otherwise indexing of .pdf works fine.
please let me kno wat wrong m i doing in this.I m looking forward to use terrier in building custom enterprise search engine with its distributed hadoop capabilities for html,.pdf and xls seaarching..your help is much appreciated

thanks

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: craigm ()
Date: June 26, 2011 04:41PM

I think there is another exception higher up that you haven't pasted.

Craig

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: arien ()
Date: June 27, 2011 05:38AM

Hi,

The whole error structure is as below---->


INFO - creating the data structures data_1
INFO - BlockIndexer creating direct index
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\DOCS
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\TEXT
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\XLS
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\DOCS\java_clas
sesAndTheirMethosss.doc
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\DOCS\New Micro
soft Word Document.doc
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\DOCS\VuSVNDeta
ils.doc
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\DOCS\VUWebSupp
ort.doc
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\allclass
es-frame.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\allclass
es-noframe.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\basicCom
ponents.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\bibliogr
aphy.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\configur
e_general.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\configur
e_indexing.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\configur
e_retrieval.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\constant
-values.html
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\HTMLS\contacts
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\2.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\7537779.p
df
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\amazonir_
10Q_20110427.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\fa39edec_
anrepeng2003-04.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\Indian_GA
AP_Financials-Q2-FY08-09.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\pers_qual
ity_checklist.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\php-and-z
end-framework-getting-started.pdf
INFO - NEXT: C:\Documents and Settings\Administrator\Desktop\DOCS\PDFS\Python.pd
f
WARN - WARNING: Problem converting PDF:
java.io.IOException: expected='obj' actual='0' org.pdfbox.io.PushBackInputStream
@b20090
at org.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:379)
at org.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:147)
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:106)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Sou
rce)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileColl
ection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileColle
ction.java:303)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java
:357)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerri
er.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTe
rrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrie
r.java:498)
WARN - WARNING: Problem converting PDF:
java.lang.NullPointerException
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:119)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Sou
rce)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileColl
ection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileColle
ction.java:303)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java
:357)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerri
er.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTe
rrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrie
r.java:498)
WARN - WARNING: Problem converting PDF:
java.lang.NullPointerException
at org.terrier.indexing.PDFDocument.getReader(PDFDocument.java:124)
at org.terrier.indexing.FileDocument.<init>(FileDocument.java:130)
at org.terrier.indexing.PDFDocument.<init>(PDFDocument.java:68)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Sou
rce)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.terrier.indexing.SimpleFileCollection.makeDocument(SimpleFileColl
ection.java:342)
at org.terrier.indexing.SimpleFileCollection.getDocument(SimpleFileColle
ction.java:303)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java
:357)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerri
er.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTe
rrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrie
r.java:498)
ERROR - An unexpected exception occured while indexing. Indexing has been aborte
d.
java.lang.NullPointerException
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream
.next(EnglishTokeniser.java:97)
at org.terrier.indexing.tokenisation.EnglishTokeniser$EnglishTokenStream
.next(EnglishTokeniser.java:76)
at org.terrier.indexing.FileDocument.getNextTerm(FileDocument.java:221)
at org.terrier.indexing.BlockIndexer.createDirectIndex(BlockIndexer.java
:371)
at org.terrier.indexing.Indexer.index(Indexer.java:346)
at org.terrier.applications.desktop.DesktopTerrier.runIndex(DesktopTerri
er.java:1129)
at org.terrier.applications.desktop.DesktopTerrier.access$1100(DesktopTe
rrier.java:114)
at org.terrier.applications.desktop.DesktopTerrier$8$1.run(DesktopTerrie
r.java:498)


--->As u can see it has already indexed some pdfs bfore..but wen it comes to python.pdf it shows error..and also for few of such pdfs


Thanks



Edited 1 time(s). Last edit at 06/27/2011 05:41AM by arien.

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: craigm ()
Date: June 27, 2011 12:05PM

There seems to be something about your PDF that pdfbox can't handle. I note that Pdfbox is now an apache project - perhaps you could try to integrate the latest version with Terrier instead?

This will mean changing some of the code in PDFDocument

Craig

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: arien ()
Date: June 27, 2011 01:11PM

Thnks Craig

I will try with that.

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: ounis ()
Date: June 27, 2011 01:43PM

Perhaps, worth creating a JIRA issue about this. We welcome patches.

Iadh

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: craigm ()
Date: June 27, 2011 02:23PM

arien,

If you find a solution, can you attach a patch to
[terrier.org]

Craig

Options: ReplyQuote
Re: Unable to index .pdf files in desktop search
Posted by: craigm ()
Date: September 05, 2011 07:13PM

There is a revised PDFDocument for the latest version of pdfbox available at [terrier.org]

Craig

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.