[TR-172] Upgrade PDFBox Created: 27/Jun/11  Updated: 06/Mar/14  Resolved: 06/Mar/14

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 3.0, 3.5
Fix Version/s: 3.6

Type: Improvement Priority: Minor
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: File pdfbox-app-1.6.0.jar     Text File PDFDocument.java     Text File Report.txt    

 Description   
pdfbox is now an Apache project. We should upgrade, as there are likely some PDF parsing improvements. See also http://terrier.org/forum//read.php?3,1928

 Comments   
Comment by Craig Macdonald [ 05/Sep/11 ]

Revised PDFDocument for upgraded pdfbox.

Comment by Craig Macdonald [ 13/Apr/12 ]

Committed for 3.6

Comment by Rolf Neidhart [ 19/Apr/12 ]

This seems to me to be more than just a minor problem. Old PDFBox could not read my PDFs created with Acrobat 9.
There is another problem: The standard configuration reads the whole PDF in the Java heap space. This solution causes problems with PDF files with a size of more than 1 Gigabyte.
So I had to modify the source of PDFDocument.
I will try to attach my detailed report.

Comment by Rolf Neidhart [ 19/Apr/12 ]

My way to enable terrier to parse large PDF documents

Comment by Craig Macdonald [ 19/May/12 ]

Thanks Rolf, I'll give a look at this.

Comment by Richard McCreadie [ 06/Mar/14 ]

Looked at some alternative implementations, but seems that they just rely on the load() method, which reads the entire pdf.

I have committed a patch that adds a file size check to each pdf when you first try to open the reader, it will skip the file if its size exceeds 300Mb.

Commit 3747.

Generated at Thu Dec 14 02:34:56 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.