Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0, 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      pdfbox is now an Apache project. We should upgrade, as there are likely some PDF parsing improvements. See also http://terrier.org/forum//read.php?3,1928

        Attachments

          Activity

          craigm Craig Macdonald created issue -
          craigm Craig Macdonald made changes -
          Field Original Value New Value
          Project TREC [ 10010 ] Terrier Core [ 10000 ]
          Key TREC-261 TR-172
          Workflow jira [ 10586 ] Terrier Open Source [ 10587 ]
          Affects Version/s 3.5 [ 10040 ]
          Affects Version/s 3.0 [ 10030 ]
          Affects Version/s 3.5 [ 10021 ]
          Affects Version/s 3.0 [ 10020 ]
          Component/s .indexing [ 10002 ]
          Component/s Core [ 10020 ]
          Fix Version/s 4.0 [ 10051 ]
          Fix Version/s 4.0 [ 10050 ]
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Patch Available [ 10000 ]
          craigm Craig Macdonald made changes -
          Status Patch Available [ 10000 ] Open [ 1 ]
          Hide
          craigm Craig Macdonald added a comment -

          Revised PDFDocument for upgraded pdfbox.

          Show
          craigm Craig Macdonald added a comment - Revised PDFDocument for upgraded pdfbox.
          craigm Craig Macdonald made changes -
          Attachment PDFDocument.java [ 10323 ]
          Attachment pdfbox-app-1.6.0.jar [ 10324 ]
          Hide
          craigm Craig Macdonald added a comment -

          Committed for 3.6

          Show
          craigm Craig Macdonald added a comment - Committed for 3.6
          craigm Craig Macdonald made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.6 [ 10060 ]
          Fix Version/s 4.0 [ 10051 ]
          Resolution Fixed [ 1 ]
          Hide
          rolf Rolf Neidhart added a comment -

          This seems to me to be more than just a minor problem. Old PDFBox could not read my PDFs created with Acrobat 9.
          There is another problem: The standard configuration reads the whole PDF in the Java heap space. This solution causes problems with PDF files with a size of more than 1 Gigabyte.
          So I had to modify the source of PDFDocument.
          I will try to attach my detailed report.

          Show
          rolf Rolf Neidhart added a comment - This seems to me to be more than just a minor problem. Old PDFBox could not read my PDFs created with Acrobat 9. There is another problem: The standard configuration reads the whole PDF in the Java heap space. This solution causes problems with PDF files with a size of more than 1 Gigabyte. So I had to modify the source of PDFDocument. I will try to attach my detailed report.
          Hide
          rolf Rolf Neidhart added a comment -

          My way to enable terrier to parse large PDF documents

          Show
          rolf Rolf Neidhart added a comment - My way to enable terrier to parse large PDF documents
          rolf Rolf Neidhart made changes -
          Attachment Report.txt [ 10341 ]
          Hide
          craigm Craig Macdonald added a comment -

          Thanks Rolf, I'll give a look at this.

          Show
          craigm Craig Macdonald added a comment - Thanks Rolf, I'll give a look at this.
          craigm Craig Macdonald made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Assignee Richard McCreadie [ richardm ] Craig Macdonald [ craigm ]
          Hide
          richardm Richard McCreadie added a comment - - edited

          Looked at some alternative implementations, but seems that they just rely on the load() method, which reads the entire pdf.

          I have committed a patch that adds a file size check to each pdf when you first try to open the reader, it will skip the file if its size exceeds 300Mb.

          Commit 3747.

          Show
          richardm Richard McCreadie added a comment - - edited Looked at some alternative implementations, but seems that they just rely on the load() method, which reads the entire pdf. I have committed a patch that adds a file size check to each pdf when you first try to open the reader, it will skip the file if its size exceeds 300Mb. Commit 3747.
          richardm Richard McCreadie made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: