Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0, 3.5
    • Fix Version/s: 3.6
    • Component/s: .indexing
    • Labels:
      None

      Description

      pdfbox is now an Apache project. We should upgrade, as there are likely some PDF parsing improvements. See also http://terrier.org/forum//read.php?3,1928

        Attachments

        1. PDFDocument.java
          6 kB
          Craig Macdonald
        2. Report.txt
          4 kB
          Rolf Neidhart

          Activity

          Hide
          craigm Craig Macdonald added a comment -

          Revised PDFDocument for upgraded pdfbox.

          Show
          craigm Craig Macdonald added a comment - Revised PDFDocument for upgraded pdfbox.
          Hide
          craigm Craig Macdonald added a comment -

          Committed for 3.6

          Show
          craigm Craig Macdonald added a comment - Committed for 3.6
          Hide
          rolf Rolf Neidhart added a comment -

          This seems to me to be more than just a minor problem. Old PDFBox could not read my PDFs created with Acrobat 9.
          There is another problem: The standard configuration reads the whole PDF in the Java heap space. This solution causes problems with PDF files with a size of more than 1 Gigabyte.
          So I had to modify the source of PDFDocument.
          I will try to attach my detailed report.

          Show
          rolf Rolf Neidhart added a comment - This seems to me to be more than just a minor problem. Old PDFBox could not read my PDFs created with Acrobat 9. There is another problem: The standard configuration reads the whole PDF in the Java heap space. This solution causes problems with PDF files with a size of more than 1 Gigabyte. So I had to modify the source of PDFDocument. I will try to attach my detailed report.
          Hide
          rolf Rolf Neidhart added a comment -

          My way to enable terrier to parse large PDF documents

          Show
          rolf Rolf Neidhart added a comment - My way to enable terrier to parse large PDF documents
          Hide
          craigm Craig Macdonald added a comment -

          Thanks Rolf, I'll give a look at this.

          Show
          craigm Craig Macdonald added a comment - Thanks Rolf, I'll give a look at this.
          Hide
          richardm Richard McCreadie added a comment - - edited

          Looked at some alternative implementations, but seems that they just rely on the load() method, which reads the entire pdf.

          I have committed a patch that adds a file size check to each pdf when you first try to open the reader, it will skip the file if its size exceeds 300Mb.

          Commit 3747.

          Show
          richardm Richard McCreadie added a comment - - edited Looked at some alternative implementations, but seems that they just rely on the load() method, which reads the entire pdf. I have committed a patch that adds a file size check to each pdf when you first try to open the reader, it will skip the file if its size exceeds 300Mb. Commit 3747.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              craigm Craig Macdonald
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: