[TR-252] Update Apache POI versions to parse newer Word/Excel/Powerpoint files Created: 24/Mar/11  Updated: 04/Apr/14  Resolved: 13/Apr/12

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.0, 3.5
Fix Version/s: 3.6

Type: Bug Priority: Minor
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

We can't index .xlsx .docx .pptx etc documents, but Apache POI can.

Moreover, our ppt indexing includes terms like "Click here to edit the title", even though this isn't visible in the presentation itself (it is coming from the slide master?).

Finally, Apache POI seems to have delivered improved interfaces for extracting text from Microsoft Office files. Perhaps we can use their newer interfaces.

Comment by Craig Macdonald [ 13/Apr/12 ]

Fixed for 3.6. Unfortunately, these jar files are very heavy, significantly increasing the size of Terrier's tar.

Generated at Wed May 12 05:35:26 BST 2021 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.