[TR-252] Update Apache POI versions to parse newer Word/Excel/Powerpoint files Created: 24/Mar/11  Updated: 04/Apr/14  Resolved: 13/Apr/12

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.0, 3.5
Fix Version/s: 3.6

Type: Bug Priority: Minor
Reporter: Craig Macdonald Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None


 Description   
We can't index .xlsx .docx .pptx etc documents, but Apache POI can.

Moreover, our ppt indexing includes terms like "Click here to edit the title", even though this isn't visible in the presentation itself (it is coming from the slide master?).

Finally, Apache POI seems to have delivered improved interfaces for extracting text from Microsoft Office files. Perhaps we can use their newer interfaces.





 Comments   
Comment by Craig Macdonald [ 13/Apr/12 ]

Fixed for 3.6. Unfortunately, these jar files are very heavy, significantly increasing the size of Terrier's tar.

Generated at Wed Dec 13 09:00:33 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.