[TR-252] Update Apache POI versions to parse newer Word/Excel/Powerpoint files Created: 24/Mar/11 Updated: 04/Apr/14 Resolved: 13/Apr/12
|Affects Version/s:||3.0, 3.5|
|Reporter:||Craig Macdonald||Assignee:||Craig Macdonald|
We can't index .xlsx .docx .pptx etc documents, but Apache POI can.
Moreover, our ppt indexing includes terms like "Click here to edit the title", even though this isn't visible in the presentation itself (it is coming from the slide master?).
Finally, Apache POI seems to have delivered improved interfaces for extracting text from Microsoft Office files. Perhaps we can use their newer interfaces.
|Comment by Craig Macdonald [ 13/Apr/12 ]|
Fixed for 3.6. Unfortunately, these jar files are very heavy, significantly increasing the size of Terrier's tar.