[TR-383] Default constructors dont work for WARC Collection implementations Created: 08/Feb/16  Updated: 08/Feb/16  Resolved: 08/Feb/16

Status: Resolved
Project: Terrier Core
Component/s: .indexing
Affects Version/s: 4.1
Fix Version/s: 4.2

Type: Bug Priority: Minor
Reporter: Hideo Joho Assignee: Craig Macdonald
Resolution: Fixed  
Labels: None

Attachments: Text File TR-383.patch    

 Description   
Hi

We're trying to index ClueWeb12 collection using Terrier 4.1. When we try to index a single warc.gz file, we get the following error.

---
[terrier@xxxx terrier]$ ./bin/trec_terrier.sh -i -j
Setting TERRIER_HOME to /home/terrier/terrier
Picked up _JAVA_OPTIONS: -Xmx8192M
Starting building the inverted file ...
15:28:28.191 [main] INFO o.t.structures.indexing.Indexer - Checking memory usage every 20 maxDocPerFlush=0
15:28:28.200 [main] INFO o.t.structures.indexing.Indexer - creating the data structures data_1
15:28:28.200 [main] INFO o.t.structures.indexing.Indexer - Creating IF (no direct file)..
A problem occurred: java.lang.NullPointerException
java.lang.NullPointerException
at org.terrier.indexing.WARC018Collection.readLine(WARC018Collection.java:234)
at org.terrier.indexing.WARC10Collection.nextDocument(WARC10Collection.java:69)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createInvertedIndex(BasicSinglePassIndexer.java:200)
at org.terrier.structures.indexing.singlepass.BasicSinglePassIndexer.createDirectIndex(BasicSinglePassIndexer.java:159)
at org.terrier.structures.indexing.Indexer.index(Indexer.java:348)
at org.terrier.applications.TRECIndexing.createSinglePass(TRECIndexing.java:272)
at org.terrier.applications.TrecTerrier.run(TrecTerrier.java:402)
at org.terrier.applications.TrecTerrier.applyOptions(TrecTerrier.java:599)
at org.terrier.applications.TrecTerrier.main(TrecTerrier.java:246)
[terrier@ibq0 terrier]$
---

Any idea?

Here is some detail information.

[terrier@xxxx terrier]$ cat etc/terrier.properties
trec.collection.class=WARC10Collection
indexer.meta.forward.keys=docno,url
indexer.meta.forward.keylens=26,256
indexer.meta.reverse.keys=docno
TrecDocTags.skip=SCRIPT,STYLE

[terrier@xxxx terrier]$ cat etc/collection.spec
#add the files to index
/home/terrier/0000tw-00.warc.gz

[terrier@xxxx terrier]$ echo $JAVA_HOME
/opt/jdk1.7.0_60/

[terrier@xxxx terrier]$ java -version
Picked up _JAVA_OPTIONS: -Xmx8192M
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)


Thanks,

Hideo

 Comments   
Comment by Craig Macdonald [ 08/Feb/16 ]

Yes. Minor bug.

On line 47, the default constructor of WARC10Collection, change super() to this(ApplicationSetup.COLLECTION_SPEC);

New test case will be added to next release.

Craig

Comment by Craig Macdonald [ 08/Feb/16 ]

Patch attached.

Comment by Hideo Joho [ 08/Feb/16 ]

Solved the problem. Thanks!

Hideo

Comment by Craig Macdonald [ 08/Feb/16 ]

Generated at Fri Dec 15 23:24:52 GMT 2017 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.