Terrier Users :  Terrier Forum terrier.org
General discussion about using/developing applications using Terrier 
Why there are so many empty documents when indexing GOV2 with Terrier-2.2.1?
Posted by: deeper2 ()
Date: December 06, 2017 05:13AM

WARN - Adding empty document GX003-28-16136758
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\29.gz
WARN - Adding empty document GX003-29-0949202
WARN - Adding empty document GX003-29-5894824
WARN - Adding empty document GX003-29-12684533
WARN - Adding empty document GX003-29-14760080
WARN - Adding empty document GX003-29-14876623
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\30.gz
WARN - Adding empty document GX003-30-2792609
WARN - Adding empty document GX003-30-3327218
WARN - Adding empty document GX003-30-3390050
WARN - Adding empty document GX003-30-5439668
WARN - Adding empty document GX003-30-5763777
WARN - Adding empty document GX003-30-12853506
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\31.gz
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\32.gz
WARN - Adding empty document GX003-32-14971151
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\33.gz
WARN - Adding empty document GX003-33-9858287
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\34.gz
WARN - Adding empty document GX003-34-10861992
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\35.gz
WARN - Adding empty document GX003-35-6481663
WARN - Adding empty document GX003-35-13164728
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\36.gz
WARN - Adding empty document GX003-36-3859427
WARN - Adding empty document GX003-36-14057384
WARN - Adding empty document GX003-36-16549948
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\37.gz
WARN - Adding empty document GX003-37-3550765
WARN - Adding empty document GX003-37-6011084
WARN - Adding empty document GX003-37-9313499
WARN - Adding empty document GX003-37-14964062
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\38.gz
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\39.gz
WARN - Adding empty document GX003-39-4984126
WARN - Adding empty document GX003-39-9583730
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\40.gz
WARN - Adding empty document GX003-40-14224645
WARN - Adding empty document GX003-40-15578239
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\41.gz
WARN - Adding empty document GX003-41-4619954
WARN - Adding empty document GX003-41-8375790
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\42.gz
WARN - Adding empty document GX003-42-0496623
WARN - Adding empty document GX003-42-8166667
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\43.gz
WARN - Adding empty document GX003-43-8184177
WARN - Adding empty document GX003-43-13139166
INFO - Processing D:\codes\corpus\DOTGOV2\GX003\44.gz

Options: ReplyQuote
Re: Why there are so many empty documents when indexing GOV2 with Terrier-2.2.1?
Posted by: craigm ()
Date: December 07, 2017 06:06PM

Have you looked in any of the said documents? Usually, they might be iframes or such-like.

Also, why still using Terrier-2.2.1 so many years later?

Craig

Options: ReplyQuote
Re: Why there are so many empty documents when indexing GOV2 with Terrier-2.2.1?
Posted by: deeper2 ()
Date: January 03, 2018 12:27AM

Thank you!
It is easy to stick to this version because we have used it for a long time.

Options: ReplyQuote


Sorry, only registered users may post in this forum.
This forum powered by Phorum.