• Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: .indexing
    • Labels:


      I used Terrier 4.0 to index CLEF 2015 collection ( around 1.1 million documents from the medical domain in English language) I did not use stopword list or any special tokenizer, just the default "English" tokeniser, also I did not use Stemmer.

       Then I wanted to do some experiments with idf scores for the terms.

      I wrote java-api based application to do so, but the results were very strange e.g (very high score for terms like 'the', 'for'..), anyway I thought there were maybe some bugs in my code. So I printed the lexicon and grep some terms and I got:

      the,term180579 Nt=367 TF=368 @{0 364896201 7} TFf=0,368
      cancer,term9819 Nt=317598 TF=2756499 @{0 80013017 3} TFf=65701,2690798
      medical,term57084 Nt=629 TF=1095 @{0 237510927 3} TFf=0,1095

      Do these values say that: Number of documents which contain "medical" or "cancer" is much higher than the documents that contain "the"?

      Thanks for help.



          There are no comments yet on this issue.


            • Assignee:
              craigm Craig Macdonald
              shadisaleh shadi saleh
            • Watchers:
              0 Start watching this issue


              • Created: