Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.0
    • Fix Version/s: None
    • Component/s: .indexing
    • Labels:
      None

      Description

      I used Terrier 4.0 to index CLEF 2015 collection ( around 1.1 million documents from the medical domain in English language) I did not use stopword list or any special tokenizer, just the default "English" tokeniser, also I did not use Stemmer.

       Then I wanted to do some experiments with idf scores for the terms.

      I wrote java-api based application to do so, but the results were very strange e.g (very high score for terms like 'the', 'for'..), anyway I thought there were maybe some bugs in my code. So I printed the lexicon and grep some terms and I got:

      the,term180579 Nt=367 TF=368 @{0 364896201 7} TFf=0,368
      cancer,term9819 Nt=317598 TF=2756499 @{0 80013017 3} TFf=65701,2690798
      medical,term57084 Nt=629 TF=1095 @{0 237510927 3} TFf=0,1095


      Do these values say that: Number of documents which contain "medical" or "cancer" is much higher than the documents that contain "the"?


      Thanks for help.


        Attachments

          Activity

          There are no comments yet on this issue.

            People

            • Assignee:
              craigm Craig Macdonald
              Reporter:
              shadisaleh shadi saleh
            • Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: