Terrier IR Platform
1.1.1

uk.ac.gla.terrier.utility
Class TagSet

java.lang.Object
  extended by uk.ac.gla.terrier.utility.TagSet

public class TagSet
extends java.lang.Object

A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag,, we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR

In the source code, we would create an instance of the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.

Version:
$Revision: 1.20 $
Author:
Vassilis Plachouras, Craig Macdonald

Field Summary
static java.lang.String EMPTY_TAGS
          A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
static java.lang.String FIELD_TAGS
          The prefix for the tags to consider as fields, during indexing.
static java.lang.String TREC_DOC_TAGS
          The prefix for the TREC document tags.
static java.lang.String TREC_EXACT_DOC_TAGS
          The prefix for the TREC document exact tags.
static java.lang.String TREC_QUERY_TAGS
          The prefix for the TREC topic tags.
 
Constructor Summary
TagSet(java.lang.String prefix)
          Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.
 
Method Summary
 java.lang.String getDocTag()
          Return the document delimiter tag.
 java.lang.String getIdTag()
          Return the id tag.
 java.lang.String getTagsToProcess()
          Returns a comma separated list of tags to process
 java.lang.String getTagsToSkip()
          Returns a comma separated list of tags to skip
 boolean hasWhitelist()
          Returns true if whiteListSize > 0.
 boolean isDocTag(java.lang.String tag)
          Checks whether the given tag indicates the limits of a document.
 boolean isIdTag(java.lang.String tag)
          Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
 boolean isTagToProcess(java.lang.String tag)
          Checks whether the tag should be processed.
 boolean isTagToSkip(java.lang.String tag)
          Checks whether a tag should be skipped.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMPTY_TAGS

public static final java.lang.String EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.

See Also:
Constant Field Values

TREC_DOC_TAGS

public static final java.lang.String TREC_DOC_TAGS
The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.

See Also:
Constant Field Values

TREC_EXACT_DOC_TAGS

public static final java.lang.String TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.

See Also:
Constant Field Values

TREC_QUERY_TAGS

public static final java.lang.String TREC_QUERY_TAGS
The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.

See Also:
Constant Field Values

FIELD_TAGS

public static final java.lang.String FIELD_TAGS
The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.

See Also:
Constant Field Values
Constructor Detail

TagSet

public TagSet(java.lang.String prefix)
Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Parameters:
prefix - the common prefix of the properties to read.
Method Detail

hasWhitelist

public boolean hasWhitelist()
Returns true if whiteListSize > 0.

Returns:
Returns true if whiteListSize > 0

isTagToProcess

public boolean isTagToProcess(java.lang.String tag)
Checks whether the tag should be processed.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag should be processed

isTagToSkip

public boolean isTagToSkip(java.lang.String tag)
Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.

Parameters:
tag - the tag to check.
Returns:
true if the tag is an identifier tag, otherwise it returns false.

isIdTag

public boolean isIdTag(java.lang.String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag is an identifier tag, otherwise it returns false.

isDocTag

public boolean isDocTag(java.lang.String tag)
Checks whether the given tag indicates the limits of a document.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag is a document delimiter tag, otherwise it returns false.

getTagsToProcess

public java.lang.String getTagsToProcess()
Returns a comma separated list of tags to process

Returns:
String the tags to process

getTagsToSkip

public java.lang.String getTagsToSkip()
Returns a comma separated list of tags to skip

Returns:
String the tags to skip

getIdTag

public java.lang.String getIdTag()
Return the id tag.

Returns:
String the id tag

getDocTag

public java.lang.String getDocTag()
Return the document delimiter tag.

Returns:
String the document delimiter tag

Terrier IR Platform
1.1.1

Terrier Information Retrieval Platform 1.1.1. Copyright 2004-2007 University of Glasgow