org.terrier.utility
Class TagSet

java.lang.Object
  extended by org.terrier.utility.TagSet

public class TagSet
extends Object

A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag, we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR
TrecDocTags.casesensitive=false

In the source code, we would create an instance of the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.

Author:
Vassilis Plachouras, Craig Macdonald

Field Summary
protected  HashSet<String> blackList
          The set of tags to skip.
protected  String blackListTags
          A comma separated list of tags to skip.
protected  boolean caseSensitive
          is this TagSet case sensitive.
protected  String docTag
          The tag that is used for denoting the beginning of a document.
static String EMPTY_TAGS
          A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
static String FIELD_TAGS
          The prefix for the tags to consider as fields, during indexing.
protected  String idTag
          The tag that is used as a unique identifier.
static String TREC_DOC_TAGS
          The prefix for the TREC document tags.
static String TREC_EXACT_DOC_TAGS
          The prefix for the TREC document exact tags.
static String TREC_PROPERTY_TAGS
          The prefix for the TREC property tags.
static String TREC_QUERY_TAGS
          The prefix for the TREC topic tags.
protected  HashSet<String> whiteList
          The set of tags to process.
protected  int whiteListSize
          Size of whiteList hashset
protected  String whiteListTags
          A comma separated list of tags to process.
 
Constructor Summary
TagSet(String prefix)
          Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.
 
Method Summary
 String getDocTag()
          Return the document delimiter tag.
 String getIdTag()
          Return the id tag.
 String getTagsToProcess()
          Returns a comma separated list of tags to process
 String getTagsToSkip()
          Returns a comma separated list of tags to skip
 boolean hasWhitelist()
          Returns true if whiteListSize > 0.
 boolean isCaseSensitive()
          Returns true if this tag set has been specified as case-sensitive
 boolean isDocTag(String tag)
          Checks whether the given tag indicates the limits of a document.
 boolean isIdTag(String tag)
          Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
 boolean isTagToProcess(String tag)
          Checks whether the tag should be processed.
 boolean isTagToSkip(String tag)
          Checks whether a tag should be skipped.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

EMPTY_TAGS

public static final String EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.

See Also:
Constant Field Values

TREC_DOC_TAGS

public static final String TREC_DOC_TAGS
The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.

See Also:
Constant Field Values

TREC_EXACT_DOC_TAGS

public static final String TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.

See Also:
Constant Field Values

TREC_QUERY_TAGS

public static final String TREC_QUERY_TAGS
The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.

See Also:
Constant Field Values

TREC_PROPERTY_TAGS

public static final String TREC_PROPERTY_TAGS
The prefix for the TREC property tags. The corresponding properties in the setup file should start with TrecPropertyTags.

See Also:
Constant Field Values

FIELD_TAGS

public static final String FIELD_TAGS
The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.

See Also:
Constant Field Values

whiteList

protected HashSet<String> whiteList
The set of tags to process.


whiteListSize

protected final int whiteListSize
Size of whiteList hashset


whiteListTags

protected String whiteListTags
A comma separated list of tags to process.


blackList

protected HashSet<String> blackList
The set of tags to skip.


blackListTags

protected String blackListTags
A comma separated list of tags to skip.


idTag

protected String idTag
The tag that is used as a unique identifier.


docTag

protected String docTag
The tag that is used for denoting the beginning of a document.


caseSensitive

protected boolean caseSensitive
is this TagSet case sensitive. Defaults to true for all sets except TrecDocTags

Constructor Detail

TagSet

public TagSet(String prefix)
Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Parameters:
prefix - the common prefix of the properties to read.
Method Detail

hasWhitelist

public boolean hasWhitelist()
Returns true if whiteListSize > 0.

Returns:
Returns true if whiteListSize > 0

isTagToProcess

public boolean isTagToProcess(String tag)
Checks whether the tag should be processed.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag should be processed

isTagToSkip

public boolean isTagToSkip(String tag)
Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.

Parameters:
tag - the tag to check.
Returns:
true if the tag is an identifier tag, otherwise it returns false.

isIdTag

public boolean isIdTag(String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag is an identifier tag, otherwise it returns false.

isDocTag

public boolean isDocTag(String tag)
Checks whether the given tag indicates the limits of a document.

Parameters:
tag - String the tag to check.
Returns:
boolean true if the tag is a document delimiter tag, otherwise it returns false.

isCaseSensitive

public boolean isCaseSensitive()
Returns true if this tag set has been specified as case-sensitive


getTagsToProcess

public String getTagsToProcess()
Returns a comma separated list of tags to process

Returns:
String the tags to process

getTagsToSkip

public String getTagsToSkip()
Returns a comma separated list of tags to skip

Returns:
String the tags to skip

getIdTag

public String getIdTag()
Return the id tag.

Returns:
String the id tag

getDocTag

public String getDocTag()
Return the document delimiter tag.

Returns:
String the document delimiter tag


Terrier 3.6. Copyright © 2004-2011 University of Glasgow