TagSet (Terrier Information Retrieval Platform version 1.1.1 API Specification)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

Terrier IR Platform
1.1.1

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

uk.ac.gla.terrier.utility
Class TagSet

java.lang.Object
  uk.ac.gla.terrier.utility.TagSet

public class TagSet
extends java.lang.Object
extends java.lang.Object

A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag,, we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR

In the source code, we would create an instance of the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.

Version:: $Revision: 1.20 $
Author:: Vassilis Plachouras, Craig Macdonald

Field Summary
`static java.lang.String`	`EMPTY_TAGS` A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
`static java.lang.String`	`FIELD_TAGS` The prefix for the tags to consider as fields, during indexing.
`static java.lang.String`	`TREC_DOC_TAGS` The prefix for the TREC document tags.
`static java.lang.String`	`TREC_EXACT_DOC_TAGS` The prefix for the TREC document exact tags.
`static java.lang.String`	`TREC_QUERY_TAGS` The prefix for the TREC topic tags.

Constructor Summary
`TagSet(java.lang.String prefix)` Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Method Summary
`java.lang.String`	`getDocTag()` Return the document delimiter tag.
`java.lang.String`	`getIdTag()` Return the id tag.
`java.lang.String`	`getTagsToProcess()` Returns a comma separated list of tags to process
`java.lang.String`	`getTagsToSkip()` Returns a comma separated list of tags to skip
`boolean`	`hasWhitelist()` Returns true if whiteListSize > 0.
`boolean`	`isDocTag(java.lang.String tag)` Checks whether the given tag indicates the limits of a document.
`boolean`	`isIdTag(java.lang.String tag)` Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
`boolean`	`isTagToProcess(java.lang.String tag)` Checks whether the tag should be processed.
`boolean`	`isTagToSkip(java.lang.String tag)` Checks whether a tag should be skipped.

Methods inherited from class java.lang.Object
`equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

EMPTY_TAGS

public static final java.lang.String EMPTY_TAGS

A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.

See Also:: Constant Field Values

TREC_DOC_TAGS

public static final java.lang.String TREC_DOC_TAGS

The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.

See Also:: Constant Field Values

TREC_EXACT_DOC_TAGS

public static final java.lang.String TREC_EXACT_DOC_TAGS

The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.

See Also:: Constant Field Values

TREC_QUERY_TAGS

public static final java.lang.String TREC_QUERY_TAGS

The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.

See Also:: Constant Field Values

FIELD_TAGS

public static final java.lang.String FIELD_TAGS

The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.

See Also:: Constant Field Values

Constructor Detail

TagSet

public TagSet(java.lang.String prefix)

Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.

Parameters:: prefix - the common prefix of the properties to read.

Method Detail

hasWhitelist

public boolean hasWhitelist()

Returns true if whiteListSize > 0.

Returns:: Returns true if whiteListSize > 0

isTagToProcess

public boolean isTagToProcess(java.lang.String tag)

Checks whether the tag should be processed.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag should be processed

isTagToSkip

public boolean isTagToSkip(java.lang.String tag)

Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.

Parameters:: tag - the tag to check.
Returns:: true if the tag is an identifier tag, otherwise it returns false.

isIdTag

public boolean isIdTag(java.lang.String tag)

Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag is an identifier tag, otherwise it returns false.

isDocTag

public boolean isDocTag(java.lang.String tag)

Checks whether the given tag indicates the limits of a document.

Parameters:: tag - String the tag to check.
Returns:: boolean true if the tag is a document delimiter tag, otherwise it returns false.

getTagsToProcess

public java.lang.String getTagsToProcess()

Returns a comma separated list of tags to process

Returns:: String the tags to process

getTagsToSkip

public java.lang.String getTagsToSkip()

Returns a comma separated list of tags to skip

Returns:: String the tags to skip

getIdTag

public java.lang.String getIdTag()

Return the id tag.

Returns:: String the id tag

getDocTag

public java.lang.String getDocTag()

Return the document delimiter tag.

Returns:: String the document delimiter tag