|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.terrier.utility.TagSet
public class TagSet
A class that models a set of tags to process (white list),
a set of tags to skip (black list), a tag that is used as a
document delimiter, and a tag the contents of which are
used as a unique identifier. The text within any tag encountered
within the scope of a tag from the white list, is processed
by default, unless it is explicitly black listed.
For example, in order to index all the text within
the DOC tag of a document from a typical TREC collection,
without indexing the contents of the DOCHDR tag,
we could define in the properties file the following properties:
TrecDocTags.doctag=DOC
TrecDocTags.idtag=DOCNO
TrecDocTags.process=
TrecDocTags.skip=DOCHDR
TrecDocTags.casesensitive=false
In the source code, we would create an instance of
the class as follows:
TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
All the tags are converted to uppercase, in order to check
whether they belong to the specified set of tags.
Field Summary | |
---|---|
protected HashSet<String> |
blackList
The set of tags to skip. |
protected String |
blackListTags
A comma separated list of tags to skip. |
protected boolean |
caseSensitive
is this TagSet case sensitive. |
protected String |
docTag
The tag that is used for denoting the beginning of a document. |
static String |
EMPTY_TAGS
A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file. |
static String |
FIELD_TAGS
The prefix for the tags to consider as fields, during indexing. |
protected String |
idTag
The tag that is used as a unique identifier. |
static String |
TREC_DOC_TAGS
The prefix for the TREC document tags. |
static String |
TREC_EXACT_DOC_TAGS
The prefix for the TREC document exact tags. |
static String |
TREC_PROPERTY_TAGS
The prefix for the TREC property tags. |
static String |
TREC_QUERY_TAGS
The prefix for the TREC topic tags. |
protected HashSet<String> |
whiteList
The set of tags to process. |
protected int |
whiteListSize
Size of whiteList hashset |
protected String |
whiteListTags
A comma separated list of tags to process. |
Constructor Summary | |
---|---|
TagSet(String prefix)
Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file. |
Method Summary | |
---|---|
String |
getDocTag()
Return the document delimiter tag. |
String |
getIdTag()
Return the id tag. |
String |
getTagsToProcess()
Returns a comma separated list of tags to process |
String |
getTagsToSkip()
Returns a comma separated list of tags to skip |
boolean |
hasWhitelist()
Returns true if whiteListSize > 0. |
boolean |
isCaseSensitive()
Returns true if this tag set has been specified as case-sensitive |
boolean |
isDocTag(String tag)
Checks whether the given tag indicates the limits of a document. |
boolean |
isIdTag(String tag)
Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic. |
boolean |
isTagToProcess(String tag)
Checks whether the tag should be processed. |
boolean |
isTagToSkip(String tag)
Checks whether a tag should be skipped. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
public static final String EMPTY_TAGS
public static final String TREC_DOC_TAGS
public static final String TREC_EXACT_DOC_TAGS
public static final String TREC_QUERY_TAGS
public static final String TREC_PROPERTY_TAGS
public static final String FIELD_TAGS
protected HashSet<String> whiteList
protected final int whiteListSize
protected String whiteListTags
protected HashSet<String> blackList
protected String blackListTags
protected String idTag
protected String docTag
protected boolean caseSensitive
Constructor Detail |
---|
public TagSet(String prefix)
prefix
- the common prefix of the properties to read.Method Detail |
---|
public boolean hasWhitelist()
public boolean isTagToProcess(String tag)
tag
- String the tag to check.
public boolean isTagToSkip(String tag)
tag
- the tag to check.
public boolean isIdTag(String tag)
tag
- String the tag to check.
public boolean isDocTag(String tag)
tag
- String the tag to check.
public boolean isCaseSensitive()
public String getTagsToProcess()
public String getTagsToSkip()
public String getIdTag()
public String getDocTag()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |