Class TagSet


  • public class TagSet
    extends java.lang.Object
    A class that models a set of tags to process (white list), a set of tags to skip (black list), a tag that is used as a document delimiter, and a tag the contents of which are used as a unique identifier. The text within any tag encountered within the scope of a tag from the white list, is processed by default, unless it is explicitly black listed.
    For example, in order to index all the text within the DOC tag of a document from a typical TREC collection, without indexing the contents of the DOCHDR tag, we could define in the properties file the following properties:
    TrecDocTags.doctag=DOC
    TrecDocTags.idtag=DOCNO
    TrecDocTags.process=
    TrecDocTags.skip=DOCHDR
    TrecDocTags.casesensitive=false

    In the source code, we would create an instance of the class as follows:
    TagSet TrecIndexToProcess = new TagSet("TrecDocTags");
    All the tags are converted to uppercase, in order to check whether they belong to the specified set of tags.
    Author:
    Vassilis Plachouras, Craig Macdonald
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  TagSet.TagSetFactory  
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected java.util.Set<java.lang.String> blackList
      The set of tags to skip.
      protected int blackListSize
      Size of whiteList hashset
      protected java.util.List<java.lang.String> blackListTags  
      protected boolean caseSensitive
      is this TagSet case sensitive.
      protected java.lang.String docTag
      The tag that is used for denoting the beginning of a document.
      static java.lang.String EMPTY_TAGS
      A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
      static java.lang.String FIELD_TAGS
      The prefix for the tags to consider as fields, during indexing.
      protected java.lang.String idTag
      The tag that is used as a unique identifier.
      static java.lang.String TREC_DOC_TAGS
      The prefix for the TREC document tags.
      static java.lang.String TREC_EXACT_DOC_TAGS
      The prefix for the TREC document exact tags.
      static java.lang.String TREC_PROPERTY_TAGS
      The prefix for the TREC property tags.
      static java.lang.String TREC_QUERY_TAGS
      The prefix for the TREC topic tags.
      protected java.util.Set<java.lang.String> whiteList
      The set of tags to process.
      protected int whiteListSize
      Size of whiteList hashset
      protected java.util.List<java.lang.String> whiteListTags  
    • Constructor Summary

      Constructors 
      Constructor Description
      TagSet​(java.lang.String prefix)
      Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      static TagSet.TagSetFactory factory()  
      java.lang.String getDocTag()
      Return the document delimiter tag.
      java.lang.String getIdTag()
      Return the id tag.
      java.lang.String[] getTagsToProcess()
      Returns a comma separated list of tags to process
      java.lang.String[] getTagsToSkip()
      Returns a comma separated list of tags to skip
      boolean hasWhitelist()
      Returns true if whiteListSize > 0.
      boolean isCaseSensitive()
      Returns true if this tag set has been specified as case-sensitive
      boolean isDocTag​(java.lang.String tag)
      Checks whether the given tag indicates the limits of a document.
      boolean isIdTag​(java.lang.String tag)
      Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
      boolean isTagToProcess​(java.lang.String tag)
      Checks whether the tag should be processed.
      boolean isTagToSkip​(java.lang.String tag)
      Checks whether a tag should be skipped.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • EMPTY_TAGS

        public static final java.lang.String EMPTY_TAGS
        A prefix for an empty set of tags, that is a set of tags that are not defined in the properties file.
        See Also:
        Constant Field Values
      • TREC_DOC_TAGS

        public static final java.lang.String TREC_DOC_TAGS
        The prefix for the TREC document tags. The corresponding properties in the setup file should start with TrecDocTags.
        See Also:
        Constant Field Values
      • TREC_EXACT_DOC_TAGS

        public static final java.lang.String TREC_EXACT_DOC_TAGS
        The prefix for the TREC document exact tags. The corresponding properties in the setup file should start with TrecExactDocTags.
        See Also:
        Constant Field Values
      • TREC_QUERY_TAGS

        public static final java.lang.String TREC_QUERY_TAGS
        The prefix for the TREC topic tags. The corresponding properties in the setup file should start with TrecQueryTags.
        See Also:
        Constant Field Values
      • TREC_PROPERTY_TAGS

        public static final java.lang.String TREC_PROPERTY_TAGS
        The prefix for the TREC property tags. The corresponding properties in the setup file should start with TrecPropertyTags.
        See Also:
        Constant Field Values
      • FIELD_TAGS

        public static final java.lang.String FIELD_TAGS
        The prefix for the tags to consider as fields, during indexing. The corresponding properties in the setup file should start with FieldTags.
        See Also:
        Constant Field Values
      • whiteList

        protected java.util.Set<java.lang.String> whiteList
        The set of tags to process.
      • whiteListSize

        protected final int whiteListSize
        Size of whiteList hashset
      • whiteListTags

        protected java.util.List<java.lang.String> whiteListTags
      • blackList

        protected java.util.Set<java.lang.String> blackList
        The set of tags to skip.
      • blackListSize

        protected final int blackListSize
        Size of whiteList hashset
      • blackListTags

        protected java.util.List<java.lang.String> blackListTags
      • idTag

        protected final java.lang.String idTag
        The tag that is used as a unique identifier.
      • docTag

        protected final java.lang.String docTag
        The tag that is used for denoting the beginning of a document.
      • caseSensitive

        protected final boolean caseSensitive
        is this TagSet case sensitive. Defaults to true for all sets except TrecDocTags
    • Constructor Detail

      • TagSet

        public TagSet​(java.lang.String prefix)
        Constructs the tag set for the given prefix, by reading the corresponding properties from the properties file.
        Parameters:
        prefix - the common prefix of the properties to read.
    • Method Detail

      • hasWhitelist

        public boolean hasWhitelist()
        Returns true if whiteListSize > 0.
        Returns:
        Returns true if whiteListSize > 0
      • isTagToProcess

        public boolean isTagToProcess​(java.lang.String tag)
        Checks whether the tag should be processed.
        Parameters:
        tag - String the tag to check.
        Returns:
        boolean true if the tag should be processed
      • isTagToSkip

        public boolean isTagToSkip​(java.lang.String tag)
        Checks whether a tag should be skipped. You should use isTagToProcess as it checks the whitelist and blacklist.
        Parameters:
        tag - the tag to check.
        Returns:
        true if the tag is an identifier tag, otherwise it returns false.
      • isIdTag

        public boolean isIdTag​(java.lang.String tag)
        Checks whether the given tag is a unique identifier tag, that is the document number of a document, of the identifier of a topic.
        Parameters:
        tag - String the tag to check.
        Returns:
        boolean true if the tag is an identifier tag, otherwise it returns false.
      • isDocTag

        public boolean isDocTag​(java.lang.String tag)
        Checks whether the given tag indicates the limits of a document.
        Parameters:
        tag - String the tag to check.
        Returns:
        boolean true if the tag is a document delimiter tag, otherwise it returns false.
      • isCaseSensitive

        public boolean isCaseSensitive()
        Returns true if this tag set has been specified as case-sensitive
      • getTagsToProcess

        public java.lang.String[] getTagsToProcess()
        Returns a comma separated list of tags to process
        Returns:
        String the tags to process
      • getTagsToSkip

        public java.lang.String[] getTagsToSkip()
        Returns a comma separated list of tags to skip
        Returns:
        String the tags to skip
      • getIdTag

        public java.lang.String getIdTag()
        Return the id tag.
        Returns:
        String the id tag
      • getDocTag

        public java.lang.String getDocTag()
        Return the document delimiter tag.
        Returns:
        String the document delimiter tag