Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.6
    • Fix Version/s: 4.0
    • Component/s: None
    • Labels:
      None

      Description

      Real-time search really simple website search engine, with a small db backend.

      So the idea would be:
      - the collection.spec contains the hostnames. All pages must have that as a prefix.
      - a simple crawler might be crawler4j but I'm not sure it has robots.txt support. However I think we had a student project that fixed that, and I can find the code.
      - crawled pages are placed in db. We use a SQL query to place the updated pages in the index?

        Attachments

          Activity

          richardm Richard McCreadie created issue -
          craigm Craig Macdonald made changes -
          Field Original Value New Value
          Assignee Iadh Ounis [ ounis ] Richard McCreadie [ richardm ]
          craigm Craig Macdonald made changes -
          Link This issue duplicates TREC-255 [ TREC-255 ]
          Hide
          richardm Richard McCreadie added a comment -

          Made a major commit on this in r3875

          Changed package structure of website search
          Added crawler4j
          Added crawler4j indexing code for Terrier
          Updated build.xml
          Updated search interface
          Updated start scripts

          Show
          richardm Richard McCreadie added a comment - Made a major commit on this in r3875 Changed package structure of website search Added crawler4j Added crawler4j indexing code for Terrier Updated build.xml Updated search interface Updated start scripts
          Hide
          richardm Richard McCreadie added a comment -

          @Craig - have a look at this and try to break it when you get a chance.

          Show
          richardm Richard McCreadie added a comment - @Craig - have a look at this and try to break it when you get a chance.
          craigm Craig Macdonald made changes -
          Comment [ Did you commit all the jar files, esp Crawler4j? ]
          Hide
          craigm Craig Macdonald added a comment -
          The constructor TaggedDocument(StringReader, Map<String,String>, Tokeniser, String, String, String) is undefined	CrawlStrategy.java	/terrier4core/src/websitesearch/org/terrier/services/websitesearch/crawler4j	line 103
          

          Got this error - can you tell me why the new constructor was needed?

          Show
          craigm Craig Macdonald added a comment - The constructor TaggedDocument(StringReader, Map<String,String>, Tokeniser, String, String, String) is undefined CrawlStrategy.java /terrier4core/src/websitesearch/org/terrier/services/websitesearch/crawler4j line 103 Got this error - can you tell me why the new constructor was needed?
          Hide
          richardm Richard McCreadie added a comment -

          Added the missing files in r3875

          We already had a constructor that took an input stream, but not one for a reader, added one that took a Reader so it could read html text from a StringReader.

          Show
          richardm Richard McCreadie added a comment - Added the missing files in r3875 We already had a constructor that took an input stream, but not one for a reader, added one that took a Reader so it could read html text from a StringReader.
          Hide
          craigm Craig Macdonald added a comment - - edited

          I have tried to maintain a contract for Document object constructors: InputStream, properties, tokeniser.

          I think you have done this due to limitation with crawler4j - it should return the raw bytes, not the HTML - HTML allows a character set specified by HTTP headers to be overridden within the document by a <meta> tag. The current code block in https://code.google.com/p/crawler4j/source/browse/src/main/java/edu/uci/ics/crawler4j/parser/Parser.java is the problem.

          try {
              if (page.getContentCharset() == null) {
                  parseData.setHtml(new String(page.getContentData()));
              } else {
                  parseData.setHtml(new String(page.getContentData(), page.getContentCharset()));
              }
          } catch (UnsupportedEncodingException e) {
              e.printStackTrace();
              return false;
          }
          

          As a workaround, use html.getBytes("UTF-8") and set docProperties.encoding("UTF-8"). You can then revert the constructor for TaggedDocument.

          Show
          craigm Craig Macdonald added a comment - - edited I have tried to maintain a contract for Document object constructors: InputStream, properties, tokeniser. I think you have done this due to limitation with crawler4j - it should return the raw bytes, not the HTML - HTML allows a character set specified by HTTP headers to be overridden within the document by a <meta> tag. The current code block in https://code.google.com/p/crawler4j/source/browse/src/main/java/edu/uci/ics/crawler4j/parser/Parser.java is the problem. try { if (page.getContentCharset() == null) { parseData.setHtml(new String(page.getContentData())); } else { parseData.setHtml(new String(page.getContentData(), page.getContentCharset())); } } catch (UnsupportedEncodingException e) { e.printStackTrace(); return false; } As a workaround, use html.getBytes("UTF-8") and set docProperties.encoding("UTF-8"). You can then revert the constructor for TaggedDocument.
          Hide
          richardm Richard McCreadie added a comment -

          Updated and committed r3878

          Show
          richardm Richard McCreadie added a comment - Updated and committed r3878
          Hide
          craigm Craig Macdonald added a comment - - edited

          Is SimpleCrawler still used? Also CrawlerAPI?

          Can you add a documentation page to docs/ showing usage?

          Show
          craigm Craig Macdonald added a comment - - edited Is SimpleCrawler still used? Also CrawlerAPI? Can you add a documentation page to docs/ showing usage?
          Hide
          richardm Richard McCreadie added a comment -

          SimpleCrawler is no longer used. CrawlerAPI is an interface that describes the method for crawling. SimpleCrawler and CrawlerProcess implement CrawlerAPI.

          Show
          richardm Richard McCreadie added a comment - SimpleCrawler is no longer used. CrawlerAPI is an interface that describes the method for crawling. SimpleCrawler and CrawlerProcess implement CrawlerAPI.
          Hide
          richardm Richard McCreadie added a comment -

          It should just work via http_terrier

          so ./bin/http_terrier.sh 8080 src/webapps/websitesearch

          Show
          richardm Richard McCreadie added a comment - It should just work via http_terrier so ./bin/http_terrier.sh 8080 src/webapps/websitesearch
          Hide
          richardm Richard McCreadie added a comment -

          Updated the interface with new CSS and added index saving functionality.

          Written documentation page.

          Committed r3958

          Show
          richardm Richard McCreadie added a comment - Updated the interface with new CSS and added index saving functionality. Written documentation page. Committed r3958
          richardm Richard McCreadie made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          richardm Richard McCreadie made changes -
          Project TREC [ 10010 ] Terrier Core [ 10000 ]
          Key TREC-361 TR-305
          Issue Type Improvement [ 4 ] New Feature [ 2 ]
          Workflow jira [ 10802 ] Terrier Open Source [ 10868 ]
          Affects Version/s 3.6 [ 10060 ]
          Affects Version/s 3.6 [ 10061 ]
          Component/s Core [ 10020 ]
          Fix Version/s 4.0 [ 10051 ]
          Fix Version/s 4.0 [ 10050 ]

            People

            • Assignee:
              richardm Richard McCreadie
              Reporter:
              richardm Richard McCreadie
            • Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: