[TR-305] Simple website search engine Created: 11/Apr/14  Updated: 16/Jun/14  Resolved: 12/Jun/14

Status: Resolved
Project: Terrier Core
Component/s: None
Affects Version/s: 3.6
Fix Version/s: 4.0

Type: New Feature Priority: Major
Reporter: Richard McCreadie Assignee: Richard McCreadie
Resolution: Fixed  
Labels: None

Issue Links:

Real-time search really simple website search engine, with a small db backend.

So the idea would be:
- the collection.spec contains the hostnames. All pages must have that as a prefix.
- a simple crawler might be crawler4j but I'm not sure it has robots.txt support. However I think we had a student project that fixed that, and I can find the code.
- crawled pages are placed in db. We use a SQL query to place the updated pages in the index?

Comment by Richard McCreadie [ 30/May/14 ]

Made a major commit on this in r3875

Changed package structure of website search
Added crawler4j
Added crawler4j indexing code for Terrier
Updated build.xml
Updated search interface
Updated start scripts

Comment by Richard McCreadie [ 30/May/14 ]

@Craig - have a look at this and try to break it when you get a chance.

Comment by Craig Macdonald [ 30/May/14 ]
The constructor TaggedDocument(StringReader, Map<String,String>, Tokeniser, String, String, String) is undefined	CrawlStrategy.java	/terrier4core/src/websitesearch/org/terrier/services/websitesearch/crawler4j	line 103

Got this error - can you tell me why the new constructor was needed?

Comment by Richard McCreadie [ 02/Jun/14 ]

Added the missing files in r3875

We already had a constructor that took an input stream, but not one for a reader, added one that took a Reader so it could read html text from a StringReader.

Comment by Craig Macdonald [ 02/Jun/14 ]

I have tried to maintain a contract for Document object constructors: InputStream, properties, tokeniser.

I think you have done this due to limitation with crawler4j - it should return the raw bytes, not the HTML - HTML allows a character set specified by HTTP headers to be overridden within the document by a <meta> tag. The current code block in https://code.google.com/p/crawler4j/source/browse/src/main/java/edu/uci/ics/crawler4j/parser/Parser.java is the problem.

try {
    if (page.getContentCharset() == null) {
        parseData.setHtml(new String(page.getContentData()));
    } else {
        parseData.setHtml(new String(page.getContentData(), page.getContentCharset()));
} catch (UnsupportedEncodingException e) {
    return false;

As a workaround, use html.getBytes("UTF-8") and set docProperties.encoding("UTF-8"). You can then revert the constructor for TaggedDocument.

Comment by Richard McCreadie [ 02/Jun/14 ]

Updated and committed r3878

Comment by Craig Macdonald [ 02/Jun/14 ]

Is SimpleCrawler still used? Also CrawlerAPI?

Can you add a documentation page to docs/ showing usage?

Comment by Richard McCreadie [ 02/Jun/14 ]

SimpleCrawler is no longer used. CrawlerAPI is an interface that describes the method for crawling. SimpleCrawler and CrawlerProcess implement CrawlerAPI.

Comment by Richard McCreadie [ 03/Jun/14 ]

It should just work via http_terrier

so ./bin/http_terrier.sh 8080 src/webapps/websitesearch

Comment by Richard McCreadie [ 12/Jun/14 ]

Updated the interface with new CSS and added index saving functionality.

Written documentation page.

Committed r3958

Generated at Thu Sep 24 20:59:20 BST 2020 using JIRA 7.1.1#71004-sha1:d6b2c0d9b7051e9fb5e4eb8ce177ca56d91d7bd8.