Topic-focused Web crawling

The ultimate aim of Internet search engines is to index the entire Web by utilising the links found within known pages. Topic-focused crawlers specialise this task by indexing only the subset of the Web which is relevant to some topic or information need. Typically, the user can specify the topic and a set of start (seed) pages, and the topic-focused crawler will traverse the Web via hyperlinks as with general crawlers. However, Topic-Focused Crawlers will make relevance decisions at each step as to whether this page should be included in the resultant index and also prioritise out-going links, based on various features, according to their expected gains. This project aims to use publicly available hierarchical Web Directories to facilitate the prioritsation of out-going links according, not only to their relevance to the desired topic, but also relevance to related ones. This will hopefully mean that when no ‘positive’ links are available, the crawl should be able to stay as relevant as possible.

People involved

Mark Greenwood (MPhil student)
Prof Goran Nenadic (supervisor)

gnTEAM

Text extraction, analytics, mining

Topic-focused Web crawling

Topic-focused Web crawling

People involved