Crawling Strategy

DFS is not a good idea as we don’t want to go too deep into some site and would rather explore other things we have. That gives us BFS, which is implemented as a FIFO queue (URL frontier).

Most sites will have links to their own other pages which might lead us to bombarding one server with a lot of requests. Also, some URLs might have more priority according to web traffic or page rank etc

URL Frontier

A URL frontier is a data structure that stores URLs and solves the above-mentioned problems. It ensures:

  • Politeness: We can have a queue for each host, and then we can have a delay in querying pages of the same host.
  • URL prioritisation: Component that will prioritise URLs. There is a part of randomness so that there is a bias towards higher priority URLs later on, though other URLs can also be selected.
  • Freshness: Need to handle page updates, additions, deletion. Can do this by re-crawling important pages, or keeping track of their update history.

Robots file

Robots.txt, called Robots Exclusion Principle, is a standard used by websites to communicate with crawlers. They can tell if the site allows crawling, and can disallow certain paths.

To avoid re-downloading this while crawling each page of a host, we cache it and refresh periodically

Performance

We can optimise the performance using these:

  • Distributed web crawlers.
    • These can be geographically distributed and we choose a worker that is closer to the crawled site
  • Maintain a DNS cache, or use a local resolver which has a cache
  • Short Timeout so that dead or slow servers can be skipped over quickly.

Other issues

  • Server side rendering: Some web pages render their content client side (e.g. using React). So we have to render this content on the server and then crawl
  • Spider traps: There are websites with a labyrinth of pages to trap a crawler.