Crawling Strategy
DFS is not a good idea as we don’t want to go too deep into some site and would rather explore other things we have. That gives us BFS, which is implemented as a FIFO queue (URL frontier).
Most sites will have links to their own other pages which might lead us to bombarding one server with a lot of requests. Also, some URLs might have more priority according to web traffic or page rank etc
URL Frontier
A URL frontier is a data structure that stores URLs and solves the above-mentioned problems. It ensures:
- Politeness: We can have a queue for each host, and then we can have a delay in querying pages of the same host.
- URL prioritisation: Component that will prioritise URLs. There is a part of randomness so that there is a bias towards higher priority URLs later on, though other URLs can also be selected.
- Freshness: Need to handle page updates, additions, deletion. Can do this by re-crawling important pages, or keeping track of their update history.
Robots file
Robots.txt
, called Robots Exclusion Principle, is a standard used by websites to communicate with crawlers. They can tell if the site allows crawling, and can disallow certain paths.
To avoid re-downloading this while crawling each page of a host, we cache it and refresh periodically
Performance
We can optimise the performance using these:
- Distributed web crawlers.
- These can be geographically distributed and we choose a worker that is closer to the crawled site
- Maintain a DNS cache, or use a local resolver which has a cache
- Short Timeout so that dead or slow servers can be skipped over quickly.
Other issues
- Server side rendering: Some web pages render their content client side (e.g. using React). So we have to render this content on the server and then crawl
- Spider traps: There are websites with a labyrinth of pages to trap a crawler.