As part of our LocalGov PageSpeedy Site we have recently introduced page and document counts, alongside our application directory. We are hoping to build up a comprehensive list of council applications and who uses them.
In order to provide this information we have built a custom web crawler, the PageSpeedySpider to crawl and index these numbers from localgov websites across the country.
The spider works like any other web crawler: starting at the site’s homepage, it finds all the links to other pages, documents and domains on the page, and then uses them to continue the crawl. As we are only primarily concerned with localgov sites, the spider crawler does the following:
- We only follow links on the primary council site (so www.council.gov.uk or council.gov.uk). External links main site are not crawled.
- We track all document links on the primary domain, either through extension detection (for example ‘.pdf’) or media type (so when the web server tells us it’s a pdf).
- We track all sub-domains from the main council site (example: planning.council.gov.uk). Against these domains, we store the first link we find - this link is then used when running our detection scripts to identify applications.
We are constantly improving these scripts, and welcome contributions if you know what a site is running.
It is not our intention to run the speedy spider every month. Spidering is a slow and time consuming process, that often requires us to tweak configuration and script settings to run on all sites.
At the moment our spidering server* spiders approximately 14 council websites a night, with sites that fail being analyzed and placed into the queue for recrawling. Once we have crawled all the sites we will freeze the crawl data.
It is our plan to only run the full site crawls every 6 to 12 months, and against new sites, when they are detected by the main pageSpeedy scripts.
* As part of making page speedy quick and easy for anyone to run SpeedySpider is currently running on our state of the art Raspberry Pi 2 Model B server.
We are constantly improving and tweaking the spider code. Web crawling is something of an addictive process - you set your crawlers running, come back and notice it got lost down some rabbit hole, so you tweak config and regex and set it off again. And again. And again. The diversity and complexity of websites never leaves you with a ‘simple’ crawl!