Let A Thousand Spiders Crawl

Suppose for a moment you were the website administrator for widgets.com, the premier website for purchasing widgets. One day, you hear that Intel had a major research breakthrough last year and is introducing a new microchip that lets anyone crawl the entire Internet in less than a second. You’re shocked to hear that they are seemingly selling these new MegaCrawler chips for only a few dollars each. This might mean a lot more traffic for widgets.com, which might mean some late nights for you.

A few months go by, and you’re finally in the middle of that vacation you’ve been meaning to take. Work’s been a lot lately, what with the new Widget getting shipped soon, so it’s been good for you to take some time and relax. But! Your phone is ringing, widgets.com is down and nobody can fix it. They need your help. After logging on and checking the website, you conclude that, yep, the website is in fact very down. It seems like widgets.com has been getting more traffic than it’s ever had and the website is not handling it very well.

The servers that run your website are trying their hardest but they just can’t keep up with all the traffic. Nobody ever expected your website to get this much attention in such a short amount of time. As soon as you get a clear look at some of the incoming traffic, you see what’s going on: the MegaCrawler chips!

You check the timing and, yep, it looks like the first batch of MegaCrawler chips started getting delivered over the past few days. People are now all crawling the entire Internet all the time. You start trying to figure out what to do. Usually, humans only request a few pages per minute on websites, so any program that requests more than, say, 100 pages per minute is almost certainly a crawler and should be blocked to get your traffic under control. You modify the firewall to permanently block any computer that visits your website too fast, and after a few minutes, you check widgets.com and see that the site is working perfectly. You go back and enjoy the little bit of vacation you have left.

A few weeks go by, and now you’re getting calls from the finance department. The new Widget that everyone was working really hard to get out the door was just published on the website last week. They say there haven’t been nearly as many orders for the new Widget as they expected, and the finance department was wondering, what’s up with that?

You take a look at your traffic monitoring software and see that the overall traffic to widgets.com is about the same as it ever was. Looking at the traffic specifically for the page for the new Widget shows that other people are accessing it, but they’re all finding it from looking around on widgets.com, not from clicking on direct search results via search engines. This strikes you as odd, because for every other page on widgets.com, a good deal of traffic comes directly from Google. In fact, most of your traffic comes from Google showing people your pages when they search for a particular type of widget (or just widgets in general). You do a search on Google for the name of the new Widget and see that widgets.com doesn’t even show up in the search results. Oh no!

You realize that in your haste to block all the MegaCrawler chips, you blocked all the search engines’ crawlers as well. The Google crawler was unable to get a fresh copy of widgets.com’s webpages, so it has never been able to discover and crawl the webpage for the new Widget because it was created after Google had been blocked. You unblock Google’s crawler by making a special exception for it in the firewall. It’s a straightforward fix, and you use tools supplied by Google to website administrators to request that the company direct its computers to crawl your website as soon as possible. Within the day, the new widget is showing up in the search results and the finance people are – well, they aren’t ever all that happy, but at least they’re not particularly worried any more.

And that’s the end of this hypothetical. Except, it’s not really a hypothetical. Although there are no microchips out there that can crawl the entire Internet once a second, computers have gotten very fast and computer hardware has gotten very cheap, so while it’s incredibly expensive to crawl the entire web, the cost of crawling a single website extensively is very cheap.

The cost of producing a single webpage may be cheap, but it does cost money to run a website and serve all the incoming traffic. Businesses try to maximize the percentage of traffic that directly or indirectly creates revenue and minimize traffic that does not. Computers allow their administrators to discern and select who is allowed to engage with other websites, and many of those decisions are made on the basis of how much revenue or expense that traffic creates for the business. In the hypothetical above, the human visitors to the website were not blocked, because they directly created revenue via purchases on the website. All of the MegaCrawler chips were blocked because none of them clearly created revenue directly or indirectly for the business. Google was specifically unblocked, because the traffic it sent to the website indirectly created revenue for the business later on.

In reality, thanks to the millions of computers that now exist around the world, there are functionally hundreds, if not thousands, of MegaCrawler chips. Almost every website administrator has had to deal with a computer someone has programmed to crawl their website too aggressively. A reasonable, easy and often chosen option is to block that computer from interacting with the website at all.

While it is not hard for an individual to write and deploy a program that will crawl a website so aggressively that the website crashes, it is not possible to crawl the entire web as often and as thoroughly as Google. The cost of getting enough computing hardware to crawl the entire Internet often enough is immense. And even if you could get could get all that hardware, you would be quickly blocked by websites because you don’t provide them any economic value in exchange for their expending resources serving your traffic. Due to the nature of general purpose search engines, you will never have crawled enough websites to start attracting enough visitors who could go on to generate any amount of significant revenue for any of the sites you are allowed to crawl.