The Googlebot Monopoly

We could quickly ascertain how important the googlebot.com and google.com domains are by looking at quarterly snapshots of all the IP blacklists and whitelists that website operators have created and used over the past 20 years. We don’t have access to those lists, and they wouldn’t be feasible to assemble at this time, but a reasonable proxy for those lists does exist. Websites adhere to something called the robots exclusion standard. As Wikipedia explains:

The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned. Robots are often used by search engines to categorize websites.

The robots exclusion standard is not any sort of law or formal agreement. You can run a website without knowing the standard exists (many people do, in fact). This standard is a statement of the website operator’s intent and a request to others that they respect that intent. You can expect that if you crawl a website in a fashion contrary to the guidelines in the robots.txt file, the website operator will likely block you. The robots.txt file is a pretty reasonable proxy for the IP blacklists and whitelists website operators have assembled. Helpfully, every website makes the robots.txt file available at /robots.txt for anyone to read.

Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission. The rest of the file specifies that Google, Microsoft, Yahoo and two other non-search engines are not allowed to crawl certain pages on census.gov, but are otherwise allowed to crawl whatever else they can find on the website. This tells us that there are two different classes of crawlers in the eyes of the operators of census.gov: those given wide access, and those that are totally denied.

And, broadly speaking, when we examine the robots.txt files for many websites, we find two classes of crawlers. There is Google, Microsoft, and other major search engine providers who have a good level of access and then there is anyone besides the major crawlers or crawlers that have behaved badly in the past that are given much less access. Among the privileged, Google clearly stands out as the preferred crawler of choice. Google is typically given at least as much access as every other crawler, and sometimes significantly more access than any other crawler.

Don’t just take our word for it though, let’s take a look at the evidence.