Now that we have a solid idea of the economic dynamics, let’s look at how this manifests in the specifics of crawling the web. Whenever a software program sends a request to a website, the program itself provides two things: an IP address to send the requested webpage back to, and something called a User Agent string.
The User Agent string can be anything the software program tells it to be. For example, the User Agent string that a web browser provides to all the websites it visits might look something like “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36” (To see the User Agent string your browser is using, visit this webpage). When crawlers visit websites, they typically have their User Agent string set to a name that clearly describes who is operating the crawler. Google’s User Agent string is “Googlebot,” Microsoft’s is “Bingbot,” and Yahoo’s is “Yahoo! Slurp.” Software operators can set their crawlers’ User Agent strings to be anything, including the strings of other crawlers.
User Agent strings are important because they are useful in describing just how much power someone wields over the market. Sometimes, when people try to crawl the web, they get blocked by website operators and they resort to impersonating Google’s crawler by using Google’s User Agent string. This happens often enough that researchers in the industry and academia have studied how to block the agents that impersonate Google. There is clearly value in being perceived by others as Google when one is trying to crawl the web.
If the User Agent string were the only thing crawlers had to send when requesting a webpage, Google would be in trouble. Anybody could have their programs pretend to be Googlebot and crawl the web as aggressively and completely as Google does. The IP address that software programs provide to return the response to the web request is what prevents this from happening. An IP address is a specific sequence of numbers that computer networks use to uniquely identify computers on that network. You can see the IP address your computer is using to browse the web by visiting this page. If the crawler were to provide any IP address other than the ones it controlled when it requested the webpages, the server would send the response back to that other IP address. The crawler would never actually see the content of the page it was trying to access.
When website operators consider blocking a crawler, that crawler is typically identified and discussed in terms of the IP address from which all of the web requests originate. Because Google and other search engines provide economic benefit to website operators, those operators want to make sure they never block crawlers from major search engines. Google and other search engines have published procedures for how to check if a crawler that calls itself Googlebot is actually coming from Google. Those procedures allow website operators to verify the IP addresses they see sending all the web requests are actually coming from Google itself.
The truth of today’s Internet is that to be able to crawl as aggressively and broadly as Google, you need to be able to prove that you are Google. And the only way to do that is to own the domains “google.com” and “googlebot.com.” Google has exclusive control over those assets. Website operators have been giving special approval to the IPs associated with those two domains for the past two decades. Google’s vast search market dominance rests on the long-term control of these two assets.