The Evidence We've Found So Far - Knuckleheads' Club

What follows is some of the evidence we’ve found so far that supports the idea that crawling the web is a natural monopoly and that Google has control of that monopoly. We are working on writing up more of the evidence we have found so far, and will be posting notifications about updates to this page as we go along.

What We Found In The Robots.txt Files

As a disclaimer, we are ascribing specific intents to the robots.txt files without having communicated with the authors of those files. Sometimes there are comments left in the files that we can refer to that let us know the website operators intent, other times it is not clear whether something was a mistake or whether they wanted to give Google as wide access as they did. When we aren’t sure about something, we have tried to note that openly. It can be unclear whether a website operator intended for a part of their website to be exclusively crawled by Google or not, but often times those grey areas work out in Google’s favor.

What follows is a collection of websites that we have written up, each section showing off a different type of privilege Google gets when it comes to web crawling. Using our data warehouse and a rough analysis script, there’s about 600,000 more sites that are showing clear bias towards Google in some way. We’re working our way through them, trying to select the ones that act as most representative of the types of advantages Google gets.

You’ve probably never heard of many of these websites and that’s the point. Google’s advantage is not just one big site making one big decision, it’s hundreds of thousands of sites each making a few seemingly tiny decisions that all add up to Google having so much more data than anyone else that it is almost impossible to compete. There are four different types of privilege Google gets when it comes to crawling the web:

Special access to specific pages
Sites where only Google gets to display the page in full
Sites where Google gets extra access because of Google’s Ad Networks
Sites where Google gets something extra, unsure why

Special Access To Specific Pages

Some sites only want Google to crawl certain content for whatever reason.

Based on the smallarmsreview.com robots.txt file, we can see that Google has a lot of access that other crawlers don’t. In particular, only Google gets to see the inventory (Bing vs Google) pages.
Based on the gigazine.net robots.txt file, Bing isn’t allowed to crawl at all, and only Google is allowed to crawl the AMP pages. What this means in practice is that when you search for specific text from one of the recent articles on the website, only Google is able to retrieve the article (Google, Bing, Yahoo).
Based on the dailykos.com robots.txt file, we were able to find some webpages that Google was able to index that Bing does not seem to have access to. For whatever reason, Bing had it’s access to dailykos.com/news revoked, as indicated by the disallow: /news in the bingbot section. A lot of other crawlers are listed in the file, but Yandex is the only other search engine crawler listed and they don’t compete much outside Russia. By default on this website all crawlers not listed are totally blocked as well, so in effect in the English speaking world only Google has access to these pages. The pages redirect to /tags for some reason, probably a site redesign. How this plays out in practice is that Google has about 100 results for this specific type of webpage, while Bing has 2.
Wikihow lets Google crawl everything on the site while restricting every other crawler from accessing some important pages (like RSS, which is super helpful to crawlers for discovery).

Facebook gives Google and Bing special access to Facebook Watch

Looking at the robots.txt file for Facebook.com, it’s clear that Facebook gives Bing and Google the ability to crawl through Facebook Watch while restricting all the other allowed crawlers to only crawling videos that it already knows about. You have to apply to get permission to get access to scrape Facebook and many of the expected companies already and got access. Many of them have similar levels of access, but we can see that Google and Bing get something that none of the other crawlers do. This is indicated by the Googlebot and Bingbot user-agents’s being given access to “/watch” while everyone else is restricted to “/watch?v=”.

What this means is that Google and Bing can go to Facebook Watch and discover the videos there as they come up. The other crawlers have to have a direct link and already know about the video they want to see (i.e. https://www.facebook.com/watch/?v=604309266815747&extid=K3PVJIRXrLhHMqGj) and then they can go directly there. Google and Bing are allowed to discover these videos ahead of time while all the other crawlers would have to wait for somebody to talk about them someplace else and then they could visit them. What this could mean in practice is that if there is important event happening on Facebook Watch, Google and Bing would be able to start showing links to those pages much earlier than anyone else, as all the other search engines (that were allowed to crawl Facebook in the first place) would have to wait until links to the important videos started appearing someplace else. It also means that overall Google and Bing will have a much more comprehensive index of videos from Facebook Watch than anyone else.

Sites Where Only Google Gets To Display The Page In Full

Most webpages rely on javascript, css and images to make the page look nice and have more interactive elements to them than just static plain text. The javascript, css and images for these pages are often in separate files that also need to be downloaded when displaying the webpage. Many websites restrict via robots.txt who is able to access these files because they can be large and expensive to the website operators to let others download. Google has stated that they give website operators better search rankings if the website operators let them download these files. They render and display the page as part of their crawling process and viewing the page fully rendered is important for indexing content (especially when content is loaded dynamically via something like ajax). So, it’s advantageous for them when they are the only ones who can render certain websites in full. The websites that follow only allow Google to download everything need to render the website in full.

windowsable.com (file)
Warning NSFW jav.guru (file)
cracku.in (file)
talkinfrench.com (file) – only Google gets to download the images, otherwise everybody has the same access
helium10.com (file)
bni.co.id (file)
ffrf.org (file)
cienradios.com (file)
motorsport.com (file)
sushifaq.com (file)

Sites Where Google Gets Extra Access Because Of Google’s Ad Networks

If you want to display a Google ad anywhere on your website, you have to let them crawl the page with the ad on it at some point. The ad crawler doesn’t spider (follow all the links on a page) and it only runs once a week or so. The information about the webpage is used to better target ads and also checked for scams/fraud. The Google ad network crawlers share the same cache as the Google Search crawler. This is all public information that can be found here.

Why this matters is that some websites give the ad crawler full access to their entire website by default. So even if you block every search engine crawler on a page, if there are Google ads on the page the ad crawler will still get access to crawl that page. We only recently realized this was happening and so we don’t have as many concrete examples of this yet. We should have more soon.

When looking at these examples, turn off any ad blockers you might have on to get the full effect. Sometimes on Google you might also have to click “repeat the search with the omitted results included“. Also sometimes with these you need to page all the way to the end to see the full number of results that Google has (they give an over estimate at first and then show you a more realistic number at the end). Additionally, these pages are displayed differently than pages Googlebot has normal access to. It’s clear that they are in the index somehow but the results in Google Search often restrict how much information they display about the page.

In the callername.com robots.txt file, we can clearly see that all the search engines are blocked from accessing the /person urls (example page). When you look for these pages in Bing, you don’t get anything back. When you look for them in Google, you get many hundreds back (potentially more, but hard to say since Google cuts you off at 30 pages deep).

In the wowhead.com robots.txt file, we can see that all search engines are blocked from accessing the /search , /list and /account pages. When you compare the results though Google either has way more access or chooses to display much more than Bing does for whatever reason (/search Bing vs Google, /list Bing vs Google, /account Bing vs Google). This site is strange because Google isn’t displaying that much about these pages in the search results but it clearly has copies of some of them (particularly of /list).

Sites Where Google Gets Something Extra, Not Sure Why

Sometimes websites give Google things that nobody else has access to and we’re not totally sure why.

The nytimes.com robots.txt (potentially inadvertently) privileges Google and gives the Google Search crawler access to a variety of things that others don’t get. For example, Google gets access to the search results pages (https://www.nytimes.com/search?query=testing) which we’ve heard can be very useful for crawlers when they later go on to rank things. You can see this when you compare the results between Google and Bing.

The nytimes.com file looks like a certain type of mistake that happens often though, where a webmaster doesn’t realize that by specifying a set of directives for Googlebot they override directives for * and those * directives no longer apply to Googlebot. Quoting Google:
Only one group is valid for a particular crawler. The crawler must determine the correct group of lines by finding the group with the most specific user-agent that still matches. All other groups are ignored by the crawler. Sometimes people think the directives are additive and that the * also applies to every other bot, but that is not the case! So right now, we’re pretty sure The New York Times is giving Google a bunch of access it isn’t giving to anyone else.

For whatever reason only Google gets access to pbskids.org/cgi-registry/golocal. We think this is a webpage that helps you find your local broadcast station and set it for the duration of your browser session. Seems useful but we haven’t dug in enough to confirm.

For whatever reason only Google gets access to urls that have the phrase categories in them on the gulfnews.com (robots.txt file). We think this is just an operators mistake that was trying to stop bots from crawling certain backend pages , but it does create a funny thing where only Google has access to webpages that have the word categories in the headlines. For example, Google directly links to this article while Bing can only link to a reaggregated article from msn.com.

Amazon’s Robots.txt Conundrum

Sometimes it is easy to see that Google is given special access but it is difficult to figure out how that access is used in practice. Up until recently, Google was clearly given special access to Amazon.com based on the robots.txt file. Based on our understanding of Amazon’s webpages, Google had up until very recently been given exclusive access to a variety of different things, such as the following:

parts of the Creator Hub,
all of the Customer Reviews,
all of the Customer Profiles,
all of the Top Selected Products and Reviews,
parts of Onsite Associates Program,
parts of Member Reviews pages.

What remains murky to us is how this was all used by Google and why Amazon would want only Google to have access. We’ve tried to find some smoking gun where Google has some text from one of these pages that Bing doesn’t have but we haven’t been able to do so. It’s a conundrum and Amazon has seemingly done this sort of thing for almost 10 years (2010, 2014, 2017, 2020). They’ve recently equalized it so that everybody all gets the same access, but we still don’t know why it even was unequal in the first place. We have a pet theory that Google had a blessed Amazon account it was logging in with to crawl further and deeper with, but some former crawl engineers said that probably wasn’t the case. Overall, this is something we are going to look into more sometime soon though.

Instagram gave Google access to a private undocumented API

Instagram used to let everybody crawl all of their webpages (indicated by this very simple robots.txt). Then around May 18th, 2017, they restricted crawling of Instagram to only a couple crawlers from the major companies (robots.txt). We haven’t tracked down an exact date yet for when this was introduced and when it was removed but for a while Google was given access to a special API, “/api/v1/guides/guide”, that nobody else was given access to. So far as we can tell this is undocumented and inaccessible generally, so we cannot say what data is/was on the other side of that API.

Evidence And Implications Of Google Being Better At Crawling

One of the creators of Discourse, an online discussion platform that powers a lot of modern forums, has been complaining about Bing’s crawling habits for the last few years. He really started complaining on Twitter recently and talked in explicit terms about how Bing crawls Discourse websites way more and provides way less traffic than Google does. They put up hard numbers to back it up and a Bing product manager showed up in the forum to try and figure out what is going on. Surprisingly, the website operators discuss the political/economic implications of blocking Bing in explicit terms! They want to do the Right Thing here but that’s hard to figure out how to do.

We have a theory that Google is so much better than Bing at crawling because Google is using data from AdSense, AdWords, Chrome, Analytics and Android all together to target their crawls much more accurately than Bing could ever hope to do. Bing has suggested as much to the UK’s Competition and Markets Authority (fact #80 in this report). As such, forcing Google to divest any of it’s products that collect data is going to have an impact on Google’s crawling operations. So, the search quality might go down as the index shrinks and the costs imposed on website operators increase as Google cannot target crawls as well.

When regulators move to break up big tech, they need to investigate thoroughly how much Google’s advantage in data collection plays a role in crawling. If regulators mandate extremely strict no interaction structural separation between, say, Chrome and Google Search and Google’s crawling operation depends significantly on tab data from Chrome, then there will be a harm to consumer welfare and that may not be approved by the courts. Breaking AWS off from Amazon is right and proper and straightforward, but breaking apart Google Search from it’s data sources used for targeting crawling and enforcing strict structural separation wouldn’t just split up disparate departments sharing a balance sheet but would harm the overall quality of Google Search. Congress was right to call Google an ecoysytem of interlocking monopolies.

Applebot

There’s further evidence for Google as website operators’ crawler of choice. Apple News was introduced with iOS 9 in late 2015. Apple built a crawler called Applebot that went out to news websites, collected stories and articles, and then made that content available via the Apple News app. When Apple started operating Applebot, they stated that Applebot would assume the same privileges granted to Googlebot if no instructions were given for Applebot. If a news website blocked every crawler besides Googlebot and didn’t mention Applebot, then Applebot would act like Googlebot and crawl the pages to which Googlebot was given exclusive access. Apple News has since become a significant driver of traffic for website publishers.

Sometime in the middle of 2018, Apple changed its stated intent and removed any reference to crawling as if it was Googlebot. It’s unclear if the company still uses Googlebot’s directives as its own or how much access they were actually able to successfully get from this scheme, but it is clear that some major companies are well aware of the preferential treatment that Google receives and seek to emulate it when possible.

Crawl Speeds

Google determines how fast a website should be crawled, not website operators. Website operators can request a change in crawl speeds but their request will only be honored for 90 days. Every other search engine respects website operators wishes and will crawl only as fast as the website operator asks them to. Google dictates terms to the website operators and they’re the only ones who get to do that. A good example of this is All Recipes (file). All Recipes gives all the pages the same level of access, but only publishes the crawl delay for every other crawler besides Google. Crawl delay is how long crawlers need to wait before requesting the next web page from the website. Google has stated that it does not follow the crawl-delay directive when given in robots.txt and that you must use the Google webmaster console to request a temporary slow down of being crawled.

← God Bless The CMA Analyzing Google→

The Evidence We’ve Found So Far