In July of this year, the UK Competition and Markets Authority (CMA) finished their study of online platforms and digital media. It’s an excellent report built upon a shocking amount of information. For example, the CMA was given data about all the queries that happened on both Bing and Google over a period of time. They then used this data to analyze the relatively frequency of different types of queries and total volumes of queries that Bing and Google were receiving. Really, if you’ve made it this far into reading this website you are presently on, you will get a big kick out of reading the report, particularly Appendix I.
Appendix I is one of the most important reports that has ever come out about Google and, as far as we can tell, one of the least discussed. Beyond the amazing query data analysis the CMA performed, they also did an analysis of the phenomena that this website is focused on, namely that Google has an advantage when it comes to crawling the web. They took all the robots.txt from Common Crawl and analyzed them looking for bias. They looked at each robots.txt file and for each User Agent in the file, they categorized the rules provided as “Denied,” “Allowed,” and “Partially Denied.” They also compared what looks to be the strict count of number of “Allow” directives each bot receives. What they found was that Google has more “Allows” relative to everyone else and they conclude that Google has an advantage when crawling the web.
This is important because, as far as we know, it’s the first time that regulators have seriously looked at Google’s advantage when it comes to web crawling. We have been unable to find any previous mention of bias in the relative access that the crawlers get in regulatory reports and investigations. Almost all mentions of crawling are perfunctory, not getting beyond “Google crawls, this is what crawling means” before getting to the actual complaint. In addition to this, and this is really very wild, they got statements from Microsoft and Google about their relative cache sizes. On page 19 of Appendix I, finding 75, they give ranges that show that Google’s index of webpages is anywhere from 300 to 500 billion pages larger than Bing’s. On page 20, Microsoft stated to CMA that “Google’s web crawler is more welcome than Bing’s crawler which makes it easier to discover new sites or updates”, partially explaining the discrepancies in index sizes.
Appendix I of the CMA Antitrust report is an important first step at beginning to understand what is going on in the search engine market. There is a mountain of work and evidence that we are endeavoring to add. The analysis that the CMA did concluded that Google only had a small advantage when it came to web crawling. We think that that advantage will widen considerably when more factors are taken into consideration. The relative percentage of “Allow”‘s between crawlers isn’t as strong a measure as could be used, i.e. there are situations where Google would have more access than anyone else but wouldn’t have strictly less allows than everyone else. Additionally, we think that if you take into account the industrial category of the website, the website’s language, and ranking/traffic of the website and looked at the changes in the robots.txt over time, you would find that Google has much stronger advantages in certain categories, languages and within ranges of the ranks that has grown over time. We have done an analysis very similar to what the CMA did with similar results and are currently working on an analysis that takes into account more of these factors.
We applaud the CMA for their work on this issue and thank them for breaking new ground when considering how to regulate Google. Their work is an inspiration to us and we could not have been more excited to see it published. The CMA’s report is incredibly solid and we look forward to seeing what they do next to regulate Google and the rest of the tech companies.