Analyzing Google

We’ve been analyzing robots.txt files on a large scale to produce all the evidence we’ve found so far. We have been renting some expensive servers from Amazon for a while now. On these servers we have taken some robots.txt files from Common Crawl and started parsing them and putting them into a data warehouse. We have about 20 million unique robots.txt files so far, all of the files from just one crawl of the web by Common Crawl. We have been analyzing these files for a while now and have seen that Google does have a significant privilege when it comes to web crawling. The warehouse has been extremely effective at finding particular examples of websites that privilege Google’s crawlers. We have some refinement of our robots.txt parsing code to do before we are confident to start speaking about the files in aggregate though — robots.txt files are pretty messy and we have some known parsing bugs to fix.

We are opening up the club for sign ups in part because running these computing resources is expensive and the amount of data to shift through is overwhelming at times. By joining the club right now you will be funding the continued analysis of these files and publishing of the results.