Understanding Robots.txt for Python Web Scraping

Bots and Misconceptions
When Not to Scrape
What is a Robots.txt file?
Checking the Robots.txt File
Can I Scrape ________ (Insert Here) Site?
Automating the Process

Bots and Misconceptions

An Internet bot is a software program that automates a task (which may or may not be repetitive). Both web scrapers and web crawlers are bots.

Some bots are harmful. They can be used to illicitly collect data from websites. They can spam websites with traffic, causing them to crash. They can generate spam. These bots are the reason why website owners add captcha checks to their web pages. It is in the owner's best interests to ensure that web traffic is coming from humans.

While it might be easy to say that bots do more harm than good, that may not necessarily be the case. It shouldn't be surprising, but Google, Bing, DuckDuckGo, and many others all use bots, such as web crawlers. Google, for example, isn't shy to call its web crawler a bot; Google Bot knows about and has indexed billions of web pages.

If bots were alive, they would be very simple creatures. Because they're programs, they do what they're told to do until they have to stop. Googlebot, a web crawler, hops around the web, gathering assorted information for web pages. Other bots are web scrapers. Web scrapers are looking for very specific information from the pages they scrape. To get the latest headlines, a web scraper might target a specific website, like a news network. Both scrapers and crawlers gather information. The differences lie in what they are gathering and where they're getting it.

Illicit internet bots are those that don't follow the instructions a website's owner sets out for them. Some websites might have a "no scraping" policy. They might have a ban in place on a specific bot. These rules should be respected. Some websites don't want to be scraped/crawled. Luckily for us, there are automatic ways of checking. If a website doesn't allow scraping, it would be best to listen.

When Not to Scrape

Note: These are general guidelines and should not be your sole source of information when making web scraping decisions. This is not legal advice.

Knowing when it is safe to scrape is very important. While web scraping is generally legal, some websites may not allow it. Scraped data might also have restrictions on it (such as data that has copyright protection prohibiting redistribution). Here are some things to note:

Confidential data should generally not be scraped without permission.
You should comply with the ToS (Terms of Service) of a website if it explicitly prohibits web scraping.
You shouldn't copy or sell data that is copyrighted.

What is a Robots.txt file?

One way that a website might tell bots that it doesn't want to be scraped is through its robots.txt file. The robots.txt file tells bots if they are permitted to scrape a site.

< If a website has a robots.txt file, it will be located at the root of the website. Here is ours: fieryflamingo.com/robots.txt. We're very lenient; we allow all bots to scrape all available pages on our site.

Here's our robots.txt file as it currently exists:


User-agent: *
Allow: /

Ours is a simple example of a robots.txt file. What you need to remember from this section is that robots.txt files are for robots: they tell bots how they should behave on sites, where they are allowed to go, and where they aren't.

On another note, you, the human, can actually glean some information about a company from their robots.txt page; for example, let's have a look at YouTube's (note the comment).

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.

User-agent: Mediapartners-Google*
Disallow:

User-agent: *
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax

Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

And then there's Facebook (the killjoy):

# Notice: Collection of data on Facebook through automated means is
# prohibited unless you have express written permission from Facebook
# and may only be conducted for the limited purpose contained in said
# permission.
# See: http://www.facebook.com/apps/site_scraping_tos_terms.php
...
User-agent: *
Disallow: /

Checking the Robots.txt File

A robots.txt file might look like this:

User-agent: *
Disallow: /admin
Allow: /admin/info

User-agent: Bad-Bot
Disallow: /

User-agent: Trusted-Bot
Allow: /

To understand what we are looking for in a robots.txt, we need to understand what a user agent is.

User agents allow a website to gather basic information about users, like what type of device they're using. If you want to see your user agent, you can go to What is my user agent? to find out. Your user agent will appear as a string. Here is an example:

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"

Bots can also be given user agents. Adding a user agent to a bot is quite simple, and we'll get into how to do it. They can also be given a user agent token, which tells websites the bot's name. The user agent of the smartphone version of GoogleBot is this:

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1;  http://www.google.com/bot.html)

A distinction between a user agent and a user-agent token. The user agent string is what is shown above. A user-agent token is more specific: in the case of GoogleBot, the user agent token is Googlebot.

Let's look back at the example Robots.txt again.

User-agent: *
Disallow: /admin
Allow: /admin/info

User-agent: Bad-Bot
Disallow: /

User-agent: Trusted-Bot
Allow: /

The layout is pretty self-explanatory. There are groups of rules, or instructions, for specific bots. A group seems to provide the following information: the name of the bot, what pages the bot can access, and what pages it cannot.

Let's first look at this group

User-agent: *
Disallow: /admin
Allow: /admin/info

The group refers to all bots (hence the *) and it is telling them that the pages /admin and any subpages after /admin (such as /admin/one) are disallowed, meaning that they shouldn't be scraped. In addition, /admin/info is explicitly allowed as an exception to the first rule.

Because there are no other restrictions to this group, all other webpages are implicitly allowed, meaning that, because they weren't mentioned, they can be scraped.

Let's now look at another group:

User-agent: Bad-Bot
Disallow: /

This group refers to a bad bot, with a user agent token that's aptly named "Bad-Bot". It is not allowed to scrape any page on this website (hence the Disallow: /). Any and all subpages are thus disallowed with the /.

Finally, let's look at this group:

User-agent: Trusted-Bot
Allow: /

In contrast to the previous example, this bot is completely trusted. In fact, it is so trusted that it has the site's official permission to scrape areas of the website that general user agents (*) cannot. This bot is allowed to scrape every page on a site.

Can I Scrape ________ (Insert Here) Site?

One of the ways to figure out if your bot is allowed to scrape a specific website is to check the site's robots.txt file. This is an especially useful practice if you've already set your bot up to have a unique user agent. That way, you'll be able to discern which rules apply specifically to your bot, if any.

Hopefully you don't have any rules that apply to you specifically, though, as that would likely mean you're blocked. To see what I mean, look at Wikipedia's awesome robots.txt file, with a bunch of "blocked" bots. I hope you're not Zealbot, ZyBORG, Download Ninja, or any of the others :)

It's important to note that the robots.txt file is not, by itself, enforceable. It is simply a guideline page that bots are supposed to follow.

There are other ways to check if your bot is allowed to scrape a site. Some methods we recommend are to:

Contact the owner
Read the site's conditions on scraping

Automating the Process

We just went over the basics of analyzing a robots.txt file. To automatically analyze a robots.txt file, there is a useful Python library called reppy. Let's get started with the library here:

First, to install reppy, the easiest way is to use pip:

pip install reppy

If you have trouble using pip, go to reppy's Installation page for alternative ways to install it.

reppy should now be good to go. Let's make a new Python file and import it

# main.py
from reppy.robots import Robots

Great! To test reppy, let's see if a specific bot is allowed by a robots.txt file. For now, let's call our bot 'MyTestBot'. In the file, let's define the URL that we want MyTestBot to scrape. I'm going to use https://wikipedia.org/wiki/Python_(programming_language). Let's see if we're allowed to scrape this page:

Important: With reppy, we have to input the URL we want without the domain name. In this case, that means that the URL has to be shortened to "/wiki/Python_(programming_language)". I consider this in the code below:

from reppy.robots import Robots

url = '/wiki/Python_(programming_language)'
user_agent = 'MyTestBot'
robots = Robots.fetch('http://wikipedia.org/robots.txt')

if robots.allowed(url, user_agent):
    print("You can scrape this page!")
else:
    print("Sorry, you can't scrape this page :(")

The above code gets, or "fetches" the robots.txt file of the website (wikipedia.org) and sees if our bot (defined from its user agent token, MyTestBot) is allowed to scrape the URL we provided. Let's see what this returns!

You can scrape this page!

The print statement confirms that our code works as intended. If you go into the robots.txt file, you will indeed see that we are allowed to scrape this page (as there are no rules against it).

Now let's change this code to a URL we cannot scrape and see if the program returns the correct response.

To find an example URL, go to the robots.txt file and look under the User-Agent: *. Let's see what are disallowed URLs are for Wikipedia currently.

User-agent: *
Allow: /w/api.php?action=mobileview&
Allow: /w/load.php?
Allow: /api/rest_v1/?doc
Disallow: /w/
Disallow: /api/
Disallow: /trap/
Disallow: /wiki/Special:
Disallow: /wiki/Spezial:
Disallow: /wiki/Spesial:
Disallow: /wiki/Special:
Disallow: /wiki/Spezial:
Disallow: /wiki/Spesial:

Let's try changing our code to check if we can scrape one of these disallowed URLs.


from reppy.robots import Robots
url = '/trap/'
user_agent = 'MyTestBot'
robots = Robots.fetch('http://wikipedia.org/robots.txt')
if robots.allowed(url, user_agent):
    print("You can scrape this page!")
else:
    print("Sorry, you can't scrape this page :(")

And this is the result:

Sorry, you can't scrape this page :(

Great! So we know that the code works correctly. Let's try one more example, using the name of a forbidden bot instead of a forbidden URL. Here are some of those:

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

Let's try changing MyTestBot to SiteSnagger and see what happens. We'll be using the URL that we were allowed to scrape with MyTestBot ( http://wikipedia.org/wiki/Python_(programming_language) ). Now we will see if we have permission to scrape that web page with a banned bot.


from reppy.robots import Robots
url = '/wiki/Python_(programming_language)/'
user_agent = 'SiteSnagger'
robots = Robots.fetch('http://wikipedia.org/robots.txt')
if robots.allowed(url, user_agent):
    print("You can scrape this page!")
else:
    print("Sorry, you can't scrape this page :(")

And sure enough, we get:

Sorry, you can't scrape this page :(

This method can be applied to any website with a robots.txt file. You don't need to use reppy, but I've found that it works pretty well. There are more ways to use it (including setting an agent as an object) that you can find on the project's GitHub page.

fieryflamingo