Homework 7 - Robots

Due by 11:59pm on 2023-12-06.

Objectives

Starter Files

Download hw07.zip. Inside the archive, you will find the test files for this homework.

Introduction

When accessing files on the World Wide Web via a program, there are etiquette rules to be followed. One of these is to respect the wishes of the web site owners and not accessing files specified in the the site's robots.txt file. This file lists all paths and files on a domain that a "robot" or automated script should not access but which are meant only to be accessed by human visitors.

For Project 4, it is absolutely critical that your program respects the robots.txt file and limits its page loads to the initially specified domain. In addition to just being proper web etiquette, if your program wanders off the specified site and starts traversing the entire internet it could have negative repercussions for you, your fellow students, and the University. This has happened in the past and BYU has been blocked from accessing certain important websites. Please keep this in mind and don't point your program at major websites. Limit it to small sites you control or the ones we give you to test with. After the semester is over, and if you are accessing the web from a non-BYU connection, you can do whatever you want.

This homework focus on building the functionality you'll need in your project to read and respect the robots.txt file for a given domain. You will build a small module that your full web crawler in Project 4 will use.

Part 1 - Getting Started

Task 1 - Installing libraries

For this homework, you'll need the requests library installed. If you've done Lab 21 you will already have them installed but if not, you need to install them now.

To do so, open a terminal window where you normally run your python commands and install the package using pip:

pip install requests

This should install the request library. You can test that it was installed correctly by opening up a python interpreter and running the following command:

import requests

If that returns an error, something didn't work. If it just returns the interpreter prompt, you are good to go.

Task 2 - Create the LinkValidator module

The code for this project will be contained in a module called LinkValidator. Start by creating a LinkValidator.py file. Code in this file will be using the requests library so go ahead and import that at the top of your file.

You should be writing tests to verify your code is doing what it should. You can either add the if __name__ == "__main__": line to your module and writes tests below that or create pytest or doctest tests in the file along with your code.

Part 2 - Handling robots.txt

With everything set up, let's start coding.

Because of the importance of this functionality, we'll walk you through this fairly carefully. In the final project, this will also be the first thing the autograder will test and it will test it explicitly. If it isn't working properly, it will not test the rest of the project and you will not receive any points on the project.

Task 1 - Limiting where your program goes

Because this is so important, we're going to write this as a LinkValidator class that can be tested completely independent of the project. In your LinkValidator.py file, create a LinkValidator class. Note the capitalization, this is important for you to get credit as the autograder will be creating objects of this class. This will be a fairly simple class and it will just have two methods: __init__() and can_follow_link()

The __init__() method should take the domain name as its first input parameter and a list as its second parameter. The list will contain the paths listed in the robots.txt file that are disallowed. Both parameters should be stored as local instance variables. In practice we might only pass in the domain name, but to make testing easier, we'll have the contents of robots.txt passed as a parameter instead of reading it in the class. Generating that list is Task 2.

Domains will be of the form "https://" or "https:///". In either case, treat what is passed in as the true domain. The list will contain paths that should be appended to the provided domain to provide a pattern to look for in URLs visited on site.

Once the object has the domain and the contents of the robots.txt file saved, you should write the can_follow_link() method. This method takes a URL as its input parameter and does the following checks:

Does the link start with the stored domain? - If it doesn't, return False. Otherwise proceed to the next check.
Does the link contain any of the paths specified in the list generated from the path list provided? - If it does, return False, otherwise return True.

This is a good opportunity to use regular expressions to match the domain and the individual paths in the robots.txt file.

Note that for the second condition, the paths have to immediately follow the domain. So if the domain is "https://cs111.byu.edu" and an item in the path list is "/data" then a URL that begins "https://cs111.byu.edu/data" would match condition 2 and should return false but a URL like "https://cs11.byu.edu/Projects/Project1/data" would not match because the "/data" part of the URL is not immediately after the domain.

Examples: If the domain provided is "https://cs111.byu.edu" and the path list is ['/data', '/images', '/lectures'] then you should get the following return values for the specified inputs:

"https://byu.edu" - False - doesn't match the domain
"https://cs111.byu.edu/HW/HW01" - True - matches the domain and doesn't match anything in the path list
"https://cs111.byu.edu/images/logo.png" - False - matches the domain but also matches /images
"https://cs111.byu.edu/data/spectra1.txt" - False - matches the domain but also matches /data
"https://cs111.byu.edu/Projects/Project4/images/cat.jpg" - True - matches the domain and doesn't match any of the paths. It contains /images but not immediately following the domain.

Later, when it is time to check the links, you'll create an instance of your LinkValidator class and call the can_follow_link() method to check to see if it should be followed or not.

Task 2 - Read in robots.txt

Now that you are ready to use the contents of the robots.txt file, let's read it in. You will create a function called parse_robots() that takes as its input parameter a domain name (i.e. "https://cs111.byu.edu", "https://byu.edu", or "https://frontierexplorer.org", etc.) and returns the list of paths to exclude that are listed in that domain's robots.txt file. This is the list that you will pass to the LinkValidator object.

This function will be part of your web crawler so create a webcrawler.py file and put this function in it. The webcrawler.py will be the main file for Project 4 but we'll start it here in this homework. You will be using the requests library in this function so you should import it at the top of your webcrawler.py file.

To get the data from the robots.txt file, we need to download the file, read each line, and parse it to add the prohibited paths into the list that will be passed to our LinkValidator. By the web protocol, this file will be located at the top level of the domain and so will be found at

<domain>/robots.txt

For this project, we will be passing in initial URLs of the form "https://cs111.cs.byu.edu/" and will use that as the domain. If you were writing this for a different project and not this homework and were given a URL like that, you should strip out everything after the first '/' to get the true domain but since the websites we are giving you may be located in subdirectories of the top cs111.cs.byu.edu domain, you will use full URL provided as the domain. If you want to use this program off of the sites we provide you, you should make sure your program makes this distinction after you've submitted your project for grading. The LinkValidator class assumes that the domain passed to it is the correct domain to use.

Using the requests library, get the robots.txt file. These have the form of:

User-agent: *
Disallow: /data
Disallow: /images/jpg
Disallow: /Projects/Project4/Project4.md
...

The User-agent tag specifies what user agents the following rules apply to. Typically it is * which means all agents although it is possible to specify. For this lab you will ignore this line and assume that all the following rules apply to your program.

The Disallow lines are the items that your program should ignore. They are all paths relative to the top of the domain. So /data means that anything in <domain>/data should not be downloaded by your program. You need to extract these entries from the file and create a list of the disallowed items. This is the list that is passed to the LinkValidator object when it is created. The first item /data is an example of a top level directory, /images/jpg means ignore anything in the <domain>/images/jpg subdirectory, but <domain>/images and <domain>/images/png are fine to access. The final line in the example is simply disallowing a specific file.

Again, this is an excellent place to use regular expressions to look for the "Disallow" text and capture the contents that follow it.

Your parse_robots() method should return a list of the paths to avoid. Given the robots.txt file in the example above, parse_robots() should return this list:

['/data', '/images/jpg', 'Projects/Project4/Project4.md']

Turn in your work

You'll submit your LinkValidator.py and webcrawler.py files on Canvas via Gradescope where they will be checked via the auto grader. Make sure that you haven't "hard coded" anything specific to the test data we gave you. We do not guarantee that all scenarios are tested by the test data that we have provided you.