Project 4 - Web Crawler

Objectives

Learn web basics (HTML, web protocols)
Learn how to use the Requests, BeautifulSoup, and matplotlib libraries
Understand basic plotting

Starter Files

Download proj4.zip. Inside the archive, you will find the starter and test files for this project.

Introduction

In this project you will build a simple web crawler that can load web pages, follow links on those pages, and extract data from the pages loaded. This is the final project of the class and by now you should be able to take the specifications for a problem and turn that into a working program.

As part of this project you'll learn about how the web works, some etiquette and protocols when crawling web pages, and how to programmatically read the pages you load and extract data from them. This project consists of three parts:

Link counting - starting at the home page of a web site created for the class, you'll find all the links and load all the linked pages from the site, and repeat that until you've loaded every linked page. During that time your program will record how many times each link was included on pages in the site. At the end, you'll produce a histogram of the number of pages with 1, 2, 3, ... links as a plot and a table of all the links and the count of their references. These will be saved to output files
Data Extraction - Given a URL with tabular data on the web page, you will need to extract that data, plot it, and write the plot and data to output files.
Image manipulation - Find all the images on a specified page and then download and then run your image manipulation program from Project 1 on them to produce a new output images.

In the previous projects, we've walked you through each of the individual steps to accomplish the task at hand. In this project, there will be a lot less of that. We'll tell you what needs to be done and possibly give you some specifications of elements to use, but generally you'll be on your own to design the exact implementation and steps to take to achieve the outcome.

Part 1 - Setup

Task 1 - Install the needed libraries

In order for this project to work, you will need three external libraries: Requests, Beautiful Soup, and matplotlib, that are not part of the default python installation. If you've done Labs 19 & 21, you will already have them installed but if not, you need to install them now.

To do so, open a terminal window where you normally run your python commands and install the packages using pip:

pip install requests
pip install beautifulsoup4
pip install matplotlib

This should install the libraries you need. You can test that they are installed correctly by opening up a python interpreter and running the following commands:

import requests
import bs4
import matplotlib

If any of those return errors, something didn't work. If they just return the interpreter prompt, you are good to go. We won't be using all of the library functionality and will just be importing parts of them in the project. We'll give you the details at the appropriate time.

Task 2 - Create the initial program

For this project, you'll be submitting several files. You'll put your main program in a file named webcrawler.py (which you created in Homework 7) but there will be others you create along the way as well. RIght now webcrawler.py probably just has the parse_robots() method you wrote in Homework 7. Now would be a good time to add your initial main function or main block if you haven't already.

Just like in Project 1, we'll be passing command line arguments to this program to tell it which of the three tasks we want it to perform. The three commands this program should accept are:

-c <url> <output filename 1> <output filename 2>
-p <url> <output filename 1> <output filename 2>
-i <url> <output file prefix> <filter to run>

The -c command is for counting the links. The URL is the starting page of the search. Output file 1 will contain the histogram image and output file 2 will be CSV formatted file of all the links and their reference counts.

The -p command is to extract and plot data. The URL is the page that contains the data to extract. Output file 1 will contain the data plot and output file 2 will contain the data in CSV format.

The -i option is for finding and manipulating an image. The URL is the page where we want to extract images. The output file prefix is a sting that will be prepended to the name of every image manipulated to produce the name of the output image file. The filter to run will be a flag specifying which filter from your image manipulation program to run. Specifically, your program should handle the following filter flags:

-s - sepia filter
-g - grayscale filter
-f - vertical flip
-m - horizontal flip (mirror)

Some examples of possible commands are:

-c http://cs111.byu.edu/project4 project4plot.png project4data.csv
-p http://cs111.byu.edu/project4/datatable.html data.png data.csv
-i http://cs111.byu.edu/project4/imagegallery.html grey_ -g
-i http://cs111.byu.edu/project4/imagegallery2.html sepia_ -s

Note: These are just examples. These webpages do not exist.

Write code to verify that the passed in parameters are valid and print an error message if they are not. You error message should contain the phrase "invalid arguments" somewhere in the text returned to the use (this is what the autograder will be looking for).

Part 2 - Counting links

The part of the project will implement the -c command to count links. For this operation, you'll be given a starting URL that represents the home page of the site to analyze, along with the name of two output files for storing the final data.

An overview of the basic algorithm of this part of the code is as follows:

Given the initial URL
1. Read the robots.txt file and store all values in there in an exclusion list (You'll use the LinkValidator class from Homework 7 for this)
2. Add the link to the list of links to visit.
Get the next link from the list of links to visit
1. See if the link has been visited already
2. If so, increment the count for that link
3. If not
  1. Add it to the table of visited links with a count of 1
  2. Verify that it is a link to follow (LinkValidator's can_follow_link() method returns True). If so continue, if not jump to step 3.
  3. Load the page
  4. Extract all the links on the page and add them to the list of links to visit
Repeat step 2 until the list of links to visit is empty.
Create a plot from the table of visited link counts and write it output file 1
Write the data from the table of visited link counts in CSV formate to output file 2

Task 1 - Limiting where your program goes

This is step 2.3.2 from the algorithm above and uses the LinkValidator class you wrote in Homework 7.

Start by reading in the robots.txt from the domain passed in as the first command-line argument for the -c flag. Use your parse_robots() function to read the robots.txt file and then create a LinkValidator object for your program to use.

Remember this warning from Homework 7: It is absolutely critical that your program respects the robots.txt file and limits its page loads to the initially specified domain. In addition to just being proper web etiquette, if your program wanders off the specified site and starts traversing the entire internet it could have negative repercussions for you, your fellow students, and the University. This has happened in the past and BYU has been blocked from accessing certain important websites. Please keep this in mind and don't point your program at major websites. Limit it to small sites you control or the ones we give you to test with.

Task 2 - Data storage

There are a couple of data items your program will need. The first is a list of all the links you need to visit. Create that list and store the initial URL (passed in on the command line) in the list.

You will be iterating over this list, visiting them in order and adding new links to the end as they are discovered. You could keep track of the current index of the item you've visited. You can also create a generator that will return the next item in the list. You could also use list slicing to remove links you have visited but that is less efficient as you are constantly creating new lists. You can also create a generator that will return the next item in the list. A generator is fairly simple and will just continue to give you the next item until the list is empty.

The other data structure is a dictionary that will keep track of the number of times a link appeared on the pages of the website. This will initially be empty but you need to create it.

Task 3 - The main loop

This is step 2 in the algorithm presented above. Using the generator (or whatever method you decided on in Task 2), loop over all the links in your list of links to visit and process it according to the algorithm described. You'll use your LinkValidator to determine if it is a link that should be loaded, keep track of the number of times the link appears in the dictionary you created, and add any links found on the loaded pages to the list of links to visit.

To load a page, use the requests library to get the HTML and the BeautifulSoup library to search through it for <a> references. Your code may look something like this:

# ONLY DO THIS IF LinkValidator SAYS IT'S OKAY
page = requests.get(link)
html = BeautifulSoup(page.text,"html.parser")
for tag in html.find_all('a'):
    href = tag.get('href')
	## process the link

When processing the link, there are several things that you should be aware of:

If the link ever contains a pound sign (or hashtag - the '#' symbol), the web crawler should visit and count the link without the pound sign and all following characters.
Some of the links may be relative links that start from the domain. These start with a forward slash ('/'). These links are relative to the domain and should have the domain prepended to them before being added to your list of links to visit. For example, if the web crawler is scraping "https://cs111.cs.byu.edu/proj/proj4" and finds a link "/labs/lab04", the webpage to visit and count is "https://cs111.cs.byu.edu/labs/lab04".
Many of the links in the page are going to be relative links that start with the current page like "page2.html" which means they will not start with a protocol specifier (i.e "http://") or a forward slash ("/"). These links are relative to the current page. For example, if you loaded "https://cs111.cs.byu.edu/project4/assets/page1.html" and found a link that had the href "page2.html", the full link would be "https://cs111.cs.byu.edu/project4/assets/page2.html". Notice that the final item in the URL path was replaced with the content in the href.
Some links may be absolute links with full URLs (that start with "http" or "https"); these can just be added directly unless the link contains a pound sign. Once again, if the link contains a webpage containing a pound sign, the web crawler should strip off the pound sign and all following characters and count and visit the resulting URL. For example, if a the link is "https://cs111.cs.byu.edu/proj/proj4#part1", the link that should be visited and counted will be "https://cs111.cs.byu.edu/proj/proj4".
Some links may be to anchor tags and begin with a pound sign. These are references to markers within the page. For example, if you are on the page "https://cs111.cs.byu.edu/project4/selfreferencing.html" and had an href of the form "#section1", this refers to the URL "https://cs111.cs.byu.edu/project4/selfreferencing.html#section1". Notice that there is no forward slash between the parts. For this project, if the web crawler finds a link starting with a pound symbol, the URL to count will be the one currently being scraped; however, the web crawler should NOT visit that webpage again (it would just reload the same page).

Pro Tip: In all cases, if the link contains a pound sign (or hashtag - the "#" symbol), the web crawler should strip off the pound sign and all following characters. With the stripped result, use that result to process considering the bullet points above.

It is highly recommended that you write a function that will handle this logic given the necessary information and return the link to count and visit.

Once you have all that working, you're ready to crawl the site and generate the link count data. To test your code, crawl the following webpage:

https://cs111.byu.edu/proj/proj4/assets/page1.html

The next step is to plot it and save it to a file.

Task 4 - Generating the plot

Now that we have a list of links and the number of times they were referenced, we need to generate the data for our plot and create the plot itself. For this we will use matplotlib's histogram functionality, the matplotlib.pyplot.hist() function. You should read the documentation for this function at https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html.

The data you need to pass to the function are the values in your dictionary of link counts. When generating your histogram, you'll need to specify the bins parameter to the hist() function. Ths parameter represents the edges of the bin values, i.e. the first bin contains the values greater than or equal to bin[0] and less than bin[1]. The smallest bin should be 1 and the largest should be the one larger than the maximum count value in your data. For example if you had the following count data:

[1, 3, 6, 2, 1, 1, 4, 2, 3, 6, 2]

the bins should be:

[1, 2, 3, 4, 5, 6, 7]

You should save the generated histogram image to a file called with a filename specified by the <output filename 1> passed in on the command line.

The hist() function returns three items, the first two are lists containing the counts in each bin and the bin values themselves. You'll need to capture all three of them but will only use the first two.

Using those values, create a file with the name specified by <output filename 2> that was specified on the command line that contains the data as a comma separated value list with each line holding the bin value followed by the counts in that bin. For example, give the data above, hist() would return the following lists:

values = [3., 3., 2., 1., 0., 2.]
bins = [1., 2., 3., 4., 5., 6., 7.]

The resultant file would look like:

1.0,3.0
2.0,3.0
3.0,2.0
4.0,1.0
5.0,0.0
6.0,2.0

Note that the last bin value (7 in this case) is not printed to the file.

If you are successfully generating both files, you are done with this part of the project.

Part 3 - Reading and plotting data

In this part of the project, you'll implement the functionality for the -p command line argument. For this you'll find a specific table on a specified web page and read the data from the table, plot it, and save the data to a CSV file.

Task 1 - Load the page

The page to load is specified by the <url> parameter on the command line. Use the requests library to load the page. Print an error and exit if the page doesn't exist. Once you've read in the page, use BeautifulSoup to convert it into an object you can search through.

Task 2 - Find the table and parse the data

You need to find a table with the the "#CS111-Project4b" id on the specified page. Once you've found that you need to read the data from the table. The first column of the table will contain the x-values for the data, every subsequent column will contain a set of y-values for the x-value for the row. Each column should be read into a list of data.

Hint: Once you've found the table, you can extract a list of table row (<tr>) elements that will contain as its table data (<td>) values, the x-value in the first <td> element, and the y-values in all the others. Loop over the table rows, extracting the values (assume they are all floats) and store them in the data lists.

Task 3 - Plot the data and write the output.

For this part of the program, we'll be using the matplotlib.pyplot.plot() function. You can read the documentation on this function at https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html.

Each set of y-values should be plotted against the x-values and all the lines should be on a single plot. Each set of y-values should have a different color. The first y-value set should be blue, the second, green, then red, then black. We will never have more than four data sets on a page for you to plot.

Once you've created the plot save it to the file specified as <output file 1> in the command line arguments.

After you've saved the plot, you should create the CSV file containing the data. This will be saved in the file specified by <output file 2> in the command line arguments. Each line should have the x-value, followed by each of the y-values for that x-value, separated by commas. They should be presented in the same order they appeared in the table on the web page.

Task 4 - Refactoring

Was any of the work you did in this part the same as you did in Part 2? Do you have code duplication? Do you have code that is similar that you could generalize? Maybe you already noticed some and did that as you were writing the code for this part of the project. Or maybe you didn't. If you do have duplicate or similar code, consider refactoring your program to remove the duplications and create functions that can be called in multiple places.

Hint: look at the code that writes the CSV files.

Part 4 - Modifying images

In this part of the project, you'll be finding all the images on the page specified by the <url> command line argument, downloading them, applying the specified filters to them using the code you wrote in Project 1, and writing the new images to disk.

Task 1 - Find the images

Images on a web page are specified by the <img> tag. The URL to the image is in the src attribute in that tag. Just like the links in Part 2, the URLs in the src attribute may be complete, absolute, or relative URLs that you will need to construct the complete URL for to be able to download the image.

Load the web page specified by the <url> command line argument and create a list of all the URLs to the images on that page. We'll use that list in the next task.

Task 2 - Download the images

For this part you'll need the code you wrote in Project 1. Copy your image_processing.py file into your Project4 directory. Add a line to webcrawler.py to import the necessary functions from that file. You can either import just the ones you need, import all of them if you don't have name conflicts, or just import the file as a package (so you'd call the functions as image_processing.grayscale() for example).

Since we are going to use our byuimage library to process the image, we first need to save all the images to the local disk so our Image object can open them (it's not designed to take a raw byte stream). To do this, we'll take advantage of the .raw attribute that the response object has that just gives you the raw bytes received. For each image URL we need to do the following:

import shutil

response = requests.get(url, stream=True)
with open(output_filename, 'wb') as out_file:
    shutil.copyfileobj(response.raw, out_file)
del response

This gets the image (url variable) and opens it as a stream (the stream=True part) so it doesn't need to read the entire image at once. This reduces the memory cost of copying it to disk as only some of the image is read into memory at a time. The shutil.copyfileobj() function is what writes the raw response data out to disk.

Before you can run this code, you need to generate the output file name. You should just use the filename provided in the URL. So if the URL is "https://cs111.cs.byu.edu/images/image1.png", the output_filename should be "image1.png".

Task 3 - Process and save the images

Now that the images are downloaded, you need to apply the filter to them and save the modified image with the new filename. You could do this in a separate loop or as part of the loop that downloads the images.

For each image you need to do the following:

Create the output filename - this is just the original filename with the <output file prefix> command line argument prepended to it. If the filename was "image1.png", and the <output file prefix> was "g_", the output filename would be "g_image1.png".
Load the original image into an Image object
Call the appropriate filter function based on the <filter to run> command line argument
Save the modified image to the output filename determined in step 1.

Task 4 - Refactor

Once again, look at your code for any duplicate or similar code and consider creating functions to simplify the code, make it more clear, and/or reduce redundancy.

Turn in your work

Congratulations, you've completed the project.

You'll submit your webcrawler.py, LinkValidator.py, and image_processing.py files on Canvas via Gradescope where it will be checked via the auto grader. We will be testing your program using both the website and examples we gave you to test with as well as other pages. Make sure that you haven't "hard coded" anything specific to the test data. We do not guarantee that all scenarios are tested by the sample data that we have provided you.

Going further

We only included image filters that required no additional input parameters. How would you extend this program to also be able to use filters that did take additional parameters?
You can think of a website as a tree with the homepage as the root node, all the link from that page as depth 2, the links from the depth 2 pages as depth 3, etc. What if you only wanted to crawl a website to a given "depth"? How might you implement that functionality, something common in real web crawlers?