Lab 22 - Web Crawl Scavenger Hunt

Due by 11:59pm on 2023-12-5.

Starter Files

There are no starter files for this lab. Create a lab22.py file.

Topics

Review Lab21's topics at your discretion. There will be no new content in this lab. This lab will use what is learned from Lab21 and more generally draw upon general programming knowledge.

Lab21 - Requests and Beautiful Soup Library

Scavenger Hunt!

Create a program called lab22.py. This program will take command line arguments containing a URL, an HTML element, an attribute, and an output file name. The program should load the page specified by the given URL and find the HTML element with the specified attribute. This same attribute will contain the next URL, HTML element, and attribute (in that order) to look for. The information is stored as a comma-separated list.

For example, if we ran lab22.py with the following line

python3 lab22.py https://cs111.cs.byu.edu/ p checkpoint1 output.txt

The program should look for the paragraph tag <p> in the CS111 home page with the attribute checkpoint1.

<p checkpoint1="https://www.byu.edu/,a,checkpoint2 ">Do you know Joe?</p>

Once the program does that, the program should then go to BYU's home page, look for an anchor tag with the attribute checkpoint2, and read its contents for the next location.

This searching loop will continue until an attribute called final is found. Once final is found, write the tag's attribute content to the output file.

For example, if we have an HTML tag that looks like this:

<img src="doge.jpg" final="Congratulations! You made it!">

We would write Congratulations! You made it! to the output file specified.

Test Inputs

Each of these inputs increase with the length of the scavenger hunt with Input 1 being the shortest.

Input 1:

python3 lab22.py https://cs111.cs.byu.edu/lab/lab22/assets/sample1.html li checkpoint1 output.txt

Note: TAs should walk through this example

Input 2:

python3 lab22.py https://cs111.byu.edu/lab/lab22/assets/webpage1.html p mediumhunt-checkpoint1 output.txt

Input 3:

python3 lab22.py https://cs111.byu.edu/lab/lab22/assets/webpage4.html ul longhunt-checkpoint1 output.txt

Submit

If you attend the lab, you don't have to submit anything.

If you don't attend the lab, you will have to submit working code. Submit the lab22.py file on Canvas to Gradescope in the window on the assignment page.

Going Further for Project 4

In reality, not all links contain the full URL to a webpage. For example, consider the following links:

<a href="/lab/lab21">Link 1</a>
<a href="#scavenger-hunt">Link 2</a>
<a href="sample2.html">Link 3</a>

In project 4, you will be building a web scraper that will count and store all the links on a webpage given a URL. After storing and counting all the links, the web scraper will visit all webpages given the valid links (the links that are on the CS 111 website) and count the links at those webpages. To help you visit valid webpages given the types of links provided above, we will write a function that will be able to properly process these types of links.

In order to do this correctly, you have to know how to correctly process hrefs like the one's given above. Down below is information to help you do this. If you want to view examples as your go through the bullet points below, visit https://cs111.byu.edu/proj/proj4/assets/page1.html and https://cs111.byu.edu/proj/proj4/assets/page2.html. Right click on the webpage and click View page source or Inspect to view the HTML and view the links.

If the link ever contains a pound sign (or hashtag - the '#' symbol), the web crawler should visit and count the link without the pound sign and all following characters.
Some of the links may be relative links that start from the domain. These start with a forward slash ('/'). These links are relative to the domain and should have the domain prepended to them before being added to your list of links to visit. For example, if the web crawler is scraping "https://cs111.cs.byu.edu/proj/proj4" and finds a link "/labs/lab04", the webpage to visit and count is "https://cs111.cs.byu.edu/labs/lab04".
Many of the links in the page are going to be relative links that start with the current page like "page2.html" which means they will not start with a protocol specifier (i.e "http://") or a forward slash ("/"). These links are relative to the current page. For example, if you loaded "https://cs111.cs.byu.edu/project4/assets/page1.html" and found a link that had the href "page2.html", the full link would be "https://cs111.cs.byu.edu/project4/assets/page2.html". Notice that the final item in the URL path was replaced with the content in the href.
Some links may be absolute links with full URLs (that start with "http" or "https"); these can just be added directly unless the link contains a pound sign. Once again, if the link contains a webpage containing a pound sign, the web crawler should strip off the pound sign and all following characters and count and visit the resulting URL. For example, if a the link is "https://cs111.cs.byu.edu/proj/proj4#part1", the link that should be visited and counted will be "https://cs111.cs.byu.edu/proj/proj4".
Some links may be to anchor tags and begin with a pound sign. These are references to markers within the page. For example, if you are on the page "https://cs111.cs.byu.edu/project4/selfreferencing.html" and had an href of the form "#section1", this refers to the URL "https://cs111.cs.byu.edu/project4/selfreferencing.html#section1". Notice that there is no forward slash between the parts. For project 4, if the web crawler finds a link starting with a pound symbol, the URL to count will be the one currently being scraped; however, the web crawler should NOT visit that webpage again (it would just reload the same page).

Hint: In all cases, if the link contains a pound sign (or hashtag - the "#" symbol), the web crawler should strip off the pound sign and all following characters. With the stripped result, use that result to process considering the bullet points above. (As always, feel free to disregard this hint if you feel it will not help you.)

Write a function that will handle this logic given the necessary information and return the link to count and visit. It is up to you what the parameters should be, but it return link to count and visit (given in bullet points above). You should write the function in such a way where you only have to make a minor adjustment to your current code.

Test your code:

python3 lab22.py https://cs111.cs.byu.edu/lab/lab22/assets/webpage10.html p link-checkpoint1 output.txt

It may be worth visiting https://cs111.cs.byu.edu/lab/lab22/assets/webpage10.html and following the link-checkpoints.