Lab 20 - Web Crawl Scavenger Hunt

Due by 11:59pm on 2023-08-10.

Starter Files

There are no starter files for this lab. Create a lab20.py file.

Topics

Review Lab19's topics at your discretion. There will be no new content in this lab. This lab will use what is learned from Lab19 and more generally draw upon general programming knowledge.

Scavenger Hunt!

This problem is very open ended. You will have to use all the tools you know to solve this problem.

Create a program called lab20.py. This program will take command line arguments containing a URL, an HTML element, an attribute, and an output file name. The program should load the page specified by the given URL and find the HTML element with the specified attribute. This same attribute will contain the next URL, HTML element, and attribute (in that order) to look for. The information is stored as a comma-separated list.


For example, if we ran lab20.py with the following line

python3 lab20.py https://cs111.cs.byu.edu/ p checkpoint1 output.txt

The program should look for the paragraph tag <p> in the CS111 home page with the attribute checkpoint1.

<p checkpoint1="https://www.byu.edu/,a,checkpoint2 ">Do you know Joe?</p>

Once the program does that, the program should then go to BYU's home page, look for an anchor tag with the attribute checkpoint2, and read its contents for the next location.


This searching loop will continue until an attribute called final is found. Once final is found, write the tag's attribute content to the output file.

For example, if we have a HTML tag that looks like this:

<img src="doge.jpg" final="Congratulations! You made it!">

We would write Congratulations! You made it! to the output file specified.

Test Inputs

Each of these inputs increase with the length of the scavenger hunt with Input 1 being the shortest.

Input 1:

python3 lab20.py https://cs111.cs.byu.edu/lab/lab20/assets/sample1.html li checkpoint1 output.txt

Note: TAs should walk through this example

Input 2:

python3 lab20.py https://cs111.byu.edu/lab/lab20/assets/webpage1.html p mediumhunt-checkpoint1 output.txt

Input 3:

python3 lab20.py https://cs111.byu.edu/lab/lab20/assets/webpage4.html ul longhunt-checkpoint1 output.txt

Submit

If you attend the lab, you don't have to submit anything.

If you don't attend the lab, you will have to submit working code. Submit the lab20.py file on Canvas to Gradescope in the window on the assignment page.

© 2023 Brigham Young University, All Rights Reserved