Lab 19 - Requests and Beautiful Soup Library
Files: lab19.zip
Due by 11:59pm on 2023-08-08.
Starter Files
Download lab19.zip. Inside the archive, you will find starter files for the questions in this lab.
Topics
Requests Library
Installing the Requests Library
To install the Requests library, type one of the following into the terminal
pip install requests
python3 -m pip install requests
To check if you correctly downloaded the library, type the following in the python interpreter (type python3 into the terminal) and ensure it does not error:
>>> import requests
To uninstall any library, type
pip uninstall library_nameorpython3 -m pip uninstall library_name
Requests Review
When asking for a web page from some server on the internet, your electronic or client can send several different types of requests. In this lab, we will only be looking at GET requests. The Requests library allow us to grab and utilize the information from the request in our code. To do this, you have to use the get function. Import the requests library and write this code,
response_object = requests.get("URL")
where "URL" would be a string containing an actual URL.
The response_object has several attributes we can use. For example, we have the following attributes:
url- Returns the URL of the web page requestedtext- Returns the web page's content in HTMLcontent- Returns the web page's content of the response as bytesstatus_code- Returns a code representing the status of retrieving the web page. For more information refer to this wikipedia article. The most common status code is200which means that everything wentOKwhen retrieving the web page.headers- Returns a dictionary with additional information passed in with a request.
Try out the following code:
response_object = requests.get("https://cs111.byu.edu")
print(response_object.url)
print(response_object.status_code)
print(response_object.headers)
Who is the mascot mentioned in the headers? What was the message to posterity?
Refer to w3schools for more information on the attributes: https://www.w3schools.com/python/ref_requests_response.asp
Beautiful Soup Library
Installing the Beautiful Soup Library
To install the Beautiful Soup library, type one of the following into the terminal
pip install beautifulsoup4
python3 -m pip install beautifulsoup4
To check if you correctly downloaded the library, type the following in the python interpreter (by typing python3 into the terminal) and ensure it does not error:
import bs4
To uninstall any library, type
pip uninstall library_nameorpython3 -m pip uninstall library_name
Beautiful Soup Review
One way of accessing a web sites information is by accessing its HTML and parsing it. One library or framework that allows us to parse HTML somewhat easily is the Beautiful Soup library. To start, your code should look similar to the following
import bs4, requests
r = requests.get("URL")
soup_object = bs4.BeautifulSoup(r.content, features="html.parser") # `features` prevent an unimportant warning for this class
Once we have a beautiful soup object, we have access to several useful methods.
-
.prettify()returns a string of all the web page's HTML nicely indented and formatted.- Add the following to the code above, provide a valid URL, and demo it:
Compare it to the actual HTML contents of the web page given by the URL.print(soup_object.prettify())
- Add the following to the code above, provide a valid URL, and demo it:
-
.find_all('tag')takes in as an argument a string containing the tag and returns a list oftagobjects.- For example, if we wanted to find all the paragraph (
<p>) tags in a website, we could do the following:list_of_tags = soup_object.find_all('p')
print(list_of_tags)
# [<p>Computer Science is amazing!</p>, <p>I want to become a CS Major!</p>]
- For example, if we wanted to find all the paragraph (
Tags also have several useful methods and attributes:
-
.get('attr')- In some cases, we want to access a tag's attributes. Given a
tagobject, we can use its.get()method to access its attribute's content.For example, if we wanted to access an image tag'swidthattribute, we can do the following:img_tag = soup_object.find('img') # .find returns one tag instead of a list of tags
width = img_tag.get('width')
print(width) - If the attribute does not exist, the method will return
None.
- In some cases, we want to access a tag's attributes. Given a
-
.attrs.attrsis a dictionary. The keys of.attrsis the attribute names and the values are strings containing the attribute's associated information. For example, with the tag<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>:from bs4 import BeautifulSoup
data = '<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>'
soup = BeautifulSoup(data, features="html.parser")
tag = soup.find('p')
print(tag.attrs)
# {'style': 'font-size: 12px; font-color: blue;', 'id': 'this_id'}
print(tag.attrs['style'])
# 'font-size: 12px; font-color: blue;'
print(tag.get('style'))
# 'font-size: 12px; font-color: blue;'
Required Questions
Remember to create a lab19.py file.
Q1: Downloading a Web Page
Write a function called download that has the parameters url and output_file. The function should get the HTML content from the url and write it to the output_file.
When getting the content, use .text rather than .content.
To test, get the HTML contents from the cs111 website's pair programming article and write them to a file called lab19_test.txt. Compare your file to the HTML content on the website.
Q2: Prettify
Write a function called make_pretty which takes in a url and a output_file. The function should save the results of calling .prettify() on the web page given by the url to the output_file.
Q3: Finding Paragraphs
Write a function called find_paragraphs which takes in a url and a output_file. The function should find all paragraph tags <p> on the web page given by the url and write them to the output_file.
Q4: Finding Links
Write a function called find_links which takes in a url and a output_file and finds all the hrefs in a web page and writes them to the output_file. Each link should be on its own line (1 link per line).
Note: If you see something like
/staffor#syllabus-course-policiesin your file, these are valid links. The CS111 website was designed that way.
Submit
If you attend the lab, you don't have to submit anything.
If you don't attend the lab, you will have to submit working code. Submit the lab19.py file on Canvas to Gradescope in the window on the assignment page.