Lab 19 - Requests and Beautiful Soup Library
Due by 11:59pm on 2023-04-02.
Starter Files
Download lab19.zip. Inside the archive, you will find starter files for the questions in this lab.
Topics
Requests Library
Installing the Requests Library
To install the Requests library, type one of the following into the terminal
pip install requests
python3 -m pip install requests
To check if you correctly downloaded the library, type the following in the python interpreter (type python3
into the terminal)
and ensure it does not error:
>>> import requests
To uninstall any library, type
pip uninstall library_name
orpython3 -m pip uninstall library_name
Requests Review
When asking for a web page from some server on the internet, your electronic or client can send several different types of requests.
In this lab, we will only be looking at GET
requests. The Requests library allow us to grab and utilize the information from the
request in our code. To do this, you have to use the get
function. Import the requests
library and write this code,
response_object = requests.get("URL")
where "URL"
would be a string containing an actual URL.
The response_object
has several attributes we can use. For example, we have the following attributes:
url
- Returns the URL of the web page requestedtext
- Returns the web page's content in HTMLcontent
- Returns the web page's content of the response as bytesstatus_code
- Returns a code representing the status of retrieving the web page. For more information refer to this wikipedia article. The most common status code is200
which means that everything wentOK
when retrieving the web page.headers
- Returns a dictionary with additional information passed in with a request.
Try out the following code:
response_object = requests.get("https://cs111.byu.edu")
print(response_object.url)
print(response_object.status_code)
print(response_object.headers)
Who is the mascot mentioned in the headers? What was the message to posterity?
Refer to w3schools for more information on the attributes: https://www.w3schools.com/python/ref_requests_response.asp
Beautiful Soup Library
Installing the Beautiful Soup Library
To install the Beautiful Soup library, type one of the following into the terminal
pip install beautifulsoup4
python3 -m pip install beautifulsoup4
To check if you correctly downloaded the library, type the following in the python interpreter
(by typing python3
into the terminal) and ensure it does not error:
>>> import bs4
To uninstall any library, type
pip uninstall library_name
orpython3 -m pip uninstall library_name
Beautiful Soup Review
One way of accessing a web sites information is by accessing its HTML and parsing it. One library or framework that allows us to parse HTML somewhat easily is the Beautiful Soup library. To start, your code should look similar to the following
import bs4, requests
r = requests.get("URL")
soup_object = bs4.BeautifulSoup(r.content, features="html.parser") # `features` prevent an unimportant warning for this class
Once we have a beautiful soup object, we have access to several useful methods.
-
.prettify()
returns a string of all the web page's HTML nicely indented and formatted.- Add the following to the code above, provide a valid URL, and demo it:
Compare it to the actual HTML contents of the web page given by the URL.print(soup_object.prettify())
- Add the following to the code above, provide a valid URL, and demo it:
-
.find_all('tag')
takes in as an argument a string containing the tag and returns a list oftag
objects.- For example, if we wanted to find all the paragraph (
<p>
) tags in a website, we could do the following:>>> list_of_tags = soup_object.find_all('p')
>>> print(list_of_tags)
[<p>Computer Science is amazing!</p>, <p>I want to become a CS Major!</p>]
- For example, if we wanted to find all the paragraph (
Tags also have several useful methods and attributes:
-
.get('attr')
- In some cases, we want to access a tag's attributes. Given a
tag
object, we can use its.get()
method to access its attribute's content. For example, if we wanted to access an image tag'swidth
attribute, we can do the following:>>> img_tag = soup_object.find('img') # .find returns one tag instead of a list of tags
>>> width = img_tag.get('width')
>>> print(width)
111 - If the attribute does not exist, the method will return
None
.
- In some cases, we want to access a tag's attributes. Given a
-
.attrs
.attrs
is a dictionary. The keys of.attrs
is the attribute names and the values are strings containing the attribute's associated information. For example, with the tag<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>
:>>> from bs4 import BeautifulSoup
>>> data = '<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>'
>>> soup = BeautifulSoup(data, features="html.parser")
>>> tag = soup.find('p')
>>> print(tag.attrs)
{'style': 'font-size: 12px; font-color: blue;', 'id': 'this_id'}
>>> print(tag.attrs['style'])
'font-size: 12px; font-color: blue;'
>>> print(tag.get('style'))
'font-size: 12px; font-color: blue;'
Required Questions
Q1: Downloading a Web Page
Write a function called download
that has the parameters url
and output_filename
. The function should get the HTML text
from the url
, open the file, and write it to the provided output_filename
.
When getting the text, use .text
rather than .content
.
To test, get the HTML contents from the CS111's pair programming article
and write them to a file called lab19_test.txt
. Compare your file to the HTML content on the webpage.
def download(url, output_filename):
"*** YOUR CODE HERE ***"
Test your code:
python3 -m pytest test_lab19.py::test_download
Q2: Prettify
Write a function called make_pretty
which takes in a url
and an output_filename
. The function should save the results of
calling .prettify()
on the web page given by the url
to the output_filename
.
def make_pretty(url, output_filename):
"*** YOUR CODE HERE ***"
Test your code:
python3 -m pytest test_lab19.py::test_make_pretty
Q3: Finding Paragraphs
Write a function called find_paragraphs
which takes in a url
and an output_filename
. The function should find all paragraph
tags <p>
on the webpage given by the url
and write them to the output_filename
.
def find_paragraphs(url, output_filename):
"*** YOUR CODE HERE ***"
Test your code:
python3 -m pytest test_lab19.py::test_find_paragraphs
Q4: Finding Links
Write a function called find_links
which takes in a url
and an output_filename
and finds all the href
s in a web page
and writes them to the provided output_filename
. Each link should be on its own line (1 link per line).
def find_links(url, output_filename):
"*** YOUR CODE HERE ***"
Test your code:
python3 -m pytest test_lab19.py::test_find_links
Note: If you see something like
/staff
or#syllabus-course-policies
in your output file when testing it yourself, these are valid links. The CS111 website was designed with links like that.
Submit
Submit the lab19.py
file on Canvas to Gradescope in the window on the assignment page.