Lab 21 - Requests and Beautiful Soup Library

Due by 11:59pm on 2023-11-30.

Starter Files

Download lab21.zip. Inside the archive, you will find starter files for the questions in this lab.

Topics

Requests Library

Installing the Requests Library

To install the Requests library, type one of the following into the terminal

pip install requests
python3 -m pip install requests

To check if you correctly downloaded the library, type the following in the python interpreter (type python3 into the terminal) and ensure it does not error:

>>> import requests

To uninstall any library, type pip uninstall library_name or python3 -m pip uninstall library_name

Requests Review

When asking for a web page from some server on the internet, your electronic or client can send several different types of requests. In this lab, we will only be looking at GET requests. The Requests library allow us to grab and utilize the information from the request in our code. To do this, you have to use the get function. Import the requests library and write this code,

response_object = requests.get("URL")

where "URL" would be a string containing an actual URL.

The response_object has several attributes we can use. For example, we have the following attributes:

  • url - Returns the URL of the web page requested
  • text - Returns the web page's content in HTML
  • content - Returns the web page's content of the response as bytes
  • status_code - Returns a code representing the status of retrieving the web page. For more information refer to this wikipedia article. The most common status code is 200 which means that everything went OK when retrieving the web page.
  • headers - Returns a dictionary with additional information passed in with a request.

Try out the following code:

response_object = requests.get("https://cs111.byu.edu")
print(response_object.url)
print(response_object.status_code)
print(response_object.headers)

Who is the mascot mentioned in the headers? What was the message to posterity?

Refer to w3schools for more information on the attributes: https://www.w3schools.com/python/ref_requests_response.asp

Beautiful Soup Library

Installing the Beautiful Soup Library

To install the Beautiful Soup library, type one of the following into the terminal

pip install beautifulsoup4
python3 -m pip install beautifulsoup4

To check if you correctly downloaded the library, type the following in the python interpreter (by typing python3 into the terminal) and ensure it does not error:

import bs4

To uninstall any library, type pip uninstall library_name or python3 -m pip uninstall library_name

Beautiful Soup Review

One way of accessing a web sites information is by accessing its HTML and parsing it. One library or framework that allows us to parse HTML somewhat easily is the Beautiful Soup library. To start, your code should look similar to the following

import bs4, requests

r = requests.get("URL")
soup_object = bs4.BeautifulSoup(r.content, features="html.parser") # `features` prevent an unimportant warning for this class

Once we have a beautiful soup object, we have access to several useful methods.

  1. .prettify() returns a string of all the web page's HTML nicely indented and formatted.

    • Add the following to the code above, provide a valid URL, and demo it:
      print(soup_object.prettify()) 
      Compare it to the actual HTML contents of the web page given by the URL.
  2. .find_all('tag') takes in as an argument a string containing the tag and returns a list of tag objects.

    • For example, if we wanted to find all the paragraph (<p>) tags in a website, we could do the following:
      list_of_tags = soup_object.find_all('p')
      print(list_of_tags)
      # [<p>Computer Science is amazing!</p>, <p>I want to become a CS Major!</p>]

Tags also have several useful methods and attributes:

  1. .get('attr')

    • In some cases, we want to access a tag's attributes. Given a tag object, we can use its .get() method to access its attribute's content.For example, if we wanted to access an image tag's width attribute, we can do the following:
      img_tag = soup_object.find('img') # .find returns one tag instead of a list of tags
      width = img_tag.get('width')
      print(width)
    • If the attribute does not exist, the method will return None.
  2. .attrs

    • .attrs is a dictionary. The keys of .attrs is the attribute names and the values are strings containing the attribute's associated information. For example, with the tag <p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>:
      from bs4 import BeautifulSoup
      data = '<p style="font-size: 12px; font-color: blue;" id="this_id">Hello, World</p>'
      soup = BeautifulSoup(data, features="html.parser")
      tag = soup.find('p')
      print(tag.attrs)
      # {'style': 'font-size: 12px; font-color: blue;', 'id': 'this_id'}
      print(tag.attrs['style'])
      # 'font-size: 12px; font-color: blue;'
      print(tag.get('style'))
      # 'font-size: 12px; font-color: blue;'

Required Questions

Q1: Downloading a Web Page

Write a function called download that has the parameters url and output_file. The function should get the HTML content from the url, open the file, and write it to the output_file. (Note: output_file is the name of the file as a string)

When getting the content, use .text rather than .content.

To test, get the HTML contents from the cs111 website's pair programming article and write them to a file called lab21_test.txt. Compare your file to the HTML content on the website.

Q2: Prettify

Write a function called make_pretty which takes in a url and a output_file. The function should save the results of calling .prettify() on the web page given by the url to the output_file.

Write a function called find_paragraphs which takes in a url and a output_file. The function should find all paragraph tags <p> on the web page given by the url and write them to the output_file.

Write a function called find_links which takes in a url and a output_file and finds all the hrefs in a web page and writes them to the output_file. Each link should be on its own line (1 link per line).

Note: If you see something like /staff or #syllabus-course-policies in your file, these are valid links. The CS111 website was designed that way.

Submit

If you attend the lab, you don't have to submit anything.

If you don't attend the lab, you will have to submit working code. Submit the lab21.py file on Canvas to Gradescope in the window on the assignment page.

© 2023 Brigham Young University, All Rights Reserved