==================
Lab: BeautifulSoup
==================
The goal of this lab is to learn how to scrape web pages using BeautifulSoup.
Introduction
------------
BeautifulSoup is a Python library that parses HTML files and allows you to extract information from them. HTML files are the files that are used to represent web pages.
If you are interested in some question, whether it be related to finance, meteorology, sports, etc., there are almost certainly web sites that provide access to data that can be used to explore your question and gain insight.
Sometimes, you are lucky and find a website that both has the data you want and that helpfully provides the data in a machine-readable format. For instance, it may have a mechanism that allows you to submit queries and download CSV or JSON files back. A website that offers this service is said to "provide an API."
Sadly, most websites are not so helpful. To gather data from them, you will instead need to go rogue and "scrape" them -- submit requests as if they were coming from a web browser being operated by a human, but save the raw HTML output and interpret it programmatically in a Python script instead. When this is necessary, you will need to be able to parse the HTML code for the web page. And when you need to parse HTML, BeautifulSoup is the library to use.
Unfortunately, web pages are designed to present information to humans; web designers rely on the ability of humans to make sense of the data, no matter how it may be formatted. BeautifulSoup cannot analyze an entire web page and understand where the data is; it can only help you extract the data once you know where to find it. Therefore, applying BeautifulSoup to a scraping task involves:
#. inspecting the source code of the web page in a text editor to infer its structure
#. using information about the structure to write code that pulls the data out, employing BeautifulSoup
Installation
------------
If you are completing this lab on a CSIL machine, the required software is likely already present. But, if you are working on your VM, perform the following step to upgrade your modules to working versions (the versions pre-installed on the VM appear to be buggy):
::
sudo pip3 install --upgrade beautifulsoup4 html5lib
If you experience import issues on a CSIL machine, try:
::
pip3 install --user --upgrade beautifulsoup4 html5lib
References
----------
Here are links to reference material on HTML and BeautifulSoup that you may find useful during the course of this lab:
- `W3C HTML Tutorial
`` tag.
Step 2: Parsing the HTML
========================
I could then locate the weather observation using this approach:
#. Find all paragraph (``p``) tags with attribute ``clear="both"``. (There is only one.)
#. This gives me a list of length one. Select the first entry in this list.
#. I could then navigate the tree of HTML, starting with this paragraph tag, to get to the next sibling. Critically, this skips over everything nested within the paragraph tag, including the bold (``strong``) tag and the human-readable date.
#. It turns out that the next sibling is the newline character after the closing of the paragraph tag. Not what I want. I'll move on to the next sibling of that.
#. That entry turns out to be the comment. (``Data starts here``) I'll move on to the next sibling another time.
#. Another newline. Let's go one sibling further again.
#. I get a ```` tag at this point. I can use ``.text`` on this tag to pull out its contents.
Using a Python interpreter, try writing lines of code to go through this process. Here are fragments of code that you will find useful:
::
import bs4
html = open(filename).read()
soup = bs4.BeautifulSoup(html)
tag_list = soup.find_all("p", clear="both")
...
tag = tag.next_sibling
...
data = tag.text
...
Note that when you try the line that initializes the ``soup`` object, you will get a warning message, along with advice on how to improve your code by adding an extra parameter to the function call. Go ahead and follow this advice.
Combine these snippets, the step-by-step process for getting the weather observations listed above, and your knowledge of Python to write a series of lines of code that successfully retrieve the weather observation string.
If you looked over the HTML file and made the observation that you could just search for the ```` tag and go directly to where you want to be, instead of performing the additional navigation described above, you are correct. At some point in the recent past, the web designer added this tag; before, you did have to follow the more indirect process stated above. I've left the more indirect version in this lab, however, because it demonstrates more of the techniques you need to use in typical situations. When you have a very direct approach available, though, feel free to use it.
Step 3: Automating Queries for Any Airport
==========================================
Now, here is a code snippet that loads HTML from a URL:
::
import urllib3
import certifi
pm = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
...
html = pm.urlopen(url=myurl, method="GET").data
Take these snippets, and combine them with your BeautifulSoup-based parsing code and some additional code to build a URL through string concatenation, and create a single function ``current_weather`` that takes in a parameter -- an airport code -- and returns back a string which is the latest weather observation from that airport.
Once you have this working, take a moment to understand the power of the function you just wrote: although you only inspected the format of the web page for one airport weather observation at one time, you found characteristic landmarks in the formatting of this page that will allow your function to work for any airport at any time, and to be used millions of times, if desired.
Example 2: Climate Data
-----------------------
This data is no longer available.
Example 3: Chicago 'L' Lines
----------------------------
The CTA maintains web pages for each of the 'L' lines in the city. For instance, here is `the one for the Red Line `_.
Although these pages are valuable, they were even more valuable in the past -- during a recent redesign of the site, they reduced the amount of information that can be seen at a glance. Fortunately, it is often (though not always) possible to view older versions of web pages using `the Internet Archive Wayback Machine `_. Try entering the same URL into that system and exploring back in time. You should be able to find a version of the site that has more detailed information in its chart. In particular, `this version `_ is much better than the current one.
Scroll down to the section titled "Route Diagram and Guide" and have a look. This section seems to be chock full of detail about this transit line: a list of stations, information about wheelchair accessibility and parking, and rail and bus transfers. (This part of the web site is the one that has seen the most dramatic decrease in information density recently, if you compare the two versions.)
If you were doing a project on public transportation in Chicago, this page (and the corresponding ones for the other lines) seems like it would be a treasure trove of data. Unfortunately, much of it appears to be graphical. For instance, you can transfer to the Yellow and Purple lines at Howard, but short of writing code to interpret an image (a very complicated task), determining this in your code rather than by visual inspection seems out of reach. Similarly, while there are wheelchair and parking icons for some stations, they are icons, not text.
The HTML tag for images, however, allows a web site designer to provide an alternative textual representation of an image. This is accomplished by adding an ``alt`` attribute to an ``img`` tag. For various reasons, it is considered good practice to do this, and a well-designed web site will follow this protocol.
Save this page (the archived version) and open it in your text editor. Search for this table and take a look at the image tags. We're in luck!
When we are navigating the tree of elements that results from parsing a web page with BeautifulSoup, we can retrieve the value of a tag's attribute with the syntax ``tag["atr_name"]``. For instance, if ``t`` is an ``img`` tag, we could use ``t["alt"]`` to retrieve its ``alt`` text.
This table seems to be a little harder to distinctively identify. But there is a nearby ``a`` tag that might serve as a landmark based on one of its attributes.
To take advantage of this, here's what I tried:
::
soup.find_all("a", name="map")
Please try it yourself.
This didn't work; Python gave me and, likely, you, an error message. What happened here?
It turns out that the name of the first argument, the one that stores the name of the tag itself (in this case ``a``), is already named ``name``. This conflicts with our attempt to specify the value of an attribute named ``name``.
This is a similar, but different, issue as the fact that you have to write ``class_`` instead of ``class`` when matching against an attribute named ``class``, because this keyword is a Python reserved word.
There is a workaround, however: you can make a dictionary of attributes to look for, and just have a single entry:
::
soup.find_all("a", attrs={"name":"map"})
Once you have found this ``a`` tag, you can use ``.parent`` and ``.next_sibling`` to get to the table. Examine the nesting of tags carefully to understand what the tree of nested tags looks like in this vicinity.
Write code to scrape the table and turn it into a useful in-memory representation. Here is one reasonable choice:
- There is a list of stations in order down the line, in the order presented on the page. (For instance, for the Red Line, this is north to south, from Howard to 95th/Dan Ryan.)
- Each entry in this list is a dictionary. It has a key for the name, a key for the "amenities" (accessible, parking), a key for the URL for the station page, a key for the 'L' transfers (colored lines), and a key for the other connections (bus and Metra rail lines). The values associated with each of these keys could simply be text strings or lists of strings retrieved from the table, with processing to remove HTML coding. Here is an example for the Howard station, for instance:
::
{'L-transfers': ['transfer to yellow line', 'transfer to purple line'],
'amenities': ['accessible station', 'automobile parking available'],
'connections': '\n CTA Buses #22, #97, #147, #151, #201, #205, #206\n
Pace Buses #215, #290\n ',
'URL': '/web/20170211053109/http://www.transitchicago.com/travel_information/station.aspx?StopId=71', 'name': 'Howard'}
Conclusion
----------
For any web page that holds data you want, you should start by seeing if the web site offers JSON or CSV data. But if it doesn't, you have the option of retrieving HTML and using BeautifulSoup.
Doing so involves examining the HTML source code for the page, finding landmarks that allow you to programmatically locate the data of interest within the page, and then extracting it. Working through the page involves using navigation facilities like ``find_all``, ``parent``, and ``next_sibling``; extracting data involves unwrapping the tags around it to get to the text, while respecting and maintaining the structure of the data.
This process can be painstaking, but in the end, working with HTML is manageable, and made easier thanks to BeautifulSoup.
The experience you have gained in this lab is very likely to help you with the upcoming programming assignment, and in the course project.