BeautifulSoup Basics

Getting started with web scraping in Python

Web scraping is a great way to get more data for a unique data science project. Since BeautifulSoup is one of the best ways to do web scraping in Python, here is how to get started with it.

from bs4 import BeautifulSoup
import requests

Along with installing BeautifulSoup itself, you’ll want to install the requests library. Sending requests in vanilla Python is too many steps; the requests library makes it much simpler. To send a request to a web page with requests, and turn it into a BeautifulSoup object, simply use the following code

url = 'https://en.wikipedia.org/wiki/Web_scraping'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')

To use this soup object, you’ll now need to do a little digging. On whatever website you want to get information from, right click the data of interest and select “Inspect Element”. Some HTML code should pop up in your browser, highlighting something around where you want. Select a few different elements in the HTML until you find the tag that highlights the information you want. This will get easier the more you do, or may already be easy if you know some HTML (quick tip: the div tag is used often to segment out divisions of the webpage). For this example, let’s take a look at the table of contents for the web scraping article on Wikipedia.

To get the information into Python, use BeautifulSoup’s .find() method. The first argument is the type of html tag (the first word after the ‘<’), and the second is a dictionary of the other elements the tag may contain.

toc = soup.find('div', {'id':'toc', 'class':'toc'})

Let’s pull all of the section headers out of this table of contents and put them into a list. Taking a look back at the HTML, you might see that each of the headers is in an <li> tag, for ‘list item’. BeautifulSoup is able to grab them all at once with the .find_all() method. This returns a list of every matching tag. To get just the text from those tags, we will have to use the .get_text() method on each of those tags, then store the results in a list.

headers = [header.get_text().split('\n')[0] for header in toc.find_all('li')]

Hopefully you get most of that code, but what’s with the .split(‘\n’)[0]? Well, each subheader tag is actually embedded within the tag above it (i.e. the 2.1 and 2.2 tags are both inside the 2 tag). So the .find_all() call actually returns the text for each subheader list item twice: once on its own, and once with the tag that contains it. Fortunately there is a ‘\n’ (new line) character between each list item, so we can split there and grab only the first split to get every header only once.

The great part about this but of web scraping we’ve done is that it can be used for other Wikipedia articles. If you have the title of any Wikipedia page, you can use this function to get all of the headers of the article.

def get_wiki_headers(title):
"""
Uses BeautifulSoup and requests to scrape a Wikipedia page via the article title and return a list of strings containing all headers and subheaders in the article.
"""
url = 'https://en.wikipedia.org/wiki/' + title.replace(' ', '_')
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
toc = soup.find('div', {'id':'toc', 'class':'toc'})
headers = [header.get_text().split('\n')[0] for header in toc.find_all('li')]
return headers

Useful, no? Well you can run this in a loop to get information on tons of articles at once. Or, more likely, use these principles to do your own thing. That’s the point to web scraping, anyways — to get your own unique data to use instead of some dataset on Kaggle that everyone else has used already.

BeautifulSoup is much deeper than this, and there are more libraries that may be needed to scrape more complex websites. I hope to write posts about those some day. For now, if you need to know more, you can check out BeautifulSoup’s documentation here.

Student of Data Science