Selenium Basics

More web scraping in Python

BeautifulSoup is a great tool for reading HTML from web pages. But a website has more than just HTML in it. And if you need to get information somewhere other then the HTML, BeautifulSoup alone won’t cut it.

Enter Selenium. Selenium is an automation tool used to control web browsers. It can also be a powerful tool to use in webscraping.

To get started you’ll need to download the web driver, which Selenium uses to control the browser. Here is the download for chrome, which I will use for this tutorial. Use whatever version matches your browser’s version, which you can find in Settings > About Chrome.

from selenium import webdriverdriver = webdriver.Chrome()
driver.get(url)

For scraping, we will just be using the webdriver from Selenium. If the web driver you downloaded is in the same location as your Python file, this is all you need to do to open up a browser to a particular webpage for Selenium to control.

from selenium import webdriveroptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options, executable_path='C:\path\to\chromedriver.exe')
driver.get(url)

If the chrome driver file is not in the same location as your Python file, the executable_path argument will allow you to tell Selenium where it is. The Options object and options argument allow you to to set various options for your web driver when initialized. Most useful is the headless option, which stops the browser from actually displaying on the screen . If used, Selenium will still be running the browser, but you won’t have to watch while it runs, and it can run quicker without having to display everything it does.

So how do you get something useful out of this? Well you can extract HTML from a driver and feed it into BeautifulSoup. It seems oddly circuitous, using Selenium instead of requests, but it is sometimes necessary. Using requests with BeautifulSoup will only get the HTML, but some sites use Javascript to automatically complete their HTML. If you run Selenium to load the site in a browser, it can run that HTML and then you can load the now-complete HTML into BeautifulSoup.

html = driver.execute_script(‘return document.body.innerHTML’)
soup = BeautifulSoup(html, ‘html.parser’)

And Finally, the find and click methods. There are many ways to find web elements with Selenium (check out this documentation). The easiest to use one is .find_element_by_xpath(). To find the xpath of a web element, use the inspect element feature of a web browser, as shown in my BeautifulSoup post. Right-click the tag you want to find in the inspect element window, select ‘Copy’, and then ‘Copy Xpath’. Put the resulting string into the .find_element_by_xpath() method to the the needed web element.

element = driver.find_element_by_xpath(‘xpath’)
element.click()

And then you can click the element! You can use this to select filters on a search or calendar, then send the new HTML to BeautifulSoup, and only read the HTML of the filtered items. Very useful when you know you only want a subset of the data a site has.

There are a ton more things you can do with Selenium. But for the basics, this should already get you pretty far in web scraping, when used with BeautifulSoup. With these two mini tutorials, you should be able to scrape data from most any site you find.

Student of Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store