BeautifulSoup is a great tool for reading HTML from web pages. But a website has more than just HTML in it. And if you need to get information somewhere other then the HTML, BeautifulSoup alone won’t cut it.
Enter Selenium. Selenium is an automation tool used to control web browsers. It can also be a powerful tool to use in webscraping.
To get started you’ll need to download the web driver, which Selenium uses to control the browser. Here is the download for chrome, which I will use for this tutorial. Use whatever version matches your browser’s version, which you can find in Settings > About Chrome.
from selenium import webdriverdriver = webdriver.Chrome()
For scraping, we will just be using the webdriver from Selenium. If the web driver you downloaded is in the same location as your Python file, this is all you need to do to open up a browser to a particular webpage for Selenium to control.
from selenium import webdriveroptions = Options()
options.headless = True
driver = webdriver.Chrome(options=options, executable_path='C:\path\to\chromedriver.exe')
If the chrome driver file is not in the same location as your Python file, the
executable_path argument will allow you to tell Selenium where it is. The
Options object and
options argument allow you to to set various options for your web driver when initialized. Most useful is the
headless option, which stops the browser from actually displaying on the screen . If used, Selenium will still be running the browser, but you won’t have to watch while it runs, and it can run quicker without having to display everything it does.
html = driver.execute_script(‘return document.body.innerHTML’)
soup = BeautifulSoup(html, ‘html.parser’)
And Finally, the find and click methods. There are many ways to find web elements with Selenium (check out this documentation). The easiest to use one is
.find_element_by_xpath(). To find the xpath of a web element, use the inspect element feature of a web browser, as shown in my BeautifulSoup post. Right-click the tag you want to find in the inspect element window, select ‘Copy’, and then ‘Copy Xpath’. Put the resulting string into the
.find_element_by_xpath() method to the the needed web element.
element = driver.find_element_by_xpath(‘xpath’)
And then you can click the element! You can use this to select filters on a search or calendar, then send the new HTML to BeautifulSoup, and only read the HTML of the filtered items. Very useful when you know you only want a subset of the data a site has.
There are a ton more things you can do with Selenium. But for the basics, this should already get you pretty far in web scraping, when used with BeautifulSoup. With these two mini tutorials, you should be able to scrape data from most any site you find.