5 Useful Web Scraping Functions in Python

A few miscellaneous methods I’ve found while web scraping

dateutil.parser.parse

Python’s built-in datetime library can parse strings into a datetime object using strptime, but it’s a little hard to use. You have to know what format the date is in every time, and you have to know the way to write that out in their symbols.

Dateutil is a separate library you will have to import, but it is very useful. This parser function, in particular, attempts to parse a string into a datetime object without any other input. It doesn’t always work, but it works much easier than the built-in strptime does. In particular, it handles ordinals (‘st’, ‘nd’, ‘rd’, ‘th’, etc.), which strptime does not.

Check out the documentation here.

urllib.parse.urljoin

When you get a url from a link (‘a’) tag with BeautifulSoup, it often only contains what is called a ‘relative’ url. A relative url only contains the part of a url that adds on the end of the main url. If you want to get a link that is useful outside of the web page itself, you need to combine the relative url with the main url of the site. It is easy to do this manually — by checking the links yourself and finding where in the url the relative part goes, and splicing the strings together — but it is time consuming to do this correctly when scraping data from different places. urljoin takes the base url and the relative url and combines them by itself, figuring that part out for you.

Check out the documentation here.

time.sleep

When I scrape a site with Selenium, I often only do it because the javascript needs to run in order to load the needed data into the HTML. Usually using driver.get() then putting the HTML into BeautifulSoup is enough, but every once in a while I come across a really uncooperative website, one that just refuses to load. If you come across a site that doesn’t load fast enough for Selenium, and causes errors when grabbing the HTML, try using a time.sleep(1) just after the driver.get() call. It’s pretty hacky, and you’ll want to see if a lower number works if you are scraping a lot of sites, but sometimes it can help.

Check out the documentation here.

datetime.datetime.now

Say you want to scrape a calendar for events that happen before or after a particular date, you can simply use > and <. But if you want don’t want to manually set a date every time you run your script, you’ll want to set the target date programmatically. Using datetime.datetime.now(), you can set the target date as whatever the current date is (or a set number of days off from today) every time. You can also use datetime.date.today() if you are using dates instead of datetimes.

Check out the documentation here.

tag.has_attr

BeautifulSoup allows you to get any attribute from a tag as if it were a dictionary (ex: tag[‘class’]). But if the tag does not have the specified attribute, an error gets thrown. So, you’d first want to use tag.has_attr(‘class’) to make sure the tag has the needed attribute before trying to get it. You can also see all attributes using tag.attrs.

Selenium has allows you to get an attribute using element.get_attribute(‘class’). Instead of throwing an error, this Selenium returns None if the attribute is not found.

Check out the documentations here for BeautifulSoup and here for Selenium.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store