I recently completed a project analyzing movie revenues against factors such as genre, release date, and viewer rating. A project for a beginner data scientist, and I enjoyed it. But as I was gathering the data, I grew more curious. This developed into a project of its own. I learn a ton, especially about web scraping, as I did this, so I’m hoping somebody else may learn by reading about it.
What other factors could affect a movie’s reception?
If you want to save a pandas DataFrame to use later, there are a few options: csv, feathers, pickles, etc. But if a DataFrame is your completed product, and you want to share it, you can’t beat uploading the data as a Google Sheet. It’s free, you can access it anywhere, and you can easily share it with anyone.
Pandas can’t turn a DataFrame straight to a Google Sheet automatically, as you can with actual file formats. Instead, you need to access the Google Drive API. There are lots of libraries to use this access with Python, but for simply…
Epub files are the standard for digital books. If you want a lot of text data, or just have a lot of epubs you want to analyze, you can get that text data using Python. There are a few libraries for managing epub data in Python, but ebooklib is a good one with lots of useful features.
If you haven’t checked out an epub file yourself, it’s actually remarkably easy. DRM-free epubs are actually simply renamed zip files. You can change the .epub extension to .zip and unzip like any normal zip file.
Looking at the inside of an epub…
A while back, I wrote about a project I did. It was an NLP classification project where I attempted to classify reviews (or general internet comments) about video games as either positive or negative. I wrote another post about gathering the data I needed to train and test these models.
I am here now to say that I messed up. When I gathered this data. I did it wrong. Oh I got real reviews, actual user-submitted labeled data, but the reviews I got weren’t ones I should have used in my project.
If you look at the “Getting more app…
A few miscellaneous methods I’ve found while web scraping
Python’s built-in datetime library can parse strings into a datetime object using strptime, but it’s a little hard to use. You have to know what format the date is in every time, and you have to know the way to write that out in their symbols.
Dateutil is a separate library you will have to import, but it is very useful. This parser function, in particular, attempts to parse a string into a datetime object without any other input. It doesn’t always work, but it works much easier than the built-in…
BeautifulSoup is a great tool for reading HTML from web pages. But a website has more than just HTML in it. And if you need to get information somewhere other then the HTML, BeautifulSoup alone won’t cut it.
Enter Selenium. Selenium is an automation tool used to control web browsers. It can also be a powerful tool to use in webscraping.
To get started you’ll need to download the web driver, which Selenium uses to control the browser. Here is the download for chrome, which I will use for this tutorial. …
Web scraping is a great way to get more data for a unique data science project. Since BeautifulSoup is one of the best ways to do web scraping in Python, here is how to get started with it.
from bs4 import BeautifulSoup
Along with installing BeautifulSoup itself, you’ll want to install the requests library. Sending requests in vanilla Python is too many steps; the requests library makes it much simpler. To send a request to a web page with requests, and turn it into a BeautifulSoup object, simply use the following code
url = 'https://en.wikipedia.org/wiki/Web_scraping'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')…
The Steam store is the premier online store for video games out there. It’s been around since 2003, but in 2013, they added the ability for users to submit their own reviews. These reviews are labeled as either “Recommended” or “Not Recommended” by the reviewer. This gives you a huge amount of data you can use for your next NLP project. Or perhaps you’re just curious about one game in particular. Either way, I’m going to so you how to scrap those reviews using python so you can get all the reviews you want easily.
Accessing reviews from one game…
NLP Classification with and Neural Networks
This goal of this project is to create a model that can predict sentiment of internet talk about video games. Specifically, it will be a neural network model trained on Steam reviews. The reviews are marked either “suggested” or “not suggested”, corresponding to a results classification of “positive” or “negative”. Eventually, the project will result in a website that, when supplied with a Twitter hashtag or Reddit thread, will analyze the sentiment of the related comments.
The data is user reviews collected from Steam. Steam user reviews are available for any Steam user to…
Investigate and predict the average housing prices in the next two years using various zip codes in Springfield, MO. Use data from 1996 to 2018 and various time series models to determine which zip code would be the best investment to buy houses in.
Analyzing 15,000 different zip codes (the entirety of the Zillow dataset) isn’t very feasible, so after looking at the data we have decided to focus on the six zip codes in one city: Springfield, Missouri. Since we are only interested in analyzing the trends over time, we will remove the unnecessary columns from our dataset…
Student of Data Science