Web Scraping and Data Analysis via Python

I recently completed a project analyzing movie revenues against factors such as genre, release date, and viewer rating. A project for a beginner data scientist, and I enjoyed it. But as I was gathering the data, I grew more curious. This developed into a project of its own. I learn a ton, especially about web scraping, as I did this, so I’m hoping somebody else may learn by reading about it.

What other factors could affect a movie’s reception?

The problem with answering such a vague question is the need to gather data. Even just coming up with a list…


More web scraping in Python

BeautifulSoup is a great tool for reading HTML from web pages. But a website has more than just HTML in it. And if you need to get information somewhere other then the HTML, BeautifulSoup alone won’t cut it.

Enter Selenium. Selenium is an automation tool used to control web browsers. It can also be a powerful tool to use in webscraping.

To get started you’ll need to download the web driver, which Selenium uses to control the browser. Here is the download for chrome, which I will use for this tutorial. …


Getting started with web scraping in Python

Web scraping is a great way to get more data for a unique data science project. Since BeautifulSoup is one of the best ways to do web scraping in Python, here is how to get started with it.

from bs4 import BeautifulSoup
import requests

Along with installing BeautifulSoup itself, you’ll want to install the requests library. Sending requests in vanilla Python is too many steps; the requests library makes it much simpler. To send a request to a web page with requests, and turn it into a BeautifulSoup object, simply use the following code

url = 'https://en.wikipedia.org/wiki/Web_scraping' soup = BeautifulSoup(requests.get(url).text…


Using python to scrape user reviews from the Steam store without using the API.

The Steam store is the premier online store for video games out there. It’s been around since 2003, but in 2013, they added the ability for users to submit their own reviews. These reviews are labeled as either “Recommended” or “Not Recommended” by the reviewer. This gives you a huge amount of data you can use for your next NLP project. Or perhaps you’re just curious about one game in particular. Either way, I’m going to so you how to scrap those reviews using python so you can get all the reviews you want easily.

Getting some reviews

Accessing reviews from one game…


NLP Classification with and Neural Networks

Business Case

This goal of this project is to create a model that can predict sentiment of internet talk about video games. Specifically, it will be a neural network model trained on Steam reviews. The reviews are marked either “suggested” or “not suggested”, corresponding to a results classification of “positive” or “negative”. Eventually, the project will result in a website that, when supplied with a Twitter hashtag or Reddit thread, will analyze the sentiment of the related comments.

Data Collection

The data is user reviews collected from Steam. Steam user reviews are available for any Steam user to…


Time series analysis in python

https://github.com/MullerAC/springfield-housing-analysis

Purpose

Investigate and predict the average housing prices in the next two years using various zip codes in Springfield, MO. Use data from 1996 to 2018 and various time series models to determine which zip code would be the best investment to buy houses in.

Exploratory Data Analysis

Analyzing 15,000 different zip codes (the entirety of the Zillow dataset) isn’t very feasible, so after looking at the data we have decided to focus on the six zip codes in one city: Springfield, Missouri. Since we are only interested in analyzing the trends over time, we will remove the unnecessary columns from our dataset…


https://github.com/MullerAC/tanzanian-water-well-classification

Overview

Tanzania is a developing country that struggles to get clean water to its population of 57 million as part of an ongoing competition at DrivenData, this projects goal was to predict the functionality of water wells given data provided by Taarifa and the Tanzanian Ministry of Water.

This is a ternary classification problem: all points are either functional, nonfunctional, or functional and in need of repair. The data has 39 independent variables relating to the management, use, and location of the pumps. …


For my course in Data Science, I performed a linear regression model on the sales of houses in King County. Data for sales in 2014 and 2015 was provided, and Python was used to do the modeling. This is a rundown of my first linear regression modeling project. For the full code, see the Github repo: https://github.com/MullerAC/king-county-house-sales

Business Case

It is important to establish a business case when starting a project like this. This is to create a goal and help move towards a meaningful conclusion. My last blog post’s project didn’t have any real goal, so it just stopped when I…

Andrew Muller

Student of Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store