Movie Data by the US State it’s Set In

Web Scraping and Data Analysis via Python

Andrew Muller
9 min readOct 26, 2020

I recently completed a project analyzing movie revenues against factors such as genre, release date, and viewer rating. A project for a beginner data scientist, and I enjoyed it. But as I was gathering the data, I grew more curious. This developed into a project of its own. I learn a ton, especially about web scraping, as I did this, so I’m hoping somebody else may learn by reading about it.

What other factors could affect a movie’s reception?

The problem with answering such a vague question is the need to gather data. Even just coming up with a list of what data to use is a daunting task, much less actually gathering the data itself. After spending too long reading about movie data — and pausing this idea to complete the original project — I hit upon something interesting: https://en.wikipedia.org/wiki/Category:Films_set_in_the_United_States_by_state.

It turns out Wikipedia has organized thousands of movies by the state they are set in. A movie’s setting could certainly be a factor in how it is received, and it’s not the type of information usually found in the standard movie dataset. I decided to focus on this data and set about to satisfy my curiosity.

Web Scraping

The first step was to gather the data from Wikipedia. The requests and BeautifulSoup libraries are key here, and Selenium will help with some of the trickier bits. Pandas, of course, is needed to store and process the data.

from bs4 import BeautifulSoup
import pandas as pd
import requests
from selenium import webdriver
from time import sleep

Looking at the links on that wikipedia page, I noticed that many of them contain yet more subcategories. Instead of reading each page for all the subcategories, and then all those pages for yet more subcategories, they can all be read from this one page. This is where we need Selenium: the links are not in the html by default, they only get inserted after the arrow icon next to the link is pressed. After some trial and error, I found this method for clicking every expandable arrow on the webpage.

driver_options = webdriver.ChromeOptions()
driver_options.add_argument('headless')
driver = webdriver.Chrome(options=driver_options)
driver.get('https://en.wikipedia.org/wiki/Category:Films_set_in_the_United_States_by_state')
sleep(2)
to_click = driver.find_elements_by_xpath('//*[@title="expand"]')
sleep(2)
while len(to_click) > 1:
for element in to_click[1:]:
element.click()
sleep(2)
to_click = driver.find_elements_by_xpath('//*[@title="expand"]')
sleep(2)
soup = BeautifulSoup(driver.execute_script('return document.body.innerHTML'), 'html.parser')

When I first attempted this code, without the sleep elements, I got inconsistent results. I wouldn’t find any arrows one run, then find a hundred the next. Sometimes I would get errors from not being able to find any, sometimes they’d just skip. Adding the sleep calls increases the time it takes to run, but I only needed to run it once, so it was worth it. The driver options allow Selenium to run without actually opening the window on my screen. The execute script method at the end returns the html of the webpage. This is different from calling page source, as that will only return the base html, not the changed html after all the button clicks. I skipped the first item on the webpage because it was only documentaries about the states: not exactly what I’m looking for.

Next I passed the html to BeautifulSoup to search for the links I need. But just grabbing the links wasn’t enough, as I had to associate each link with a state. Luckily the subcategories are all listed under a category that has the state name, so it didn’t end up too difficult. After removing duplicate links (some subsubcategories were listed under multiple subcategories), I ended up with 234 webpages full of movie titles to scrape.

state_links = []
for group in soup.find_all('div', class_='mw-category-group')[1:]:
for state in group.find_all('li'):
item = state.find('a')
name = item.get_text(strip=True).split('in ')[-1].split(' (')[0]
state_links.append((name, f"https://en.wikipedia.org{item['href']}"))
for item in state.find('div', class_='CategoryTreeChildren').find_all('a'):
state_links.append((name, f"https://en.wikipedia.org{item['href']}"))
state_links = [i for n, i in enumerate(state_links) if i not in state_links[:n]]

Now to pull the actual movie links from each page. I already grabbed every subcategory, so I only need the movies themselves. Unfortunately, some of the categories have over 200 movies, leading to them being displayed over multiple pages. This required me to loop the scraping if I found a next page link. Not all of the pages used the same div wrappers, which led to a few conditionals scattered throughout the code.

movie_list = []
for entry in state_links:
state_name = entry[0]
url = entry[1]
next_page = True
while next_page:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
page_wrapper = soup.find('div', id='mw-pages')
if not page_wrapper:
next_page = False
continue
content_wrapper = page_wrapper.find('div', class_='mw-content-ltr')
if not content_wrapper:
next_page = False
continue
for movie in content_wrapper.find_all('a'):
movie_dict = {}
movie_dict['title'] = movie.get_text(strip=True)
movie_dict['state'] = state_name
movie_dict['wiki_link'] = f'https://en.wikipedia.org{movie["href"]}'
movie_list.append(movie_dict)
next_page_link = content_wrapper.find_previous_sibling('a')
if next_page_link and (next_page_link.get_text(strip=True) == 'next page'):
url = f"https://en.wikipedia.org{next_page_link['href']}"
else:
next_page = False
wiki_df = pd.DataFrame(movie_list)

After getting the data, I inserted it into a Pandas dataframe. I ended up with the name, Wikipedia link, and state the movie was set in for 14868 movies. This is an absolutely massive amount of data, but I know it will be cut down as I get more information on each movie.

Since the only information I have for each movie is its Wikipedia link, I’ll have to pull information from there. Most of the movies have an IMDB link in the external links section of their page. The IMDB url contains the IMDB id within it, making it easy to match up to other databases. The code to get this information is simple, but for nearly 15000 pages, even after removing duplicates, it took over 2 hours to run.

def get_imdb_link(url):
print(url)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
try:
ext_links = soup.find('span', id='External_links').find_parent().find_next_sibling('ul')
return ext_links.find('a', text='IMDb').find_previous_sibling()['href']
except:
return 'n/a'
return 'n/a'
wiki_df.drop_duplicates(inplace=True)
wiki_df['imdb_link'] = wiki_df['wiki_link'].map(get_imdb_link)

I then removed any movies that had no IMDB link, cleaned up the movie titles, and extracted the IMDB id from the url. I was left with a dataset of 13254 movies, which I then saved to a csv file.

wiki_df = wiki_df[wiki_df['imdb_link']!='n/a']
wiki_df['title'] = wiki_df['title'].map(lambda x: x.split(' (')[0])
wiki_df['imdb_id'] = wiki_df['imdb_link'].map(lambda x: x[x.find('title/tt')+6:-1])
wiki_df.to_csv('Data/wiki_data.csv', index=False)

Data Merging and Cleaning

Getting the data I want from IMDB is actually very simple. Their daily updated datasets are freely available to anyone at https://www.imdb.com/interfaces/. I decided to collect genres, release year, and ratings information as it seemed the most interesting to me.

Pandas makes importing the tsv files simple. Merging the two, leading to a total of 1086028 movies in a dataframe in seconds, much quicker than the 15000 movies I got from Wikipedia. I also renamed the columns to make them more intuitive.

import pandas as pdimdb_basics_df = pd.read_csv('Data/title-basics.tsv', sep='\t', usecols = ['tconst','startYear', 'genres'])
imdb_ratings_df = pd.read_csv('Data/title-ratings.tsv', sep='\t')
imdb_df = imdb_basics_df.merge(imdb_ratings_df, on='tconst')
imdb_df.rename(columns={'tconst':'imdb_id', 'startYear':'year', 'averageRating':'rating', 'numVotes':'votes'}, inplace=True)

Finally I merged the IMDB data with the Wikipedia data, read from the saved csv. Many of the Wikipedia pages did not have IMDB links, and some of the IMDB ids I got from Wikipedia did not match up with anything on the IMDB data. Most Wikipedia data is user generated and edited, so I’d guess that human error is the cause.

wiki_df = pd.read_csv('Data/wiki_data.csv')
compiled_df = wiki_df.merge(imdb_df, on='imdb_id')

The genre information is all put into a single string. To make this more useful. I need to break this out into multiple columns. I’ll use a column for each genre, with a boolean value for each row. IMDB is kind enough to provide a list of their genres at https://help.imdb.com/article/contribution/titles/genres/GZDRMS6R742JRGAG#. After turning that into a list, I simply looped over that list of genres, adding a new column for each. The new column is mapped against the existing genres column. Both values are strings in the same format, so the evaluation is simple.

genres = 'Action | Adult | Adventure | Animation | Biography | Comedy | Crime | Documentary | Drama | Family | Fantasy | Film-Noir | Game-Show | History | Horror | Musical | Music | Mystery | News | Reality-TV | Romance | Sci-Fi | Short | Sport | Talk-Show | Thriller | War | Western'.split(' | ')
for genre in genres:
compiled_df[genre] = compiled_df['genres'].map(lambda x: genre in x)

Time to do a little cleaning. I turned the year column into integers. Some of the genres, namely Adult, Game-Show, News, Reality-TV, Short, and Talk-Show, are not part of movies I’m looking for. Most of these probably only apply to television shows in IMDB, and not movies, but I still removed any movies that had those genres and deleted those columns. Finally, after removing duplicates, I’m left with a clean dataset of 12793 movies to analyze, saved once again to a csv file.

compiled_df['year'] = pd.to_numeric(compiled_df['year'])
genres_to_remove = ['Adult', 'Game-Show', 'News', 'Reality-TV', 'Short', 'Talk-Show']
for genre in genres_to_remove:
compiled_df = compiled_df[~compiled_df[genre]]
compiled_df.drop(columns=genres_to_remove).columns
compiled_df.drop_duplicates(inplace=True)
compiled_df.to_csv('Data/compiled_data.csv', index=False)

Analysis

Analyzing the data is actually the easiest part. The data is once again read in using pandas read csv function.

import pandas as pddf = pd.read_csv('Data/compiled_data.csv')

I used Panda’s groupby function to get collective data for each state. I did this to get the mean year, rating, and vote count for each state. Years are limited to since movies have existed, ratings are limited from 0–10, and votes are limited to the user base of IMDB. Because of that, outliers should be rare and not very far off, so mean is a good measure of this data.

df.groupby('state').mean()[['year', 'rating', 'votes']]
Mean release year, viewer rating, and vote count for movies by state setting

I can also take a subset of the entire dataframe and use the sort values function to get individual movies. Putting it all in a function allows me to play around with different options, like sorting by popularity as opposed to rating, getting the lowest value instead of the highest, or disallowing nonfiction (Documentary or Biography genre) movies.

def get_per_state(movies_df, column='rating', allow_nonfiction=True, lowest=False):
states = list(df['state'].unique())
genres_to_ignore = ['Biography', 'Documentary']
i = 1 if lowest else -1
top_movies = []
for state in states:
top_movie = df[(df['state']==state) & (allow_nonfiction | ~df[genres_to_ignore].any(axis=1))].sort_values(by=column).iloc[i].to_dict()

'''movie = {}
movie['state'] = state
movie['title'] = top_movie['title']
movie['imdb_link'] = top_movie['imdb_link']
movie['rating'] = top_movie['rating']
movie['votes'] = top_movie['votes']
movie['genres'] = top_movie['genres']'''
top_movies.append(top_movie)
return pd.DataFrame(top_movies)get_per_state(df, 'rating', False)[[ 'state', 'title', 'rating']]
Highest rated movie by state setting
get_per_state(df, 'votes', False, True)[[ 'state', 'title', 'votes']]
Least popular movie by state setting

Finally, I want to get information on genres per state. To do this, I counted the total number of movies for each state. Then I created a new dataframe where I sum the true values of each genre column for each state. These genre columns were then multiplied by 100 and divided by the total number of movies, creating a percentage. After dropping the unnecessary columns, I was able to find which column had the highest value for each row.

total_movies = df['state'].value_counts().to_dict()
genre_percentages_df = df.groupby('state').sum().iloc[:, 3:]
genre_percentages_df['total'] = genre_percentages_df.index.map(lambda x: total_movies[x])
for column in genre_percentages_df.columns[:-1]:
genre_percentages_df[column] = genre_percentages_df[column]*100/genre_percentages_df['total']
genre_percentages_df.drop(columns='total', inplace=True)
genre_percentages_df['top_genre'] = genre_percentages_df.idxmax(1)

It turns out most movies include the Drama genre. The most popular genre for every state is Drama. Drama can apply to almost any story, and can be applied on top of any other genre, so I suppose this is to be expected. After removing it from the running, I was left with the varied genre listing I had expected.

genre_percentages_df['top_genre'] = genre_percentages_df.drop(columns=['Drama']).idxmax(1)
genre_percentages_df[['top_genre']]
Most common genre by state setting

That’s it, all my curiosities satisfied. A lot of movies were listed under multiple states, so some more cleaning may have produced significantly different results.

You can check out all the code and datasets on this project’s Github repo: https://github.com/MullerAC/movies-by-state. Hopefully by reading this, you learned a little about gathering, cleaning, and analyzing data in Python. I know I learned a lot by writing it.

--

--