Scraping Steam User Reviews

Using python to scrape user reviews from the Steam store without using the API.

The Steam store is the premier online store for video games out there. It’s been around since 2003, but in 2013, they added the ability for users to submit their own reviews. These reviews are labeled as either “Recommended” or “Not Recommended” by the reviewer. This gives you a huge amount of data you can use for your next NLP project. Or perhaps you’re just curious about one game in particular. Either way, I’m going to so you how to scrap those reviews using python so you can get all the reviews you want easily.

Getting some reviews

Accessing reviews from one game is pretty easy. Just visit<appid>?json=1 with the app id of whatever game you want at the end. For example: To find the app id of a particular game, just check the url on its store page.

So to get some reviews, it’s just as simple as this:

import requests
response = requests.get(url='').json()

You need the .json() call to get the information out of the request, instead of just <Response [200]>.

Even that can be improved, though. Here is a more easily usable function to get the reviews. The ‘params’ parameter allows us to put in parameters beyond json=1, which we will get to soon. The ‘headers’ parameter tells Steam that it is a browser making the request, not — for example — a python program pretending to be a browser. It may not be necessary here, but it might help if you are scraping lots of reviews at once.

import requests

def get_reviews(appid, params={'json':1}):
url = ''
response = requests.get(url=url+appid, params=params, headers={'User-Agent': 'Mozilla/5.0'})
return response.json()

Getting more reviews

Back to those parameters I mentioned. You can see the documentation here, but the most important is ‘cursor’. If you want more than the maximum 100 reviews you can get from a single request, you’ll need to use the cursor. A response includes a ‘cursor’ attribute, marking which review your request completed on. Including the same cursor in your next request’s parameters starts the reviews at the same spot, meaning you get a completely new set of reviews. The cursor is a seemingly random string of characters, and may include characters that need to be encoded to work with a URL request.

params = {'json':1}
response = get_reviews(413150, params)
cursor = response['cursor']
params['cursor'] = cursor.encode()
response_2 = get_reviews(413150, params)

Or, to put it into a function:

def get_n_reviews(appid, n=100):
reviews = []
cursor = '*'
params = {
'json' : 1,
'filter' : 'all',
'language' : 'english',
'day_range' : 9223372036854775807,
'review_type' : 'all',
'purchase_type' : 'all'

while n > 0:
params['cursor'] = cursor.encode()
params['num_per_page'] = min(100, n)
n -= 100

response = get_reviews(appid, params)
cursor = response['cursor']
reviews += response['reviews']

if len(response['reviews']) < 100: break

return reviews

You can change the parameters if you’d like, but this grabs the n most helpful reviews from all time. The loop at the end ensures you stop taking reviews when you reach the end of all the reviews the game has (at least, those that meet your parameter criteria), and returns all reviews into one list.

Getting an app id

Alright, so you’ve got a way to get a bunch of reviews. But how do you get an app id in the first place? Well I showed earlier how to get the app id of a single game, but that isn’t useful if you want to get reviews via python, or to get a bunch of app ids at once. To do this, we are going to have to do a bit of web scraping.

from bs4 import BeautifulSoupdef get_app_id(game_name):
response = requests.get(url=f'{game_name}&category1=998', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
app_id = soup.find(class_='search_result_row')['data-ds-appid']
return app_id

First, we use the requests library to get the html from a Steam store search result. the category parameter checks the ‘Games’ category, ensuring you don’t accidentally get a software, demo, soundtrack, or some other store item. Then feed the html of the resulting webpage to a library called BeautifulSoup. BeautifulSoup is a n extremely useful webscraping library. You can check out the documentation here, but I’ll go over some of the basic uses for this blog post.

The ‘find’ method grabs the first html tag the meets the required characteristics. In this case, we will take any tag that has the class ‘search_result_row’, which is what the Steam store uses for each search result. The index after that method call takes the ‘data-ds-appid’ attribute of the found tag, which is the attribute the Steam store uses to store the app id of a game in a search result.

I’ve given you these answers here, but if you want to find out the tags and attributes used on some other site, you’ll need to do soom digging into the site’s html. To find out the html tag of a specific element you see on the site, just right click it and select ‘Inspect Element’.

Getting more app ids

But now you want to get a bunch of app ids at once. Well, you can do that too.

def get_n_appids(n=100, filter_by='topsellers'):
appids = []
url = f'{filter_by}&page='
page = 0

while page*25 < n:
page += 1
response = requests.get(url=url+str(page), headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
for row in soup.find_all(class_='search_result_row'):

return appids[:n]

This functions similarly to the earlier function: it pulls up a search result page from the Steam store, grabs the results, and takes the app ids from those tags. By default this grabs the current ‘topsellers’, but if you play around with the search page, you can find the url to do whatever search you want.

Why do this?

I first explored scraping these Steam reviews for an NLP project I worked on a bit ago. To gather my data, I used those functions I put up above with a loop to gather all the reviews I needed.

import pandas as pd

reviews = []
appids = get_n_appids(750)
for appid in appids:
reviews += get_n_reviews(appid, 100)
df = pd.DataFrame(reviews)[['review', 'voted_up']]

I actually got too many reviews at first — my computer couldn’t handle them all! It was so quick to get thousands of labeled NLP data that was actually interesting to me. It was data that I actually wanted to work with, and so I was driven to do new and more interesting things with this project. I think anything I make will come out better the more interested I am in it. If these reviews are at all interesting to you, hopefully you found something useful in these bits of code here.

Student of Data Science