Choosing Your Data Carefully

A while back, I wrote about a project I did. It was an NLP classification project where I attempted to classify reviews (or general internet comments) about video games as either positive or negative. I wrote another post about gathering the data I needed to train and test these models.

I am here now to say that I messed up. When I gathered this data. I did it wrong. Oh I got real reviews, actual user-submitted labeled data, but the reviews I got weren’t ones I should have used in my project.

If you look at the “Getting more app ids” section of that post about gathering reviews, you might see my problem. That code I put in there was the actual code I used on my project.

The app ids I gathered were from the ‘topsellers’ section of Steam. These are chosen by recency, popularity. I chose this because I didn’t see another way to get a ton of app ids, and I didn’t see how it might cause problems.

Of course it caused problems! I only gathered data on popular games! Games that are way more likely to get good reviews. Games that review readers are likely to be somewhat familiar with already. Games from known studios or series.

Biased data. That’s what I had. I had noted that my reviews were 80% positive and 20% negative. Now that I’m gathering data from more varied app ids, I’m seeing that a more accurate count is 75% positive and 25% negative, almost nearing 70–30.

Just by those numbers you can see how bad my data was. A 5% difference, when talking about hundreds of thousands of data points, is an absolutely massive amount of missed reviews. It also lessens the load on any under- or over-sampling I might later do to lessen this class imbalance.

A Better Way to Get App IDs

Yes, I am redoing the project. I have a new way to gather app ids. In fact, Steam’s Web API has a command to gather every app id they have. The ids for games, DLCs, and soundtracks don’t have any differences, but that can be taken care of; you can check the type of app via the id. Alternatively, you can ignore the differences. Reviews of DLC are actually likely to be helpful in a neural network making generalized predictions.

How you actually do it is remarkably simple. There is just a single command to get every app id on Steam, tied to the name of the app. It is trivial to pull the data and put it into a dataframe.

As of a few days ago, when I ran this code, there are 115,472 apps on Steam. Many of these are test apps, software, demos, or other non-game items, but this list represents a huge amount of reviews that I’d simply ignored before.

To gather the actual reviews, I looped through the entire list, gathering 50 from each. Many games have less than the 100 reviews I grabbed from each title before, further favoring the already-popular titles. By selecting 50 from each, I evened the playing field, at least a little. There are still many titles with less than 50 reviews, but lowering too far would lessen the number of reviews I gathered too much. As it is I gathered just over 1 million reviews, far more than I gathered the first time I attempted this project. It’s a struggle to deal with this much data effectively, but I am confident that I will get a better result.

You can follow this project on the GitHub repo here. I am still working on it regularly, challenging my earlier assumptions and trying to improve upon the final product. I’m hoping that by doing this, I can make a truly good product, and become a better data scientist in the process.

Student of Data Science