NLP Classification with and Neural Networks
This goal of this project is to create a model that can predict sentiment of internet talk about video games. Specifically, it will be a neural network model trained on Steam reviews. The reviews are marked either “suggested” or “not suggested”, corresponding to a results classification of “positive” or “negative”. Eventually, the project will result in a website that, when supplied with a Twitter hashtag or Reddit thread, will analyze the sentiment of the related comments.
The data is user reviews collected from Steam. Steam user reviews are available for any Steam user to write. They are labeled are either “suggested” or “not suggested” (referring to the game they are about), and can be voted as “helpful” or “unhelpful” by other Steam users. Notebook 1 reads the id’s for the top 750 most popular games on Steam, and gets the 100 most helpful reviews from each. The helpful reviews are chosen because they are more likely to contain meaningful text in English characters that can be interpreted by the models. The end result is 73,096 reviews, as some of the games have less than 100 reviews.
I plan to write a post detailing obtaining reviews from Steam later. The code is available on the GitHub repo for this project (see the bottom of this port for the link).
Instead of csv files, the data is stored in the feather format, to decrease their size, as the dataset is large.
To preprocess the data, markdown tags and punctuation marks are removed, and the remaining text is tokenized using nltk’s regexp tokenizer, capturing only latin characters and arabic numerals. The tokens are then lemmatized using nltk’s wordnet lemmatizer.
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from string import punctuationdef remove_markdown(x):
return sub(r'\[.*?\]', '', x)def remove_punctuation(x):
punctuation_list = list(punctuation) + ['`', '’', '…', '\n']
return x.translate(str.maketrans('', '', ''.join(punctuation_list)))def tokenize(x):
tokenizer = RegexpTokenizer(r'[a-zA-Z0-9]+')
return tokenizer.tokenize(x.lower())def lemmatize(x):
lemmatizer = WordNetLemmatizer()
return list(map(lemmatizer.lemmatize, x))X_pre = list(map(remove_markdown, X))
X_pre = list(map(remove_punctuation, X_pre))
X_pre = list(map(tokenize, X_pre))
X_pre = list(map(lemmatize, X_pre))
X_join = [' '.join(x) for x in X_pre]
To engineer the features to model the data on, scikit-learn’s tf-idf vectorizer is used, with bigrams, to generate 8000 features for each datapoint. Initially, I had trialed other engineered featured, but found this to perform the best for this data. Check out the project files to see more processing methods. Make sure to perform a train/test split before using any vectorizer.
from sklearn.feature_extraction.text import TfidfVectorizertf_bigram = TfidfVectorizer(max_features=8000, ngram_range=(1,2))
X_train_bigram = pd.DataFrame(tf_bigram.fit_transform(X_train_join).todense(), columns=tf_bigram.get_feature_names())
X_test_bigram = pd.DataFrame(tf_bigram.transform(X_test_join).todense(), columns=tf_bigram.get_feature_names())
Before attempting a neural network model, I tried more basic machine learning models from the scikit-learn library. They performed worse than the neural network, so I didn’t include them here, but check out the GitHub repo if you want to know more.
I then created a basic convoluted neural network using dense and dropout layers. A gridsearch was performed, but the search could not be very wide, as the models take so long to train. The final model is very close to the first one made, and had a test accuracy of 91%, matching that of the logistic regression model. Although it didn’t perform any better on the test set, I’m hoping it will generalize more to reviews and comments from sources other than Steam.
from tensorflow.keras.layers import Dense, Dropout, LSTM
from tensorflow.keras import Sequentialmodel = Sequential()
# hidden layers
model.add(Dense(500, input_dim=8000, activation='relu'))
# output layer
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1)
Unlabeled Data Analysis
To get unlabeled internet data to perform analysis on, I used twint for Twitter searches and praw to read Reddit threads. The script then prints out the percentages of comments that are positive and negative, and samples five random from each category. See the documentation for twint or praw if you need help getting the unlabeled data.
I created a very simple local flask app to run the models. Unfortunately, both the unlabeled data and the models themselves are too large to load onto heroku, so the app only runs locally for now. To run this flask app, simply copy the contents of the flask-files folder to the main directory, then run app.py. Here is a video demo demonstrating how the app works.
Conclusions and Future Work
My product for this project can currently only live locally. The neural network models are too large to live on the free tier of the heroku server, and even with the smaller sklearn models, the data pulled in to process pushes the server over its 500 MB limit and crashes. In order to make it work (for free), I need to upload a model and a data cleaning pipeline to Amazon’s S3, and access that from the heroku app.
As well, all the models overpredict on the majority class, positive. This could be mitigated by normal class imbalance techniques, such as SMOTE and Tomek links. Unfortunately, the large data size needed to make generalizations makes it impossible to trial oversampling on my current machine. In order to run these processes, I need to run them on a stronger machine, or more likely, an Amazon Sagemaker instance.
To improve the usefulness of the models, I also need to incorporate some method of topic modelling. Topic modelling could push the results of the model from interesting to useful. I hope to use a tf-idf topic analysis in the final product, which would show what things each of the two classes are discussing the most.
Finally, the models may, after all that, still have trouble generalizing. The training data is taken only from Steam, while the unlabeled actual data is taken from other sites, Twitter and Reddit. To improve on this issue, I could gather training data from other sources. Metacritic seems like a good choice, as the reviews on there are scored, which could easily be turned into labels. Getting Steam reviews from a wider variety of games could also help.
Here is the Github repo of this project.