Blockbust-a Videoz Classifier (NLP Sentiment Classification)

Blockbust-a Videoz Classifier (NLP Sentiment Classification)

Task Summary

This post goes through a basic sentiment classifier that I produced for a natural language processing project. It can be applied to any movie reviews that are conducive to the format outlined below.

NOTE

Please note that since this is an active course that only high level concepts will be covered and no links to scripts will be provided. My hope is that I will be present when going over this post with you, and can explain any ambiguity in details.

Approach

In building this classifer, I used the same TF-IDF scores for my word vectorizer, settings for my logistic regression classifier that I had produced in a previous effort that did well with sentiment classification. That approach produced a high F1 score, so it was a good starting point for the task of building a new sentiment classifier; our previous classifier was also performing seniment analysis here so the same principles should apply.

Dependencies

A list of all dependencies that are used for each my script. There are two .py files required to generate the predictions.

  • blockbusta_classification_SALCE.py is the primary script that is run to generate the predictions. It requires that SalceClasses.py be within the same folder as this script.
import numpy as np
import pandas as pd
import time
from typing import Iterator, Iterable, List, Tuple, Text, Union
from sklearn.impute import SimpleImputer
from SalceClasses import TextToFeatures, TextToLabels, Classifier

Note, "SalceClasses" is a .py module file that accompanies the script I have written to clean up the code execution and separate the class/methods definition.

  • SalceClasses.py is a .py module that accompanies the blockbusta_classification_SALCE.py
from typing import Iterator, Iterable, List, Tuple, Text, Union
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

from scipy.sparse import spmatrix

# An NDArray can either be a numpy array (np.ndarray) or a sparse matrix (spmatrix)
NDArray = Union[np.ndarray, spmatrix]

np.random.seed(42)

Set seed to 42 for reproducibility.

  • analysis.py is the script that was used to run data diagnostics, development of functionalities needed for the main script, and for generating some additional data that we will cover in later sections of this post.
import numpy as np
import pandas as pd
import time
import math
from typing import Iterator, Iterable, List, Tuple, Text, Union
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
from SalceClasses import TextToFeatures, TextToLabels, Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

Exploratory Data Analysis

NOTE

The data used for this blog will be unavailable, so we can't recreate any of the specific analysis, but there is a similar data available on Kaggle (see IMDB dataset)

The analysis.py file was used to generate some initial insights into the data. Before we tackle anything, it's always a good idea to look at the data. The first insight that was of interest was the composition of the training data. We want to be certain that there is balance with our training data's samples, i.e. that there is not too many sample of any one particular class of review (positive, negative, or not a review). As it turns out, we do in fact have a rather stark imbalance with our training data.

training data histogram

We can see that we have nearly double the "not a review" samples vs. either the positive or negative reviews. This imbalance should be addressed prior to training (it was not addressed on the original prediction submission), which will be covered in the Preprocessing section.

Since it is good practice to check for NaNs in data prior to any processing, I utilized .isna().sum() on the training and test dataset, and found the following metrics.

pythontrain.isna().sum()
Out[8]: 
ID       0
TEXT     6
LABEL    0
dtype: int64

test.isna().sum()
Out[9]: 
ID      0
TEXT    1
dtype: int64

We have NaNs in both training and test data. This will also be addressed in Preprocessing.

Lastly, looking through some of the individual training and test samples by hand, I noticed that there are some reviews that are in languages other than English. While that initially seemed like a possible issue, it should actually not present any unusual challenges to the model. The model is only looking at character grams, and if anything, adding gram features in other languages should only give added fidelity to the model to classify unseen data in other languages.

Preprocessing

Building input data

We were able to reuse some classes that preprocessed input data and output lists with elements of the form of tuples with label/document pairs, which is the same format for the data for this task. The only difference for this task is that we are now dealing with three different types of labels, i.e. three classes and thus a multiclass classification problem instead of a binary classifier. In the blockbusta_classification_SALCE.py script (availabe in my repo under the code link below), I repurposed the read_smsspam() method to preprocess this data and is currently a function called read_file.

The read_file() function was repurposed from a previos effort. I wanted it to be able to correctly process both the training and test data, which meant that I would need some "dynamic" features in the function to account for the training data having three columns ['ID', 'TEXT', 'LABEL'] and the test data having two ['ID', 'TEXT']. I would also need to store the array of IDs for the test data so that I could rejoin them with the outputs in the end. For this function, I used Pandas and Numpy...which was not ideal but I am used to working with Pandas dataframes. The function then returns text and ID tuples in an iterable object like before, but if it is test data the iterable also includes labels.

As is often the case with real datasets, we may have missing data for NaNs, which can cause issues downstream. During early development, the TextToFeatures() class was producing errors due to NaNs in the data. As such, I utilized sklearn.impute SimpleImputer, which was configued to replace all NaN entries to a string that is just 'nan'. This should benefit the classifier in giving it features for NaN values to train rather than disregarding them.

Addressing the Data imbalance

In the Exploratory Data Analysis section we uncovered an imbalance in our data. Our "not a review" samples were almost double the samples compared to either "positive" or "negative" sentiment reviews.

To address this, we want to get those numbers more even, if not exactly even. Initially I considered cutting "not a review" samples out of the training data, but I convinced myself there could be a better way. What we may have gained from balancing the data, we may lose in training by using less samples to train our classifier.

Instead, I hopped on Kaggle and managed to find just what I was looking for, an IMDB Movie Review dataset. I downloaded the dataset, and manually modified the labels to match our "1", "2" labels for positive and negative reviews. The dataset is included in the repo as IMDB Dataset.csv. Note, this dataset only includes positive and negative reviews, and no "not a review" data, which is fine for our needs.

In the analysis.py file, I imported the dataset, and created pandas dataframes of only the "positive" reviews and only the "negative" reviews. Then, I calculated the delta in samples of "positive" reviews to "not a review", and the same for "negative". Using those counts, I generated new dataframes for "positive" and "negative" reviews using randomly selected reviews in their respective quantities (using the np.random.permutation function), and combined them into one dataframe to write to a new CSV using to_csv() with index=True to generate IDs for them (note, I did confirm these IDs would be unique from the training IDs). After this, I reimported this data and appended it with the original training data. The resulting training dataset, also in the repo as train_addIMDB.csv. After this process, we now have balanced data:

training data histogram

And a lot more of it, we are up from 70,317 training samples of the original dataset to 96,213 samples (added 25,896 samples)! Our original classifier training was long, so now it will be quite a bit longer, but as is usually the case with data analysis, the more data we have to look at, the better.

SalceClasses.py TextToFeatures()

We can pass Iterables into our TextToFeatures() class where our TfidfVectorizer() sklearn class is defined. For the vectorizer, I used the following settings.

TfidfVectorizer(analyzer='char', ngram_range=(2, 6), binary=True, encoding='utf-8')

The vectorizer uses character grams from 2 to 6 characters in length, and uses binary values rather than counts. Note, this means that I am going to have a LOT of features, which we will see has impacts on training and prediction times.

The .fit() method is executed after the texts are input into a class instance. This was my first indication that my approach may be a bit of a slog...

On my first attempt, fitting features to the provided training set, it took approximately three minutes to fit the features.

fitfeaturestime

After expaning the training dataset, the time it took to fit features jumped to 423 seconds.

SalceClasses.py TextToLabels()

No updates were required to this method. And thankfully it was quick to run!

SalceClasses.py Classifier()

We are ready to train our classifier, which uses the following settings.

        self.clf = LogisticRegression(penalty='elasticnet', solver='saga', C=5, class_weight='balanced',l1_ratio=0.25)

I am using an elastic net loss function and a SAGA (a batch gradient descent method), our tuning parameter set to 5 and using balanced class weights.

By default, the LogisticRegression sklearn class sets its multi_class parameter to 'auto', which will decipher the multiple class labels and train a multiclass problem automatically. Thankfully, this means we just pass our training data in like we did in our homework!

Here's where I got a nice wake-up call. Running the classifier with the original training data took 🚨48 MINUTES🚨 to train.

original training data classifier training time

With the NEW training dataset, the training time jumped to 🚨🚨114 MINUTES🚨🚨. This will be reiterated in the reproducibility section, but this is a LONG training process. My personal computer is plenty powerful, not an HPC by any means, but it still took a long time. I found that while it was probably avoidable, I had some undeniably good results with the first predictions. So no need to fix it if it aint really broke, and as we will see, the additional data balance results in a meaningful improvement in the final predictions.

Predictions

Once my classifier had finally been trained with the training data, I was able to run my test data through the model and generate predictions. Running the predictions with the classifier trained to the original data only took about 30 seconds,

original predictions time

however when we trained with additional data we picked up more features, which doubled our time making predictions to about 60 seconds.

final classifier times

Results

In the analysis.py file, I did some experiments with kk-fold cross validation, using 5 folds as is typical, and tweaked some parameters. Even downsizing the data I was using for the CV, it was still taking prohibitively long time to run CV (I attempted using GridSearchCV, but it took too long). The results I did manage to get didn't demonstrate any meanigful improvements over my original settings. Ultimately I landed on only using the original parameters I used (it produced a good result on the first try, after all). Probably an indictment on my classifier and features selected, but we still produced good results.

SubmissionScore (F1 Macro)Delta Baseline
original0.92474+0.58913
final0.94404+0.60843

Both submissions perform well, and the final submission is a +0.0193 improvement over the original submission! Seems like balancing the data and using more samples seemed to help after all.

Runtimes:
423.64182209968567  seconds to fit features
0.007401704788208008  seconds to fit labels
6797.002720117569  seconds to train classifier
62.62460160255432  seconds to generate predictions

Error analysis

With this classifier, we managed to predict to nearly 95% accuracy, so the remaining ~5% error can likely be attributed to some more nuanced details. Maybe we needed more data, maybe I hadn't quite tweaked all of the features. For example, with more time I may find a better way to partition my grams. I wasn't able to use STOP WORDS, as I was using n-grams rather than word grams. There may have been better word embeddings strategies to employ, and I may be using too many features in general, but ultimately what this classifier produced performs fairly well. With more time, I do believe I could improve its results, but this is pretty good performance given the data.

Reproducibility

I did not have time to create a Jupyter notebook and Docker image as I had hoped, however reproducing my results should be simple.

First, clone my repository using the following link.

HTTPS: TO BE UPDATED

SSH: TO BE UPDATED

Ensure you have all dependencies from the Dependencies section installed in your local python environment.

From your local clone, open the blockbusta_classification_SALCE.py file (note, there is also a file called blockbusta_classification_SALCE submission 1.py with the settings for the original submission) with your preferred IDE (I use Spyder, I recommend installing via installation of Anaconda, be sure dependencies are installed in conda environment). Once open, all you need to do it run the script!

🚨🚨THE SCRIPT WILL TAKE A LONG TIME TO RUN, LIKELY AT LEAST TWO HOURS🚨🚨.

Once it is finished, navigate to the results folder to find an updated version of salceresults.csv, which is already formatted ready to submit.

If you would like to run the original training dataset, uncomment line 57 and comment out line 58 before you run the script.

final classifier times

If you would like to peruse the other files, feel free to explore analysis.py to reproduce any of the additional training dataset, or even run a new permutation of the training data (which it does automatically and who knows, could help or hurt the predictions).

NOTE

At this time no code can be provided