Natural Language Processing

This is a demo assignment that is openly available for the Data Science in Practice Course.

If you are in the COGS108 course at UC San Diego, this is NOT a valid version of the assignment for the course.

Important

  • This assignment has hidden tests: tests that are not visible here, but that will be run on your submitted assignment for grading.

    • This means passing all the tests you can see in the notebook here does not guarantee you have the right answer!

    • In particular many of the tests you can see simply check that the right variable names exist. Hidden tests check the actual values.

      • It is up to you to check the values, and make sure they seem reasonable.

  • A reminder to restart the kernel and re-run the code as a first line check if things seem to go weird.

    • For example, note that some cells can only be run once, because they re-write a variable (for example, your dataframe), and change it in a way that means a second execution will fail.

    • Also, running some cells out of order might change the dataframe in ways that may cause an error, which can be fixed by re-running.

Background & Work Flow

  • In this homework assignment, we will be analyzing text data. A common approach to analyzing text data is to use methods that allow us to convert text data into some kind of numerical representation - since we can then use all of our mathematical tools on such data. In this assignment, we will explore 2 feature engineering methods that convert raw text data into numerical vectors:

    • Bag of Words (BoW)

      • BoW encodes an input sentence as the frequency of each word in the sentence.

      • In this approach, all words contribute equally to the feature vectors.

    • Term Frequency - Inverse Document Frequency (TF-IDF)

      • TF-IDF is a measure of how important each term is to a specific document, as compared to an overall corpus.

      • TF-IDF encodes each word as its frequency in the document of interest, divided by a measure of how common the word is across all documents (the corpus).

      • Using this approach, each word contributes differently to the feature vectors.

      • The assumption behind using TF-IDF is that words that appear commonly everywhere are not that informative about what is specifically interesting about a document of interest, so it is tuned to representing a document in terms of the words it uses that are different from other documents.

  • To compare those 2 methods, we will first apply them on the same Movie Review dataset to analyse sentiment (how positive or negative a text is). In order to make the comparison fair, an SVM (support vector machine) classifier will be used to classify positive reviews and negative reviews.

  • SVM is a simple yet powerful and interpretable linear model. To use it as a classifier, we need to have at least 2 splits of the data: training data and test data. The training data is used to tune the weight parameters in the SVM to learn an optimal way to classify the training data. We can then test this trained SVM classifier on the test data, to see how well it works on data that the classifier has not seen before.

# Imports - these are all the imports needed for the assignment
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import nltk package 
#   PennTreeBank word tokenizer 
#   English language stopwords
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

import warnings
warnings.filterwarnings('ignore')

# scikit-learn imports
#   SVM (Support Vector Machine) classifer 
#   Vectorizer, which transforms text data into bag-of-words feature
#   TF-IDF Vectorizer that first removes widely used words in the dataset and then transforms test data
#   Metrics functions to evaluate performance
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, precision_recall_fscore_support

For this assignment we will be using nltk: the Natural Language Toolkit.

To do so, we will need to download some text data.

Natural language processing (NLP) often requires corpus data (lists of words, and/or example text data) which is what we will download here now, if you don’t already have them.

# In the cell below, we will download some files from nltk. 
#   If you hit an error doing so, come back to this cell, and uncomment and run the code below. 
#   This code gives python permission to write to your disk (if it doesn't already have persmission to do so)

# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context
# Download the NLTK English tokenizer and the stopwords of all languages
nltk.download('punkt')
nltk.download('stopwords')

Downloading Data

If you download this notebook to run locally, you will also need some data files.

Running the next cell will download the required files for this assignment.

You can also view and download these files from https://github.com/DataScienceInPractice/Data.

from os.path import join as pjoin
from urllib.request import urlretrieve

data_url = 'https://raw.githubusercontent.com/DataScienceInPractice/Data/master/'

# Fill in these values
assignment = 'A6'
data_files = ['custrev_test.tsv', 'custrev_train.tsv', 'rt-polarity.tsv']

for data_file in data_files:
    full_path = pjoin(data_url, assignment, data_file)
    urlretrieve(full_path, filename=data_file)

Part 1: Sentiment Analysis on Movie Review Data (4.25 points)

In part 1 we will apply sentiment analysis to Movie Review (MR) data.

  • The MR data contains more than 10,000 reviews collected from the IMDB website, and each of the reviews is annotated as either positive or negative. The number of positive and negative reviews are roughly the same. For more information about the dataset, you can visit http://www.cs.cornell.edu/people/pabo/movie-review-data/

  • For this homework assignment, we’ve already shuffled the data, and truncated the data to contain only 5000 reviews.

In this part of the assignment we will:

  • Transform the raw text data into vectors with the BoW encoding method

  • Split the data into training and test sets

  • Write a function to train an SVM classifier on the training set

  • Test this classifier on the test set and report the results

1a) Import data

Import the textfile ‘rt-polarity.tsv’ into a DataFrame called MR_df,

Set the column names as ‘index’, ‘label’, ‘review’

Note that ‘rt-polarity.tsv’ is a tab separated raw text file, in which data is separated by tabs (‘\t’). You can load this file with read_csv, specifying the sep (separator) argument as tabs (‘\t’). You will have to set header as None.

MR_filepath='rt-polarity.tsv'

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(MR_df, pd.DataFrame)
# Check the data
MR_df.head()

1b) Create a function that converts string labels to numerical labels

Function name: convert_label

The function should do the following:

  • if the input label is “pos” return 1.0

  • if the input label is “neg” return 0.0

  • otherwise, return the input label as is

def convert_label(label):
# YOUR CODE HERE
raise NotImplementedError()
assert callable(convert_label)

1c) Numerical Labels

Convert all labels in MR_df["label"] to numerical labels, using the convert_label function.

Save them as a new column named “y” in MR_df.

# YOUR CODE HERE
raise NotImplementedError()
assert sorted(set(MR_df['y'])) == [0., 1.]
# Check the MR_df data
MR_df.head()

1d) Convert Text data into vector

We will now create a CountVectorizer object to transform the text data into vectors with numerical values.

To do so, we will initialize a CountVectorizer object, and name it as vectorizer.

We need to pass 4 arguments to initialize a CountVectorizer:

  1. analyzer: 'word' Specify to analyze data from word-level.

  2. max_features: 2000 Set a max number of unique words.

  3. tokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK .

  4. stop_words: stopwords.words('english') Set to remove all stopwords in English. We do this since they generally don’t provide useful discriminative information.

# YOUR CODE HERE
raise NotImplementedError()
assert vectorizer.analyzer == 'word'
assert vectorizer.max_features == 2000
assert vectorizer.tokenizer == word_tokenize
assert vectorizer.stop_words == stopwords.words('english')
assert hasattr(vectorizer, "fit_transform")

1e) Vectorize reviews

Transform reviews MR_df["review"] into vectors using the vectorizer we created above:

The method you will be using is: MR_X = vectorizer.fit_transform(...).toarray()

Note that we apply the toarray method to the type cast the output to a numpy array. This is something we will do multiple times, turning custom sklearn objects back into arrays.

Note this may post a warning about stopwords. This is ok.

# YOUR CODE HERE
raise NotImplementedError()
assert type(MR_X) == np.ndarray

1f) Outcome variable

Copy out the y column in MR_df and save it as an np.array named MR_y

Make sure the shape of MR_y is (5000,) - depending upon your earlier approach, you may have to use reshape to do so.

# YOUR CODE HERE
raise NotImplementedError()
assert MR_y.shape == (5000,)

1g) Defining the train & test sets

We first set 80% of the data as the training set to train an SVM classifier. We will then test the learnt classifier on the remaining 20% of data samples (test set). (Reminder: For this homework assignment, we’ve already shuffled the data)

  • Calculate the number of training data samples (80% of total) and store it in num_training

  • Calculate the number of test data samples (20% of total) and store it in num_testing

  • Make sure both of these variables are of type int

# YOUR CODE HERE
raise NotImplementedError()
assert type(num_training) == int
assert type(num_testing) == int

1h) Extracting train & test Data

Split the MR_X and MR_y into training set and test set. You should use the num_training variable to extract the data from MR_X and MR_y.

Extract the first num_training samples as training data, and extract the rest as test data.

Name them as:

  • MR_train_X and MR_train_y for the training set

  • MR_test_X and MR_test_y for the test set

# YOUR CODE HERE
raise NotImplementedError()
assert MR_train_X.shape[0] == MR_train_y.shape[0]
assert MR_test_X.shape[0] == MR_test_y.shape[0]

assert len(MR_train_X) == 4000
assert len(MR_test_y) == 1000

1i) SVM

Define a function called train_SVM that initializes an SVM classifier and trains it

Inputs:

  • X: np.ndarray, training samples,

  • y: np.ndarray, training labels,

  • kernel: string, set the default value of “kernel” as “linear”

Output: a trained classifier clf

Hint: There are 2 steps involved in this function:

  • Initializing an SVM classifier: clf = SVC(...)

  • Training the classifier: clf.fit(X, y)

def train_SVM(X, y, kernel='linear'):
# YOUR CODE HERE
raise NotImplementedError()
assert callable(train_SVM)

1j) Train SVM

Train an SVM classifier with the default linear kernel on the samples MR_train_X and the labels MR_train_y

You need to call the function train_SVM you just created. Name the returned object as MR_clf.

Note that running this function may take many seconds / up to a few minutes to run.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(MR_clf, SVC)
assert hasattr(MR_clf, "predict")

1k) Predict outcome

Predict labels for both training samples and test samples. You will need to use MR_clf.predict(...)

Name the predicted labels for the training samples as MR_predicted_train_y. Name the predicted labels for the testing samples as MR_predicted_test_y.

Your code here will also take a minute to run.

# YOUR CODE HERE
raise NotImplementedError()

Now we will use the function classification_report to print out the performance of the classifier on the training set:

# Your classifier should be able to reach above 90% accuracy 
# on the training set
print(classification_report(MR_train_y,MR_predicted_train_y))

And finally, we check the performance of the trained classifier on the test set:

# Your classifier should be able to reach around 70% accuracy on the test set.
print(classification_report(MR_test_y, MR_predicted_test_y))
assert MR_predicted_train_y.shape == (4000,)
assert MR_predicted_test_y.shape == (1000,)

precision, recall, _, _ = precision_recall_fscore_support(MR_train_y,MR_predicted_train_y)
assert np.isclose(precision[0], 0.91, 0.02)
assert np.isclose(precision[1], 0.92, 0.02)

Part 2: TF-IDF (1.25 points)

In this part, we will explore TF-IDF on sentiment analysis.

TF-IDF is used as an alternate way to encode text data, as compared to the BoW approach used in Part 1.

To do this, we will:

  • Transform the raw text data into vectors using TF-IDF

  • Train an SVM classifier on the training set and report the performance this classifer on the test set

2a) Text Data to Vectors

We will create a TfidfVectorizer object to transform the text data into vectors with TF-IDF

To do so, we will initialize a TfidfVectorizer object, and name it as tfidf.

We need to pass 4 arguments into the “TfidfVectorizer” to initialize a “tfidf”:

  1. sublinear_tf: True Set to apply TF scaling.

  2. analyzer: 'word' Set to analyze the data at the word-level

  3. max_features: 2000 Set the max number of unique words

  4. tokenizer: word_tokenize Set to tokenize the text data by using the word_tokenizer from NLTK

# YOUR CODE HERE
raise NotImplementedError()
assert tfidf.analyzer == 'word'
assert tfidf.max_features == 2000
assert tfidf.tokenizer == word_tokenize
assert tfidf.stop_words == None
assert hasattr(vectorizer, "fit_transform")

2b) Transform Reviews

Transform the review column of MR_df into vectors using the tfidf we created above.

Save the transformed data into a variable called MR_tfidf_X

Hint: You might need to cast the datatype of MR_tfidf_X to numpy.ndarray by using .toarray()

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(MR_tfidf_X, np.ndarray)

assert "skills" in set(tfidf.stop_words_)
assert "risky" in set(tfidf.stop_words_)
assert "adopts" in set(tfidf.stop_words_)

2c)

Split the MR_tfidf_X and MR_y into training set and test set.

Name these variables as:

  • MR_train_tfidf_X and MR_train_tfidf_y for the training set

  • MR_test_tfidf_X and MR_test_tfidf_y for the test set

We will use the same 80/20 split as in part 1. You can use the same num_training variable from part 1 to split up the data.

# YOUR CODE HERE
raise NotImplementedError()
assert MR_train_tfidf_X[0].tolist() == MR_tfidf_X[0].tolist()
assert MR_train_tfidf_X.shape == (4000, 2000)
assert MR_test_tfidf_X.shape == (1000, 2000)

2d) Training

Train an SVM classifier on the samples MR_train_tfidf_X and the labels MR_train_tfidf_y.

You need to call the function train_SVM you created in part 1. Name the returned object as MR_tfidf_clf.

Note that this may take many seconds, up to a few minutes, to run.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(MR_clf, SVC)
assert hasattr(MR_tfidf_clf, "predict")

2e) Prediction

Predict the labels for both the training and test samples (the ‘X’ data). You will need to use MR_tfidf_clf.predict(...)

Name the predicted labels on training samples as MR_pred_train_tfidf_y. Name the predicted labels on testing samples as MR_pred_test_tfidf_y

# YOUR CODE HERE
raise NotImplementedError()

Again, we use classification_report to check the performance on the training set.

# Your classifier should be able to reach above 85% accuracy.
print(classification_report(MR_train_tfidf_y, MR_pred_train_tfidf_y))

Again, check performance on the test set:

# Your classifier should be able to reach around 70% accuracy.
print(classification_report(MR_test_tfidf_y, MR_pred_test_tfidf_y))
precision, recall, _, _ = precision_recall_fscore_support(MR_train_tfidf_y, MR_pred_train_tfidf_y)
assert np.isclose(precision[0], 0.86, 0.02)
assert np.isclose(precision[1], 0.87, 0.02)

Written Answer Question

How does the performance of the TF-IDF classifier compare to the classifier used in part 1?

YOUR ANSWER HERE

Part 3: Sentiment Analysis on Customer Review with TF-IDF (2 points)

In this part, we will use TF-IDF to analyse the sentiment of some Customer Review (CR) data.

The CR data contains around 3771 reviews, and they were all collected from the Amazon website. The reviews are annotated by humans as either positive reviews or negative reviews. In this dataset, the 2 classes are not balanced, as there are twice as many positive reviews as negative reviews.

For more information on this dataset, you can visit https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In this part, we have already split the data into a training set and a test set, in which the training set has labels for the reviews, but the test set doesn’t.

The goal is to train an SVM classifier on the training set, and then predict pos/neg for each review in the test set.

To do so, we will:

  • Use the TF-IDF feature engineering method to encode the raw text data into vectors

  • Train an SVM classifier on the training set

  • Predict labels for the reviews in the test set

The performance of your trained classifier on the test set will be checked by a hidden test.

3a) Loading the data

Customer review task has 2 files

  • “custrev_train.tsv” contains training data with labels

  • “custrev_test.tsv” contains test data without labels which need to be predicted

Import raw textfile custrev_train.csv into a DataFrame called CR_train_df. Set the column names as index, label, review.

Import raw textfile custrev_test.csv into a DataFrame called CR_test_df. Set the column names as index, review

Note that both will need to be imported with sep and header arguments (like in 1a)

CR_train_file = 'custrev_train.tsv'
CR_test_file = 'custrev_test.tsv'

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_train_df, pd.DataFrame)
assert isinstance(CR_test_df, pd.DataFrame)

3b) Concatenation

Concatenate the 2 DataFrames from the last step into a single DataFrame, and name it CR_df.

# YOUR CODE HERE
raise NotImplementedError()
assert len(CR_df) == 3771

3c) Cleaning

Convert all labels in CR_df["label"] using the convert_label function we defined above. Save these numerical labels as a new column named y in CR_df.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_df['y'], pd.Series)

3d) Use tfidf

Transform reviews CR_df["review"] into vectors using the tfidf vectorizer we created in part 2. Save the transformed data into a variable called CR_tfidf_X.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_tfidf_X, np.ndarray)

Here we will collect all training samples & numerical labels from CR_tfidf_X. The code provided below will extract all samples with labels from the dataframe:

# code provided to collect labels
CR_train_X = CR_tfidf_X[~CR_df['y'].isnull()]
CR_train_y = CR_df['y'][~CR_df['y'].isnull()]

# Note: if these asserts fail, something went wrong
#  Go back and check your code (in part 3) above this cell
assert CR_train_X.shape == (3016, 2000)
assert CR_train_y.shape == (3016, )

3e) SVM

Train an SVM classifier on the samples CR_train_X and the labels CR_train_y:

  • You need to call the function train_SVM you created above.

  • Name the returned object as CR_clf.

  • Note that this function will take many seconds / up to a few minutes to run.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_clf, SVC)

3f) Predict: training data

Predict labels on the training set, and name the returned variable as CR_pred_train_y

# YOUR CODE HERE
raise NotImplementedError()
# Check the classifier accuracy on the train data
#   Note that your classifier should be able to reach above 90% accuracy.
print(classification_report(CR_train_y, CR_pred_train_y))
precision, recall, _, _ = precision_recall_fscore_support(CR_train_y, CR_pred_train_y)
assert np.isclose(precision[0], 0.90, 0.02)
assert np.isclose(precision[1], 0.91, 0.02)
# Collect all test samples from CR_tfidf_X
CR_test_X = CR_tfidf_X[CR_df['y'].isnull()]

3g) Predict: test set

Predict the labels on the test set. Name the returned variable as CR_pred_test_y

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_test_X, np.ndarray)
assert isinstance(CR_pred_test_y, np.ndarray)

3h) Convert labels

Convert the predicted numerical labels back to string labels (“pos” and “neg”).

Create a column called label in CR_test_df to store the converted labels.

# YOUR CODE HERE
raise NotImplementedError()
assert isinstance(CR_test_df['label'], pd.Series)
assert set(CR_test_df['label']) == {'neg', 'pos'}

The hidden assignments tests for the cell above will check that your model predicts the right number of pos/neg reviews in the test data provided.

We now have a model that can predict positive or negative sentiment!

In the cell below, as a written answer question, briefly, in your own words, what BoW and TF/IDF word representations are, and how they differ. Also, think about and write a quick example of when and why it might be useful to automatically analyze the sentiment of text data. [This whole answer can/should be a couple of sentences].

After you answer this question, you are done!

YOUR ANSWER HERE

Complete!

Good work! Have a look back over your answers, and also make sure to Restart & Run All from the kernel menu to double check that everything is working properly. While you can typically use the ‘Validate’ button above, which runs your notebook from top to bottom and checks to ensure all assert statements pass silently, ‘Validate’ may fail on this assignment as the code takes too long to run. Use Restart & Run All instead. When you are ready, submit on datahub!

Note that the final validation is for your reassurance and is not a required step. You can submit without validating. You can also submit without passing all asserts (for partial credit on the assignment). We grade whatever is submitted on datahub. We will grade your most recent submission.