Lab 08: Airline Tweets Sentiment Analysis

For Lab 08, you will use airline tweets data to develop a model that will identify sentiment.

Background

Air-travel can be miserable experience. Travelers have a habit of taking to the platform formerly known as Twitter to complain and seek support from customer service. As such, airlines likely employ machine learning models, in addition to customer service representatives, to help efficiently process these communications.

Scenario and Goal

Who are you?

You are a data scientist working for social team of a major US airline.

What is your task?

You are tasked with building a sentiment classifier that will alert customer service representatives to respond to negative tweets about the airline and for positive tweets to be automatically acknowledged. Your goal is to develop a model that accurately classifies tweets as one of negative, neutral, or positive.

Who are you writing for?

To summarize your work, you will write a report for your manager who manages the social team. You can assume your manager is very familiar with the platform formerly known as Twitter, and somewhat familiar with the general concepts of machine learning.

Data

To achieve the goal of this lab, we will need previous tweets and their sentiment. The necessary data is provided in the following files:

Twitter US Airline Sentiment Data: tweets.csv

Source

The data for this lab originally comes from Kaggle.

Kaggle: Twitter US Airline Sentiment

A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).

We are providing a modified version of this data for this lab. Modifications include:

Keeping only the airline_sentiment, text, and airline variables.
Withholding some data that will be considered the production data.

Data Dictionary

Each observation in the train, test, and (hidden) production data contains information about a particular tweet.

Response

sentiment

[object] the sentiment of the tweet. One of negative, neutral, or positive.

Features

text

[object] the full text of the tweet.

Additional Variables

airline

[object] the airline the tweet was “sent” to.

Data in Python

To load the data in Python, use:

import pandas as pd

tweets = pd.read_csv(
    "https://cs307.org/lab-08/data/tweets.csv",
)

Prepare Data for Machine Learning

First, train-test split the data by using:

from sklearn.model_selection import train_test_split

tweets_train, tweets_test = train_test_split(
    tweets,
    test_size=0.25,
    random_state=42,
)

Then, to create the X and y variants of the data, use:

# create X and y for train data
X_train = tweets_train["text"]
y_train = tweets_train["sentiment"]

# create X and y for test data
X_test = tweets_test["text"]
y_test = tweets_test["sentiment"]

Here, we are purposefully excluding the airline for the creation of models.

You can assume that within the autograder, similar processing is performed on the production data.

Text Processing

To use the text of the tweets as input to machine learning models, you will need to do some preprocessing. The text cannot simply be input into the models we have seen.

X_train

2233     @JetBlue Then en route to the airport the rebo...
10733    @united now you've lost my bags too.  At least...
400      @USAirways Hi, can you attach my AA FF# 94LXA6...
7615     @United, will you fill it? Yes they will. Than...
4099     @AmericanAir thanks! I hope we get movies. Tv'...
                               ...                        
5734     @united Can i get a refund? I would like to bo...
5191     @VirginAmerica what is your policy on flying a...
5390     @united I'm not sure how you can help. Your fl...
860      @VirginAmerica LAX to EWR - Middle seat on a r...
7270     @united Hopefully my baggage fees will be waiv...
Name: text, Length: 8235, dtype: object

To do so, we will create a so-called bag-of-words. Let’s see what that looks like with a small set of strings.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

word_counter = CountVectorizer()

word_counts = word_counter.fit_transform(
    [
        "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
        "The quick brown fox jumps over the lazy dog",
        "",
    ]
).todense()

for key, value in word_counter.vocabulary_.items():
    print(f"{key}: {value}")

buffalo: 1
the: 8
quick: 7
brown: 0
fox: 3
jumps: 4
over: 6
lazy: 5
dog: 2

print(word_counts)

[[0 8 0 0 0 0 0 0 0]
 [1 0 1 1 1 1 1 1 2]
 [0 0 0 0 0 0 0 0 0]]

pd.DataFrame(
    word_counts,
    columns=sorted(list(word_counter.vocabulary_.keys())),
)

	brown	buffalo	dog	fox	jumps	lazy	over	quick	the
0	0	8	0	0	0	0	0	0	0
1	1	0	1	1	1	1	1	1	2
2	0	0	0	0	0	0	0	0	0

Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appears in a sample’s text. This is an example of feature engineering.

Let’s find the 100 most common words in the train tweets at the airlines.

top_100_counter = CountVectorizer(max_features=100)
X_top_100 = top_100_counter.fit_transform(X_train)
print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")

Top 100 Words:
['about' 'after' 'again' 'airline' 'all' 'am' 'americanair' 'amp' 'an'
 'and' 'any' 'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call'
 'can' 'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight'
 'flightled' 'flights' 'for' 'from' 'gate' 'get' 'got' 'had' 'has' 'have'
 'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
 'just' 'late' 'like' 'me' 'my' 'need' 'no' 'not' 'now' 'of' 'on' 'one'
 'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 're' 'service' 'so'
 'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
 'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
 'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
 'you' 'your']

X_top_100_dense = X_top_100.todense()
X_top_100_dense

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 1, 0],
        ...,
        [0, 0, 0, ..., 0, 1, 1],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], shape=(8235, 100))

X_top_100.shape

(8235, 100)

plane_idx = np.where(top_100_counter.get_feature_names_out() == "plane")
plane_count = np.sum(X_top_100.todense()[:, plane_idx])
print('The Word "plane" Appears:', plane_count)

The Word "plane" Appears: 362

Note that you’ll need to do this same process, but within a pipeline! You might also consider looking into other techniques to process text for input to models.

Additional information:

Sample Statistics

Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.

Models

For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:

Models must start from the given training data, unmodified.
- Importantly, the types and shapes of X_train and y_train should not be changed.
- In the autograder, we will call mod.predict(X_test) on your model, where your model is loaded as mod and X_test has a compatible shape with and the same variable names and types as X_train.
- In the autograder, we will call mod.predict(X_prod) on your model, where your model is loaded as mod and X_prod has a compatible shape with and the same variable names and types as X_train.
- We assume that you will use a Pipeline and GridSearchCV from sklearn as you will need to deal with heterogeneous data, and you should be using cross-validation.
  - So more specifically, you should create a Pipeline that is fit with GridSearchCV. Done correctly, this will store a “model” that you can submit to the autograder.
Your model must have a fit method.
Your model must have a predict method.
Your model must have a predict_proba method.
Your model should be created with scikit-learn version 1.5.2 or newer.
Your model should be serialized with joblib version 1.4.2 or newer.
- Your serialized model must be less than 5MB.

While you can use any modeling technique, each lab is designed such that a model using only techniques seen so far in the course can pass the checks in the autograder.

To obtain the maximum points via the autograder, your model performance must meet or exceed:

Test Accuracy: 0.79
Production Accuracy: 0.79

The production data is data that will mimic data that is passed through your model after you have put it into production, that is, it is being used for the stated goal within the scenario of the lab. As such, you do not have access to it. You do however have access to the test data.

Model Persistence

To save your model for submission to the autograder, use the dump function from the joblib library. Check PrairieLearn for the filename that the autograder expects for this lab.

from joblib import dump
dump(mod, "filename.joblib")

Discussion

As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! Justify your conclusion. If you trained multiple models that are mentioned in your report, first make clear which model you selected and are considering for use in practice.

Additional discussion topics:

What are the potential mistakes the model could make, and what are the severities of these mistakes?

When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!

Template Notebook

Lab 08: Template Notebook

Submission

Before submission, especially of your report, you should be sure to review the Lab Policy page.

On Canvas, be sure to submit both your source .ipynb file and a rendered .html version of the report.