import pandas as pd
Lab 07: Airline Tweets Sentiment Analysis
Scenario: You work at the intersection of the data and social teams for a major US airline. You are tasked with building a sentiment classifier that will allow customer service representatives to respond to negative tweets about the airline and for positive tweets to be automatically acknowledged.
Goal
The goal of this lab is to create a sentiment classifier that can automatically classify tweets at US airlines as one of three sentiments: negative, neutral, or positive.
Data
The data for this lab originally comes from Kaggle.
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as “late flight” or “rude service”).
We are providing a modified version of this data for this lab.
Modifications include:
- Keeping only the
airline_sentiment
,text
, andairline
variables. - Withheld some data that will be considered the production data.
Response
sentiment
[object]
the sentiment of the tweet. One ofnegative
,neutral
, orpositive
.
Features
text
[object]
the full text of the tweet.
Additional Variables
airline
[object]
the airline the tweet was “sent” to.
Data in Python
To load the data in Python, use:
= pd.read_csv("https://cs307.org/lab-07/data/tweets.csv") tweets
Prepare Data for Machine Learning
First, train-test split the data by using:
from sklearn.model_selection import train_test_split
= train_test_split(
tweets_train, tweets_test
tweets,=0.25,
test_size=42,
random_state )
Then, to create the X
and y
variants of the data, use:
# create X and y for train data
= tweets_train["text"]
X_train = tweets_train["sentiment"]
y_train
# create X and y for test data
= tweets_test["text"]
X_test = tweets_test["sentiment"] y_test
Here, we are purposefully excluding the airline
for the creation of models.
Bag-of-Words
To use the text of the tweets as input to machine learning models, you will need to do some preprocessing, this text cannot simply be input into the models we have seen.
X_train
2233 @JetBlue Then en route to the airport the rebo...
10733 @united now you've lost my bags too. At least...
400 @USAirways Hi, can you attach my AA FF# 94LXA6...
7615 @United, will you fill it? Yes they will. Than...
4099 @AmericanAir thanks! I hope we get movies. Tv'...
...
5734 @united Can i get a refund? I would like to bo...
5191 @VirginAmerica what is your policy on flying a...
5390 @united I'm not sure how you can help. Your fl...
860 @VirginAmerica LAX to EWR - Middle seat on a r...
7270 @united Hopefully my baggage fees will be waiv...
Name: text, Length: 8235, dtype: object
To do so, we will create a so-called bag-of-words. Let’s see what that looks like with a small set of strings.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
= CountVectorizer() word_counter
= word_counter.fit_transform(
word_counts
["Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo",
"The quick brown fox jumps over the lazy dog",
"",
] ).todense()
for key, value in word_counter.vocabulary_.items():
print(f"{key}: {value}")
buffalo: 1
the: 8
quick: 7
brown: 0
fox: 3
jumps: 4
over: 6
lazy: 5
dog: 2
print(word_counts)
[[0 8 0 0 0 0 0 0 0]
[1 0 1 1 1 1 1 1 2]
[0 0 0 0 0 0 0 0 0]]
=sorted(list(word_counter.vocabulary_.keys()))) pd.DataFrame(word_counts, columns
brown | buffalo | dog | fox | jumps | lazy | over | quick | the | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 2 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Essentially, we’ve created a number of feature variables, each one counting how many times words in the vocabulary appears in a sample’s text. This is an example of feature engineering.
Let’s find the 100 most common words in the train tweets at the airlines.
= CountVectorizer(max_features=100)
top_100_counter = top_100_counter.fit_transform(X_train)
X_top_100 print("Top 100 Words:")
print(top_100_counter.get_feature_names_out())
print("")
Top 100 Words:
['about' 'after' 'again' 'airline' 'all' 'am' 'americanair' 'amp' 'an'
'and' 'any' 'are' 'as' 'at' 'back' 'bag' 'be' 'been' 'but' 'by' 'call'
'can' 'cancelled' 'co' 'customer' 'delayed' 'do' 'don' 'flight'
'flightled' 'flights' 'for' 'from' 'gate' 'get' 'got' 'had' 'has' 'have'
'help' 'hold' 'hour' 'hours' 'how' 'http' 'if' 'in' 'is' 'it' 'jetblue'
'just' 'late' 'like' 'me' 'my' 'need' 'no' 'not' 'now' 'of' 'on' 'one'
'or' 'our' 'out' 'over' 'phone' 'plane' 'please' 're' 'service' 'so'
'southwestair' 'still' 'thank' 'thanks' 'that' 'the' 'there' 'they'
'this' 'time' 'to' 'today' 'united' 'up' 'us' 'usairways' 've'
'virginamerica' 'was' 'we' 'what' 'when' 'why' 'will' 'with' 'would'
'you' 'your']
= X_top_100.todense()
X_top_100_dense X_top_100_dense
matrix([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 1, 0],
...,
[0, 0, 0, ..., 0, 1, 1],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
X_top_100.shape
(8235, 100)
= np.where(top_100_counter.get_feature_names_out() == "plane")
plane_idx = np.sum(X_top_100.todense()[:, plane_idx])
plane_count print('The Word "plane" Appears:', plane_count)
The Word "plane" Appears: 362
Note that you’ll need to do this same process, but within a pipeline! You might also consider looking into other techniques to process text for input to models.
Sample Statistics
Before modeling, be sure to look at the data. Calculate the summary statistics requested on PrairieLearn and create a visualization for your report.
Models
For this lab you will select one model to submit to the autograder. You may use any modeling techniques you’d like. The only rules are:
- Models must start from the given training data, unmodified.
- Importantly, the types and shapes of
X_train
andy_train
should not be changed. - In the autograder, we will call
mod.predict(X_test)
on your model, where your model is loaded asmod
andX_test
has a compatible shape with and the same variable names and types asX_train
. - We assume that you will use a
Pipeline
andGridSearchCV
fromsklearn
as you will need to deal with heterogeneous data, and you should be using cross-validation.- So more specifically, you should create a
Pipeline
that is fit withGridSearchCV
. Done correctly, this will store a “model” that you can submit to the autograder.
- So more specifically, you should create a
- Importantly, the types and shapes of
- Your model must have a
fit
method. - Your model must have a
predict
method that returns numbers. - Your model must have a
predict_proba
method that returns numbers. - Your model should be created with
scikit-learn
version1.4.0
or newer. - Your model should be serialized with
joblib
version1.3.2
or newer.- Your serialized model must be less than 5MB.
To obtain the maximum points via the autograder, your model performance must meet or exceed:
Test Accuracy: 0.79
Production Accuracy: 0.79
Model Persistence
To save your model for submission to the autograder, use the dump
function from the joblib
library. Check PrairieLearn for the filename that the autograder expects.
Discussion
As always, be sure to state a conclusion, that is, whether or not you would use the model you trained and selected for the real world scenario described at the start of the lab! If you are asked to train multiple models, first make clear which model you selected and are considering for use in practice. Discuss any limitations or potential improvements.
Additional discussion topics:
- What are the potential mistakes the model could make, and what are the severities of these mistakes?
When answering discussion prompts: Do not simply answer the prompt! Answer the prompt, but write as if the prompt did not exist. Write your report as if the person reading it did not have access to this document!
Template Notebook
Submission
On Canvas, be sure to submit both your source .ipynb
file and a rendered .html
version of the report.