TFIDF – rohit apte

Love it or hate it Tesla as a company draws some very polarized opinions. Twitter is full of arguments both for and against the company. In this post we will see how to tackle this from an NLP perspective.

Disclaimer: This article is intended to purely show how to tackle this from an NLP perspective. I am currently short Tesla through stocks and options and any data and results presented here should not be interpreted as research or trading advice.

Fetching Twitter data

There are many libraries out there to fetch twitter data. The one I used was tweepy. I downloaded 25,000 of the most recent tweets and filtered for tweets in English. We were left with 18,171 tweets over a period of 9 days. Tweepy has a few configurable options. Unless you have a paid subscription you need to account for Rate Limiting. I also chose to filter out retweets and selected extended mode to get the full text of each tweet.

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)
search_hashtag = tweepy.Cursor(api.search, q="TSLA -filter:retweets",tweet_mode='extended').items(25000)

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)

auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

search_hashtag = tweepy.Cursor(api.search, q="TSLA -filter:retweets",tweet_mode='extended').items(25000)

No NLP post is complete without a word cloud! We generate one on the twitter text removing stop words and punctuations. Its an interesting set of words – both positive and negative.

Sentiment Analysis

I found Vader (Valence Aware Dictionary and sEntiment Reasoner) to be a very good tool for Twitter sentiment analysis. It uses a lexicon and rule-based approach especially attuned to sentiments expressed in social media. Vader returns a score of 1 split across positive, neutral and and a compound score between -1 (extremely negative) and +1 (extremely positive). We use the compound score for our analysis. Here are the results for some sample tweets that it got correct.

If there was no fraud, there was no demand issue, 
no cash crunch, this is what would kill Tesla. Crafting artisan cars in the 
most expensive part of the world while competition is virtually fully 
automated already simply can't fly.
Compound score: -0.9393


@GerberKawasaki Generational opportunity GERBER to add to TSLA longs. 
This might be the last chance to get in before she blows!
Compound score: 0.6239

If there was no fraud, there was no demand issue,

no cash crunch, this is what would kill Tesla. Crafting artisan cars in the

most expensive part of the world while competition is virtually fully

automated already simply can't fly.

Compound score: -0.9393

@GerberKawasaki Generational opportunity GERBER to add to TSLA longs.

This might be the last chance to get in before she blows!

Compound score: 0.6239

Vader gets plenty of classifications wrong. I guess the language for a stock is quite nuanced.

Today in "a company that is absolutely not experiencing a cash crunch". $TSLA
Compound score: 0.0

Tesla (TSLA) Stock Ends the Week in Red. Are the Reasons Model Y or 
Musk Himself?
#cryptocurrency #btcnews #altcoins #enigma #tothemoon #altcoins #pos
Compound score: 0.0

Tesla (TSLA:NAS) and China Unicom (CHU:NYS) Upgraded
Compound score: 0.0

Today in "a company that is absolutely not experiencing a cash crunch". $TSLA

Compound score: 0.0

Tesla (TSLA) Stock Ends the Week in Red. Are the Reasons Model Y or

Musk Himself?

#cryptocurrency #btcnews #altcoins #enigma #tothemoon #altcoins #pos

Compound score: 0.0

Tesla (TSLA:NAS) and China Unicom (CHU:NYS) Upgraded

Compound score: 0.0

One option is to train our own sentiment classifier if we can find a way to label data. But what about clustering tweets and analyzing sentiment by cluster? We may get a better understanding of which ones are classified correctly that way.

To cluster tweets we need to vectorize them so we can compute a distance metric. TFIDF works very well for this task. TFIDF consists of 2 components

Term Frequency – how often a word occurs in a document

Inverse Document Frequency – how much information the word provides (whether its common or rare across all documents).

Before applying TFIDF we need to tokenize our words. I used NLTK’s TweetTokenizer which preserves mentions and $ tags, and lemmatized the words to collapse similar meaning words (we could also try stemming). I also removed all http links in tweets since we cant analyze them algorithmically. Finally I added punctuations to the stop words that TFIDF will ignore. I ran TFIDF using 1000 features. This is a parameter that we can experiment with and tune. This is what a sample subset of resultant matrix looks like.

We finally have a matrix we can use to run KMeans. Determining the number of clusters is a frequently encountered problem in clustering, different from the process of actually clustering the data. I used the Elbow Method to fine tune this parameter – essentially we try a range of clusters and plot the SSE (Sum of Squared Errors). SSE tends to 0 as we increase the cluster count. Plotting the SSE against number of errors tends to have the shape of an arm with the “elbow” suggesting at what value we start to see diminishing reduction in SSE. We pick the number of clusters to be at the elbow point.

I decided to use 50 clusters since that’s where the elbow is. Its worth looking at a distribution of tweets for each cluster center and the most important features for clusters with a high population.

Cluster 6: ’ @elonmusk tesla elon ‘ musk stock #tesla sec time let going think today @tesla
Cluster 7: @elonmusk #tesla @tesla stock time today would going cars short car company see day get
Cluster 36: musk elon sec tesla ceo judge tweets contempt settlement via cramer going tesla's get

Cluster 6: ’ @elonmusk tesla elon ‘ musk stock #tesla sec time let going think today @tesla

Cluster 7: @elonmusk #tesla @tesla stock time today would going cars short car company see day get

Cluster 36: musk elon sec tesla ceo judge tweets contempt settlement via cramer going tesla's get

Finally, to visualize the clusters we first use TSNE to reduce the TFIDF feature matrix to 2 dimensions, and then plot them using Bokeh. Bokeh lets us look at data when we hover over points to see how the clustering is working with text.

Analyzing the tweets and clusters I realized there is a lot of SPAM in twitter. For cleaner analysis its worth researching how to remove these tweets.

As usual, code is available on my Github.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Category: TFIDF

An analysis of Tesla Tweets

Fetching Twitter data

Sentiment Analysis