In this article, we are going to learn how to use R and Rstudio to scrape tweets and do some basic qualitative data analysis.
R might not be the most popular programming language out there, and some don’t even consider it a programming language at all. One could even say that Python is quite literally strangling R to death.
Still, thousands of data scientists are still using it. Devoted developers keep updating their packages. And, to be honest, R has never been more powerful as a statistics tool than it is today.
Obviously, when people hire software developers, they are most often looking for programmers who know their way around Python. But for academic work and pure statistics analysis, R is a solid toolkit well worth learning.
Having said that, let’s get started.
Load the required packages
First, let’s load the required packages in Rstudio by using the following code.
if(!"rtweet" %in% rownames(installed.packages()) == FALSE){
install.packages(“rtweet”)
}
if(!”tm” %in% rownames(installed.packages()) == FALSE){
install.packages(“rtweet”)
}
if(!”SnowballC” %in% rownames(installed.packages()) == FALSE){
install.packages(“rtweet”)
}
if(!”wordcloud” %in% rownames(installed.packages()) == FALSE){
install.packages(“rtweet”)
}
if(!”RColorBrewer” %in% rownames(installed.packages()) == FALSE){
install.packages(“rtweet”)
}
library(rtweet)
library(“tm”)
library(“SnowballC”)
library(“wordcloud”)
library(“RColorBrewer”)
Getting to know Rtweet
Before we get started, you are going to need a Twitter account, so either use your own or create an account for scraping. As soon as you make a request to Twitter’s API by using a function, you’ll get a browser popup requesting approval. Click ok and Rtweet will save your authorization key for future requests.
The package has several functions that allow you to scrape Twitter in different ways. For this tutorial, we’ll just focus on one of them. You can check the rest as well as the whole documentation here.
Search_tweets and ts_plot
As the name implies, this function returns tweets related to a user-provided search_query. Keep in mind that due to Twitter limitations, you can only get tweets from the last 6 to 9 days, and up to 18.000 tweets every 15 minutes.
Let’s say that I want to gather 18.000 unique tweets about COVID-19:
df <- search_tweets(“COVID-19”, n = 18000, include_rtweets = FALSE)
Another argument worth mentioning is geocode, which takes latitude.longitude.radius coordinates and limits your search to that specific area, and retryonratelimit =TRUE, which will send the search query every 15 minutes or so allowing you to bypass the 18.000 tweet limit.
If everything goes right, you’ll see a progress bar like in the image below, this indicates that Rtweets is downloading the information.
Here goes download.png with the text Progress bar
After it’s done you’ll end up with a data frame that looks something like this:
Here goes dataframe.png with the text The data frame has 90 variables
The resulting data frame has 90 different variables you can play around with, but for this tutorial, we are just going to use the time and the body of the tweets.
We can graph the frequency of our search query by using the ts_plot function, this will produce a graph of the frequency by a time frame set with the “by” argument. “By” can take any of the following values: “secs”, “mins”, “hours”, “days”, “weeks”, “months”, or “years”.
For example, I ran ts_plot(df, by="hours")
with the data I gathered and got the following graph:
Here goes graph.png with the text Frequency by hour
Unsurprisingly, there were so many tweets about COVID-19 that my data frame only has tweets from 4:00 PM to 6:00 PM. What we see is a linear increase in tweets related to the disease for the first hour and a linear decrease for the next hour.
Making a Word Cloud
We now have plenty of qualitative data to work with, and while we could run something as complex as sentiment analysis. I just want to show you the basics of cleaning and preparing the data for deep analysis.
Our first step is to turn the data into something that’s more manageable, let’s transform the raw data into a Corpus:
First, extract the text column from the data frame and create a new list called text:
text <- df$text
Then use the Corpus function to transform the data:
corp <- Corpus(VectorSource(text))
To make sure that the Corpus was built correctly, run inspect(corp)and check that every tweet has been turned into a list.
Now is when the magic happens. Raw Social Media data is extremely messy, so you need to trim it and delete some stuff to make it as clean as possible.
Let’s begin by eliminating numbers, since in the end what you want is a word cloud.
corp <- tm_map(corp, removeNumbers)
Words next to punctuations are interpreted as a different string, so you need to get rid of those pesky dots and commas.
corp <- tm_map(corp, removePunctuation)
Same deal with white spaces
corp <- tm_map(corp, stripWhitespace)
You also have to get rid of functional words such as “to, but, yet” (referred to as “stop words”) since you should focus on content words.
corp <- tm_map(corp, removeWords, stopwords("english"))
And finally, get rid of special characters
corp <- tm_map(corp, removeWords, c("/", "@", "#", “covid”))
Wait, why delete covid? because that was the search query. If you were to run a frequency analysis without removing it would come out on top hands down.
Now you need to prepare the data to actually turn it into a word cloud. Basically, you need a frequency table or all the words left in the corpus.
a <- TermDocumentMatrix(corp)
b <- as.matrix(a)
c <- sort(rowSums(b),decreasing=TRUE)
d <- data.frame(word = names(c),freq=c)
And then finish by actually creating the word cloud:
wordcloud(words = d$word, freq = d$freq, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, “Dark”))
Here goes word.png with the text Wordcloud