TEXT MINING: Term Frequency & Word Clouds using R - Rihanna Fanpage!

Vahid S. J.
Apr 18, 2016
3 min read

Text is a huge, mainly untapped source of data, and with Wikipedia alone estimated to contain 23 billion words in 35 million articles in 291 languages (Of cours, at the time of writing this blog), there’s plenty to analyze. Performing a text analysis will allow you to find out what people are saying about you!

A great way of applying text analysis towards the comments on Facebook is to find a simple frequency of each word used. I’ll show you how to do this in R and how to then plot these frequencies as a word cloud.

The Dataset of this Tutorial contains the comments of a video which has been posted by #Rihanna on 14 March on her own Facebook Fanpage! and has been seen so far about 5 Million times!

Click Here to get access to Data Set!

In order to figure out how to access any information from facebook pages you can see this blog. >>> DATA EXTRACTION

Step 1:

Install “tm”, the text mining library and also the other Packages for R, once these are installed you need to load the libraries into your session.

Step 2:

Import Data into R and set up a source for your text.

2.1. We need to set header to FALSE, because our dataset has no head in CSV file, in case, your dataset has a head or a lable you don't need to set this argument within read.csv function.

2.2. We need to set strings As Factors to FALSE, because we’re going to be treating our strings as strings, rather than as categories.

2.3. You can look at comments variable to see if the csv has loaded correctly using str() funtion.

Based on, how your dataset looks like, in R Console you'll see something like

that : >>> |'data.frame': 6073 obs. of 1 variable|

(At the time of writing this blog, 6073 comments have been left by the Fans of Rihanna!)

2.4. We are going to treat all the comments as one text and look at the word count in those comments. We currently have all the text of every comment in a vector of size 6073. Each element of the vector corresponds to one comment. Since we’re currently not interested in the difference between each comment, we can simply paste every comment together, separating with a space. The collapse argument in paste tells R that we want to paste together elements of a vector, rather than pasting vectors together.

Step 3:

3.1. Create a corpus from that source (a corpus is just another name for a collection of texts)

3.2. Create a document-term matrix, which tells us how frequently each term appears in each document in your corpus

Step 4:

Cleaning the text. We use the multipurpose tm_map function inside tm to do a variety of cleaning tasks.

Step 5:

Creating the document-term matrix

Since we only have one document in this case, our document-term matrix will only have one column.

The tm package stores document term matrixes as sparse matrices for efficacy. Since we only have 6073 comments and one document we can just convert our term-document-matrix into a normal matrix, which is easier to work with.