Intro to TidyText with Are.na

Posted on May 21, 2019

TidyText is a package in R that let ones use the signature tidy method of R data analysis to do various forms of text analysis, from sentiment to topic modeling. By working with the tidyverse, this makes text analysis significantly easier to preform.

Below is a brief tutorial on how to gather text, and preform a sentiment analysis on the resulting information, done with tidytext in R.

To start, one needs to collect data. While many guides use Gutenberg books, in this case we will use a data source that requires a little more preparation.

For starters, lets load the necessary libraries and load the data.

In this case, we will be using one of my are.na channels as a source of textual data. As are.na provides quite a bit of information from it’s API, the specific part of the result needs to be specified. Are.na for reference, is a social media platform that is unique in that it’s not a very textual platform. Instead, users post links to pictures, websites, videos and then organize them into channels. In this case, we are using one of my channels, called interesting things, that is a compilation of mainly articles I found interesting.

The next step is to convert the data set into a form with only the data we actually need, and then to tokenize it.

The usage of tidyverse is out in full force, as tibble and mutate are part of some of data manipulation functions that come with the package, which provide a neat functional interface to do data analytics.

As seen above, we created a tibble, a type of dataframe, with text from the description of a are.na channel, the title of the channel, and the date the channel was created. We also with lubridate cleaned some of the date information so it’s stored as a date and not as a character object.

Once we have this information, it is time to create the tokens. Before we do that, we should load stop_words, as we do not want some words in our dataset. Stop words include words like the or a, words that are very commonly used and so will only skew the analysis in the direction of very generic words.

Now one can for example, plot out the frequency of the top ten most common words.

Sentiment analysis is looking at the positive or negative associations that words have, and then seeing with the word corpus how positive or negative each word actually is via a cross reference. In this case, we will use the bing corpus, and see how positive or negative the descriptions are for each block.

By making having a index (which could the a page or chapter), one could be able to track over time or book how much the sentiment has changed.

One for example with the index, could evaluate the sentiment of a description by hour, and see what time of the day has positive versus negative sentiment.

While their is plenty of more to learn in the field of textual analysis, at minimum, sentiment analysis and knowing what tokenizing is represents a start in a good direction.