read

I’ve been using Martin Hawksey’s brilliant Google Spreadsheet TAGS for archiving Twitter data for y-e-a-r-s. It works very well, and allowed me to develop little R toys like twitterytics-shiny to interactively explore Twitter archives.

However, I have a few complains. First, Twitter API authentication needs to be set up for each archive. The task which require a few steps in Google Spreadsheet is not trivial. Second, I often ended up getting hundreds or even thousands of duplicates that may cause the file size to exceed Google Spreadsheet’s size limit. Third, sometimes the getTweets() function powered by Google Code (I think) fails for unknown reasons. Although I still greatly enjoy TAGS, for these reasons I have been wishing to have an easier and more liable way to archive theoretically unlimited number of tweets for quite a while. So when I read a post about archiving tweets with MongoDB (part1 & part2) by Julian Hillebrand today, I couldn’t stop myself from trying it out.

mongoDB-logo

Julian’s code works! But after a bit playing and tweaking, I think the solution I ended up using became quite distinctive and deserved an independent post. So below is what I did, which I think contains some improvement over the original solution.

  1. Setup MongoDB

You can download MongoDB here and install it on your computer as Julian suggested. Or, you can using MongoDB hosting services in the cloud, like MongoLab. I used the free subscription on MongoLab. It was very easy to setup (believe me, this was my first time partying with MongoDB), and you don’t need to install any other things (like the Netbeans MongoDB plugin).  Plus, you might want to keep it running when your laptop is off so it’s better to use something in the cloud. So I strongly recommend MongoLab. If you choose to follow this path, there’re around 3 steps:

  • Register for a MongoLab free subscription
  • Create a new database, say “twitter-mongo”
  • Click into the database and add a database user; record the username/password for later use

It will definitely help to read MongoDB’s manual, but I really got as far as the first page.

  1. Setup Java project and run the code

Java code is used to retrieve tweets through Twitter API and save them into MongoDB. Julian explained how his code works in details. I did some tweaks to make the settings more visible and easier to modify. My code is posted in this gist. Download TwitterLoop.java and do the following few things:

  • Create a Java project in your favorite Java IDE (e.g., Netbeans) and put the Java file in
  • Add dependencies: twitter4j (core) and mongodb-java-driver
  • In the Java file, change settings for MongoDB and Twitter API

Then you should be able to directly run the Java file and start collecting tweets. Each time when you run the file, you will need to type in the search keyword (e.g. “#mri13”) for Twitter. Then the file will repeatedly retrieve 100 tweets containing that keyword from Twitter every 60 seconds (these two numbers can also be customized in the Java file), and put new ones into MongoDB. Theoretically you can have Java instances running forever. (As Julian mentioned, there should be a better way to do this loop, for eaxmple using streaming API.)

If you run the Java file twice for two different Twitter archives, say “#mri13” and “#edtechchat”, two MongoDB collections will be created respectively with these two names.

  1. Retrieve tweets in R from MongoDB for analysis

After tweets are collected in MongoDB, querying data in R becomes very straightforward.

library(rmongodb)

## Host info and credentials
host <- "ds053858.mongolab.com:53858"
username <- "your_username"
password <- "your_pass"
db <- "twitter-mongo"

## Connect to mongodb
mongo <- mongo.create(host=host, db=db,
                      username=username, password=password)

Check how many collections are in the database:

## Get a list of collections within our namespace
#  here I used each collection for a twitter archive
mongo.get.database.collections(mongo, db)

Do some simple queries in the #mri13 collection:

## Create a string that points to the namespace
#  the collection I'm interested in is "#mri13"
collection <- "#mri13"
namespace <- paste(db, collection, sep=".")

## Check the total number of tweets in "#mri13"
mongo.count(mongo, namespace, mongo.bson.empty())

Build a query to find how many tweets were posted by me

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "user_name", "bodongchen")
query <- mongo.bson.from.buffer(buf)
# get the count
count <- mongo.count(mongo, namespace, query)
count
 
## Get all tweets posted by me
tweets <- list()
cursor <- mongo.find(mongo, namespace, query)
while (mongo.cursor.next(cursor)) {
  val <- mongo.cursor.value(cursor)
  tweets[[length(tweets)+1]] <- mongo.bson.value(val, "tweet_text")
}
length(tweets)

Retrieve all tweets in this collection/archive as a dataframe:

## Retrieve all tweets and put into a dataframe
library(plyr)
df_arch1 = data.frame(stringsAsFactors = FALSE)
cursor <- mongo.find(mongo, namespace)
while (mongo.cursor.next(cursor)) {
  # iterate and grab the next record
  tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
  # make it a dataframe
  tmp.df = as.data.frame(t(unlist(tmp)), stringsAsFactors = F)
  # bind to the master dataframe
  df_arch1 = rbind.fill(df_arch1, tmp.df)
}
dim(df_arch1)

Start playing with another collection/archive named “#edtechchat”:

### Try with another collection
collection2 <- "#edtechchat"
namespace2 <- paste(db, collection2, sep=".")
mongo.count(mongo, namespace2, mongo.bson.empty())

The R code above is also included in the gist.

Conclusions

This solution brought together by MongoDB, Java, and R seems to me a proof-of-concept of a reliable and scalable way to automatically archive and analyze tweets. Here, Java can be easily replaced by Python or R. It might evolve into a nice toolbox for hacking Twitter data. I believe there will be a lot of fun.

Blog Logo

Bodong Chen


Published

Image

Crisscross Landscapes

Bodong Chen, University of Minnesota

Back to Home