Bodong Chen

Crisscross Landscapes

MongoDB + Java + R = A handy toolbox for archiving and analyzing Twitter data


I’ve been using Martin Hawksey’s brilliant Google Spreadsheet TAGS for archiving Twitter data for y-e-a-r-s. It works very well, and allowed me to develop little R toys like twitterytics-shiny to interactively explore Twitter archives.

However, I have a few complains. First, Twitter API authentication needs to be set up for each archive. The task which require a few steps in Google Spreadsheet is not trivial. Second, I often ended up getting hundreds or even thousands of duplicates that may cause the file size to exceed Google Spreadsheet’s size limit. Third, sometimes the getTweets() function powered by Google Code (I think) fails for unknown reasons. Although I still greatly enjoy TAGS, for these reasons I have been wishing to have an easier and more liable way to archive theoretically unlimited number of tweets for quite a while. So when I read a post about *archiving tweets with MongoDB* (part1 & part2) by Julian Hillebrand today, I couldn’t stop myself from trying it out.


Julian’s code works! But after a bit playing and tweaking, I think the solution I ended up using became quite distinctive and deserved an independent post. So below is what I did, which I think contains some improvement over the original solution.

  1. Setup MongoDB —————-

You can download MongoDB here and install it on your computer as Julian suggested. Or, you can using MongoDB hosting services in the cloud, like MongoLab. I used the free subscription on MongoLab. It was very easy to setup (believe me, this was my first time partying with MongoDB), and you don’t need to install any other things (like the Netbeans MongoDB plugin).  Plus, you might want to keep it running when your laptop is off so it’s better to use something in the cloud. So I strongly recommend MongoLab. If you choose to follow this path, there’re around 3 steps:

It will definitely help to read MongoDB’s manual, but I really got as far as the first page.

  1. Setup Java project and run the code ————————————–

Java code is used to retrieve tweets through Twitter API and save them into MongoDB. Julian explained how his code works in details. I did some tweaks to make the settings more visible and easier to modify. My code is posted in this gist. Download and do the following few things:

Then you should be able to directly run the Java file and start collecting tweets. Each time when you run the file, you will need to type in the search keyword (e.g. “#mri13”) for Twitter. Then the file will repeatedly retrieve 100 tweets containing that keyword from Twitter every 60 seconds (these two numbers can also be customized in the Java file), and put new ones into MongoDB. Theoretically you can have Java instances running forever. (As Julian mentioned, there should be a better way to do this loop, for eaxmple using streaming API.)

If you run the Java file twice for two different Twitter archives, say “#mri13” and “#edtechchat”, two MongoDB *collections* will be created respectively with these two names.

  1. Retrieve tweets in R from MongoDB for analysis ————————————————-

After tweets are collected in MongoDB, querying data in R becomes very straightforward.

{% highlight r %} library(rmongodb)

Host info and credentials

host <- “” username <- “your_username” password <- “your_pass” db <- “twitter-mongo”

Connect to mongodb

mongo <- mongo.create(host=host, db=db, username=username, password=password) {% endhighlight %}

Check how many collections are in the database:

{% highlight r %}

Get a list of collections within our namespace

here I used each collection for a twitter archive

mongo.get.database.collections(mongo, db) {% endhighlight %}

Do some simple queries in the #mri13 collection:

{% highlight r %}

Create a string that points to the namespace

#  the collection I’m interested in is “#mri13” collection <- “#mri13” namespace <- paste(db, collection, sep=“.”)

Check the total number of tweets in “#mri13”

mongo.count(mongo, namespace, mongo.bson.empty()) {% endhighlight %}

Build a query to find how many tweets were posted by me

{% highlight r %} buf <- mongo.bson.buffer.create() mongo.bson.buffer.append(buf, “user_name”, “bodongchen”) query <- mongo.bson.from.buffer(buf)

get the count

count <- mongo.count(mongo, namespace, query) count  

Get all tweets posted by me

tweets <- list() cursor <- mongo.find(mongo, namespace, query) while ( {   val <- mongo.cursor.value(cursor)   tweets[[length(tweets)+1]] <- mongo.bson.value(val, “tweet_text”) } length(tweets) {% endhighlight %}

Retrieve all tweets in this collection/archive as a dataframe:

{% highlight r %}

Retrieve all tweets and put into a dataframe

library(plyr) df_arch1 = data.frame(stringsAsFactors = FALSE) cursor <- mongo.find(mongo, namespace) while ( { # iterate and grab the next record tmp = # make it a dataframe tmp.df =, stringsAsFactors = F) # bind to the master dataframe df_arch1 = rbind.fill(df_arch1, tmp.df) } dim(df_arch1) {% endhighlight %}

Start playing with another collection/archive named “#edtechchat”:

{% highlight r %}

Try with another collection

collection2 <- “#edtechchat” namespace2 <- paste(db, collection2, sep=“.”) mongo.count(mongo, namespace2, mongo.bson.empty()) {% endhighlight %}

The R code above is also included in the gist.


This solution brought together by MongoDB, Java, and R seems to me a proof-of-concept of a reliable and scalable way to automatically archive and analyze tweets. Here, Java can be easily replaced by Python or R. It might evolve into a nice toolbox for hacking Twitter data. I believe there will be a lot of fun.