I’ve been using Martin Hawksey’s brilliant Google Spreadsheet TAGS for archiving Twitter data for y-e-a-r-s. It works very well, and allowed me to develop little R toys like twitterytics-shiny to interactively explore Twitter archives.
However, I have a few complains. First, Twitter API authentication needs to be set up for each archive. The task which require a few steps in Google Spreadsheet is not trivial. Second, I often ended up getting hundreds or even thousands of duplicates that may cause the file size to exceed Google Spreadsheet’s size limit. Third, sometimes the getTweets() function powered by Google Code (I think) fails for unknown reasons. Although I still greatly enjoy TAGS, for these reasons I have been wishing to have an easier and more liable way to archive theoretically unlimited number of tweets for quite a while. So when I read a post about archiving tweets with MongoDB (part1 & part2) by Julian Hillebrand today, I couldn’t stop myself from trying it out.
Julian’s code works! But after a bit playing and tweaking, I think the solution I ended up using became quite distinctive and deserved an independent post. So below is what I did, which I think contains some improvement over the original solution.
You can download MongoDB here and install it on your computer as Julian suggested. Or, you can using MongoDB hosting services in the cloud, like MongoLab. I used the free subscription on MongoLab. It was very easy to setup (believe me, this was my first time partying with MongoDB), and you don’t need to install any other things (like the Netbeans MongoDB plugin). Plus, you might want to keep it running when your laptop is off so it’s better to use something in the cloud. So I strongly recommend MongoLab. If you choose to follow this path, there’re around 3 steps:
- Register for a MongoLab free subscription
- Create a new database, say “twitter-mongo”
- Click into the database and add a database user; record the username/password for later use
It will definitely help to read MongoDB’s manual, but I really got as far as the first page.
Setup Java project and run the code
Java code is used to retrieve tweets through Twitter API and save them into MongoDB. Julian explained how his code works in details. I did some tweaks to make the settings more visible and easier to modify. My code is posted in this gist. Download TwitterLoop.java and do the following few things:
- Create a Java project in your favorite Java IDE (e.g., Netbeans) and put the Java file in
- Add dependencies: twitter4j (core) and mongodb-java-driver
- In the Java file, change settings for MongoDB and Twitter API
Then you should be able to directly run the Java file and start collecting tweets. Each time when you run the file, you will need to type in the search keyword (e.g. “#mri13”) for Twitter. Then the file will repeatedly retrieve 100 tweets containing that keyword from Twitter every 60 seconds (these two numbers can also be customized in the Java file), and put new ones into MongoDB. Theoretically you can have Java instances running forever. (As Julian mentioned, there should be a better way to do this loop, for eaxmple using streaming API.)
If you run the Java file twice for two different Twitter archives, say “#mri13” and “#edtechchat”, two MongoDB collections will be created respectively with these two names.
Retrieve tweets in R from MongoDB for analysis
After tweets are collected in MongoDB, querying data in R becomes very straightforward.
Check how many collections are in the database:
Do some simple queries in the #mri13 collection:
Build a query to find how many tweets were posted by me
Retrieve all tweets in this collection/archive as a dataframe:
Start playing with another collection/archive named “#edtechchat”:
The R code above is also included in the gist.
This solution brought together by MongoDB, Java, and R seems to me a proof-of-concept of a reliable and scalable way to automatically archive and analyze tweets. Here, Java can be easily replaced by Python or R. It might evolve into a nice toolbox for hacking Twitter data. I believe there will be a lot of fun.