March 11, 2010

Tweeting Sentiment


I finally got my hands on some Twitter data from my collaborative partner and began the process of text mining it. The creation of the Rapidminer model and then its  subsequent execution took all of 10 minutes. That’s the beauty of the Rapidminer system, you can build templates and have processes ready to go! Just add data!

But adding the data is usually the hardest and most time consuming part of text mining, especially getting the right data in the right format! Since we’re working on a proof of concept model for now, my collaborative partner had to crawl Twitter, parse the tweets, and then hand classify 1,500 Twitter posts into Positive, Neutral, and Negative labels! Whew!

Once I got the data I built a 10 fold cross validation model to process train and test the sentiment in Tweets for accuracy. Then I identified the most strongly correlated words to sentiment classification. Our results are definitely promising, we achieved a near 80% classification accuracy and nailed all the correlated words. There were some issues with missclassification of positive sentiment as negative and vice versa which we have to work on but overall this is a great start.

We now know how to fine tune the process/data, and hopefully squeeze out more accuracy between parameter optimization and better crawled data.

Now its back to civil engineering for a while, unless you guys want to hire me full time. :)

Don't forget to sign up for our monthly newsletter on Data Science and RapidMiner here!

Text Mining Twitter RapidMiner tutorials

Previous post
Rapidminer and R - Together At Last! It’s ALMOST here, the R extension in Rapidminer is just one more week away!!! If you want a sneak peak of it, check out this intro video by Ralf
Next post
Poking Python So I’m finally getting around to poking around with Python again, and I created a scatter plot for Google. I’m pretty impressed with the modules