I finally got my hands on some Twitter data from my collaborative partner and began the process of text mining it. The creation of the Rapidminer model and then its subsequent execution took all of 10 minutes. That’s the beauty of the Rapidminer system, you can build templates and have processes ready to go! Just add data!
But adding the data is usually the hardest and most time consuming part of text mining, especially getting the right data in the right format! Since we’re working on a proof of concept model for now, my collaborative partner had to crawl Twitter, parse the tweets, and then hand classify 1,500 Twitter posts into Positive, Neutral, and Negative labels! Whew!
Once I got the data I built a 10 fold cross validation model to process train and test the sentiment in Tweets for accuracy. Then I identified the most strongly correlated words to sentiment classification. Our results are definitely promising, we achieved a near 80% classification accuracy and nailed all the correlated words. There were some issues with missclassification of positive sentiment as negative and vice versa which we have to work on but overall this is a great start.
We now know how to fine tune the process/data, and hopefully squeeze out more accuracy between parameter optimization and better crawled data.
Now its back to civil engineering for a while, unless you guys want to hire me full time. :)Don't forget to sign up for our monthly newsletter on Data Science and RapidMiner here!