Use RapidMiner to Discover Twitter Content

Welcome to this new tutorial on how to use RapidMiner to discover Twitter Content. I created this process as a way to monitor what’s going on in the Twitter universe and see what topics are being tweeted about. Could I do this in Python scripts? Yes, but that would be a big waste of time for me. RapidMiner makes complex ETL and tasks simple, so I live and breathe it.

Why this process?

Back when I was in Product Marketing, I had to come up with many different blog posts and ‘collateral’ to help push the RapidMiner cause. I monitor what goes on on KD Nuggets, DataScience Central, and of course Twitter. I thought, it would be fun to extract key terms and subjects from Twitter (and later websites) to see what’s currently popular and help make a ‘bigger splash’ when we publish something new.

I’ve since applied this model to my new website Yeast Head to see what beer brewing lifestyle bloggers are posting about. The short end of that discussion is that the terms ‘#recipies’ and ‘#homebrew_#recipes’ are most popular. So I need to make sure to include some recipies going forward. ¬†Interestingly enough, there’s a lot of retweets with respect to Homebrewer’s Association, so I’ll be exploiting that for sure.

The Process Design

This process utilizes RapidMiner’s text processing extension, X-means clustering, association rules, and a bunch of averaged attribute weighting schemes. ¬†Since I’m not scoring any incoming tweets (this will be a later task) to see if any new tweets are important/not important, I didn’t do any classification analysis. I did create a temporary label called “Important/Not Important” based on a simple rule that if Retweets > 10, then it has to be important.

This is a problem because I don’t know what the actual retweet number threshold is for important (aka viral tweets) and my attribute weight chart (as above) will be a bit suspect, but it’s a start I suppose.

The Process

For this particular process I shared, I used a Macro to set the search terms to #machinelearning, #datascience, and #ai. When you run this process over and over, you’ll see some interesting Tweeters emerge.

Next Steps

My next steps are to figure out the actual retweet # that truly indicates whether a tweet is important and viral and what is not. I might write a one class auto-labeling process or just hand label some important and non-important tweets. That will hone down the process and let me really figure out what his the best number to watch.