Use RapidMiner to Auto Label a Twitter Training Set

rapidminer, d3js, radoop

I’ve been struggling with how to separate the signal from noise in Twitter data. There’s great content and sentiment there but it’s buried by nonsense tweets and noise. How do you find that true signal within the noise?

This question wracked my brain until I solved it with a one-class SVM application in RapidMiner.

Autolabeling a Training Set

If you read and use my RapidMiner Twitter Content process from here, you would’ve noted that process didn’t include any labels. The labels were something left to do” at the end of the tutorial and I spent a few days thinking on how to go about it. My first method was to label tweets based on Retweets and the second method was to label tweets based on Binning. Both of these methods are easy but they didn’t solve the problem at hand. The solution? A One Class SVM model.

Labeling based on Retweets

With this method I created a label class of Important” and Not Important” based on the number of retweets a post had. This was the simplest way to cut the traning set into two classes but I had to choose an arbitrary Retweet value. What was the right number of Retweets? If you look at the tweets surrounding #machinelearning, #ai, and #datascience you’ll notice that a large amount retweets happen from a small handful ofTwitterati’. Not to pick on @KirkDBorne but when he Tweets something, bots and people Retweet it like crazy.

There’s a large percentage of the tweets he sends that links back to content that’s been posted or generated elsewhere. He happens to have a large following that Retweets his stuff like crazy. His Retweets can range in the 100′s, so does this mean those Tweets are Important’ or a lot of noise? If some Tweet only has 10 Retweets but it’s a great link, does that mean it’s Not Important? So what’s the right number of retweets? One? Ten? One Hundred? There was no good answer here because I didn’t know what the right number was.

Labeling based on Binning

My next thought was to bin the tweets based on their Retweets into two buckets. Bucket one would be Not Important” and bucket two would be Important.” When I did this, I started getting a distribution that look better. It wasn’t till I examined the buckets that I realized that this method gleaned over a lot of good tweets.

In essence I was repeating the same mistakes as labeling based on Retweets. So if I trained a model on this, I’d still get shit.

Labeling based on a One Class SVM

I realized after trying the above two methods that there was no easy to do it. I wanted to find a lazy way of autolabeling but soon came back what is important, the training set.

The power and accuracy of any classification model depends on how good its training set is. Never overlook this!

The solution was to use a One Class SVM process in RapidMiner. I would get a set of 100 to 200 Tweets, read through them, and then ONLYlabel the Important’ ones. What were the Important’ ones? Any Tweet that I thought was interesting to me and my followers.

After I marked the Important Tweets, I imported that data set into RapidMiner and built my process. The process is simple.

The top process branch loads the hand labeled data, does some Text Processing on it, and feeds it into a SVM set with a One-Class kernel. Now the following is important!

The use a One Class SVM in RapidMiner, you have to train it only on one class, Important’ being that class. When you apply the model to out of sample (OOS) data, it will generate an inside’ and outside’ prediction with confidence values. These values show how close the new data point is inside the Important’ class (meaning it’s Important) or outside of that class. I end up renaming the inside’ and outside’ predictions toImportant’ and Not Important’ .

The bottom process takes the OOS data, text processes it, and applys the model for prediction. At the end I do some cleanup where I merge the Tweets together so I can feed it into my Twitter Content model and find my Important words and actually build a classification model now!

Within a few seconds, I had an autolabeled data set! YAH!


While this process is a GREAT first start, there is more work to do. For example, I selected an RBF kernel and a gammas of 0.001 as a starting point. This was a guess and I need to put together and optimization process to help me figure out that right parameters to use to get a better autolabeling model. I’m also interested in using @mschmitz_′s LIME operator to help me understand the potential outliers when using this autolabeling method.

The Process

As I noted above, this process is a work in proces’ so use with caution. It’s a great blueprint because applying One Class SVMs in RapidMiner is easy but sometimes confusing.