Predicting Historical Volatility for the S&P500

Predicting Historical Volatility is easy with RapidMiner. The attached process uses RapidMiner to recreate a research paper (Options trading driven by volatility directional accuracy) on how to predict historical volatility (HV) for the S&P500. The idea was to predict the HV 5 trading days ahead from Friday to Friday and then compare it with the Implied Volatility (IV) of the S&P500.  If the directions of HV and IV converge or diverge, then you would execute a specific type of option trade.

I did take some liberties with the research paper. At first I did use a Neural Net algorithm to train the data and I got a greater than 50% directional accuracy. When I switched to a SVM with and RBF kernel, I got it over 60%. Then when I added optimization for the Training and Testing Windows, gamma, and C parameters, I managed to get this over 70%.

I did test this “live” by paper trading it and managed to be right 7 out of 10 times. I did not execute any actually trades.

The data file is here: ^GSPC

Use RapidMiner to Auto Label a Twitter Training Set

I’ve been struggling with how to separate the signal from noise in Twitter data. There’s great content and sentiment there but it’s buried by nonsense tweets and noise. How do you find that true signal within the noise?

This question wracked my brain until I solved it with a one-class SVM application in RapidMiner.

Autolabeling a Training Set

If you read and use my RapidMiner Twitter Content process from here, you would’ve noted that process didn’t include any labels. The labels were something left to do” at the end of the tutorial and I spent a few days thinking on how to go about it. My first method was to label tweets based on Retweets and the second method was to label tweets based on Binning. Both of these methods are easy but they didn’t solve the problem at hand. The solution? A One Class SVM model.

Labeling based on Retweets

With this method I created a label class of Important” and Not Important” based on the number of retweets a post had. This was the simplest way to cut the traning set into two classes but I had to choose an arbitrary Retweet value. What was the right number of Retweets? If you look at the tweets surrounding #machinelearning, #ai, and #datascience you’ll notice that a large amount retweets happen from a small handful ofTwitterati’. Not to pick on @KirkDBorne but when he Tweets something, bots and people Retweet it like crazy.

There’s a large percentage of the tweets he sends that links back to content that’s been posted or generated elsewhere. He happens to have a large following that Retweets his stuff like crazy. His Retweets can range in the 100′s, so does this mean those Tweets are Important’ or a lot of noise? If some Tweet only has 10 Retweets but it’s a great link, does that mean it’s Not Important? So what’s the right number of retweets? One? Ten? One Hundred? There was no good answer here because I didn’t know what the right number was.

Labeling based on Binning

My next thought was to bin the tweets based on their Retweets into two buckets. Bucket one would be Not Important” and bucket two would be Important.” When I did this, I started getting a distribution that look better. It wasn’t till I examined the buckets that I realized that this method gleaned over a lot of good tweets.

In essence I was repeating the same mistakes as labeling based on Retweets. So if I trained a model on this, I’d still get shit.

Labeling based on a One Class SVM

I realized after trying the above two methods that there was no easy to do it. I wanted to find a lazy way of autolabeling but soon came back what is important, the training set.

The power and accuracy of any classification model depends on how good its training set is. Never overlook this!

The solution was to use a One Class SVM process in RapidMiner. I would get a set of 100 to 200 Tweets, read through them, and then ONLYlabel the Important’ ones. What were the Important’ ones? Any Tweet that I thought was interesting to me and my followers.

After I marked the Important Tweets, I imported that data set into RapidMiner and built my process. The process is simple.

The top process branch loads the hand labeled data, does some Text Processing on it, and feeds it into a SVM set with a One-Class kernel. Now the following is important!

The use a One Class SVM in RapidMiner, you have to train it only on one class, Important’ being that class. When you apply the model to out of sample (OOS) data, it will generate an inside’ and outside’ prediction with confidence values. These values show how close the new data point is inside the Important’ class (meaning it’s Important) or outside of that class. I end up renaming the inside’ and outside’ predictions toImportant’ and Not Important’ .

The bottom process takes the OOS data, text processes it, and applys the model for prediction. At the end I do some cleanup where I merge the Tweets together so I can feed it into my Twitter Content model and find my Important words and actually build a classification model now!

Within a few seconds, I had an autolabeled data set! YAH!


While this process is a GREAT first start, there is more work to do. For example, I selected an RBF kernel and a gammas of 0.001 as a starting point. This was a guess and I need to put together and optimization process to help me figure out that right parameters to use to get a better autolabeling model. I’m also interested in using @mschmitz_′s LIME operator to help me understand the potential outliers when using this autolabeling method.

The Process

As I noted above, this process is a work in proces’ so use with caution. It’s a great blueprint because applying One Class SVMs in RapidMiner is easy but sometimes confusing.

What works; What Doesn’t Work

The most important lesson I’ve learned while working at a Startup is to do more of what works and jettison what doesn’t work, quickly. That’s the way to success, the rest is just noise and a waste of time. This lesson can be applied to everything in life.

Data is your friend

We generate data all the time, whether it’s captured in a database or spreadsheet, just by being alive you throw of data points. The trick is to take notice of it, capture it, and then do something with it.

It’s the “do something with it” that matters to your success or not.  Your success can be anything that is of value to you. Time, money, weight loss, stock trading, whatever. You just need to start capturing data, evaluate it, and take action on it.

This is where you fail

Many people fail by taking no action on the data they captured and evaluated. They hope that things are going to get better or that things are going to change. Maybe they will, maybe they won’t but you must act on what the data is telling you now.


My Examples, what Works/Doesn’t Work

  1. My $100 Forex experiment worked really well for a time, then it started to flag. The data was telling me that my trading method was no longer working. Did I listen? Nope. I blew up that account. This didn’t work for me.
  2. Writing RapidMiner Tutorials on this blog ended up getting me a job with a great job with RapidMiner. This lead to an amazing career in Data Science. Writing and taking an interest in things works.
  3. Day trading doesn’t work for me. I blow up all the time. What works for me is swing and trend trading. Do more of that and no day trading.

Keep it simple, stupid

The one thing I’ve also learned working at a startup is to keep things simple and stupid. You’re running so fast trying to make your quarter that you have no time for complex processes. Strip things down to their minimum and go as light as you can. This way you can adjust your strategy and make changes quickly, you can do more of what works and jettison what doesn’t.