Monthly Archives: September 2010

Rapidminer 5.0 Video Tutorial #11 – Pattern Recognition & Landmarking Plugin

I’m back to making new videos again, at least for a little while! This new video showcases the Pattern Recognition & Landmarking plugin that was unveiled at RCOMM 2010.

This plugin is fantastic!  It analyzes your data, ranks the best type of learners that should yield the highest accuracy, and then automatically constructs the process for you. Its so great that it helps answer one of the most often asked questions from my readers, “which learner should I use for my data?”

[flashvideo file=wp-content/uploads/2010/09/Rapidminer5-Vid11.mp4 /]

The video is NOT uploaded to my Youtube channel because its 13 minutes long.  Here’s the HQ video tutorial #11.

What is the WhiBo plugin for Rapidminer?

Today’s guest post about an awesome new plugin for Rapidminer, is from Milan Vukicevic.  Although I walked in at the very end of his presentation at RCOMM 2010, I sat down with Milan on my last day and he gave me a personal demo of WhiBo.  The applications I see from this plugin, as it relates to the financial world, is its ability to build algorithms on new data,  find patterns, and tweak parameters that were never possible before. Thanks Milan!

WhiBo is a RapidMiner plug-in for component-based design and performance testing of data mining algorithms. Users can design whole algorithms simply by connecting components. These components are building blocks that represent crucial algorithmic steps that every algorithm of certain type should have.

WhiBo has an interactive GUI for design of component-based algorithms that can be designed and saved for reuse with just a few clicks, without having to write a single line of code. This way, data mining practitioners have more possibilities to construct and rebuild algorithms that better adapt to concrete data.

In comparison with traditional algorithms, which could only be adjusted by parameter tuning, this approach offers more significant possibilities of algorithm adjustment. A component repository for design and testing of Decision tree and Partitioning Clustering algorithms is provided. This repository allows users to design algorithms which can outperform traditional, well-known, algorithms. If needed, component-based design allows simple extension of the repository, but also definitions of new generic algorithms (e.g. neural networks, SVMs etc.). When combined with RapidMiner’s pre-processing and visualization operators, WhiBo becomes a powerful tool for pattern recognition and predictive analysis.

For more information about WhiBo and component-based approach in design and application of data mining algorithms, feel free to contact me at milan.vukicevic *AT*, (remove *AT*). Installation instructions, detailed user and developer documentation and list of our publications can be found on

Using the SVM RBF Kernel

I’m happy to announce that today is the first of a two part guest post series. Today’s guest post is by Marin Matijas, who gave a presentation at RCOMM 2010 about Short Term Load Forecasting using Support Vector Machines (SVM). I asked Marin to elaborate a little about his use of the Radial Bias Function (RBF) in Rapidminer’s SVM operator and here’s what he had to say! I did edit the post a bit for readability.  Thanks Marin!

In my RCOMM 2010 presentation, titled “Application of Short Term Load Forecasting using Support Vector Machines in RapidMiner 5.0,” I showed how SVMs can be used to solve a volatile Load Forecasting problem.

Load Forecasting is an old problem, it is almost as old as modern stock exchange related forecasting. I am comparing these two, as both problems are time-series which makes them similar (also because we are all eagerly waiting for Tom’s videos with more insights on how to predict financial markets).  The goal of Load Forecasting is to predict exact values of an electricity (power) load in a given time interval. Typically a load for the day ahead is being predicted on hourly basis. Unlike predictions in the financial markets where trend prediction is often more important than ‘exact’ value, here the goal is to predict the (exact) value of the load itself.

Depending on the problem, Mean Average Percentage Error (MAPE) varies, but it is typically between 1 and 10 % for 24 intervals or more. A good precision can be obtained as load does not fluctuate much. Overall we consume typically more in winter than in autumn, more on Monday morning than Sunday evening, but when averaged electricity consumptions follow certain patterns.

Since load is serial nature where patterns are being repeated on a known basis, windowing has been used to take the advantage of this property. Support Vector Machines has been chosen for the regression, as it gave better results than previously used method. Compared to Artificial Neural Networks, it is much faster, an important characteristic with large datasets. One key parameter used for the SVM learner was the Radial Basis Function (RBF) kernel.  It was chosen for three main reasons, discussed below.

The first reason is that it is good for non-linear problems. Looking at a typical graph of the electricity grid daily load, one can easily see that Load Forecasting is a non-linear problem (see graph below).

The other types of kernels, linear and sigmoid  may be used but only under special conditions.  The second reason is that RBF has gamma parameter which makes optimizing the SVM in Rapidminer a simpler task.  The third reason is that RBF gave us the best results (low MAPE) than other kernels and it tends to be a standard kernel used in other research papers on Load Forecasting.

This same kernel can be applied for variety of other non-linear problems e.g. forecasting of options volatility and many others, as lot of problems are non-linear. The key take away points when incorporating an RBF kernel in a SVM is its simplicity in parameter and windowing optimizing in Rapidminer. I hope Tom will soon show in his video how simple it is to optimize parameters in RapidMiner, so you can create processes that utilize this powerful group of operators.



mmatijas *at*

Text Mining Annual Reports

I’m playing around with Rapidminer’s powerful text mining tools to dig through annual reports this evening and I’m making progress.  Rapidminer can text mine all sorts of formats but the operators are still a bit tough to use if you don’t know what you’re doing, like me!  Still, I did pick up a thing or two at RCOMM and I’m putting that to good use.

For tonight I decided to mine through the annual reports of $CSCO, $XOM, $INTC, $AMD, and $BP.  Granted, these stocks are in three different industry groups but I’m just poking around to see how they use buzz words like “sustainability” and “greenhouse.” It’s all rather fun and silly, but wait till I post about my Twitter mining experiment.  LOL.

“Sustainability” buzzword

(Note: AMD never used it but BP did the most)

“Greenhouse” buzzword

(Note: AMD never used it but BP did the most)

The Whirlwind that was RCOMM – Part 2

Well the jet lag finally caught up to me so I apologize for this late post on RCOMM. Thursday morning was kicked off by yours truly, and I was deeply humbled that the Rapid-I team asked me to one of their two invited speakers at RCOMM.

For my presentation I choose to talk about Forecasting Historical Volatility for Option Trading. The subject of this talk was about the creation, or rather recreation, of a research paper that tried to predict the rise and fall of historical volatility and then utilize option volatility strategies to make profit.  I created the Rapidminer model from this research paper back in 2007 after an astute NMT reader, who also is a full time option trader, contacted me about collaborating on such an endeavor.  Long story short, we test traded the model through summer 2007 and it seemed to be working fine until Bear Stearns blew up.  We both got busy with the financial mess that began unfolding before us and the collaboration was put on indefinite hiatus.

When the Rapid-I team invited me to give a talk, I decided to talk about this experiment because it yielded some interesting results that perhaps the original researchers didn’t think about.  The first thing I did was to recreate this model using the newer Time Series Forecasting plugin and include the volatility time period from 2005 to 2010 for the S&P500.  In doing so, I yielded results that differed from what the research paper was predicting.  I proceeded to further drill down into the details and retrain the model on two distinct time periods from 2005 to 2007, and 2007 to 2009, with both showing very different results.  With the benefit of time, I was able to determine that in times of orderly/low volatility the historical volatility forecasting trend had greater than 60% accuracy.  In times of high volatility it was slightly better than a coin flip.  It seems that this strategy for forecasting historical volatility does work but only when the markets “behave.”

Marin Matijas followed my talk on a similar type time series project, trying to apply  Short Term Load Forecasting using Support Vector Machines in RapidMiner 5.0. I was able to glean some interesting insight from his talk about using SVM’s for my previous talk if I wanted to supercharge the option trading system, but that’s for another time.  Check back next week for a guest post from Marin where he details a bit more about using a RBF function in a SVM for his time series analysis.

Following Marin’s talk there was a short break which we chatted, networked, and drank lots of coffee. We began the next set of talks about how data analysis in Rapideminer can be improved.  Alexander Arimond presented about Distributed Pattern Recognition in Data Mining, then Marco Stolpe presented how stream mining can be integrated into Rapidminer (this was really amazing) in his Implementing Hierarchical Heavy Hitters in RapidMiner talk, and lastly for the morning we heard from Olaf Laber of Ingres Vectorwise about how the way databases use in memory are about to be changed forever.

Sounds like a lot for day doesn’t it? Well that was just the morning!  We kicked off the afternoon with two workshops that included the unveiling of the “R” plugin by Sebastian Land and how to use RapidAnalytics by Simon Fischer.  The rumor is that RapidAnalytics will be released as a open source soon.  If that’s true I’ll be installing it on the NMT server and pulling down lots of daily financial data!

Closing out the RCOMM 2010 were two amazing text mining presentations.  I realized that we are on the cusp of something amazing in text mining when I listened intently to Timur Fayruzoy’s talk about using the Rapiminer Framework for Protien Interaction Extraction.  Timur unvileved a working system that helps researchers, doctors, and other medical practicitions find protein interactions by text mining research papers.  WOW.  If that didn’t blow me away, Felix Jungerman’s talk about the creation of a new plugin for Information Extraction did.  Under development is a new text mining related plugin that attempts to extract information, not data, from text.  This plugin will be a quantum leap for text mining in Rapidminer for sure and I’ll be checking for it regularly on the Rapid-I site.

The Whirlwind that was RCOMM – Part 1

Incorporating and expanding on my first RCOMM 2010 post, I going to write about the various presentations that I found highly interesting and applicable to financial data mining.  I walked in on Milan Vukiecevic, who gave a talk about an upcoming plugin release called Whi Bo. Unfortunately I walked in toward the end of the talk and only caught the Q&A part.  Still, I was able to catch up with on the last day of RCOMM to discuss his application.  Ingo from Rapid-I describes it best, it’s like a mini Rapidminer inside Rapidminer!  Essentially Whi Bo works within the Decision Tree modelers and helps the user fine tune the splitting parameters.  It also enhances the modelers by detecting better splitting algorithms for your particular data set.

Right after Milan’s talk we had another great talk about Landmarking for Meta Learning by Sarah Abdelmessih.  This talk was considered a continuation of the PaREN talk I missed early in the day about pattern recognition.   I found Sarah’s discussion on determining the right learner for your particular data set to be very useful.  Why? Often my readers ask me, would an SVM learner better to use in this data set? Or is Knn better?  Often it’s a combination of learners, not just one, that gives you the better answer!  The end result is the creation of ranking system of learners for a given data set!  I can’t wait for the PaREN plugin to come out. Man, so many cool things were going on in those few short hours!

We closed out the day with a workshop by Tobias Malberct, Rapid-I team member about using the Reporting operators in Rapidminer and the now famous “Who wants to be Dataminer” game show.  I think the game show was the funniest thing I saw in a long time!   Contestants pitted themselves against veteran Rapid-I developers with the surprise of the evening coming at the end. Contestant Matko Bošnjak, from Croatia, finished surprisingly strong after only “picking up” Rapidminer 3 months ago. Not even the veteran Rapid-I guys could finish in the 5 minutes time given and Matko took home the prize.  I believe he said that he learned how to use Rapidminer from watching my tutorials.

Dinner followed at nice local establishment only a few 100 meters from the University.  We ate, drank, and chatted the night away.  I met up with Milan, Matko, Ralf Klinkenberg, Ingo & Nadja Mierswa, Markus Hoffman, and Miran Matjis.  Miran was presenting the next day about load forecasting electrical demand using SVMs.  Although our talks were different in subject, we both applied the time series forecasting plugin for Rapidminer and had LOTS to talk about that night and the next, but I’ll leave those adventures for tomorrow.

Sitting in Frankfurt

Yes, I’m about to board my plane back to the USA so this post will have to be a bit short.  I do owe you guys a long series of posts (and new videos) about my time in Dortmund with the Rapid-I team at RCOMM 2010, which will start after I survive the jet lag again!

What I can say is that I was amazed by the papers presented by the many RCOMM 2010 speakers. All of them are leveraging the power of Rapidminer in ways that I never dreamed of!  BUT! That’s not the best part!   The best part of this trip was meeting some amazingly intelligent and dynamic people from all over the world and making new friends.

Ingo posted his Day 1 and Day 2 review of RCOMM 2010 but here’s yours truly in action!

RCOMM 2010 – Having A Blast!

Wow, RCOMM 2010 is so much fun! After an exhausting flight to Frankfurt, I made it to RCOMM 2010 late Tuesday afternoon.  I got to listen to two great talks so far and watch a hilarious game show, “Who wants to be a data miner.”

The Rapid-i team has really done a great job of hosting this event and its amazing to hear how people are using Rapidminer to solve complex tasks to make everyday life better.

After the game show we all went down for dinner at the Krautergarten and had some great food, drink, and of course conversation.  I’ve made lots of new friends and went to bed very late.   Now the trick is to be awake, on time,  and coherent for my presentation. lol.

(from the game show event)

(the after RCOMM 2010 dinner)

(ofc I have to enjoy a good German beer, or three)

My Experiment Community

I recently installed the new MyExperiment Community plugin for Rapidminer after it was first suggested on my forums by a poster/reader (hat tip to Ronmac).  I’m glad I did because this plugin enables me to access, upload, and download Rapidminer workflows / processes that users share as part of the community.

The plugin allows to do all these functions from within Rapidminer and there are currently about 50 processes ranging from Image Mining to Text Mining available for you to download! Really great stuff and I wonder why I didn’t install this plugin sooner!