Modeling Robust Data
I mentioned in my supervised learning post that your data model will only be as good as your training data. Too often, and I’ve been guilty of this as well, data modelers throw all kinds of inputs variables together into a training data set figuring the neural net learner needs them all. They click “run”, the learner magically creates a model, and your prediction sets start spitting out predictions.
Let me ask you this Mr./Ms. Modeler, how confident are you that model itself is significant? Would you be willing to bet your job on it? After all, we’ve all heard the saying, “garbage in, garbage out.” How do we separate the good data from the garbage data or prevent modeling garbage data in the first place?
To prevent embarrassment, or the loss your job, there are two main statistical measures that you can use to check your training data for significance. I do this almost every time before I run a neural net learner because throwing in unnecessary input variables slows down your analysis and takes up valuable memory resources. The two measures I’m talking about are the coefficient of determination (R2) and t-statistics.
Coefficient of determination (R2)
The coefficient of determination (R2) is the measure of variation in your output that is explained by your inputs. Taking my previous post’s example of plankton growth (PG), how do fluctuations in the measures of sea temperature (ST), sunlight intensity (SI), and whale population (WP) truly affect the output PG?
Running this statistical analysis can be an eye opener because you easily see if your inputs truly do drive your output. This simple test helped me determine that the original option volatility model I built for my client (recreating the model in a research paper at his request) was a piece of junk. The R2 measure for the original model was a mere 5%, meaning that the inputs were driving only 5% of the volatility prediction. My newer volatility model now is showing an R2 in the high 80% level. Just know this makes me feel confident that the model I’m building isn’t garbage.
For reference, a measure of 0% means your model is insignificant and 100% means your model is perfect (you’ll never get this high, nothings perfect).
Tip: If you don’t have a statistical package or the means to calculate R2, may I suggest using the LINEST function in Excel. This Excel function allows you calculate the R2 value and t-stats, which will be discussed next.
T-statistics
A t-statistic is nothing more than a measure of each input’s statistical significance to the output. It could be that all the variables ST, SI, and WP affect 80% of PG’s growth but when you test each variable, you might find that ST and SI are very significant but WP isn’t. In this case a measure of greater than 2 or less than -2 indicates significance. The greater the number (or lesser as the case may be), the more significantly that input variable is to explaining your output.
This measure is really good at identifying the weakest input from your training data. In some cases you can actually delete a few input variables from your training data without affecting your coefficient of determination. If that’s the case, then you just saved yourself some CPU time when you build your model.
Tip: Once again you can calculate T-statistics using Excel’s LINEST function (just read the help section for explanation).
There you have it, two good ways to check if you’re the data you are about to model is robust or weak. As always, if you have questions please drop me a comment.
Update: For clarification purposes as to why I’m using linear statistics to test data that will be used in a nonlinear model, I’m posting a screen shot from the book Data Mining and Business Intelligence by Stephan Kudyba and Richard Hoptroff.
Discussions with the author yielded the following important information: the coefficient of determination is the best indicator for data robustness and works for both linear and nonlinear models. T-stats will be less reliable in a non-linear model but are important as a “check” for your overall model’s robustness. If a low scoring t-stat input variable is removed and your R2 barely moves, then your model (linear and nonlinear) is very stable. Conversely if you remove a low scoring t-stat input from your model and the R2 swings wildly, you have a very unstable model.



August 9th, 2007 at 4:46 am
I think with the T-Statistics you bring up a good point. Very often I find people combining data when the data will confuse, or add no relevance to the scenario. Adding irrelevant data will make your system very unstable and rely more on luck that anything else.
August 9th, 2007 at 5:09 am
Christian, you are 100% correct!
August 9th, 2007 at 8:42 am
Thanks Tom for such a Great post!
For the volatility model of R2 = 80%, what was the performance of this model on the test data set?
Also, I played around with popular trading software called ‘Trading Solutions,’ and it uses correlation as a measure of data of selecting statically significant data. Any comments?
~Sarah
August 9th, 2007 at 8:48 am
Sarah: Regarding the R2=80% result, that was the same model that gave me 21 correct signals out of 31 total (68% correct).
Regarding correlation, you have to watch out for that.
Correlation doesn’t necessarily mean causality. You might have a model whose inputs correlate 95% to the output but not necessary drive the output. You’re more interested in what drives the output.
I’m not saying that correlation is unimportant but it can be misleading in the case of neural nets.
August 9th, 2007 at 11:50 am
You are using linear statistics to check the quality of the inputs that will be used in a non-linear model. But many non-linear models have low statistical significance in terms of linear measures like the t-stat or the R2. Sometimes the best “statistic” is to ask yourself “does this relationship makes any sense?” Quantitative metrics are no substitute for common sense.
August 9th, 2007 at 12:28 pm
Dario, you are correct that I’m using linear statistics to check the quality of inputs that will be used in a non linear model. Its a bit weird to do it this way, and of course common sense takes precedence, but these statistical measures are a good starting point if you have nowhere else to go. I would rather check these measures, build a nonlinear model, then just blindly trust the output my neural net learner gives me.
Many quant’s got themselves in a lot trouble recently with the subprime mess when common sense warned the stay away from that debt.
August 16th, 2007 at 10:22 am
Are you running the r2/t-stat for each input prior to running it through YALE? Or are you running the r2/t-stat on the output of the learned YALE model vs the actual results?
Thanks
Jonathan
August 16th, 2007 at 10:53 am
JW, yes I run the R2/t-stat prior to running it through the model. I use a validation operator to take samples of training data and try to validate the model as its being built.
Depending on the importance of the model I’m building, I will check the predicted output against portions of the training data to get an idea of what my error is.