Modeling Robust Data

Posted on Do 09 August 2007 in misc • 4 min read

  • Data Analytics
  • Neural Nets
  • Tutorials tags: [] meta: _aioseop_keywords: Neural Nets, Robust, Data, Data Mining, t-stat, R2, coeffecient of determination, statistics dsq_thread_id: '181042026' author:

    I mentioned in my supervised learning post that your data model will only be as good as your training data. Too often, and I’ve been guilty of this as well, data modelers throw all kinds of inputs variables together into a training data set figuring the neural net learner needs them all. They click “run”, the learner magically creates a model, and your prediction sets start spitting out predictions.

    Let me ask you this Mr./Ms. Modeler, how confident are you that model itself is significant? Would you be willing to bet your job on it? After all, we’ve all heard the saying, “garbage in, garbage out.” How do we separate the good data from the garbage data or prevent modeling garbage data in the first place?

    To prevent embarrassment, or the loss your job, there are two main statistical measures that you can use to check your training data for significance. I do this almost every time before I run a neural net learner because throwing in unnecessary input variables slows down your analysis and takes up valuable memory resources. The two measures I’m talking about are the coefficient of determination (R2) and t-statistics.

    Coefficient of determination (R2)

    The coefficient of determination (R2) is the measure of variation in your output that is explained by your inputs. Taking my previous post’s example of plankton growth (PG), how do fluctuations in the measures of sea temperature (ST), sunlight intensity (SI), and whale population (WP) truly affect the output PG?

    Running this statistical analysis can be an eye opener because you easily see if your inputs truly do drive your output. This simple test helped me determine that the original option volatility model I built for my client (recreating the model in a research paper at his request) was a piece of junk. The R2 measure for the original model was a mere 5%, meaning that the inputs were driving only 5% of the volatility prediction. My newer volatility model now is showing an R2 in the high 80% level. Just know this makes me feel confident that the model I’m building isn’t garbage.

    For reference, a measure of 0% means your model is insignificant and 100% means your model is perfect (you’ll never get this high, nothings perfect).

    Tip: If you don’t have a statistical package or the means to calculate R2, may I suggest using the LINEST function in Excel. This Excel function allows you calculate the R2 value and t-stats, which will be discussed next.


    A t-statistic is nothing more than a measure of each input’s statistical significance to the output. It could be that all the variables ST, SI, and WP affect 80% of PG’s growth but when you test each variable, you might find that ST and SI are very significant but WP isn’t. In this case a measure of greater than 2 or less than -2 indicates significance. The greater the number (or lesser as the case may be), the more significantly that input variable is to explaining your output.

    This measure is really good at identifying the weakest input from your training data. In some cases you can actually delete a few input variables from your training data without affecting your coefficient of determination. If that’s the case, then you just saved yourself some CPU time when you build your model.

    Tip: Once again you can calculate T-statistics using Excel’s LINEST function (just read the help section for explanation).

    There you have it, two good ways to check if you’re the data you are about to model is robust or weak. As always, if you have questions please drop me a comment.

    Neural Net and Regression StatisticsUpdate: For clarification purposes as to why I'm using linear statistics to test data that will be used in a nonlinear model, I'm posting a screen shot from the book Data Mining and Business Intelligence by Stephan Kudyba and Richard Hoptroff.

    Discussions with the author yielded the following important information: the coefficient of determination is the best indicator for data robustness and works for both linear and nonlinear models.  T-stats will be less reliable in a non-linear model but are important as a "check" for your overall model's robustness.  If a low scoring t-stat input variable is removed and your R2 barely moves, then your model (linear and nonlinear) is very stable.  Conversely if you remove a low scoring t-stat input from your model and the R2 swings wildly, you have a very unstable model.