- Data Analytics
- Neural Nets
_aioseop_keywords: YALE, Rapidminer, tutoruial, howto, cross validation, operator,
sampling, validation, training
Another important building block of any neural net model is the creation of training and validation data sets for your model. The data you feed your neural net model is typically called "training data" and you use it to train the neural net model to learn the relationships from this data. The question then arises is, how do you know if the neural net is being trained correctly? Is it learning the right data relationships?
The way to overcome this problem and test the model as the neural net learns is to introduce something called validation data sets. A validation data set is just a random sample from your training data that is taken and then applied to the model. Once the validation data is applied to the model, the model calculates a predicted value. This predicted value is then compared to the actual data value and the error between them is determined. The neural net does this for every validation data point, adjusting the weights (more on this later) in the model each time to minimize the validation error. When the errors converge or can't be minimized any more, the model has been trained.
YALE has a great operator called the Cross Validation operator that creates a validation data set on the fly for you. The Cross Validation operators allows you to tell just how much of your training data should be used for validation data and if you should use all the data (training and validation) to rebuild your final model.
Tip: YALE has a few other validation operators, explore them when you have time. This particular operator is useful if you want to check performance measures, which I suggest you do always.
The way to use it is to load it right before you place your neural net operators (more on this later) and after your data loading operator. Once you have it in your experiment tree, you can then tweak the parameters to your liking. Some important parameters for this operator is the "Create Complete Model", "Number of Validations", and "Sampling Type."
The Create Complete Model parameter tells the Cross Validation operator to create a validation data set for testing and simultaneously use it to build the model. If left unchecked, the operator will only use the validation data for testing and essentially remove it from the training data set.
The Number of Validations parameter is just the quantity of data points you want to use to test your model with. If you have 100 data points, I suggest using 10%, or 10 data points for validation. If you have 10,000 data points, maybe 20%, it all depends on your comfort level and the complexity of the data your modeling. The last important parameter is the Sampling Type, this pull down menu allows you to choose how to sample your training data for validation data points. You have three choices: linear, shuffled, and stratified sampling (more on this later).
That's it, another important building block explained. I hope that these smaller, but more detailed, tutorials are helpful to you. If they are, how about subscribing to my feed? As always, feel free to drop me an email or comment if you have questions.