I was inspired to write a short post about trimming outliers in RapidMiner after a comment from dc yesterday. Although I've never used these particular set of data pre-processing operators (I always inspect my data visually), I find them to interesting and worth a look.
If you right click and select "New Operator", you'll find many parent category operator selections. Choose the "Pre-Processing" category, then "Data", and then "Outlier."
Once in the outlier directory you'll find three operators: densitybasedoutlierdetection, distancebasedoutlierdetection, LOFoutlierdetection.
Here's what each of them do in brief:
- The densitybasedoutlierdetection operator scans your data set and looks for outliers based on a density function (squared distance, euclidean distance, angle);
- The distancebaseoutlierdetection operator uses a k-nearest neighbor algorithm to find outliers, and;
- The LOFoutlierdectection operator uses minimal upper and lower bounds (with a density function) to find outliers.
These operators, in an experiment, will automatically "snip" your the outlier data record and then build your neural net model from the remaining data. Check out RapidMiner's "Pre-Processing" category for more great data "cleaning" goodies!
From around the Social Web!
Want to leave a comment?
If you want to give me some feedback on this post, please contact me
via email or on Twitter