Guess What? I Joined H20ai!

I’m super excited to be joining the ranks of! I’ve accepted an offer to join their Sales team as a Senior Customer Solutions Engineer, and I started last week. I’m getting back into Sales Engineering, something I missed dearly. really rocks. I loved their open source products like Flow, Sparkling Water, and H2O4GPU. These guys are pushing the innovation envelope and I’ll be helping clients learn and setup their new flagship product, Driverless AI.

I don’t expect this to be easy, it will be very hard and I’ll be traveling a lot again. It’s ok, because I’ll be learning. A Lot.

What does this mean for my RapidMiner consulting? It means I’ll be closing it down for the foreseeable future.

More info later…

Interpretable Machine Learning Using LIME Framework

I found this talk to be fascinating. I’ve been a big fan of LIME but never really understood the details of how it works under the hood. I understood that it works on an observation by observation basis but I never knew that it permutates data, tests against the black box model, and then builds a simple linear model to explain it.

Really cool. My notes are below the video.


  • Input > black box > output; when don’t understand the black box like neural nets
  • Example, will the loan default?
  • Typical classification problem
  • Loan and applicant information relative to historical data
  • Linear relationships are easy
  • Nonlinear relationships via a Decision Tree can still be interpreted
  • Big data creates more complexity and dimensions
  • One way to overcome this: use feature importance
  • Feature importance doesn’t give us any understanding if it’s a linear or nonlinear relationship
  • Gets better with partial dependence plots
  • Can’t do partial dependence plots for neural nets
  • You can create Bayesian Networks / shows dependencies of all variables including output variable and strength of relationship
  • Bummer: Not as accurate as some other algorithms
  • Can give you global understanding but not detailed explanation
  • Accuracy vs Interpretablity tradeoff. Does it exist?
  • Enter LIME! Local Interpretable Model-agnostic Explanations
  • At a local level, it uses a linear model to explain the prediction
  • Creates an observation, creates fakes data (permutation), then it calculates a similarity store between the fake and original data, then it takes your black box algo (neural nets?), tries different combinations of predictors
  • Takes those features with similarity scores, fits a simple model to it to define weights and scores to explain it
  • Without know what the model picks up on if it’s really signal or noise. You need LIME to verify!
  • Can apply to NLP/Text models
  • Why is it important? Trust / Predict / Improve
  • LIME helps feature engineering by none ML practitioners
  • LIME can help comply with GDPR
  • Understanding our models can help prevent vulnerable people


A few years ago RapidMiner incorporated a fantastic open source library from That gave the platform Deep Learning, GLM, and a GBT algos, something they were lacking for a long time. If you were to look at my usage statistics, I’d bet you’d see that the Deep Learning and GLM algos are my favorites.

Just late last year released their platform, an automated modeling platform that can scale easily to GPUs.








What I find fascinating is their approach to questioning each step of the way. The above video outlines that problem with lung tumor detection. Is your model learning the shape of the ribs or the size of tumor? You would hope it was the tumor!

Fascinating video.


I know that relentlessly drives their open source market and everywhere I look there’s an library being imported or used. It wasn’t a shock to me to see an new update to their Driverless.Ai product, but what got me giddy was their incorporation of time series. This, I have to check out. Time Series can always be a pain and you can make mistakes easily, especially in the validation phase of things, but this just is plain cool.

I definitely need to check this out more.

Video Highlights/Notes

  • Many Kaggle Grandmasters at H2O
  • Built to avoid common pitfalls/mistakes of data science
  • Automate tasks: Cross validation, time series, feature engineering, etc
  • Ran it on a Kaggle challenge, came in 18th position.
  • Goal: Build robust models and avoid overfitting
  • Automatic visualization (of big data)
  • No outlier removal, it remains in big data set
  • Want to deploy a good model / must have an approximate interpretation
  • Java deployment package / will have a pure Java deployment
  • Not just talking about models but an entire model pipeline (feature generation, model building, stacking, etc)
  • Typically deployed to a Linux box
  • Will be building a Java scoring logic to score the model pipeline (on roadmap)
  • Sparkling Water will be incorporated into Driverless.AI so you can run this easily on Big Data
  • Want to write R/Py scripts to interact with Driverless.AI so it will make sense to the Data Scientist and not be complex and easy to use
  • Deep Learning is inside but not enabled yet
  • Compromise: If you want to train many models you select a good sized training set but not huge. There is a # of models vs training time tradeoff
  • User defined functions coming
  • Import the training and testing data. Model will built on training data only (won’t look at testing data)
  • Does batch style transformations instead of row by row for training
  • BUT it will do row by row transformations for testing set
  • Uses a genetic algo to create new features
  • Checks overfitting and stops early based on a holdout
  • Uses methods to evaluate and prevent overfitting
  • Only validation scores are provided (out of sample estimates)
  • Interpretability is built in
  • After the model is created, you can build a stacked model
  • Download scoring package, all built in so you can put this into production