I’m super excited to be joining the ranks of H2O.ai! I’ve accepted an offer to join their Sales team as a Senior Customer Solutions Engineer, and I started last week. I’m getting back into Sales Engineering, something I missed dearly.
H2O.ai really rocks. I loved their open source products like Flow, Sparkling Water, and H2O4GPU. These guys are pushing the innovation envelope and I’ll be helping clients learn and setup their new flagship product, Driverless AI.
I don’t expect this to be easy, it will be very hard and I’ll be traveling a lot again. It’s ok, because I’ll be learning. A Lot.
What does this mean for my RapidMiner consulting? It means I’ll be closing it down for the foreseeable future.
I found this talk to be fascinating. I’ve been a big fan of LIME but never really understood the details of how it works under the hood. I understood that it works on an observation by observation basis but I never knew that it permutates data, tests against the black box model, and then builds a simple linear model to explain it.
Really cool. My notes are below the video.
Input > black box > output; when don’t understand the black box like neural nets
Example, will the loan default?
Typical classification problem
Loan and applicant information relative to historical data
Linear relationships are easy
Nonlinear relationships via a Decision Tree can still be interpreted
Big data creates more complexity and dimensions
One way to overcome this: use feature importance
Feature importance doesn’t give us any understanding if it’s a linear or nonlinear relationship
Gets better with partial dependence plots
Can’t do partial dependence plots for neural nets
You can create Bayesian Networks / shows dependencies of all variables including output variable and strength of relationship
Bummer: Not as accurate as some other algorithms
Can give you global understanding but not detailed explanation
Accuracy vs Interpretablity tradeoff. Does it exist?
Enter LIME! Local Interpretable Model-agnostic Explanations
At a local level, it uses a linear model to explain the prediction
Creates an observation, creates fakes data (permutation), then it calculates a similarity store between the fake and original data, then it takes your black box algo (neural nets?), tries different combinations of predictors
Takes those features with similarity scores, fits a simple model to it to define weights and scores to explain it
Without know what the model picks up on if it’s really signal or noise. You need LIME to verify!
Can apply to NLP/Text models
Why is it important? Trust / Predict / Improve
LIME helps feature engineering by none ML practitioners
LIME can help comply with GDPR
Understanding our models can help prevent vulnerable people
A few years ago RapidMiner incorporated a fantastic open source library from H2O.ai. That gave the platform Deep Learning, GLM, and a GBT algos, something they were lacking for a long time. If you were to look at my usage statistics, I’d bet you’d see that the Deep Learning and GLM algos are my favorites.
Just late last year H20.ai released their driverless.ai platform, an automated modeling platform that can scale easily to GPUs.
What I find fascinating is their approach to questioning each step of the way. The above video outlines that problem with lung tumor detection. Is your model learning the shape of the ribs or the size of tumor? You would hope it was the tumor!
I know that H2O.ai relentlessly drives their open source market and everywhere I look there’s an H20.ai library being imported or used. It wasn’t a shock to me to see an new update to their Driverless.Ai product, but what got me giddy was their incorporation of time series. This, I have to check out. Time Series can always be a pain and you can make mistakes easily, especially in the validation phase of things, but this just is plain cool.
I definitely need to check this out more.
Many Kaggle Grandmasters at H2O
Built Driverless.ai to avoid common pitfalls/mistakes of data science
Automate tasks: Cross validation, time series, feature engineering, etc
Ran it on a Kaggle challenge, came in 18th position.
Goal: Build robust models and avoid overfitting
Automatic visualization (of big data)
No outlier removal, it remains in big data set
Want to deploy a good model / must have an approximate interpretation
Java deployment package / driverless.ai will have a pure Java deployment
Not just talking about models but an entire model pipeline (feature generation, model building, stacking, etc)
Typically deployed to a Linux box
Will be building a Java scoring logic to score the model pipeline (on roadmap)
Sparkling Water will be incorporated into Driverless.AI so you can run this easily on Big Data
Want to write R/Py scripts to interact with Driverless.AI so it will make sense to the Data Scientist and not be complex and easy to use
Deep Learning is inside but not enabled yet
Compromise: If you want to train many models you select a good sized training set but not huge. There is a # of models vs training time tradeoff
User defined functions coming
Import the training and testing data. Model will built on training data only (won’t look at testing data)
Does batch style transformations instead of row by row for training
BUT it will do row by row transformations for testing set
Uses a genetic algo to create new features
Checks overfitting and stops early based on a holdout
Uses methods to evaluate and prevent overfitting
Only validation scores are provided (out of sample estimates)
Interpretability is built in
After the model is created, you can build a stacked model
Download scoring package, all built in so you can put this into production