What’s new in Driverless AI?

h2o.ai, blog, Driverless AI, H2O World, writers gonna write, datatable

Arno, H2O’s CTO, gave a great 1+ hour overview in what’s new with Driverless AI version 1.4.1. If you check back in a few weeks/months, it’ll be even better. In all honesty, I have never seen a company innovate this fast.

Below are my notes from the video:

  • H2O-3 is the open source product
  • Driverless AI is the commercial product
  • Makes Feature Engineering for you
  • When you have Domain Knowledge, Feature Engineering can give you a huge lift
  • Salary, Jon Title, Zip Code example
  • What about people in this Zip Code, with # of cars >> generate mean of salaries
  • Create out of fold estimates
  • Don’t take your own prediction feature for training
  • Writes in Python, CUDA and C++ is under the hood that Python directs
  • Able to create good models in an automated way
  • Driverless AI does not handle images
  • Handles strings, numbers, and categorial
  • Can be 100’s of Gigabytes
  • Creates 100’s of models with 1,000’s of new features
  • Creates an ensemble model after its done
  • Then creates a exportable model (Java runtime or Python)
  • C++ version is being worked on
  • All standalone models
  • Connect with Python client or via the web browser
  • Changelog is on docs.h2o.ai
  • Tests against Kaggle datasets
  • BNP Paribas Kaggle set, Driverless AI ranked in the top 10 out of the box
  • Took Driverless AI 2 hours, whereas Grandmasters it took 2 months
  • Discussed how Logloss is interpreted
  • Uses Reusable Holdout(RH) and subsamples of RH
  • Driverless AI uses unsupervised methods to make supervised models
  • Uses XGBoost, GLM, LightGBM, TensorFlow CNN, and Rule Fit
  • Implemented in R’s datatable for feature engineering and munging
  • Working on a open source version of R’s datatable in Python
  • Overview in how Driverless AI handles outliers (AutoViz)
  • AutoViz only plots what you should see, not 100’s of scatterplots like Tableau
  • Overview on the GUI, what you can do
  • Validation and Test sets. How to use them and when
  • Checks data shift in training and testing set
  • Includes Machine Learning Interpretability suite
  • Does Time Series and NLP

And much more! Arno’s presentation style is excellent and he makes Data Science simply understood.