Machine Learning and Data Munging in H2O Driverless AI with datatable

I missed this presentation at H2O World and I’m glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R’s data.table called simply: datatable.

H2O World San Francisco, 2019

I’m going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Notes

  • Introduction to using the open source datatable
  • 9 million rows in 7 seconds??
  • Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:
    • Has a Python fronted with a C++ blackened
    • Parallelized with OpenMP and Hogwild
    • Supports boolean, integer, real, and string functions
    • Hashing trick based on Murmur hash function
    • Second-order feature interactions
    • One-vs-rest multinomial class-action and regression targets (experimental)
  • As simple as ‘import datatable as dt’
  • Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
  • Datatable comes from the popular R data.table package
  • When Driverless AI started, we knew Pandas was a problem
  • Pandas is memory hungry
  • Realized we needed a python version of datatable
  • The first customer is Driverless AI
  • Wanted it to be multithreaded and efficient
  • Memory thrifty
  • Memory mapped on data sets (data set can live in memory or on disk)
  • Native C++ implementation
  • Open Source
  • Fread: A doorway to Driverless AI, reading in data
  • Next step in DAI is to save it to a binary format
  • The file is called ‘.jay’
  • Check it with ‘%%timit’
  • Opening a .jay file is nearly instant
  • Syntax is very SQL like, if you’re familiar with R’s data.table, then you can get this
  • See timestamp 16:00 is basic syntax in use
H2O.ai, datatable

Question and Answers

  • Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
  • Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it’s still being built out

Making AI Happen Without Getting Fired

From H2O.ai

I watched Mike Gualtieri’s keynote presentation from H2O World San Francisco (2019) and found it to be very insightful from a non-technical MBA type of way. The gist of the presentation is to really look at all the business connections to doing data science. It’s not just about the problem at hand but rather setting yourself up for success, and as he puts it, not getting fired!

My notes from the video are below (emphasis mine):

  • Set the proper expectations
  • There is a difference between Pure AI and Pragmatic AI
  • Pure AI is like what you see in movies (i.e. ExMachina)
  • Pragmatic AI is machine learning. Highly specialized in one thing but does it really well
  • Chose more than one use case
  • The use case you choose could fail. Choose many different kinds
  • Drop the ones that don’t work and optimize the ones that do
  • Ask for comprehensive data access
  • Data will be in silos
  • Get faster with AutoML
  • Data Scientists aren’t expensive, they need better tools to be more efficient
  • Three segments of ML tools
    • Multimodel (drag and drop like RapidMiner/KNIME)
    • Notebook-based (like Jupyter Notebook)
    • Automation-focused (like Driverless AI)
  • Use them to augment your work, go faster
  • Warning: Data-savvy users can use these tools to build ML. Can be dangerous but they can vet use cases
  • Know when to quit
  • Sometimes the use case won’t work. There is no signal in the data and you must quit
  • Stop wasting time
  • Keep production models fresh
  • When code is written, it’s written the same way and runs the same forever
  • ML Models decay, so you need to figure out how to do it at scale
  • Model staging, A/B testing, Monitoring
  • Model deployment via collaboration with DevOps
  • Get Business and IT engaged early
  • They have meetings with business and IT, get ducks in a row
  • Ask yourself, how is it going to be deployed and how it will impact business process
  • Ignore the model to protect the jewels
  • You don’t have to do what the model tells you to do (i.e False Positives, etc)
  • Knowledge Engineering: AI and Humans working together
  • Explainability is important

Flux: A Machine Learning Framework for Julia

There was a HUGE announcement on the Julia blog a few days ago. The convergence of a language for machine learning and marrying it with a compiler just got a bit closer. Julia announced Flux, a machine learning frame work for Julia. 

Continue reading “Flux: A Machine Learning Framework for Julia”