TensorFlow and High Level APIs

I got a chance to watch this great presentation on the upcoming release of TensorFlow v2 by Martin Wicke. He goes over the big changes – and there’s a lot – plus how you can upgrade your earlier versions of TensorFlow to the new one. Let’s hope the new version is faster than before! My video notes are below:

  • Since it’s release, TensorFlow (TF) has grown into a vibrant community
  • Learned a lot on how people used TF
  • Realized using TF can be painful
  • You can do everything in TF but what is the best way
  • TF 2.0 alpha is just been released
  • Do ‘pip install -U –pre tensorflow’
  • Adopted tf.keras as high-level API (SWEET!)
  • Includes eager execution by default
  • TF 2 is a major release that removes duplicate functionality, makes the APIs consistent, and makes it compatible in different TF versions
  • New flexibilities: full low-level API, internal operations are accessible now (tf.raw_ops), and inheritable interfaces for variables, checkpoints, and layers
  • How do I upgrade to TensorFlow 2
  • Google is starting the process of converting the largest codebase ever
  • Will provide migration guides and best practices
  • Two scripts will be shipped: backward compatibility and a conversion script
  • The reorganization of API causes a lot of function name changes
TensorFlow, v2
Martin Wicke of Google shows the new TensorFlow v2 conversion script
  • Release candidate in ‘Spring 2019’ < might be a bit flexible in the timeline
  • All on GitHub and project tracker
  • Needs user testing, please go download it
  • Karmel Allison is an Engineering manager for TF and will show off high-level APIs
  • TF adopted Keras
  • Implemented Keras and optimized in TF as tf.keras
  • Keras built from the ground up to be pythonic and simple
  • Tf.keras was built for small models, whereas in Google they need HUGE model building
  • Focused on production-ready estimators
  • How do you bridge the gap from simple vs scalable API
  • Debug with Eager, easy to review Numpy array
  • TF also consolidated many APIs into Keras
  • There’s one set of Optimizers now, fully scalable
  • One set of Metrics and Losses now
  • One set of Layers
  • Took care of RNN layers in TF, there is one version of GRE and LSTM layers and selects the right CPU/GPU at runtime
  • Easier configurable data parsing now (WOW, I have to check this out)
  • TensorBoard is now integrated into Keras
  • TF distribute strategy for distributing work with Keras
  • Can add or change distribution strategy with a few lines of code
  • TF Models can be exported to SavedModel using the Keras function (and reloaded too)
  • Coming soon: multi-node sync
  • Coming soon: TPU’s

There’s a lot in this 22 minute video about TensorFlow v2. Must watch.

Machine Learning and Data Munging in H2O Driverless AI with datatable

I missed this presentation at H2O World and I’m glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R’s data.table called simply: datatable.

H2O World San Francisco, 2019

I’m going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Notes

  • Introduction to using the open source datatable
  • 9 million rows in 7 seconds??
  • Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:
    • Has a Python fronted with a C++ blackened
    • Parallelized with OpenMP and Hogwild
    • Supports boolean, integer, real, and string functions
    • Hashing trick based on Murmur hash function
    • Second-order feature interactions
    • One-vs-rest multinomial class-action and regression targets (experimental)
  • As simple as ‘import datatable as dt’
  • Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
  • Datatable comes from the popular R data.table package
  • When Driverless AI started, we knew Pandas was a problem
  • Pandas is memory hungry
  • Realized we needed a python version of datatable
  • The first customer is Driverless AI
  • Wanted it to be multithreaded and efficient
  • Memory thrifty
  • Memory mapped on data sets (data set can live in memory or on disk)
  • Native C++ implementation
  • Open Source
  • Fread: A doorway to Driverless AI, reading in data
  • Next step in DAI is to save it to a binary format
  • The file is called ‘.jay’
  • Check it with ‘%%timit’
  • Opening a .jay file is nearly instant
  • Syntax is very SQL like, if you’re familiar with R’s data.table, then you can get this
  • See timestamp 16:00 is basic syntax in use
H2O.ai, datatable

Question and Answers

  • Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
  • Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it’s still being built out

Making AI Happen Without Getting Fired

From H2O.ai

I watched Mike Gualtieri’s keynote presentation from H2O World San Francisco (2019) and found it to be very insightful from a non-technical MBA type of way. The gist of the presentation is to really look at all the business connections to doing data science. It’s not just about the problem at hand but rather setting yourself up for success, and as he puts it, not getting fired!

My notes from the video are below (emphasis mine):

  • Set the proper expectations
  • There is a difference between Pure AI and Pragmatic AI
  • Pure AI is like what you see in movies (i.e. ExMachina)
  • Pragmatic AI is machine learning. Highly specialized in one thing but does it really well
  • Chose more than one use case
  • The use case you choose could fail. Choose many different kinds
  • Drop the ones that don’t work and optimize the ones that do
  • Ask for comprehensive data access
  • Data will be in silos
  • Get faster with AutoML
  • Data Scientists aren’t expensive, they need better tools to be more efficient
  • Three segments of ML tools
    • Multimodel (drag and drop like RapidMiner/KNIME)
    • Notebook-based (like Jupyter Notebook)
    • Automation-focused (like Driverless AI)
  • Use them to augment your work, go faster
  • Warning: Data-savvy users can use these tools to build ML. Can be dangerous but they can vet use cases
  • Know when to quit
  • Sometimes the use case won’t work. There is no signal in the data and you must quit
  • Stop wasting time
  • Keep production models fresh
  • When code is written, it’s written the same way and runs the same forever
  • ML Models decay, so you need to figure out how to do it at scale
  • Model staging, A/B testing, Monitoring
  • Model deployment via collaboration with DevOps
  • Get Business and IT engaged early
  • They have meetings with business and IT, get ducks in a row
  • Ask yourself, how is it going to be deployed and how it will impact business process
  • Ignore the model to protect the jewels
  • You don’t have to do what the model tells you to do (i.e False Positives, etc)
  • Knowledge Engineering: AI and Humans working together
  • Explainability is important