Blog

TensorFlow and High Level APIs

I got a chance to watch this great presentation on the upcoming release of TensorFlow v2 by Martin Wicke. He goes over the big changes – and there’s a lot – plus how you can upgrade your earlier versions of TensorFlow to the new one. Let’s hope the new version is faster than before! My video notes are below:

  • Since it’s release, TensorFlow (TF) has grown into a vibrant community
  • Learned a lot on how people used TF
  • Realized using TF can be painful
  • You can do everything in TF but what is the best way
  • TF 2.0 alpha is just been released
  • Do ‘pip install -U –pre tensorflow’
  • Adopted tf.keras as high-level API (SWEET!)
  • Includes eager execution by default
  • TF 2 is a major release that removes duplicate functionality, makes the APIs consistent, and makes it compatible in different TF versions
  • New flexibilities: full low-level API, internal operations are accessible now (tf.raw_ops), and inheritable interfaces for variables, checkpoints, and layers
  • How do I upgrade to TensorFlow 2
  • Google is starting the process of converting the largest codebase ever
  • Will provide migration guides and best practices
  • Two scripts will be shipped: backward compatibility and a conversion script
  • The reorganization of API causes a lot of function name changes
TensorFlow, v2
Martin Wicke of Google shows the new TensorFlow v2 conversion script
  • Release candidate in ‘Spring 2019’ < might be a bit flexible in the timeline
  • All on GitHub and project tracker
  • Needs user testing, please go download it
  • Karmel Allison is an Engineering manager for TF and will show off high-level APIs
  • TF adopted Keras
  • Implemented Keras and optimized in TF as tf.keras
  • Keras built from the ground up to be pythonic and simple
  • Tf.keras was built for small models, whereas in Google they need HUGE model building
  • Focused on production-ready estimators
  • How do you bridge the gap from simple vs scalable API
  • Debug with Eager, easy to review Numpy array
  • TF also consolidated many APIs into Keras
  • There’s one set of Optimizers now, fully scalable
  • One set of Metrics and Losses now
  • One set of Layers
  • Took care of RNN layers in TF, there is one version of GRE and LSTM layers and selects the right CPU/GPU at runtime
  • Easier configurable data parsing now (WOW, I have to check this out)
  • TensorBoard is now integrated into Keras
  • TF distribute strategy for distributing work with Keras
  • Can add or change distribution strategy with a few lines of code
  • TF Models can be exported to SavedModel using the Keras function (and reloaded too)
  • Coming soon: multi-node sync
  • Coming soon: TPU’s

There’s a lot in this 22 minute video about TensorFlow v2. Must watch.

Driving Marketing Performance with H2O Driverless AI

I watched this great video of G5 explaining how they use H20-3, AutoML, and Driverless AI to build an NLP model and put it in production. Really cool. It uses the AWS stack and AWS Lambda. My summary notes below:

  • G5 started with zero in ML and in 3 months built an ML pipeline
  • G5 is a leader in marketing optimization for real estate marketing companies
  • They find leads for their customers
  • Owned/Paid/Earned media are breadcrumbs that site visitors leave
  • Clicks are not the most important interaction, it’s a call (90% of the time)
  • How to classify caller intent?
  • Build a training set from unstructured call data
  • Started with 110,000 unlabeled data set
  • Hired people to listen to them and hand score them
  • Problem: Every one scores things a bit different
  • Built a questionnaire to find the similar workers that would score the data the same way
  • Every day took a sample and reviewed them for consistency
Driverless AI, AutoML, Word2Vec
G5 – Getting Data to Prediction
  • Experimented with H2O-3 for testing
  • Took training set and rain it through H2o-3 and built a Word2Vec model
  • Used AutoML to understand the parameters Word2Vec model
  • Ended up with 500 features and enriched with metadata (day of the week, length of call, etc)
  • Took that processed training set and put it through Driverless AI
  • Driverless AI came up with a model with a high 95% accuracy model, beating the 89% benchmark.
  • Driverless AI made it simple to put the model in production
  • Results from Driverless AI feature interaction is making G5 rethink dropping the Word2Vec model and go completely in Driverless
  • DevOps needs to make sure the customers can use the results
  • Reliability / Reproducibility / Scalability / Iterability
  • A phone call comes in, it gets transcribed in AWS Lambda, then vectorizes with the same Lambda model that does the training. This is done so you can get the same feature set every time (for model scoring)
  • H20-3 makes transitions between R and Python easy
  • This model saves 3 minutes per call vs human listening, at 1 million calls a month that is 50,000 hours saved
  • Best part, it reduces the scoring time by 99% vs competitors

Questions and Answers

  • Do you need to retrain all or part of your manual labels periodically to tackle model shift? Yes, hand scoring continues and retraining is done and compared with model in production to see if shift occurs
  • How to maintain the models and how often to u refresh? Right now it’s a monthly cadence of review and update

Machine Learning and Data Munging in H2O Driverless AI with datatable

I missed this presentation at H2O World and I’m glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R’s data.table called simply: datatable.

H2O World San Francisco, 2019

I’m going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Notes

  • Introduction to using the open source datatable
  • 9 million rows in 7 seconds??
  • Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:
    • Has a Python fronted with a C++ blackened
    • Parallelized with OpenMP and Hogwild
    • Supports boolean, integer, real, and string functions
    • Hashing trick based on Murmur hash function
    • Second-order feature interactions
    • One-vs-rest multinomial class-action and regression targets (experimental)
  • As simple as ‘import datatable as dt’
  • Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
  • Datatable comes from the popular R data.table package
  • When Driverless AI started, we knew Pandas was a problem
  • Pandas is memory hungry
  • Realized we needed a python version of datatable
  • The first customer is Driverless AI
  • Wanted it to be multithreaded and efficient
  • Memory thrifty
  • Memory mapped on data sets (data set can live in memory or on disk)
  • Native C++ implementation
  • Open Source
  • Fread: A doorway to Driverless AI, reading in data
  • Next step in DAI is to save it to a binary format
  • The file is called ‘.jay’
  • Check it with ‘%%timit’
  • Opening a .jay file is nearly instant
  • Syntax is very SQL like, if you’re familiar with R’s data.table, then you can get this
  • See timestamp 16:00 is basic syntax in use
H2O.ai, datatable

Question and Answers

  • Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
  • Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it’s still being built out