Machine Learning and Data Munging in H2O Driverless AI with datatable

I missed this presentation at H2O World and I’m glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R’s data.table called simply: datatable.

H2O World San Francisco, 2019

I’m going to be trying this new package out in my next python munging work. It looks incredibly fast. Just as I do it with all my videos, I add in my notes for readers below.

Notes

  • Introduction to using the open source datatable
  • 9 million rows in 7 seconds??
  • Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:
    • Has a Python fronted with a C++ blackened
    • Parallelized with OpenMP and Hogwild
    • Supports boolean, integer, real, and string functions
    • Hashing trick based on Murmur hash function
    • Second-order feature interactions
    • One-vs-rest multinomial class-action and regression targets (experimental)
  • As simple as ‘import datatable as dt’
  • Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
  • Datatable comes from the popular R data.table package
  • When Driverless AI started, we knew Pandas was a problem
  • Pandas is memory hungry
  • Realized we needed a python version of datatable
  • The first customer is Driverless AI
  • Wanted it to be multithreaded and efficient
  • Memory thrifty
  • Memory mapped on data sets (data set can live in memory or on disk)
  • Native C++ implementation
  • Open Source
  • Fread: A doorway to Driverless AI, reading in data
  • Next step in DAI is to save it to a binary format
  • The file is called ‘.jay’
  • Check it with ‘%%timit’
  • Opening a .jay file is nearly instant
  • Syntax is very SQL like, if you’re familiar with R’s data.table, then you can get this
  • See timestamp 16:00 is basic syntax in use
H2O.ai, datatable

Question and Answers

  • Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
  • Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it’s still being built out

Functional Programming in Python

I’m spending time trying to understand the differences between writing classes and functions in Python. Which one is better and why? From what I’m gathering, a lot of people are tired of writing classes in general. Classes are used in Object Oriented Programming (OOP) and some python coders hate it because it’s writing too many lines of code when only a few really matter. So programmers like functional programming (FP) in python instead.

To that end, I’ve been watching videos of both. OOP and FP videos on the Internet and started writing notes on them. Below is a great but also very deep video on functional progamming in python by Daniel Kirsch from PyData 2016. It’s a great video and his presentation is about 30 minutes with a great Q&A session.

Functional Programming in Python

My notes from the above video are above are below:

  • First Class Functions
  • Higher Order Functions
  • Purity
  • Immutability (not going to talk about it)
  • Composition
  • Partial Application & Currying
  • Purity, a function without ‘side effects’
  • First Class Functions, simply means that functions are like everybody else
  • Can define with ‘def’ or lambda
  • Can use the name of functions as variables and do higher-order programming
  • Decorators “… provide a simple syntax for calling higher-order functions. By definition, a decorator is a function that takes another function and extends the behavior of the latter function without explicitly modifying it.”
  • Partial function applications – “The primary tool supplied by the Functools module is the class partial, which can be used to “wrap” a callable object with default arguments. Partial objects are similar to function objects with slight differences. Partial function application makes it easier to write and maintain the code.”
  • Partial functions are very powerful
  • “Currying transforms a function that takes multiple arguments in such a way that it can be called as a chain of functions. Each with a single argument (Partial Application).” via Wikipedia
  • The important concept for Currying is closures, aka lexical scoping
  • Remembers the variables in the scope where it was defined
  • List comprehensions vs functional equivalents
  • Map function vs list comprehension
  • Filter function vs list comprehension
  • Reduce vs list comprehension
  • Why not write out the loop instead? Using Map/Filter/Reduce is cleaner
  • Function composition: i.e. run a filter and then map: map(f, filter(p, seq))
  • ‘Import functools’ is very useful
  • Main takeaways: Function Programming is possible in Python (to a degree)
  • Main takeaways: Small composable function are good
  • Main takeaways: FP == Build General Tools and Compose them
  • Python is missing: more list functions
  • Python is missing: Nicer lambda syntax
  • Python is missing: Automatic currying, composition syntax
  • Python is missing: ADTS (Sum Types)
  • Python is missing: Pattern Matching
  • Some remedies for list functions
  • Links provide in video @ 26:00
  • Suggest learning Haskell as a gateway to functional programming.

Changing Pinboard Tags with Python

Welcome to another automation post! This is a super simple Python script for changing misspelled or wrong tags in your Pinboard account. I started using Pinboard again because it helps me save all these great articles I read on the Interwebz, so I can paraphrase and regurgitate them back to you. Ha!

I need to clean out the Pinboard tags every so often because I hooked it up to Twitter. It works well for me because it saves all my retweets, favs and posts, but there’s a lot of noise. Sometimes I end up with tags like “DataScience” and “DataScientists” when I really want “DataScience.” I did some searching around and found the Pinboard Python library. Changing Pinboard tags with Python is EASY!

Continue reading “Changing Pinboard Tags with Python”