I missed this presentation at H2O World and I'm glad it was recorded. Pasha Stetsenko and Oleksly Kononenko give a great presentation on the Python version of R's data.table called simply: datatable.
- Introduction to using the open source datatable
- 9 million rows in 7 seconds??
- Recently implemented Follow the Regularized Leader (FTRL) in Driverless AI:
- As simple as 'import datatable as dt'
- Use it because its: reliable, fast, datatable FTRL is already in Kaggle and open source!!!
- Datatable comes from the popular R data.table package
- When Driverless AI started, we knew Pandas was a problem
- Pandas is memory hungry
- Realized we needed a python version of datatable
- The first customer is Driverless AI
- Wanted it to be multithreaded and efficient
- Memory thrifty
- Memory mapped on data sets (data set can live in memory or on disk)
- Native C++ implementation
- Open Source
- Fread: A doorway to Driverless AI, reading in data
- Next step in DAI is to save it to a binary format
- The file is called '.jay'
- Check it with '%%timit'
- Opening a .jay file is nearly instant
Syntaxis very SQL like, if you're familiar with R's data.table, then you can get this
- See timestamp 16:00 is basic syntax in use
Question and Answers
- Can you create datatable from redshift or some other db? No, suggest use connecting in Pandas and then convert to datatable
- Is python datatable as fully featured as R data.table and if not is there a plan to build it out? No, it's still being built out