Tag Data Science

Posts: 31

Datascience Link Roundup

A list of five data science related links that I found interesting:

  • Tensorflow 2 is officially out. It has tighter integration with Keras (see changelog here). I wonder if it will just eat Keras completely.
  • Interesting artcile on using "AI for portfolio management: from Markowitz to Reinforcement Learning." Some stuff that we've seen before but the part on Reinforcment Learning is interesting.
  • Startup Fiddler rasies $10 million. They're trying to develop 'an “explainable” engine that’s designed to analyze, validate, and manage AI solutions.' Aren't we all.
  • A few Python libraries that help with explainability. LIME is in there but the others I did not know about. Must investigate further...
  • Chinese PingAn Technology can predict flu outbreaks to a 90% accuracy. It's all about disease containment.

Updated! I added a bunch more links below because I thought it would be more useful!

  • Get a free ebook from H2O.ai on Machine Learning Interpretability (MLI) here. This link takes you to a Twitter post.
  • Oh my, this will NOT end well. Microsoft unleashes an AI bot to generate fake comments to news articles.
  • ScikitOptmize is a cool new library that let's you auto-optimize your hyperparameters.
  • Just like China, we're using AI to rank people. This is damn scary and I don't like it, even it does some good in some cases.
  • Facebook's Libra is in trouble as Mastercard and Visa rethink it. I say, Vive Le Bitcoin!

comments

TensorFlow and High Level APIs

I got a chance to watch this great presentation on the upcoming release of TensorFlow v2 by Martin Wicke. He goes over the big changes - and there's a lot - plus how you can upgrade your earlier versions of TensorFlow to the new one. Let's hope the new version is faster than before! My video notes are below:

TensorFlow

  • Since it's release, TensorFlow (TF) has grown into a vibrant community
  • Learned a lot on how people used TF
  • Realized using TF can be painful
  • You can do everything in TF but what is the best way
  • TF 2.0 alpha is just been released
  • Do 'pip install -U --pre tensorflow'
  • Adopted tf.keras as high-level API (SWEET!)
  • Includes eager execution by default
  • TF 2 is a major release that removes duplicate functionality, makes the APIs consistent, and makes it compatible in different TF versions
  • New flexibilities: full low-level API, internal operations are accessible now (tf.raw_ops), and inheritable interfaces for variables, checkpoints, and layers
  • How do I upgrade to TensorFlow 2
  • Google is starting the process of converting the largest codebase ever
  • Will provide migration guides and best practices
  • Two scripts will be shipped: backward compatibility and a conversion script
  • The reorganization of API causes a lot of function name changes

TensorFlow v2

TensorFlow v2

  • Release candidate in 'Spring 2019' < might be a bit flexible in the timeline
  • All on GitHub and project tracker
  • Needs user testing, please go download it
  • Karmel Allison is an Engineering manager for TF and will show off high-level APIs
  • TF adopted Keras
  • Implemented Keras and optimized in TF as tf.keras
  • Keras built from the ground up to be pythonic and simple
  • Tf.keras was built for small models, whereas in Google they need HUGE model building
  • Focused on production-ready estimators
  • How do you bridge the gap from simple vs scalable API
  • Debug with Eager, easy to review Numpy array
  • TF also consolidated many APIs into Keras
  • There's one set of Optimizers now, fully scalable
  • One set of Metrics and Losses now
  • One set of Layers
  • Took care of RNN layers in TF, there is one version of GRE and LSTM layers and selects the right CPU/GPU at runtime
  • Easier configurable data parsing now (WOW, I have to check this out)
  • TensorBoard is now integrated into Keras
  • TF distribute strategy for distributing work with Keras
  • Can add or change distribution strategy with a few lines of code
  • TF Models can be exported to SavedModel using the Keras function (and reloaded too)
  • Coming soon: multi-node sync
  • Coming soon: TPU's

There's a lot in this 22 minute video about TensorFlow v2. Must watch.

comments

Getting Started in Data Science Part 2

I'm finally getting around to writing Part 2 of Getting Started in Data Science. The first part can be found here. I made suggestions for university students interested in the field of Data Science. I even made a video about it too. 

Pick Two, Master One

Pick two computer languages and become proficient in one and a master at the other one. Or, pick a platform like H2O-Flow or RapidMiner and a language. Become a master at one but proficient in the other. This way you can set yourself apart from other students or applicants. 

The reality is that you will be flipping back and forth between languages in your day to day work life. You could be writing a Python script to connect a database. Then pull in some data and then a D3js wrapper to make make a dashboard. It all depends on where you end up.

Social Equity

I spoke about this in my video, you should get involved socially. Join meetups, go to conferences and then contribute. Did you do a cool project or solve an interesting problem? Ask to speak about it at a meetup. Public speaking does two things for you: it builds your brand, and it helps you get over the fear of speaking. 

I used to 'pooh pooh' people with communication skills. I used to think all they do is talk and produce nothing. Boy was I wrong. Communicating is as important as solving whatever problem you're working on. 

Another way is to join a club or meetup. This is great low stress way to get out and listen to some interesting speakers in the field. There's tons of meetups happening all the time and all you need to do is go to meetup.com and do a search in your area.

If you saw someone give an interesting talk at a meetup, go up to him or her and tell them you enjoyed their talk. Then ask for a business card or ask if there's any opportunities at their company.  Do be an annoying nudge and email them every day asking about opportunities. Check in with them every quarter by sending a nice email with an interesting article you read. 

Create something

The next way is to create something. In my past article, I wrote about about how the Makers have a drive to create. As we say at H2O.ai, Makers Gonna Make. So Make something!

Write a new library for python or R. Create new RapidMiner processes. Then share them with the world. Share them on Github, share them on a blog or share them on Medium. Doesn't matter but design/build/code something and release it into the wild.  Then cultivate it's growth. 

Become that guy or gal who's software is being used at Google (but can't get a job there. sheesh!)

Make and then Share!

Start a Business

This is idea is the hardest but the most rewarding. Become an entrepreneur by starting a business. It doesn't have to be big, look at what Ugly from Uglychart is doing. He's domain flipping and making $125,000 per month. The best part? He's the only employee and doesn't want to get big.

Or, you could be like the founders of RapidMiner. Build a Data Science platform back in 2007, then build build a Startup around it! The founders of Instagram designed an app and photo platform for the iPhone and sold it to Facebook. Of course they left Facebook but I'm sure they're going to be sought after by Venture Capitalists. 

The hard part with this suggestion is figuring out what kind of business to start. Are you going to be a consultant or are you going to build a product? Then how are you going to sell it (beware the Fremium Devil). 

In the end, it doesn't matter which route you choose. The most important aspect is to remain involved with a Data Science community. Read up on latest advances, write code, build things, talk to people, and build your personal brand

comments

H2O AI World 2018 in London

It's been nearly a whole month since I've been back from H2O AI World 2018 in London. First off, London is always a great city. I love it. Add in H2O World and it was like a machine learning fairy tale.

There were Kaggle Grandmasters, new H2O-3 and Driverless AI releases, and awesome speakers. Oh my!

I wanted to write a detailed 'blow by blow' account of my experiences at H2O World but I was beaten to the punch. Pavel wrote an awesome post about his experiences in London AND so did folks over in Mountain View do the same with their 'Top 5 Things You Should Know About H2O AI World London.'

That's the thing when you work for a fast paced organization, you gotta be fast too.

My favorite talks

Four talks/presentations really stood out for me, and I'll list them in no particular order below.

The first was Dr. Tanya Berger-Wolf's presentation on combating extinction. The use of image mining amateur photographs to calculate animal populations was fascinating. I was lucky enough to meet up with her after her talk to get into the nitty gritty details of the process.

https://youtu.be/4McrCDioiCc

Next was Ashrith's talk on "Modeling Approaches for Malicious Behavioural Detection." I found this fascinating because I have a big interest in anomaly detection and it's application.

https://youtu.be/DXcLfrvkJgs

Followed by Erin's talk on "Scalable Automatic Machine Learning with H2O." In it she goes over the recent AutoML enhancements and what the roadmap looks like for today.

https://youtu.be/KN3iVR497TI

Then there was the Kaggle Grandmaster Panel, which hosted some amazing talent. I found the question and answer discussion to be quite enlightening. On top of that I got to speak with the youngest Kaggle Grandmaster, Mikel Bobar-Irizar, after hours on how he began his journey in Data Science.

https://youtu.be/BNAiHpH_gMM

H2O AI World 2019

Wouldn't you know? H2O AI World 2019 was just announced for February in San Francisco. Hope to see you there.

Update: If you're a support of Open Source, check out my post on Startups and Open Source here.

comments

Isolation Forests in H2O.ai

A new feature has been added to H2O-3 open source, isolation forests. I've always been a fan of understanding outliers and love using One Class SVM's as a method, but the isolation forests appear to be better in finding outliers, in most cases.

From the H2O.ai blog:

There are multiple approaches to an unsupervised anomaly detection problem that try to exploit the differences between the properties of common and unique observations. The idea behind the Isolation Forest is as follows.

  • We start by building multiple decision trees such that the trees isolate the observations in their leaves. Ideally, each leaf of the tree isolates exactly one observation from your data set. The trees are being split randomly. We assume that if one observation is similar to others in our data set, it will take more random splits to perfectly isolate this observation, as opposed to isolating an outlier.
  • For an outlier that has some feature values significantly different from the other observations, randomly finding the split isolating it should not be too hard. As we build multiple isolation trees, hence the isolation forest, for each observation we can calculate the average number of splits across all the trees that isolate the observation. The average number of splits is then used as a score, where the less splits the observation needs, the more likely it is to be anomalous.

While there's other methods of outlier detection like LOF (local outlier factor), it appears that Isolation Forests tend to be better than One Class SVM's in finding outliers.

See this handy image from Scikit-Learn site:

Anomaly Detection Comparison
Anomaly Detection Comparison

Interesting indeed. I plan on using this new feature on some work I'm doing for customers.

comments