I’m finally getting around to writing Part 2 of Getting Started in Data Science. The first part can be found here. I made suggestions for university students interested in the field of Data Science. I even made a video about it too.
Pick Two, Master One
Pick two computer languages and become proficient in one and a master at the other one. Or, pick a platform like H2O-Flow or RapidMiner and a language. Become a master at one but proficient in the other. This way you can set yourself apart from other students or applicants.
This is the forward to an introduction on getting started in data science. I wanted to write a set of ‘getting started’ posts to share with readers on how I became a data scientist at RapidMiner. How I went from a civil engineer with an MBA to working for an amazing startup. Granted, I’m not a classically trained data scientist, I hardly knew how to code but with the right tools and attitude, you can ‘huff’ your way into this field. Will you be a data scientist after reading these series of posts? Of course not, but you’ll have a framework to move forward or, at the least, have a better understanding what we do.
My journey into data science starts with my engineering degree. It taught me the basics of statistics, math, critical thinking, and even how to program in Fortran. I promptly forgot how to code in Fortran, which it turns out was a mistake on my part. I worked as a civil engineer for close to 6 years before I decided to get an MBA with a specialization in technology. It was in MBA school that I took a course titled “Data Mining for Managers” taught by Dr. Stephan Kudyba. Dr. Kudyba turned me on to a passion that I didn’t know existed. The very thought of ‘mining’ data for statistical relationships got me so excited that I ended up starting this website/blog back in 2007. He was the match to this fire.
It was after his class that I found YALE, the initial alpha version of what became RapidMiner. Right of the bat I could tell that it was feature rich but incredible hard to understand or use. You had to be a PhD to figure that out, which was true since the Founders (Ingo & Ralf) were PhD students at the time. In my Data Mining for Managers class, we never talked about Cross Validation. We never created a confusion matrix or calculated precision/recall. We just talked about ETL and data preparation, modeling with a Neural Net, and consuming the results. So, I had work to do if I wanted to use this tool.
Eventhough it was hard, I chose YALE/RapidMiner because I didn’t need to code. I didn’t have time teach myself a programming language, which probably was a mistake on my part as I reflect. I had the chance to take some Java classes back then but decided not too. If I had to do it all over again, I would choose either Java or Python to learn from the very beginning. Java if I wanted to really build out RapidMiner and Python because it’s fast to prototype and easy to work with.
What are Data Scientists coding in?
This will change from year to year and you can always check out the KD Nuggets yearly poll on what data scientists are using but here are the ones I’m familiar with with comment. My suggestion, pick two but become really proficient in one.
Java is a statically typed language. It means you have to explicitly declare variables and takes more time to write your program. The data science benefit is that platforms like RapidMiner and KNIME run on it, so it’s platform independent. H20.ai also lets you export its process as a POJO file (Plain Old Java Object) so you can quickly put it into production. Then there’s WEKA, another java based data science platform. The upside to Java is that it’s very mature and has a ton of libraries to use. Note: H20 Flow is java based and runs in your browser. H2O also provides R and Python bindings to boot! Woot!
Another added benefit is if you’re working with Hadoop, Java works well too. Of course, every Hadoop distribution will be different, but generally it supports Java. If Hadoop and Big Data interests you, then also look into Scala. Scala is very similar to Java.
I’ll start with a disclaimer, I’ve used R and it has some great packages but I find it clunky. This is my personal bias and I’ve worked with people who absolutely love R. It’s a very feature rich open source software that let’s you do all aspects of machine learning with some of the best graphics libraries out there. A lot of universities teach data science related course on R and I completely understand it. It’s not as heavy to code as Java and it is a bit easier than python in my opinion, but you have to know the syntax. It’s a bit harder to put into production and you can use it on Hadoop via SparkR. You can download it right away and get started with the 1000’s of video tutorials out there.
If you’re going to work with R, I suggest downloading R Studio. It’s a very nice workbench that let’s you write R scripts, load data, and display charts right in one neatly organized place.
I like Python a lot because it feels like a engineering mindset. Programing is relatively fast and everything is considered ‘dynamic’ This flexibility, unfortunately, makes it slower. There are so many great open source libraries out there for Python that it’s becoming the defacto programming language for data scientists. There’s Scikit-learn, numpy, Keras, TensorFlow, etc.
It can be productionalized with pickle files and exposed as a REST API via some sort of framework like Flask, but it’s a bit trickier. Still, you can rapidly prototype data science projects with it and if you get stuck, there are a ton of communities to help you. Just visit any StackOverflow Python forum.
I use python extensively for mundane and routine tasks and occasionally do some data science with it.
I love what Julia can become. It’s a great programming language that reminds me a lot of Python and R BUT it has speed. It has a Just In Time (JIT) compiler that makes it leaps and bounds faster than Python and was designed from the ground up to be parallelized and offloaded to the ‘cloud.’ The negative for now is that it doesn’t have the depth and breadth of libraries that Python has but it’s growing.
I like that it can be integrated with a Jupyter Notebook, which makes things a lot easier to code in.
Deep Learning Libraries
Right now there are so many competing Deep Learning libraries out there that it’s hard to choose one. I personally like TensorFlow and Keras (Keras being a wrapper of three DL libraries) but Keras seems to be the dominate one for today.
There are so many other bits of software and programming languages out there that I couldn’t even begin to write about each and everyone of them. Like I said above, choose two platforms and/or languages and become a master in one.
For Part 2 I want to talk about taking all these tools and aligning them with a business problem.