Apr 25, 2018 6 min read Tutorials

Getting Started in Data Science - Part 1

My experiences in learning Data Science and working at a Startup. I share my lessons learned and tips on how you can get started in Data Science.

Photo by Julia Koblitz / Unsplash

his is the forward to an introduction to getting started in data science. I wanted to write a set of 'getting started' posts to share with readers on how I became a data scientist at RapidMiner. How I went from a civil engineer with an MBA to working for an amazing startup. Granted, I'm not a classically trained data scientist, I hardly knew how to code but with the right tools and attitude, you can 'huff' your way into this field.

Will you be a data scientist after reading this series of posts? Of course not, but you'll have a framework to move forward or, at the least, have a better understanding of what we do.

Introduction

My journey into data science started with my engineering degree. It taught me the basics of statistics, math, critical thinking, and even how to program in Fortran. I promptly forgot how to code in Fortran, which it turns out was a mistake on my part. I worked as a civil engineer for close to 6 years before I decided to get an MBA with a specialization in technology.

It was in MBA school that I took a course titled "Data Mining for Managers" taught by Dr. Stephan Kudyba. Dr. Kudyba turned me on to a passion that I didn't know existed. The very thought of 'mining' data for statistical relationships got me so excited that I ended up starting this website/blog back in 2007. He was the match to this fire.

It was after his class that I found YALE, the initial alpha version of what became RapidMiner. Right off the bat, I could tell that it was feature-rich but incredibly hard to understand or use. You had to be a Ph.D. to figure that out, which was true since the Founders (Ingo & Ralf) were Ph.D. students at the time. In my Data Mining for Managers class, we never talked about cross-validation. We never created a confusion matrix or calculated precision/recall. We just talked about ETL and data preparation, modeling with a Neural Net, and consuming the results. So, I had work to do if I wanted to use this tool.

No Coding

Even though it was hard, I chose YALE/RapidMiner because I didn't need to code. I didn't have time to teach myself a programming language, which probably was a mistake on my part as I reflected. I had the chance to take some Java classes back then but decided not to. If I had to do it all over again, I would choose either Java or Python to learn from the very beginning. Java if I wanted to build out RapidMiner and Python because it's fast to prototype and easy to work with.

What are Data Scientists coding in?

This will change from year to year and you can always check out the KD Nuggets yearly poll on what data scientists are using but here are the ones I'm familiar with with comments. My suggestion, pick two but become proficient in one.

Java

Java is a statically typed language. It means you have to explicitly declare variables and takes more time to write your program. The data science benefit is that platforms like RapidMiner and KNIME run on it, so it's platform-independent. H20.ai also lets you export its process as a POJO file (Plain Old Java Object) so you can quickly put it into production. Then there's WEKA, another Java-based data science platform. The upside to Java is that it's very mature and has a ton of libraries to use. Note: H20-3 is Java-based but it has Python and R APIs.

Another added benefit is if you're working with Hadoop, Java works well too. Of course, every Hadoop distribution will be different, but generally, it supports Java. If Hadoop and Big Data interest you, then also look into Scala. Scala is very similar to Java.

YouTube

R

I'll start with a disclaimer, I've used R and it has some great packages but I find it clunky. This is my personal bias and I've worked with people who love R. It's a very feature-rich open-source software that lets you do all aspects of machine learning with some of the best graphics libraries out there. A lot of universities teach data science-related courses on R and I completely understand it. It's not as heavy to code as Java and it is a bit easier than Python in my opinion, but you have to know the syntax. It's a bit harder to put into production and you can use it on Hadoop via SparkR. You can download it right away and get started with the 1000's of video tutorials out there.

If you're going to work with R, I suggest downloading R Studio. It's a very nice workbench that lets you write R scripts, load data, and display charts right in one neatly organized place.

YouTube

Python

I like Python a lot because it feels like an engineering mindset. Programming is relatively fast and everything is considered 'dynamic' This flexibility, unfortunately, makes it slower. There are so many great open-source libraries out there for Python that it's becoming the defacto programming language for data scientists. There's Scikit-learn, Numpy, Keras, TensorFlow, etc.

It can be productionalized with pickle files and exposed as a REST API via some sort of framework like Flask, but it's a bit trickier. Still, you can rapidly prototype data science projects with it and if you get stuck, there are a ton of communities to help you. Just visit any StackOverflow Python forum.

I use Python extensively for mundane and routine tasks and occasionally do some data science with it.

YouTube

Julia

I love what Julia can become. It's a great programming language that reminds me a lot of Python and R BUT it has speed. It has a Just In Time (JIT) compiler that makes it leaps and bounds faster than Python and was designed from the ground up to be parallelized and offloaded to the 'cloud.' The negative for now is that it doesn't have the depth and breadth of libraries that Python has but it's growing.

I like that it can be integrated with a Jupyter Notebook, which makes things a lot easier to code in.

YouTube

Deep Learning Libraries

Right now there are so many competing Deep Learning libraries out there that it's hard to choose one. I personally like TensorFlow and Keras (Keras being a wrapper of three DL libraries) but Keras seems to be the dominant one for today.

Others

There are so many other bits of software and programming languages out there that I couldn't even begin to write about every one of them. Like I said above, choose two platforms and/or languages and become a master in one.

For Part 2 I want to talk about taking all these tools and aligning them with a business problem.