Is it Possible to Automate Data Science?

A few months ago I read about a programmer that automated his job down to the point where the coffee machine would make him lattes! Despite the ethical quandary, I thought it was pretty cool to automate your job with scripts. Then I wondered, was it possible to automate data science? Or at least parts of it? This general question proved to be a rabbit hole of exploration.
 
StackExchange has an ongoing discussion into another programmer’s automation of his tasks. He used scripts to prepare customer data into spreadsheets that other employees would use. The task used to take a month to do but it was able to cut that time down to 10 minutes. It did take him several months figure out how to build the right scripts to do the work he now only works 1 to 2 hours a week and gets paid for 40 hours
Throw Data on the Wall
 
In my life at RapidMiner I interacted with potential customers that wanted to “throw data on the wall and see what sticks.” They wanted to find some automated way to use data science to tell them something novel. This usually raises a red flag in my mind and leads me to ask more detailed questions like:
 
“Do you know the business objective you want to solve/meet?”
 
“Do you have a Data Science team or plan to hire a Data Scientist?”
 
“How do you do you do your data exploration and glean insight now?”
 
At this point I can ferret out the true reason for the call or the lack of understanding for the true problem at hand. I’ve even had one potential customer reveal that he called us because he heard of this “data mining stuff” 6 months ago and wanted to get in on it quick.
 
I get it. If you have lots of data where do you begin to make sense of it?
Automate what?
 
The path to insight in your data starts with the data. It’s always going to be messy, missing values, wrong key strokes, and in wrong places. It’s in a database in one office but Sally’s spreadsheet in another office. You can’t get any insight until you start extracting the data, transforming it, and loading it for analysis. This is the standard ETL we all know and love to hate.
 
You can automate ETL completely provided you know what format your data needs to be in. This is where tools like SQL and RapidMiner can help with your dirty work. If you haven’t automated your ETL, you’re behind the curve!
 
Once all the data is ready, then you can model it and test your hypothesis, but which algorithm?
 
Here’s where the crticial thinking comes in. You can’t automate your decision of which model to put into production but you can automate the modeling and evaluation of it. Once again, here’s where RapidMiner can help.
 
When working with a business group, the ubiquitous Decision Tree algorithm tends to come up. Why? Because business LOVE the pretty tree it makes and they’ve always used it before.
 
While Decision Trees are a great algorithm, they’re notorious for overfitting. So then use Random Forests! Random Forests do help with the overfitting problem but is it the right algorithm to use for your particular problem?
Automate Modeling and Evaluation
 
You can automate modeling and evaluation in RapidMiner. It’s easy to try many different algorithms within the same process and build ROC plots. You can output performance measures like LogLoss our AUC to rank which model performed the best. You can even create a leaderboard in RapidMiner Server to ‘automatically’ display which model performed the best!
 
I’ve worked with Customers that do just that. They used RapidMiner to prototype, optimize, and deploy models in a week. Even if they need bits of Python or R to finish the job, they just automate everything.
 
Yet still the question remains should you do this? The answer is that it depends if you know what you are doing. For example, feature generation is something that I’d be every cautious to ‘automate’. Sure you can create some simple calculations and add them as a new attribute, but in general feature generation is something that requires a bit more thinking and less automation. That is until you figured out what features work.
In a nutshell here’s what you can automate with warnings:

 

  1. ETL: You bet, automate away if you know what your format your data needs to be in
  2. Model Building: Yes, because of the no free lunch theorem you should try multiple models on the same data set. Just be cautious of the algorithms you choose
  3. Evaluation: Yes, just compare each model results using the same and multiple performance metrics (i.e. LogLoss, AUC, Kappa, etc)
  4. Feature Generation: No at first.  This is where your thinking comes in on how to include new data or manipulate the existing data to create new features that your model can train on. After that, you can automate it

Data Science Helps Us Ask the Right Questions

We all do Data Science on a daily basis but sometimes we forget why we’re really doing it. It’s not to spend hours coding but rather it’s to answer often ambiguous questions.

We learn to ask the right questions at an early age. At an intersection, for example, a child might ask his parent: “Does red mean we must stop or just should stop.” The validity of the question will be confirmed by the answer in that case. Years later we ask questions about all aspects of our lives — jobs, finance, relationships etc. We hope to ask the right questions at the right time. [via Forbes]

The formulation of the right questions (aka hypothesis) is key.

When we take on complex scientific problems using data science, asking the right questions at each stop is critical to the process. Failure to do so may make the difference between frustration and profound innovation. Aim carefully and with proper consideration in order to sculpt the right question. You may not get a second chance. [emphasis mine]

 

Millennials can’t catch a break

This is just nuts. Millennials just can’t seem to catch a break. Now AI is coming for their jobs.

Research released by Gallup on Thursday indicates a collision between technology and “business as usual” is coming soon, and the fallout will be ugly, especially for Millennials. Automation and artificial intelligence (AI) are among the most disruptive forces descending upon the workplace, says the Gallup report, and 37% of Millennials “are at high risk of having their job replaced by automation, compared with 32% of those in the two older generations.”[via Forbes]

So how can they stay relevant? Look for new trends in hiring. The top one I can think of is Data Science.

If you’re considering a career move, get a beat on what jobs are trending up (software engineer) and which ones are on their way out (reporter). You can boost your skills through a boot camp or with a traditional degree, no matter what your industry is, but know that some companies may prefer a regular degree over a boot-camp certificate or DIY learning.

But those industries might be susceptible to offshoring.

Though the Bureau of Labor Statistics (BLS) says that programmer and coder jobs will decline 8% due to outsourcing to other countries from 2014 to 2024, there will still be plenty of work, and in many cases, it will be too unwieldily to move massive operations overseas.

So in other words, Millennials can’t seem to catch a break. If I were part of that creative and awesome generation, I’d probably go the route of entrepreneurship.