Tag RapidMiner

Posts: 99

Why I Left RapidMiner


For those that are wondering why I left RapidMiner, my dream job, there are no gory details to share.

The simple reason is I got burnt out. My time at RapidMiner was some of the best learning and growth years in my entire professional career. I solved problems, made presentations to C-suite people, and worked with some of the best talent. The flipside of this was that it wasn't easy and it sure as hell wasn't a smooth ride. I worked through some of the most tumultuous years at RapidMiner. We had 3 years of management changes and radical 180 degree strategy changes while I was there. All this 'chaos' eventually took it's toll on me.

Don't get me wrong, I have no ill will towards anyone. I really miss my colleagues and friends, but I made my decision to leave in early May 2017. I had no real plan at that moment, I just needed some 'me' time. I quit in July 2017 and took a few weeks off to be with my family.

As luck would have it, my town's Municipal Engineer approached me and asked if I'd like to work on a town project. He knew that in my former life as an Engineer, I had a lot of stormwater management design experience and this project needed some big stormwater analysis. I decided to take this job and simultaneously start a Data Science consultancy too. I ended up doing Engineering and RapidMiner/Data Science consultancy from August 2017 work till right about September 2018. What happened in September 2018? Well that deserves its own post for another time.

I started taking on RapidMiner related consulting work and doing my stormwater analysis. I was living the life of a consultant, working strange hours, worrying about invoices, worrying about where the next job will come, etc. Luckily I picked up more Engineering work from my Municipal Engineer to keep me afloat as I navigated through the lean times.

In August of 2018, my stormwater project ended up getting approved by the NJDEP and Highlands Council as a Major Development in New Jersey. This was a first in Highlands and NJDEP history, a brand new Community Center and Shelter got the green light to be constructed, all because we engineered a better stormwater management system. Although my work will never be seen - it's all underground - I can take solace that I will be recharging 133% of clean rainwater over the development area into a depleting aquifer. I know that this project will be a benefit to the Community and the environment, and that makes me happy.

Making a positive impact can be really hard at times but the reward is immeasurable.

So there you have it. Nothing to see. It was time to move on to the next adventure. I can take solace that all the RapidMiner adventures and friends will always be a part of me. As Ingo from RapidMiner would say, "onward and upward."

Let's go.


Python, RapidMiner, and Carriage Returns

I've been working on some Python code for a RapidMiner process. What I want to do is simplify my Instagram Hashtag Tool and make it go faster.

Part of that work is extracting the Instagram comments for text processing. I ran into utter hell trying to export those comments into a CSV file that RapidMiner could read. It was exporting the data just fine but wrapping the comment into carriage returns. For some strange reason, RapidMiner can not read carriage returned data in a cell. It can only read the first line. Luckily with the help of some users I managed to work around and find a solution on my end. DO all the carriage return striping on my end before export.

The trick is to strip all carriage returns, spaces, tabs, etc using the regular expression 's', then replace the stripped items with a space like this ' ' in place. While this isn't elegant, it had to be done because Instagram comment are so messy to begin with.


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd
import json
from pprint import pprint
import csv
import re

df = pd.read_json('https://www.instagram.com/explore/tags/film/?__a=1')

name = df['graphql']['hashtag']['name']

hashtag_count = df['graphql']['hashtag']['edge_hashtag_to_media']['count']

count_list = []
likes = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
for i in likes:
    add = (df['graphql']['hashtag']['name'], df['graphql']['hashtag']['edge_hashtag_to_media']['count'], i['node']['id'], i['node']['edge_liked_by']['count'], i['node']['display_url'])

count_out = pd.DataFrame(count_list)
count_out.columns = ['hashtag', 'count', 'user_id', 'likes', 'image_url']

# This just exports out a CSV with the above data. Plan is to use this in a RM process

# Now comes the hard part, preparing the comments for a RM Process. 
# This is where the carriage returns killed me for hours

text_export = []
rawtext = df['graphql']['hashtag']['edge_hashtag_to_media']['edges']
for i in rawtext:
    rawtext = i['node']['edge_media_to_caption']['edges']
    for j in rawtext:
        final_text = j['node']['text']
        df['final_text'] = j['node']['text']

text_out = pd.DataFrame(text_export)

#This is the key, I had to strip everything using s and replacing it with a space via ' ' 

text_out.replace(r's',' ', regex=True, inplace=True)

text_out.columns = ['comment']

text_out.to_csv('out2.csv', line_terminator='rn')


Exploring H2O.ai

A few years ago RapidMiner incorporated a fantastic open source library from H2O.ai. That gave the platform Deep Learning, GLM, and a GBT algos, something they were lacking for a long time. If you were to look at my usage statistics, I'd bet you'd see that the Deep Learning and GLM algos are my favorites.

Just late last year H20.ai released their driverless.ai platform, an automated modeling platform that can scale easily to GPUs.








What I find fascinating is their approach to questioning each step of the way. The above video outlines that problem with lung tumor detection. Is your model learning the shape of the ribs or the size of tumor? You would hope it was the tumor!

Fascinating video.


I know that H2O.ai relentlessly drives their open source market and everywhere I look there's an H20.ai library being imported or used. It wasn't a shock to me to see an new update to their Driverless.Ai product, but what got me giddy was their incorporation of time series. This, I have to check out. Time Series can always be a pain and you can make mistakes easily, especially in the validation phase of things, but this just is plain cool.

I definitely need to check this out more.

Video Highlights/Notes

  • Many Kaggle Grandmasters at H2O
  • Built Driverless.ai to avoid common pitfalls/mistakes of data science
  • Automate tasks: Cross validation, time series, feature engineering, etc
  • Ran it on a Kaggle challenge, came in 18th position.
  • Goal: Build robust models and avoid overfitting
  • Automatic visualization (of big data)
  • No outlier removal, it remains in big data set
  • Want to deploy a good model / must have an approximate interpretation
  • Java deployment package / driverless.ai will have a pure Java deployment
  • Not just talking about models but an entire model pipeline (feature generation, model building, stacking, etc)
  • Typically deployed to a Linux box
  • Will be building a Java scoring logic to score the model pipeline (on roadmap)
  • Sparkling Water will be incorporated into Driverless.AI so you can run this easily on Big Data
  • Want to write R/Py scripts to interact with Driverless.AI so it will make sense to the Data Scientist and not be complex and easy to use
  • Deep Learning is inside but not enabled yet
  • Compromise: If you want to train many models you select a good sized training set but not huge. There is a # of models vs training time tradeoff
  • User defined functions coming
  • Import the training and testing data. Model will built on training data only (won't look at testing data)
  • Does batch style transformations instead of row by row for training
  • BUT it will do row by row transformations for testing set
  • Uses a genetic algo to create new features
  • Checks overfitting and stops early based on a holdout
  • Uses methods to evaluate and prevent overfitting
  • Only validation scores are provided (out of sample estimates)
  • Interpretability is built in
  • After the model is created, you can build a stacked model
  • Download scoring package, all built in so you can put this into production


Introduction to RapidMiner Server

I made a new video on RapidMiner Server! This is just a high level overview of the Web GUI and how to navigate through it. In future videos I'll be showing you how to productionalize a RapidMiner Studio process, expose Web Services, and even make a Dashboard.


Neural Market Trends is the online home of Thomas Ott.