Mashing Up Julia Language with RapidMiner

If you want to execute any Python in RapidMiner, you have to use the Execute Python operator. This operator makes things so simple that people use the hell out o fit. However, it wasn’t so simple in the “old days.” It could still be done but it required more effort, and that’s what I did with the Julia Language. I mashed up the Julia Language with RapidMiner with only a few extra steps.

The way we mashed up other programs and languages in the old days was to use the Execute Program operator. That operator let’s you execute arbitrary programs in RapidMiner within the process. Want to kick of some Java at run time? You could do it. Want to use Python? You could do that (and I did) too!

The best part? You can still use this operator today and that’s what I did with Julia. Mind you, this is tutorial is a simple Proof of Concept, I didn’t do anything fancy, but it works.

What I did was take some a RapidMiner sample data set (Golf) and pass it to a Julia script that writes it out as a CSV file. I save the CSV file to a working directory defined by the Julia script.

Tutorial Processes

A few prerequisites, you’ll need RapidMiner and Julia installed. Make sure your Julia path is correct in your environment variables. I had some trouble in Windows with this but it worked fine after I fixed it.

Below you’ll find the XML for the Rapidminer process and the simple Julia script. I named the script read.jl and called from my Dropbox, you’ll need to repath this on your computer.

The RapidMiner Process

 

The Julia Language script

Note: You’ll need to “Pkg.add(“Dataframes”)” to Julia first.

Of course the next steps is to write a more defined Julia script, pass the data back INTO RapidMiner, and then continue processing it downstream.

 

Big Data and Infrastructure

I have a daily downtime routine. Every evening I set aside about a hour and think. I sit or walk around the house and ruminate about all sorts of random things. Sometimes it’s with a glass of wine and more often it’s with a cup of black tea and milk. Sometimes my mind wanders to what I did that day or what I didn’t finish. Other times I get inspired to write a new blog post or create a new tutorial. Sometimes it’s an epiphany like the impact of Big Data and Infrastructure.

Infrastructure Investigation

It’s no secret that I came from Infrastructure field. I spent many years designing and managing infrastructure projects as a Civil Engineer. Some projects were big, some were small. I traveled to remotest parts of Montana and North Dakota and worked all over the country. I’ve inspected bridges, roads, sewers, and written countless reports.

Most of these reports were to highlight deficiencies in some bit of infrastructure. We’d take an inventory of the structure, take photos, and make measurements. Then the report would go to an agency where they’d use it to get budget monies to fix the problem.

Then I moved to the machine learning startup world and here I am today.  My first move was into Pre-sales and right before I transferred to the Marketing group, I fielded  some interesting queries from potential customers.  One was from a major freight railroad and the other from an railroad car inspection company. Both of these organizations capture sensor data and make measurements on their infrastructure assets. They measure temperature of rail gauges, wear patterns, widths, and hours of use.

Big Data Migration

The most interesting part of these queries? The data was migrating from standalone reports into Hadoop clusters.  For the first time ever, at least since I was in the industry, data is coming together from all over the place. The only problem was getting the data out to work on it!

There are many ways to get the data out and work with it (i.e. Spark, Hive, RapidMiner, etc) now, but engineering professionals don’t understand it. Ask any manager in an Infrastructure firm what Hadoop is and they won’t know. Some might have heard of data science and data mining but they might not know what all the hoopla is about.

The hoopla is this.

Engineers use data to design all kinds of things. Imagine if they have access to a deeper pool of stress strain data for bridges?  Or for rails? What if researchers adjust the mixture ratios of concrete or temper steel differently to extract more performance based on terabytes of data from a central research Hadoop cluster?

These scenarios are not far fetched at all. I went to a presentation two years ago on the forecasting of flooding for Hurricane Sandy event types in the NY area. The room was filled with engineers and a presenter from Stevens Institute of Technology. The presenter says they run several wave function calculations to help state governments like New Jersey and New York predict where the flooding is going to occur and its severity.

After the presentation I asked him where they get their data from and he said from a group of computers tied together in a cluster.

The Future

The reality is that more Infrastructure companies are collecting ever increasing amounts of data. They’re using drones to do bridge inspections, tying river gauges together via the Internet, and using more sensors than ever. These sensors (aka IoT) collect and stream this data somewhere. In the old days it was an Access database. Today it’s an more robust database and one day that will be a big Hadoop cluster. The average Civil Engineer of my time hasn’t heard about a Hadoop Cluster but they heard of Big Data and wonder what its about.

Soon they’ll crush the silos of their data stores, unlock innovation, and build their own clusters.

Imagine the world we can build then?

Machine Learning on a Raspberry Pi

It looks like Google is catching up to the idea of machine learning on a Raspberry Pi! Someone put RapidMiner on a Pi back in 2013 but it was slow because the Pi was underpowered.

The Pi has been a great thin client and a small, but capable server. I’ve used it for my Personal Weather Station project and as an FTP server. Based on the news, things are about to get interesting for both Google and Raspberry Pi.

I don’t know what Google is planning to release to the Pi and Maker community but based on the survey I filled out, they haven’t decided yet. They’re looking at C#, Javascript, Go, Swift, Python, and all the other usual suspects.

The problem is optimizing the machine learning libraries for the Pi and having enough available to make it worthwhile for the community.  My guess is that they’ll go with Python, TensorFlow, and Go (Grumpy).

Whatever they decide, I consider this big news for Tinkerers and Makers everywhere. There will be an explosion of innovation if the Google toolkit is comprehensive. The Startup barrier to entry has been lowered, all you need is Pi ($40), a domain, some sweat equity, and a dream.