Spark vs Hadoop

There’s a lot of hullabaloo about Spark vs Hadoop for Big Data these days. If you’re rushing to stand up a Big Data cluster, you probably heard about this new Spark technology. The simplest way to think about the differences is that Hadoop is for batch jobs and Spark can do batch and stream processing. However, the biggest promise of Spark is the ability to code in Scala, Python (PySpark), and soon R (SparkR).

Dynamic programming languages like Python have opened up new ways to program, letting you develop algorithms interactively non-stop instead of the write/compile/test/debug cycle of C, not to mention chasing the inevitable memory management bugs. (Smart Data Collective)

While I don’t see Spark supplanting Hadoop – both rely on the HDFS data storage system – I see the leveraging of Spark to make that Hadoop elephant dance on a pin head.

As Mr. Schmitz so eloquently pointed out in the comments, Hadoop and Spark can’t supplant the other, they coexist together. What I mean to say in my last paragraph is that Spark will really let you leverage your Hadoop environment!

Using Python with RapidMiner

Just a quick note, I recently recorded a new video on how to use Python with RapidMiner Studio. This was part of the “Everyday Data Science with Tom Ott” series and you can check out more of my RapidMiner videos here.

Using Python with RapidMiner

In this video I show you how to use the Twython package and RapidMiner’s Text Mining extension to load Twitter tweets, text process them, and post a retweet based on your text processed data!

Update: This video and process was made prior to the introduction of the native Twitter operators, but the original strategy remains.

Update2: As of late 2017, this video is no longer available on RapidMiner’s YouTube channel. I do not have access to the video BUT I do have the sample process.  Just replace the app_key, app_secrete, oauth_token, and oauth_secrete with your own keys.


Twython and the Rise of Something

For the longest time I had a goal to automate Neural Market Trends with financial innovation(s) and broadcast the results to my readers. I wanted to use Rapidminer Server/Studio to build predictions using financial data, spit it out to my blog and/or twitter, lather, rinse, and repeat.

I had this goal years ago but never got around to it because I was working in a different industry. However, since that time a lot has changed. Amazon EC2 and S3 is out now, Rapidminer can live in the Cloud, new Financial Econ extensions are available, Python has continued to march forward with packages like Pandas, and Julia is on the horizon.


The fact that I’m beginning to use Python, or rather the Twython package, to finally realize my goal is cool. Right now I’m automatically posting a Tweet at 4:05PM, 5:05PM, and 6:05PM that shows my entry price and closing price for VSLR, INTC, and EWG. For the EWG post I even add the 30 day historical volatility in the Tweet. This might be super simple for a Python hacker, but for me it’s a fun tinkerish project.

Oh, did I mention that I’m doing all this via a Raspberry Pi? Yes, the Pi is perfect for stuff like this.

My next goal is to resurrect my historical volatility prediction process, automate it, and then use the Pi to mash the Rapidminer results together with an implied volatility calculation and tweet the appropriate option strategy for the upcoming week.


I use my laptop or an old Raspberry Pi to automated a lot of python scripts. These range from Pinboard link extraction for my monthly newsletter to posting blog posts to Twitter. Definitely check out my ongoing list of Raspberry Pi Projects.