September 22, 2015

Spark vs Hadoop

There’s a lot of hullabaloo about Spark vs Hadoop for Big Data these days. If you’re rushing to stand up a Big Data cluster, you probably heard about this new Spark technology. The simplest way to think about the differences is that Hadoop is for batch jobs and Spark can do batch and stream processing. However, the biggest promise of Spark is the ability to code in Scala, Python, and soon R.

Dynamic programming languages like Python have opened up new ways to program, letting you develop algorithms interactively non-stop instead of the write/compile/test/debug cycle of C, not to mention chasing the inevitable memory management bugs. (Smart Data Collective)

While I don’t see Spark supplanting Hadoop - both rely on the HDFS data storage system - I see the leveraging of Spark to make that Hadoop elephant dance on a pin head.

As Mr. Schmitz so eloquently pointed out in the comments, Hadoop and Spark can’t supplant the other, they coexist together. What I mean to say in my last paragraph is that Spark will really let you leverage your Hadoop environment!

Don't forget to sign up for our monthly newsletter on Data Science and RapidMiner here!


thoughts Machine Learning


Previous post
New RapidMiner Intro Videos I’m pretty excited to have been selected as the “voice of RapidMiner” for 6 new introduction videos that were just posted to YouTube. The first one
Next post
Coding RapidMiner in Python Back in middle school we learned about log tables. We learned how to look them up in a table, interpolate them, and then use the result in our