There’s a lot of hullabaloo about Spark vs Hadoop for Big Data these days. If you’re rushing to stand up a Big Data cluster, you probably heard about this new Spark technology. The simplest way to think about the differences is that Hadoop is for batch jobs and Spark can do batch and stream processing. However, the biggest promise of Spark is the ability to code in Scala, Python (PySpark), and soon R (SparkR).
Dynamic programming languages like Python have opened up new ways to program, letting you develop algorithms interactively non-stop instead of the write/compile/test/debug cycle of C, not to mention chasing the inevitable memory management bugs. (Smart Data Collective)
While I don’t see Spark supplanting Hadoop – both rely on the HDFS data storage system – I see the leveraging of Spark to make that Hadoop elephant dance on a pin head.
As Mr. Schmitz so eloquently pointed out in the comments, Hadoop and Spark can’t supplant the other, they coexist together. What I mean to say in my last paragraph is that Spark will really let you leverage your Hadoop environment!