image image 10th November 2015

Getting to grips with Apache Spark...

After its first stable release last year, Apache Spark was one of those things that seemed to get mentioned a lot by colleagues but no one was actually using it (see also: Apache Storm). After finally getting to grips with it, it’s been an oddly mixed experience.

As someone who’s been using Hadoop for the last half-dozen years or so (starting with version 0.2 or thereabouts; something I was stuck with for most of that time) it was hard to overlook the similarities. It runs on Hadoop’s YARN, it processes distributed data in HDFS….much like those who claim “I rewrote MapReduce in X lines of code; what’s special about Hadoop?”, I managed to completely miss the point. If MapReduce is the recipe, Hadoop is the whole damn restaurant franchise. Similarly, while map() and reduce() might be a couple of the tools Spark offers, it’s so much more than that.

MapReduce is wonderful when it fits the use-case. Invariably, however, you find yourself trying to leverage the pattern where it really doesn’t fit. At its very basic, Spark offers a whole series of operations that can be performed on data in addition to those two. The versatility of those core elements allows you to produce complex, multi-pass, multi-step data pipelines, all lazily computed—that is, they’re not actually calculated until you actually do something with the data.

MapReduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms. Each step in the data processing workflow has one Map phase and one Reduce phase and you’ll need to convert any use case into MapReduce pattern to leverage this solution. And it tries to do as much of this in-memory as possible to really improve speed (I admit at this point the phrase “in memory” really doesn’t fit with “Big Data” but I’ve yet to actually push its limitations).

Perhaps its most beautiful feature (time to put on the “I ? Python” hat) Spark, although fundamentally built around the JVM framework (and thus supporting Java and Scala) also has libraries for writing full applications in Python and R. Despite this, and my fondness for Python, my experiences have largely been limited to Java.

And it’s been ugly.

Scala, Python and R all lend themselves wonderfully to Spark’s functional paradigm. Java, however, is as Object Oriented as they get and—although I applaud the work trying to leverage Spark functionality into Java—it’s one hell of a paradigm shift, one that exposes the limitations of the latter. Foremost: classes everywhere. Every function is a class which has a method called call() (which does the actual grunt-work of our ‘function’). Need a function that takes two arguments? It’s called Function2 (suffixing your class names with numbers should be outlawed; and yes, there’s a Function3). And tuples? Something I’ve utterly taken for granted in Python, here are classes called Tuple2 (which explicitly have 2 elements; I’m almost scared to check if there’s a Tuple3).

Language-rant aside, Spark’s execution can be more than a little confusing. Hadoop was more explicit in what happened: you wrote your driver code, your mapper, reducer (maybe an InputFormat if you were feeling frisky) and you knew what was going to happen. Here, Spark’s a little less forthcoming about what’s going on (and where) and you’ll spend an inordinate length of time wondering why performance is so bad (because everything’s running on the driver) and when that’s finally sorted, why everything’s whining about the fact that nothing serializes (or pickles or whatever).


The fundamental unit of Spark is the RDD (Resilient Distributed Dataset)—an immutable unit of data partitioned across multiple servers—which get created as part of operations on other RDDs. For the type of data we’ve been working with, this doesn’t quite fit. Our data arrives as a (near-) continuous feed and doesn’t reside on a filesystem anywhere. Enter: Spark Streaming.

Spark Streaming treats the continuous stream of data as contiguous RDDs (a discretized stream or DStream). At a (configurable) regular interval, the data are batched up and processed in individual jobs and voila! They can largely be treated the same as a regular Spark jobs.

For the most part this works wonderfully and gives near-real-time processing of our inbound data. There seem, however, to be a couple of problems: firstly, the batching is temporal (i.e. every X (milli-/micro-/seconds) which for some data isn’t a natural fit. Secondly, the temporal element isn’t necessarily obeyed: if the batch takes longer than the interval to process for any reason (like your Kibana instance going caput and everything grinding to a halt—hypothetically speaking, of course) then the next batch won’t be processed until it completes. By which time you have even more data, which will, of course, take longer to process…

But those concerns are largely architectural. Spark is still a relatively new framework and already we’ve been able to obtain some very impressive results. Of course, if we’d written it in Python…

We Love Data

Want to know more?

Drop us a line – we’re always happy
to chat – we promise we’ll keep the
geek speak to a minimum (unless
that’s your bag in which case we’ll
happily comply)!