“If we’d written this in Python we’d be done by now”

While Python has powerful libraries for processing large datasets, maintained by an active, broad and invariably ingenious community, those are simply symptoms of something intrinsic in the language…


Getting to grips with Apache Spark…

The fundamental unit of Spark is the RDD (Resilient Distributed Dataset)—an immutable unit of data partitioned across multiple servers—which get created as part of operations on other RDDs. For the type of data we’ve been working with, this doesn’t quite fit. Our data arrives as a (near-) continuous feed and doesn’t reside on a filesystem anywhere. Enter: Spark Streaming.