Do I make the pipe bigger, or buy more pipes?
Scale up vs scale out is a pretty standard conundrum nowadays – particularly with the advent of distributed compute models such as Hadoop. A few years ago, the standard approach in the Business Intelligence & data management world was simply to throw more hardware at it.
I’ve worked in places where servers have half a terabye of RAM and more cores than I can count… and looking back, I wonder whether this needs to be the case.
Now – just to be clear, this isn’t a rant about why Hadoop is a going to fix all your woes. In case you hadn’t realised yet – it won’t.
However, the revelation that vast data processing can be run on commodity hardware should be big red flashing light for those of us who have thrown hardware at problems in the past.
We are regularly called into businesses where the batch processing jobs fill the existing ‘pipe’ up to such an extend that jobs over run, hang, or even worse for the poor people who need to do something with the data being published, fail.
I think the core of the problem is that we have all been indoctrinated into thinking that a robust ETL process will fix everything. The issue is – a SINGLE ETL process probably won’t. We’ve convinced ourselves that a single pipe, and therefore a single ETL process, will do what is required, and will scale.
Unfortunately, single-stream processes don’t scale – or at least, not cheaply. It also means you have a pretty clear single-point-of-failure. Which isn’t great.
The trick, as often is the case with technology, is to think about things a little differently.
If you can view your solution as not just one single huge process, but in fact thousands of tiny processes, then you can do something quite interesting – run enterprise BI on commodity hardware.
Don’t get me wrong – sometimes, you’ll need a 256 GB RAM server with more compute power than Amazon. However, this could well be for a few minutes a day. What do you do with this server when it’s bored?
Wouldn’t it be nice if you could spin up just enough compute power to get you by on demand, and then spin it down when you no longer need it? What if, rather than a single mid-sized pipe, you have 40 pipes which expand and contract depending on the work you need them to do?
I know many people don’t have the luxury of using elastic compute on . However, many people have a virtualised environment where they could maintain a set of different VMs for specific purposes…. or, even better, run on a docker (https://www.docker.com/) environment.
As with all our solutions – we realise that code is less than half the challenge. Elegant orchestration, simplicity of design and a willingness to use metadata for more than just documentation result in a genuinely great solution.
I’d strongly recommending buying some more cheap pipes before you go and blow your annual budget on a new huge one…