Building Cloud Systems at Scale
Firstly, it’s worthwhile discussing what this blog is, and what it isn’t.
It isn’t a one-stop-shop detailing exactly how to build a perfect cloud system at scale because that doesn’t exist (sorry but it’s true!)… anything that tells you otherwise has probably missed something.
This blog is, however, built on experience; including a few failures and ultimately, many great successes. It will cover our 5 pillars of success, or more simply, what we consider when designing and building any system.
We’ll be discussing the three main cloud providers throughout (Amazon Web Services – AWS, Google Cloud Platform – GCP, and Microsoft Azure), as they are the more commonly used platforms, but these principles are applicable whichever cloud platform you use.
There is a lot of hard work and testing that goes into building any system and ensuring that it works at scale, and amongst other things is sustainable and secure and this hard work is built on fundamental principles. By following these principles, we recently produced a system capable of thousands of sub-second responses a second over a sustained period.
In fact, we recorded so many transactions per second that, at its peak, it was busier than Barclaycard’s current record black Friday figures and the figures that Visa supply added together, and then some…
Official Barclaycard figures, showing record 2019 figures 1,184:
Comparative figures from the same time:
So, what exactly are these pillars of success and how do we achieve them? At The Data Shed, we refer to a 5S’s model:
This is job zero and always should be. No matter how good a system is in any other respect – if it isn’t secure, it isn’t suitable.
Make friends with a good and trusted Penetration Tester, as it is always important to continually test and review your own security. While there are many good tools out there for this, you can’t mark your own homework where security is concerned. An approved Penetration Tester will work with you to resolve any issues you identify is crucial and can teach you so much about how to make things better and evolve your system.
Also make sure you understand the standards and look at the accreditations of well-known security bodies such as ISO, Cyber Essentials as well as standards such as Centre for Internet Security and National Cyber Security Centre . The Data Shed is Cyber Essentials Plus and ISO27001 accredited, which has helped us to raise standards across the board. Again, understand the tools available such as AWS Trusted Advisor, GCP Cloud Platform Security and Azure Advisor, as they will give you a clear idea of what to focus on.
There are plenty of great resources out there, such as the Centre for Internet Security and the National Cyber Security Centre who produce benchmarks and guidance*.
Stability is not about staying still, inertia in systems can be a killer. Keeping up with technology, cyber-security and general good practice is vital. Stability will be achieved through continual, well-managed change in your system. Controlled change includes moving releases to a blue/green model and getting as close as possible to no downtime.
Learn your system and the services and make sure that they are truly understood, this means not just knowing what they do but also how they do it. This enables you to understand how you can make change. Look for and understand technologies that can help you with this, for example: AWS Auto Scaling Groups, GCP Instance Groups and Azure Autoscale.
As patching is difficult, it’s timely and takes a good deal of planning, it is easier to burn and recreate instances. Therefore, it’s also important to release frequently and kill your instances to keep them up to date as often as you can. This is a good practice which, when done in the right way, can guarantee a stable system.
However, it’s important to say here, you should not rely on these. They are there to help and be used when the time is right, but they will not do the job for you and any unexpected high demand and rapid spikes in traffic cannot be reacted to in time automatically by these tools – you have to plan for it.
It is down to you to prepare, plan and run your system with some headroom for normal growth, making sure you do the work for times of predicted high load. Understand your system, work out when these times may come, then monitor and benchmark how your system works, planning from there.
Load testing is key here. At The Data Shed use tools such as Gatling, which can be spun up easily in a different cloud provider to the one the system is in, this can therefore give a more realistic pattern of traffic coming from external sources. The excellent James Smith wrote a blog about this, so I won’t recreate the wheel here, and instead just advise you to read this.
While load testing is important, make sure it is done in a few different ways and go above and beyond what you expect the load might be. For some context, the way we ran this was to push the service both in a spiky pattern but also under sustained load. We put through 112,445,735 transactions over a two-hour period with peaks and troughs which produced 23 errors in total – this was a great success as the errors all produced a correct error response, which is also crucial.
As you would expect, with any test you must understand what is happening and good insight is vital here. Make sure you can log your test, search, and determine the accurate number of requests which are hitting your service at any one time, right down to the second.
The final thing here is that some services don’t allow pre-scaling from the portal, and this is where you need to work with your cloud provider to help you. Contact them to pre-scale for a set period and to pre-warm it, make sure you don’t try to go from zero to one hundred miles an hour in a split second as your system won’t cope and you will lose requests.
As ever, good testing, planning and communication with your provider will make this process far easier and pain-free for everyone involved.
This requires a similar approach to scaling; in that you can only decide the instance sizes when you know what they can handle and how they will react – you must first be able to monitor and understand baseline behaviour before you push things to destruction.
It is also important to know what is timely when building your system, sometimes one second is more than quick enough but at other times, it’s a lifetime. Load testing and real-user testing will help to inform this and how to solve any issues you may encounter with response times.
For example, sometimes you might need to scale horizontally and have more instances handling things, but on other occasions vertical scaling will be the answer, as it is with more powerful instances. This is another place where it is important to work closely with your account manager and architects at the cloud provider who can help and advise.
Test to the limits and beyond, as it is important to understand what a problem is and what is potentially a portent of something more important such as a total failure, you must know what this looks like.
Make sure you monitor, control, and understand your system and estate, breaking down the time it takes to process through each tool or component as this is something you can actively work to improve.
Finally, make sure you are looking after the parts you can control and don’t give out any guarantees or SLAs based on services or infrastructure you don’t. The internet is an unpredictable place at times and much like any journey, just because the speed limit suggests you can make it in a certain time there is no guarantee there won’t be slow-downs, breakdowns or congestion on the way.
Now, I admit this may be better called cost but that didn’t fit with the 5 S’s model and sustainability only came to mind later – so let’s stick with it, as it still tells the story we need.
Much like security, no matter how good the system is, if it’s unaffordable, then all of this effort is for nothing. My suggestion here is to size things appropriately, run with as little head room as you can and scale for short periods when you need too. Make sure you’re in a position to react quickly, which means you can scale-up and incur costs for as short a time as possible and are therefore also able to scale down quickly as well.
For example, we have found that for a few days in the cloud the cost can be around $500 dollars for some extra compute instances, which can be scaled down as soon as you need. The same instances would cost tens of thousands in physical infrastructure which you then have forever.
Ensure that you have a pragmatic approach to your architecture, there will always be an ideal-world textbook approach, but perfect is often very expensive and may be overkill, as well as lacking the things you need. Design, test, build and monitor the system you need, always trading costs between services to find the most cost-effective solution.
Also remember using services managed by your cloud provider may look more expensive initially, but you remove so many operational and management costs as well as buying, hosting, and securing the actual infrastructure.
I wish this blog could have outlined the perfect way to deliver a system to the cloud that works 365 days of the year, is totally secure and does everything you need – but as I said earlier, a lot of this comes down to hard work and planning, and there is always more to learn!
However, I still want to leave you with a few key takeaways. The 7 steps to follow below, and the 5S’s model above, provide some food for thought, and hopefully some guidance for when you are planning your next system or migration.
We Love Data
Want to know more?
Drop us a line – we’re always happy
to chat – we promise we’ll keep the
geek speak to a minimum (unless
that’s your bag in which case we’ll