We’re continually optimizing our software to squeeze the best out of various cloud infrastructures and after the latest updates we decided to run some large scale benchmarks on Amazon EC2 and S3 with strong initial results. In one test we scaled a complex data mart query running on a single instance over 16 very large instances (effectively 8 cores each) and reduced the runtime from 15 minutes to just over a minute. We aim to be at least 80%+ efficient across cores and 70%+ efficient across instances, so we’re delighted with these results when we know there will be further improvements. This level of efficiency applies to mixed workloads as well as the larger data mart queries but it’s easier to benchmark longer running queries.
Our massive compression of the underlying data set is what allows us to scale on commodity hardware and storage: we compress between 20X and 60X depending on the profile of the data. It’s also worth noting that in this test we were querying encrypted data as security should always take precedence in a public cloud environment but without encryption the performance was 10% better. In addition, accessing the data directly from each EC2 instance’s local disk is 18% faster, but the simplicity and scalability of S3 wins out.
On a related note, it’s worth highlighting how great Amazon EC2/S3 is for performing large scalability tests — this extensive benchmarking exercise only cost a few hundred dollars!


