Load Testing from the Cloud

Over the last few weeks I’ve been tinkering with Amazon’s EC2 service, setting up a Load Tester load engine AMI and running tests from cloud instances.  Our conclusion is that cloud engines can be useful, but there are some things you need to be careful about because they can cause subtle problems or even invalidate your test.

Getting Started

You’ll need to follow the Amazon instructions for getting set up on the service, as laid out in the Getting Started Guide.  This is a somewhat non-trivial process that involves generating an X.509 certificate, and an SSH keypair for accessing your instances.

Creating an AMI is relatively straightforward.  The documentation is pretty good, although sometimes difficult to follow, but that’s because there are multiple services involved and each has a different authentication scheme.  In particular, where a command should be run is not always clearly defined.  Some commands are to be run on the target image – an existing AMI or the image you are trying to capture – while other commands can be run from another machine, presumably your workstation with the AMI tools installed.

Once you have an AMI, starting new instances is easy – it can be done from the AWS web portal, the command-line tools, or the API if you are so inclined.  There are ways to retrieve both the public IP address and the private IP address from inside the instance itself, so setting up the load engine is pretty painless and could be easily automated.

Load Testing from the Cloud

This is where it gets interesting.  Cloud machines are, of course, virtual.  This, combined with the nature of cloud services, introduces a number of problems that need to be considered before testing.

Processor Usage

An Amazon Small instance only has one virtual CPU.  During testing, we discovered that it is not unusual for this virtual CPU to be pegged at 100% utilization – and for most of that utilization to be in the “Steal” category via top, which indicates that another VM is heavily using the shared physical CPU.

This is a problem for the same reason that running load engines on in-use workstations is a problem:  if the load engine does not have enough processor cycles to handle things in a timely manner, the results returned by the load engines may not be accurate.  High CPU usage can cause timing delays in page durations, skew the distribution of users as the load is ramped up, and slow down page processing (particularly extractors using regular expressions).  In our case, since we run thousands of users on a single load engine, this problem is highly magnified – processing delays can have a substantial impacts on overall results.

This problem can be mitigated by using a High-CPU instance, such as Amazon’s High-CPU Medium or an equivalent.  However, it can’t fully be eliminated, because in general your VM is going to be sharing physical hardware with other VMs that are not under your control.  Further, there really is no way to prevent this – even checking the VM before the test would not guarantee that it would not be competing for resources during the test, because the highly variable and temporary nature of cloud computing means that another VM could interfere at any time.  High-CPU instances are more expensive, but for our typical use case that is not a major issue – most load tests are less than 12 hours in duration, and instances can be disabled between tests with ease.

Bandwidth

If your test is bandwidth intensive (downloading a video, for example, or if your pages are heavy with images and javascript), cloud load engines may not function the way you expect.  Amazon and other cloud-computing services either do not guarantee a level of bandwidth, or guarantee a certain amount of bandwidth rather than a level.  This means that for any given load engine, the amount of bandwidth available is not predictable.  Further, the bandwidth available may change during a test, again because of the variable and temporary nature of the cloud.

While some throughput variability on the internet is to be expected, this presents a problem above and beyond the normal issues.  Page durations for one load engine may not match the page durations on another load engine, because the first load engine is sharing a VM with a heavy bandwidth user – for example, a bittorrent server.  It may be possible to work around this by testing each load engine as it comes up, and if it doesn’t meet the requirements (say, 50mbps of available bandwidth), terminate the instance and try again.  Alternatively, we could assume an average bandwidth availability and simply launch more instances than we need, trusting in over-provision to deal with any instances that are short on bandwidth.

This problem can also be partially mitigated by scheduling large-scale tests at times when internet bandwidth usage is low.  For the United States, this is typically from 3AM to 7AM Eastern time for the internet as a whole.  Unfortunately, there does not appear to be a way to get this kind of information for EC2 (or any other cloud service that I’m aware of), so it may not be possible to schedule things in a way that is useful.

Load Engine Independence

One of the things we do to improve our services is to scatter the services load engines across multiple providers and data centers, spreading them out over the whole country.  This keeps problems at one provider, or with one upstream provider, from interfering with or halting a test.  With cloud machines, this independence is harder to achieve, though still mostly possible.  This may or may not be important for your test, depending on your test profile, but in general is a good idea.

One way to improve load engine independence in the cloud is to deliberately start your load engine instances in different availabililty zones on Amazon, or on different providers.  This establishes physical independence, but has limits – Amazon only has three availability zones in the United States.

Next is to ensure that your VMs are not too close together – for example, using the same router.  This is harder, because logical networks do not always match physical networks – the same physical router could be handling multiple subnets.  However, it is in the cloud providers’ interest to distribute load.  Amazon tends to use /24 and /23 subnets while scattering instances all over the place, which usually translates to a good distribution for the load engines.  For example, I launched two sample instances, and ended up with these internal IP settings:

10.252.129.0/24 default via 10.252.129.1
10.252.114.0/23 default via 10.252.114.1

Of course, we have no good way of knowing whether or not these two subnets are served by the same hardware or not, but it’s probably a pretty safe bet that they aren’t.  As you increase the number of load engines and keep their subnets independent, the probability of them all being on the same hardware or behind the same router diminishes.

Security

Cloud load engines have the same security requirements as other internet-facing servers.  A load engine is a lightly used machine – the load tests are run, and then it is typically left idle or shut down until the next load testing cycle.  Load engines don’t provide a continuous service like a webserver, and you can just fix them before you run a test if there is a problem; thus there is little incentive to pay attention to them when load tests are not running.

When a load engine is in a public cloud, this natural lack of attention and monitoring becomes a problem.  Fortunately, there is a direct monetary incentive to avoid idling cloud instances, which helps mitigate some issues and will avoid leaving a compromised server up for very long.  However, it’s also easy to just re-use the same image over and over again without updating, and automated attacks can scan thousands of machines per second for vulnerabilities.  This leads to a situation where it is relatively easy to have a vulnerable, unpatched load engine exposed to the internet long enough to get compromised – which will likely have the effect of invalidating your test, if nothing else.  If you’re going to use cloud-based load engines, image updates should be integrated into the same update cycle as your other internet-facing servers, and should follow the same security rules – SSH keys, tight firewall rulesets, no unnecessary services, etc.

Recommendations

If you’re going to use cloud instances as load engines with Load Tester, these are our recommendations:

  1. Use high-CPU instances to avoid physical cpu contention as much as possible
  2. Use more load engines to counteract bandwidth limitations
  3. Test your load engines before the test if you can, to spot any obvious problem engines
  4. Spread the load engines across availability zones and providers to increase independence
  5. Treat cloud-base load test results with some skepticism, and repeat the tests to increase accuracy
  6. Utilize low-bandwidth-usage time periods for cloud testing when you can
  7. Keep your cloud images on the same upgrade cycle as your other internet-facing servers.

1 Comment

10 July 2009 The BrowserMob Blog » Blog Archive » Answers to load testing issues with Amazon’s EC2

[...] at Web Performance there is a well written blog post about issues one must be careful of when doing load testing from Amazon’s cloud. The major [...]

Add Your Comment

You must be logged in to post a comment.

Resources

Copyright © 2010 Web Performance, Inc.

Website design and development by DesignHammerA Durham web design company