Detecting Bad Apple Instances and Other Performance Issues in AWS

Amazon Web Services (AWS) is an awesome and quick way to get access to a lot of computational power; it is known for providing near-instant access to large amounts of computational resources. Between the API and the web interface, you can quickly scale your application with demand. As with any technology service though, it’s hardware based, and hardware does fail. I have personally seen that it is possible for AWS resources to become almost completely unreliable. These bad apple instances exist across the entire AWS infrastructure.

In looking at performance issues within AWS, most of them appear to originate from I/O performance. You can start a bunch of instances and receive widely differing performance levels. This can be caused by a failing disk in the infrastructure or, more likely, a least-optimal routing path for storage. The graph below shows a recent example of performance severely degrading read speeds on an instance, causing my system load to float around 2.5 on the midterm. After stopping and restarting the instance, AWS automatically rerouted the storage path and my load dropped to around .15. This is a huge increase in performance just from stopping and restarting the instance. This is why it is super critical to monitor and to check for these performance issues.

We were able to catch this performance issue with the aid of a couple of tools. Icinga, along with our custom check_sshperf script (available at https://github.com/AppliedTrust/nagios-plugins/blob/master/check_sshperf), allowed us to have agentless threshold checks for system load, network performance, memory usage, and much more. The tools collectd and Graphite can be used to collect and graph information about the system, with the ability to span a wide range in values. Icinga initially alerted us to a high load on the system. Then, using Graphite, we were able to see the impact of load over time and show that performance had been severely impacted.

These tools can help not just with monitoring hardware issues, but also with software misconfigurations that could be impacting memory, storage, or compute resources. Below are just a few of the things we can monitor with Graphite.

Icinga can also be more deeply configured to check everything that makes your application run. So take the time to set up some in-depth monitoring across your application. It will allow you to detect performance problems before the end-user notices.