We often get asked about how we deal with SLA’s so our Operational team thought a blog post exposing some of these processes would be a good place to start !
The below shows automated tests using all the variety of different options in the SME Web File Manager.
We also however deal with many Cloud Storage endpoints and for us to be able to understand the reliability of these endpoints we also run automated tests against these every 10 minutes. These test reports are sent to operational administrators and if tests fail we send an email to a SaaS service called Page Duty provides SaaS IT on-call schedule management, alerting and incident tracking.
The below report shows one of these test, in this case the adding of Cloud Providers from different storage services and syncing of meta-data.
Pingdom is used to assess site availability and timing. Server Density is used to monitor the server cluster metrics from our US sites (Atlanta and Phoenix) and our European sites (Amsterdam and United Kingdom).
An example of metrics from two of our servers using Server Density are show below:
The key thing about the server metrics is that Alerts can be set against them and depending on the level of escalation required these two can be configured to be either sent by ServerDensity over SMS or sent from ServerDensity to Pager Duty to call an Admin if urgent.
The other thing that Server Density is used for us to give us a feel for quality of service against the network bandwidth. The way we achieve this is that we have separate $20 Linode instances configured that upload and download large and small files to our servers using both the API and our protocol adaptors. This enables us to assess the quality of connection over the various interfaces and again alerts are set if our own SLA for this is breached so that we can investigate further with our hosting provider.
Below is a graphical look at what this Quality of Connection Service looks like in Server Density.
Pingdom enables us to monitor response times (ping latency) from different locations and also monitor our uptime. Again, outages are sent via SMS to PagerDuty which calls the on duty SME engineer.
Below is an uptime report for our SME EU Server cluster. Of the 11 reported outages, 10 where planned and the majority of the downtime was planned and was done out of hours and related to upgrading our core EU server infrastructure. If we take this into account then for the last year SME EU is running at 99.996% uptime.
Other automation that we use is the use of Chef to build our Cloud Appliance, both on IaaS infrastructure, for internal use and, if the customer wants a bare metal install rather than having the SME hypervisor delivered as a software Appliance.
We hope this has been a useful overview and welcome any feedback.