Tuesday, September 4, 2012

Deep Dive: CPU Ready Time

In my last post we tackled the topic, "VMWare ESX/ESXi CPU Scheduler",with extreme prejudice, this week will be an extension of that post. Lets talk about CPU Ready Time, how bout' it? (if you said "No", Click Here...). To put it simply: CPU Ready time is your enemy, but, to give a more formal definition of what CPU Ready Time: CPU Ready Time is when a Virtual Machine exists in a state when it is ready for processor time but cannot be scheduled on the processor. The CPU is at a point when it is "Ready" for processor time. I mentioned in my last post that CPU Ready time can be a performance killer and over allocation of vCPUs is most likely at the heart of the problem. To throw some data and thoughts at you really quick, I love me a good bulleted list and I hope you do too:
  • When possible, keep like CPU allocated VMs on the same hosts or clusters. This will help reduce CPU Ready time to a safe amount. VMware suggests 5% CPU Ready time is within the safe limits for a VM to live with. We will discuss safe zones a little later in the post. Regardless of recommendations, everyone should establish their own safe and red zone numbers. Even ours don't coincide with VMware's recommendation.
  • An ESXi host will try its best to keep a VM's requests for processor time, running on the same physical cores for caching purposes. If the benefits of moving the VM's processors requests to new cores is more efficient than waiting for the CPU scheduling to allow for processor time (CPU Ready Time), the ESXi host will clear the processor cache and move the existing and new processor requests to different cores, this move takes time and the effects can be felt in performance but the host has determined this to be the better choice than to let it accumulate any more CPU Ready time.
  • Many times CPU over allocation occurs when P2V'ing a physical machine into your Virtual environment. Although your super serious amazing awesome database server needed 12 CPUs in the physical world, monitor overtime and determine how many of those you can knock out of the picture. Maybe it could be taken down to 8 or even 6 vCPUs. 
  • In many cases, I would bet that 60-75% of workloads can fit within the 1-2 vCPU range, and I would also put money on 75-90% of workloads needing no more than 4 vCPUs. As a side note: I was reading this article on Ars Technica and found some of the hypervisor host stats for Hyper-V compared against VMware, pretty crazy and honestly, unnecessary. By the time I need 320 CPU hosts (yay Hyper-V), I would have hoped I would have been smary enough and separated that load into multiple hosts and created maybe a new cluster for high availability instead. I don't doubt there are some massive companies with an insane virtual infrastructure that could benefit from that size of hosts, but most companies and workloads don't come close to that. END OF SIDE RANT.
  • I recommend keeping a close eye, however you go about doing that in your environment, on the CPU utilization and CPU Ready Time, and always look for opportunities to reduce vCPU assignment or create like allocated vCPU Virtual Machines.
  • If you have hosts or clusters that have a huge amount of resources that are barely getting used by the VMs in your environment, you may be able to get away with not caring to much about mixing like vCPU assigned VMs. The effects of CPU Ready time are being masked by the fact that there are a bunch of physical resources. Wait for your work load get up to a medium amount of usage and your performance will start to ache because of that hidden CPU Ready time. 
With my awesome bullet-ed (spell checker wanted a hyphen, lame!) list aside, lets take a look at what in the world was going on in the "cliff hanger" picture I left at the bottom of my last post. Here is it once more:








So what are we looking at here? This is obviously a vSphere performace chart, it is showing the CPU Usage and CPU Ready time for a production VMware View Connection Server. To make a long story as short as I can, I was looking at the currently assigned Memory for this machine when I noticed the VM had been assigned 4 vCPUS.











I thought to my self, "Why in the world does this VM need 4 vCPUs?" Turns out after reading some additional View documentation, 4 vCPUs is the default recommended setting. So anyways, I thought just for fun I would take a look at the CPU usage to see how much this baby was actually using  Here is what I initially saw once I enabled the CPU Ready chart metric:









This VM was getting CPU Ready times on the "Real-time" chart of 2,500 to 3,700 milliseconds consistently . Its one thing to get some bad CPU Ready time in a blip or moment, but this was consistent and "reliable" CPU Ready time. This VM's performance was basically getting destroyed. To put these numbers into perspective lets take a look at how to interpret these millisecond numbers. I recommend learning and understanding what good and bad millisecond values are for CPU Ready Time, since this is how the vSphere performance charts presents this metric. Many people want % values, which are not bad, but because vSphere charting goes off of millisecond, I find it more beneficial to read the millisecond value and know what they mean.

If you know anything about how VMware saves statistical data this wont be news to you. But when you look at any performance chart BUT the "Real-time" view or the "Refreshes every 20 seconds view", the numbers you see present you with what looks like "higher" numbers. This seems this way because the stats are rolled up and its actually an average of the statistics saved.

To further describe roll-up effect, here are some basics to work off of.
  • The "Real-time" performance chart is updated every 20 seconds. As a good base line, 120 - 175ms is acceptable for CPU Ready Time on the "Real-time" chart. 500 ms is our red zone. For those of you who want percentages here is the equation: 
    • <currentValueINMilliseconds>/<intervalINMilliseconds>*100
    • For example 170/20,000*100 which gives us a percentage of .8% CPU Ready time. 
    • Doing the math, so you dont have to, our red zone is about 2.5%.
  • The "Day" performance chart shows the rolled up statistics for a day in 5 minute intervals. So if our Real-time interval is 20 seconds and 500ms is the red zone, our 5 minute interval red zone would be 7500ms. 
    • (5*60/20)*500=7500.  OR to simplify. From here on out our red zone is 1500ms for every minute.
  • The "Week" performance chart shows the rolled up stats for a week in 30 minutes intervals. So 30*1500=45000 is the red zone for the week chart.
Although I could do the math, because I am wicked smart, I will let you all do the rest of the math for the Month and Year charts and for any custom charts you make. Make sure you check your stats intervals on your vCenter(s) as the defaults are not always used. Access the interval settings this from the "vCenters Settings" button on the home page inside of vCenter. Heres a screen shotters for assistance:



So let me finish the story from earlier about the View Connection Server I was monitoring. To sum things up really quickly, the "Real-Time" chart screen shot I shared earlier with 2,500 to 3,700ms means that for 16% of the time this VM was not able to do anything, it was just sitting there and waiting! When would numbers like that ever be acceptable?

INITIATE SCIENTIFIC STUDY BENCHMARKS

Advanced Scientific Study #1:
IT Manager: Why is this VM so dang slow?
IT Employee: Uhhhh, well it works 84% of the time...
Outcome: NOT ACCEPTABLE

Advanced Scientific Study #2:
PotentialCustomer: What kind of guarntee can we expect for up time if we decide to go with your facility?
PotentialDatacenterFacility: Uhhhh, we are up about 84% of the time around here...
Outcome: NOT ACCEPTABLE

Advanced Scientific Study #3:
Wife: How much do you love me?
Husband: Uhhhh, about 84%...
Outcome: DIVORCED

To finish up here (this post is SOOOO LONG) I asked for a downtime window from the customer, and went in and removed 2 vCPUs. The end result of removing those 2 vCPUs is what is shown in the initial screen shot highlighted by the "BOOM!". The CPU Ready time drops off and is almost non existent. One last picture, here is a screenshot showing the "Real-time" view with the CPU Ready metric active. This screenshot was taken about 30 minutes after I changed the vCPU assignment. Check out the dive CPU Ready time took, AWESOME.








I hope that this post has been a helpful tool in helping you understand CPU Ready time more fully. I hope the real world scenario described in the post helps tie it all together. Be a Monsterrrrrr at work, destroy CPU Ready time.

2 comments:

  1. Great article. We are running into this currently. Quick questions. If we have a machine with high cpu ready times with 2vCPU and other machine on the same hosts are reduced will this help the machine with 2vCPU? For instance we have a host with 20 machine. 10 of them have 2vCPU and 10 have 1vCPU. I change all but 1 server down to 1vCPU now we have 19 with 1vCPU and 1 with 2vCPU. CPU ready time should reduce on all servers correct?

    ReplyDelete
    Replies
    1. Jason,
      I believe that taking the route of dropping all VMs, aside of 1, down to 1 vCPU may actually hurt you more than help you. Think of it like this: Obviously we know now that, If you have (10) 1 vCPU VMs they will get scheduled faster on the physical cores than VMs with multiple vCPUs being that its more likely to have an open slot of 1 vCPU than it is of more. During this time, 2 vCPU VMs are scheduled whenever 2 cores are free, 2 cores are freed up more often when you have more 2 vCPU VMs on the same host. If you add MORE 1 vCPU VMs into the equation, you effectively have now doubled the need for 1 core slots in scheduling and your 2 vCPU VM will potentially have to wait even LONGER to get scheduled as it "fights" its way to the front of the line for 2 slots to open. I would look at a couple things:
      1)How is CPU usage on the 2 vCPU VMs? If their load can be handled on 1 vCPU, this is the first change I would make. Remember usage that stays on average below or around 75%ish with no CPU Ready Time is effective usage.
      2)What is your hardware situation? How many hosts? How many per host? You can effectively mask CPU ready times with enough hardware, so if your hardware underutilized and you are seeing CPU Ready Time, something else might be the culprit.
      3)If you have a cluster, keeping like vCPU allocated VMs on the same host can remedy CPU Ready Time.
      I would recommend also reading about the relaxed vs strict co-scheduler in ESX/ESXi. Your version of vSphere and ESX can affect how these recommendation help you. Let me know if I can help out in any other way!

      Delete