Have you ever been placed in the position of being forced to address user complaints about how slow the system is performing without really knowing exactly where to begin looking? Some analysts begin with Task Manager or viewing the Windows event log. Others who are working with SQL Server search the ErrorLog for problems or run Activity Monitor from SQL Server Management Studio (SSMS). Sometimes, when I/Os take WAY too long to complete, entries will appear in the Errorlog, but generally none of these displays much pertinent information. On Windows Server 2012, Task Manager was enhanced greatly so the memory, disk, and network subsystems can be observed better in real-time. Previous versions displayed considerably less detail, particularly with respect to I/O. Unfortunately, Task Manager has two inherent weaknesses: an analyst has to be watching it when the problem occurs and be able to interpret all the data before it moves off the screen in 60 seconds and the one-second sampling often gives the viewer the impression that things are horrible when they are actually fine! Both are too fast for anyone to figure out a performance problem!
When confronted with a situation like this, it is often helpful to know about a handful of PerfMon counters that can be used to rule out a given area, if not highlight the problem area(s). The counters can be collected over time and then reviewed once a performance problem has occurred. Although the list below is far from complete, they provide a very nice overview of what is actually going on with the system.
This metric measures the amount of time that one or more processors are busy performing any kind of work. This metric can be captured for the entire processor complex (same as Task Manager) or for an individual processor. If this value is greater than 70 percent for long periods, the processor(s) is struggling to keep up with the presented workload. Very high spikes are commonplace, but sustained levels that approach saturation are problematic.
This metric measures the amount of time that one or more processors are busy performing Windows overhead work, e.g., handling the behind the scenes work for scheduling processors on threads, disk and network I/Os, and memory and device management. This metric is comprised of Windows kernel time, % Interrupt Time, and % DPC Time, so to compute the true Windows kernel time, the other two must be subtracted from % Privileged Time. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 30 percent for any single processor, overhead activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category.
Windows utilizes two kinds of interrupts: immediate and deferred. This metric measures the amount of time that one or more processors are busy handling immediate interrupts, and contributes to % Privileged Time. Immediate interrupts are those that MUST be handled right away. Interestingly, as Windows has evolved, disk I/Os once were immediate and now they are often deferred. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 20 percent for any single processor, immediate interrupt handling activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only general exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category. Expect this value to be higher on the processor(s) that handle network connections.
This metric measures the amount of time that one or more processors are busy handling deferred interrupts, and contributes to % Privileged Time. The most consistent example of deferred interrupts involves network packets. A system that is handling an enormous amount of network traffic will always exhibit high numbers of deferred interrupts. Beginning with Windows Server 2008, network packets are handled my multiple processors, although only a single processor handles any given network connection. Although this metric can be captured for the entire processor complex or for an individual processor, it is best to collect for each processor, and if its value is consistently greater than 25 percent for any single processor, deferred interrupt handling activities are unusually high and should be investigated using Resource Monitor or Event Tracing for Windows (ETW). The only general exception to this rule occurs during system backup periods because this is usually performed in a mode that is charged to the Windows overhead work category and the network is generally utilized extensively. Expect this value to be higher on the processor(s) that handle network connections.
This metric measures the amount of time that one or more physical or logical (depending upon the object) disk LUNs are NOT busy performing any kind of work. This metric is only useful when applied to a specific disk LUN; the aggregate value is meaningless except in a capacity planning context. If this value is LESS than 40 percent for long periods, the physical or logical disk LUN is struggling to keep up with the presented workload. Disk response times often increase exponentially as this value approaches zero. In this situation, either the workload should be reduced or the hardware should be upgraded.
This metric measures the amount of time in seconds that the average I/O request requires to complete. This can be considered a true disk LUN response time because the timer is activated when the I/O is extracted from the disk queue and it is terminated when a processor handles the I/O completion. If this value is greater than 0.040 seconds (40 milliseconds) for long periods, the disk LUN is struggling to keep up with the presented workload, so the pressure should be alleviated by workload reduction or hardware upgrade.
This metric measures the amount of time in seconds that the average write I/O request requires to complete. This value can be especially useful in RAID 5 environments because it will indicate clearly when the disk LUN is asked to handle too many writes. If this value is greater than 0.040 seconds (40 milliseconds) for long periods, the disk LUN is struggling to keep up with the write workload. If this occurs, the write workload should be reduced or a different level of RAID should be used, if appropriate.
This metric measures the amount of memory available to Windows. Note: this value often includes memory that has been allocated to the Windows cache, so you may observe values that appear to be double-counted as shown in the picture below. Free Memory is the true amount of memory that isn’t being used by anything. Windows always attempts to leave memory that is cached resident as long as possible to reduce the probability of having to perform repeated I/Os. If the available value is less than one GB for long periods, the system is short on physical memory. In SQL Server environments, it may be possible to reduce the amount allocated to SQL Server, but this may create problems with SQL Server performance.
This metric measures the number of read operations performed outside SQL Server. This can involve reading from files and may highlight a memory shortage only indirectly. When this number is consistently high and files are not being read into memory, the system may be short on memory.
This metric measures the weighted average of the time unlocked buffers remain in the SQL Server buffer pool. This metric exhibits unusual behavior in that when SQL Server runs out of memory, this value will often plummet and even when the problem is gone, it will increase very slowly over time. Therefore, although it is tempting to interpret values below the desired threshold of 300 seconds as representing a memory problem, one has to look at the trend because it may be simply that SQL Server ran out of memory and then recovered. However, it can also indicate a chronic SQL Server memory shortage.
This metric measures the number of read operations performed by SQL Server. Windows reads are not part of this number. This metric can be combined with the physical or logical disk metrics to determine whether SQL Server is causing stress on the disk LUNs.
Using this short list of PerfMon counters, an analyst can quickly look at data that has been captured over any given period to determine whether significant problems occurred. Note: the PerfMon sampling rate should be set to no less than five seconds because anything less than that creates so much noise and distraction that it makes accurate and reasonable assessment of performance virtually impossible.