SQL Server Memory Health

SQL Server Memory Health: Performance Monitor Counters

Often there are times that we to need assess the health of a SQL Server instance, but we have not yet added all the monitoring tools to the environment, which can involve approvals to install software, assign admin rights to accounts, and in general just take time to set up. What do we do if we want to define a quick and consistent approach to collecting data about SQL Server performance from instances we are just looking at for the first time? Microsoft Performance Monitor is that utility.

Performance Monitor is a built-in utility within the Windows operating system (OS) that allows you to capture performance counter data over time. It’s free and easy to use. This makes it a great tool for those without any 3rd party tools or who want to collect data outside of SQL server as either a primary performance monitoring solution or additional data collection to strengthen the overall monitoring footprint; it can be used by DBAs, Sysadmins, and IT Managers to collect the necessary data quickly and efficiently about overall SQL Server health.

When reviewing performance data for SQL Server there are essentially 3 high-level arenas that we need to cover, CPU, IO, and Memory. CPU and IO tend to be the more visible areas of monitoring, as the costs associated with them are high (SQL Server licensing is typically CPU core based and enterprise storage solutions can easily be 6 to 7 figures). While Memory allocation to servers might be the least expensive of the three, its performance implications for the servers is the highest. So let’s review some Memory based performance counters to verify if SQL Server is performing well.

Page Life Expectancy

Page Life Expectancy (PLE) is probably one of the most misunderstood counters. It represents how long data pages stay in memory before being evicted (either because it’s not used anymore or there are other data pages that need to use the buffer pool). Tracking PLE can tell when you have memory pressure on your system, however, the generic recommendation of 300 seconds is not always a good indication of the issue. Rather, tracking the value of time and monitoring for large sudden drops in PLE will be indicative of memory pressure. The goal with this counter is to baseline and know what values are with your system in good health.

Memory Grant Queue Waits

Memory Grant Queue Waits represents the number of SQL Server processes waiting to be granted memory workspace. All queries need memory for their execution and the query optimizer will estimate how much it needs. If there is not enough available, it will wait for memory to become available. This counter should always be as close to 0 as possible. Short bursts could indicate bad query patterns, and long sustained counts above 0 are indicative of memory pressure. If the server has the capacity, you will want to increase the amount of memory that SQL Server can use.

Memory: Pages/Sec

Memory paging is also a sign that the overall memory allocation to the server is too low for the given workload (either within SQL Server or the combination of all services running). When server memory is low, the OS will page the data to disk, which is much slower than an all-in memory operation. Occasional paging might not be problematic, but consistent high numbers is indicative of a system bottleneck.

Memory: Lazy Writes

Lazy Writes are essentially the process that allows ‘dirty pages’ in memory to be written to disk during checkpoints. Dirty pages are data pages within the memory buffer that contain modified data that has yet to be written to disk. SQL Server manages dirty pages within the checkpoint process, so as to not slow all DML queries with physical disk writes, it’s managed as a background process. Checkpoints and flushing dirty pages to disk is a routine operation that every system does but monitoring lazy writes specifically will tell us if the amount of data that is changing is too high for SQL Server to keep up with.   If you see this counter consistently higher than 20, then you likely have memory pressure on the system

Often there are times that we to need assess the health of a SQL Server instance, but we have not yet added all the monitoring tools to the environment, which can involve approvals to install software, assign admin rights to accounts, and in general just take time to set up. What do we do if we want to define a quick and consistent approach to collecting data about SQL Server performance from instances we are just looking at for the first time? Microsoft Performance Monitor is that utility. Performance Monitor is a built-in utility within the Windows operating system (OS) that allows you to capture performance counter data over time. It’s free and easy to use. This makes it a great tool for those without any 3<sup>rd</sup> party tools or who want to collect data outside of SQL server as either a primary performance monitoring solution or additional data collection to strengthen the overall monitoring footprint; it can be used by DBAs, Sysadmins, and IT Managers to collect the necessary data quickly and efficiently about overall SQL Server health. When reviewing performance data for SQL Server there are essentially 3 high-level arenas that we need to cover, CPU, IO, and Memory. CPU and IO tend to be the more visible areas of monitoring, as the costs associated with them are high (SQL Server licensing is typically CPU core based and enterprise storage solutions can easily be 6 to 7 figures). While Memory allocation to servers might be the least expensive of the three, its performance implications for the servers is the highest. So let’s review some Memory based performance counters to verify if SQL Server is performing well. <h2>Page Life Expectancy</h2> Page Life Expectancy (PLE) is probably one of the most misunderstood counters. It represents how long data pages stay in memory before being evicted (either because it’s not used anymore or there are other data pages that need to use the buffer pool). Tracking PLE can tell when you have memory pressure on your system, however, the generic recommendation of 300 seconds is not always a good indication of the issue. Rather, tracking the value of time and monitoring for large sudden drops in PLE will be indicative of memory pressure. The goal with this counter is to baseline and know what values are with your system in good health. <h2>Memory Grant Queue Waits</h2> Memory Grant Queue Waits represents the number of SQL Server processes waiting to be granted memory workspace. All queries need memory for their execution and the query optimizer will estimate how much it needs. If there is not enough available, it will wait for memory to become available. This counter should always be as close to 0 as possible. Short bursts could indicate bad query patterns, and long sustained counts above 0 are indicative of memory pressure. If the server has the capacity, you will want to increase the amount of memory that SQL Server can use. <h2>Memory: Pages/Sec</h2> Memory paging is also a sign that the overall memory allocation to the server is too low for the given workload (either within SQL Server or the combination of all services running). When server memory is low, the OS will page the data to disk, which is much slower than an all-in memory operation. Occasional paging might not be problematic, but consistent high numbers is indicative of a system bottleneck. <h2>Memory: Lazy Writes</h2> Lazy Writes are essentially the process that allows ‘dirty pages’ in memory to be written to disk during checkpoints. Dirty pages are data pages within the memory buffer that contain modified data that has yet to be written to disk. SQL Server manages dirty pages within the checkpoint process, so as to not slow all DML queries with physical disk writes, it’s managed as a background process. Checkpoints and flushing dirty pages to disk is a routine operation that every system does but monitoring lazy writes specifically will tell us if the amount of data that is changing is too high for SQL Server to keep up with.   If you see this counter consistently higher than 20, then you likely have memory pressure on the system <h2>Buffer Cache Hit Ratio</h2> This counter tracks the percent of pages that are already in memory and can be accessed quickly by SQL Server. If the data is not in memory, it will have to read the data off disk first and SQL Server will decide what other data pages will be evicted to complete the request. For OLTP systems you should see a buffer cache hit ratio of >95%, and in some cases, with critical real-time processes, you would always want > 99%. OLAP typically differ in the sense they are not dealing with real-time, latency-sensitive data and can purposefully be aggregating large datasets that will require physical reads from disk. In this case requiring all the data to be consistently within the memory buffer might not be a realistic goal. As the need for the data to be ‘real-time’ sensitive, the buffer cache hit ratio becomes more important to track and maintain a high hit ratio. <h2>Memory: Available Bytes</h2> This counter tracks the available memory on the host, not just what SQL Server is using.  SQL Server memory is configurable but by default is set to utilize all the available memory on the host, which can lead to all sorts of performance issues. Similarly, even with appropriate memory settings within SQL Server, there can be other programs running on the host that is using memory.  This counter can quickly show when memory pressure is affecting the system and either the corresponding processes can be tuned to require less memory, or if more memory is needed in the system. After reviewing these memory counters, you should be able to answer the following questions: <ol> <li>Is my host showing signs of memory pressure?</li> <li>Is SQL Server the source of that memory pressure?</li> <li>Is SQL server itself showing signs of memory pressure even when the overall host metrics are healthy?</li> </ol> These answers will determine if we need to allocate more memory to the server, allocate more memory to SQL Server, or if we need to tune queries to bring metrics back into normal healthy ranges.

Buffer Cache Hit Ratio

This counter tracks the percent of pages that are already in memory and can be accessed quickly by SQL Server. If the data is not in memory, it will have to read the data off disk first and SQL Server will decide what other data pages will be evicted to complete the request. For OLTP systems you should see a buffer cache hit ratio of >95%, and in some cases, with critical real-time processes, you would always want > 99%. OLAP typically differ in the sense they are not dealing with real-time, latency-sensitive data and can purposefully be aggregating large datasets that will require physical reads from disk. In this case requiring all the data to be consistently within the memory buffer might not be a realistic goal. As the need for the data to be ‘real-time’ sensitive, the buffer cache hit ratio becomes more important to track and maintain a high hit ratio.

Memory: Available Bytes

This counter tracks the available memory on the host, not just what SQL Server is using.  SQL Server memory is configurable but by default is set to utilize all the available memory on the host, which can lead to all sorts of performance issues. Similarly, even with appropriate memory settings within SQL Server, there can be other programs running on the host that is using memory.  This counter can quickly show when memory pressure is affecting the system and either the corresponding processes can be tuned to require less memory, or if more memory is needed in the system.

After reviewing these memory counters, you should be able to answer the following questions:

  1. Is my host showing signs of memory pressure?
  2. Is SQL Server the source of that memory pressure?
  3. Is SQL server itself showing signs of memory pressure even when the overall host metrics are healthy?

These answers will determine if we need to allocate more memory to the server, allocate more memory to SQL Server, or if we need to tune queries to bring metrics back into normal healthy ranges.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *