Apr 16, 2015
Have you ever watched the NRL Grand Final or perhaps the finale of The Block? How about the great race, Bathurst? Some amazing results with such matches and shows. Well, I reckon that a good statistician, someone like Nate Silver for example, would have been able to pick the eventual winners in all of those competitions. In the 2012 United States presidential election between Barack Obama and Mitt Romney, Nate Silver correctly predicted the winner of all 50 states and the District of Columbia. He is very well known for his statistical skills.
Alright, so by now you're wondering why I'm wittering on about statistics? You know I'm not just banging on just for the sake of it. I do have a point, or several perhaps.
Statistics, to my mind, are as important to the efficient operation of your databases as just about anything else you can name. I've seen a query that took hours to run reduced to just seconds with a carefully crafted set of up-to-date statistics. Yes, they are that powerful, and yes, neglect them at your peril. A query that now takes seconds can soon enough take hours, the reverse of what I described above.
There are about a squillion ways a database can use to answer a query, some of them being faster and more efficient than others. It is the job of the query optimiser to evaluate and choose the best path (what this actually means will be the topic of a further blog post; it's much too complicated to explain here), or execution plan, for a particular query. Using the available indexes may not always be the best or most efficient plan. Statistics are SQL Server objects which contain metrics on the data count and distribution within a column or columns used by the optimiser to help it make that choice. They are used to estimate the count of rows.
SQL Server’s Query Optimiser uses distribution statistics to determine how it’s going to satisfy your SQL query. These statistics represent the distribution of the data within a column, or columns. The Query Optimiser uses them to estimate how many rows will be returned from a query plan. With no statistics to show how the data is distributed, the optimiser has no way to compare the efficiency of different plans, so it will frequently be forced to simply scan the table or index. Without statistics, it can’t possibly know if the column has the data you’re looking for without stepping through it. With statistics about the column, the optimiser can make much better choices about how it will access your data and use your indexes.
When the data in the database changes, the statistics will become stale and outdated. This is why the performance of your database will change over time even though it appears you've done nothing but add more rows. While you think that your indexing strategy is as good as it'll get (and it might very well be), it won't help if the statistics become stale. When you look at a query execution plan and you notice a large discrepancy between the Actual Number of Rows and the Estimated Number of Rows, this indicates outdated statistics. These outdated statistics can, and do, cause the optimiser to choose an inefficient execution plan. That can have a dramatic affect on your overall performance. Steps must therefore be taken in order to keep statistics up to date.
So what are we to do? Well, there are a number of things that you should do:
- Keep Auto Create Statistics enabled. This database property allows SQL Server to automatically create statistics for a single non-indexed column if they are missing when it is used in a where or join condition. This ensures the optimiser will have the necessary information to choose a plan. The statistics automatically created by SQL Server will start with _WA_ in their name.
- Keep Auto Update Statistics enabled. This database property allows SQL Server to automatically update the statistics when they are deemed outdated. The update occurs before executing a query if certain conditions are met, or after the query is executed if Auto Update Statistics Asynchronously is used instead. The three conditions that will trigger an update if one is met are:
- The table had 0 rows and increases to one or more rows.
- The table had less than 500 rows and there is an increase of 500 rows or more since the last update.
- The table has over 500 rows and there is an increase of 500 + 20% of the table size since the last update.
- Maintenance Plan. You can also proactively update the statistics yourself using TSQL (sp_updatestats for all statistics in a database or UPDATE STATISTICS for a single set of statistics) or a maintenance plan task. Scheduling the update during off hours reduces the need to do so during peak times. You can specify the sample size you use when you update a statistic. You can use FULLSCAN or a SAMPLE X PERCENT. Obviously, the FULLSCAN will give you more accurate results but will cost more in terms of time and resources. Reducing the sample size reduces the accuracy but also reduces the cost, again in terms of time and resources.
I want to warn you, finally, about a couple of things to look out for. Firstly, if you rebuild an index, the statistics will be updated with a FULLSCAN. So, there is absolutely no point in doing an update to the statistics after a rebuild. A reorganise is another matter. You should run an update of the statistics after a reorganise. Secondly, you should not rely on the Auto Update. As we saw above, the auto update is triggered when around 20% of the rows are affected. This means that for a table with 1 million rows, there will be no update if you add 150,000 rows. As you can imagine, this can dramatically effect the statistics, but an auto update would not be triggered.
I hope this post has given you some insight into statistics and how vitally important they are to your database. As I said: neglect them at your peril.