Apr 29, 2012

Sql server query optimization tips : Tuning best practices with examples

If we will follow following tips or rules while write sql queries we can improve the performance of sql queries:

1. Help Query optimizer to estimate correct statistics

In sql server execution plan is decided by:

1. Cardinality: The total number of rows processed at each level of query.

2. Query Plan: Algorithm to use to perform task like searching, sorting etc.

When we execute a sql queries, create index, insert records in a table etc sql server automatically creates or updates the statistics of key fields of tables. In simple words we can say statistics is object which keeps information about total numbers of distinct records in a table.

There a few cases where query optimizer is unable to get correct information about statistics which decrease the performance of a sql query. For examples:

First we are creating a tblMessage table and inserting 97526 records into it and creating three indexes on it:

CREATE TABLE tblMessage (
    ntMessageID BIGINT IDENTITY PRIMARY KEY,
    ntSeverity INT,
    vcMessage VARCHAR(500),
    dtDate DATETIME
)

INSERT INTO tblMessage
SELECT
    Severity,
    LEFT ([TEXT], 500),
    DATEADD (YEAR,severity,GETDATE())   
FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Severity
ON tblMessage(ntSeverity)
INCLUDE (vcMessage)

CREATE NONCLUSTERED INDEX NC_Msg
ON tblMessage(vcMessage)

CREATE NONCLUSTERED INDEX NC_Date
ON tblMessage(dtDate)
INCLUDE (ntSeverity,vcMessage)

Note: We are setting statistics IO on to know the total logical reads:

SET STATISTICS IO ON

Scenario 1:

a. Consider on the following two sql queries:

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LEFT (vcMessage,1) = 'a'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage LIKE 'a%'

Both of them will perform same task from query point of view. Now check the logical read in message tab:

Logical read (Query 1):  1471
Logical read (Query 2):  79

So, Query 1 has to go for approximately 18 times more logical read than query 2.

Now check the actual execution plan of both of the query:


We can observe query 1 is using index scan while query 2 is index seek and overall cost of query 1 is 96% while of query 2 is only 4%.

So, now we can say query2 is much better than query 1.

b. Consider on the following two sql queries:

Query 1:

SELECT * FROM tblMessage WHERE YEAR (dtDate) = '2012'

Query 2:

SELECT * FROM tblMessage
WHERE dtDate >= '2012-01-01' AND dtDate < '2013-01-01'

Both of them will perform same task from query point of view. Now check the logical read in message tab:

Logical read (Query 1):  1589
Logical read (Query 2):  10

So, Query 1 has to go for approximately 160 times more logical read than query 2.

Now if we will check the actual execution plan of both of the query we will observe query 1 is using index scan while query 2 is index seek and overall cost of query 1 is 99% while of query 2 is only 1%.

In the same way if we will analyze the following queries query 2 will perform much better than query 1:

C.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE (ntSeverity * 5)/2 = 35

Query 2:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity = (35/5) * 2

Logical read (Query 1): 1490
Logical read (Query 2): 16
Batch query cost (Query 1): 100%
Batch query cost (Query 2): 0%

D.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LOWER (vcMessage) = LOWER ('Text')

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage = 'Text'

Or

SELECT vcMessage FROM tblMessage
WHERE vcMessage = 'Text' COLLATE SQL_Latin1_General_CP1_CI_AS

Logical read (Query 1): 1471
Logical read (Query 2): 3
Batch query cost (Query 1): 100%
Batch query cost (Query 2): 0%

Note: Default collation of sql server is case in sensitive so no need of user LOWER function to perform case insensitive comparison. If default collation of your database or table or column is  case sensitive then we should use COLLATE clause with any case in insensitive collation. 

E.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE CAST (ntSeverity AS VARCHAR) = '14'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity = CAST ('14' AS INT)

Logical read (Query 1): 1490
Logical read (Query 2): 16
Batch query cost (Query 1): 99%
Batch query cost (Query 2): 1%

Question: Why query 2 is performing better than query 1 in all of the above examples?

Answer: If you will notice the WHERE clause in all above examples in query 1, you will get that we are performing some operations with the field of table tblMessage while in query 2 equivalent value has used in constant expression.

If we perform following operations on field of any table in query predicate:

1. Using any system function or user defined function
2. Scalar operation like addition, multiplication etc.
3. Type casting

In this situation sql server query optimizer is not able to estimate correct cardinality using statistics.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LEFT (vcMessage, 1) = 'a'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage LIKE 'a%'

Both of the above queries select total records: 4171

If we check the detail section of execution plan we will get:

Actual cardinality: 4171
Cardinality (Query 1): 2432
Cardinality (Query 2):  4131.76

So cardinality of query 2 (4131.76) is very near to correct cardinality i.e. total numbers records filter by WHERE clause and pass to SELECT clause (4171)

Tips: If possible, don't perform any operation on the any field of a table in WHERE Clause, ON Clause, HAVING Clause. Instead of this, write equivalent constant expression.

Scenario 2:

Consider on the following two sql statements:

Query 1:

SELECT vcMessage FROM tblmessage
WHERE vcMessage LIKE 'Text%'

Query 2:

DECLARE @Msg AS VARCHAR (500) = 'Text%'
SELECT vcMessage FROM tblmessage
WHERE vcMessage LIKE @Msg

If we will check the execution plan of both the queries:


Total query cost of query 2 is 96% while query 1 is only 4%!!!

What wrong with query 2?

Again, same cause. Query optimizer is not able to get correct cardinality use statistics.

1. Actual cardinality: 55
2. Carnality (Query 1):  40.9086
3. Carnality (Query 2): 6222.16

If we will use local variables in query predicate,   sql server query optimizer is not able to estimate correct cardinality using statistics.

Tips:

Don't use local variables in WHERE Clause, ON Clause, HAVING Clause. Instead of this, use constant expressions.

Question: How to resolve this problem when this variable is the parameter of stored procedures or functions?

Scenario 3:

Consider on the following sql statement:

Query 1:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity * ntMessageID = 0

There any many business logic where we use more than one field and perform some arithmetic operations in query predicate.  This is called cross relationship among fields. Above query is example of cross relationship of field ntSeverity and ntMessageID  

If there is any cross relationship among different fields of a table in query predicate, sql server query optimizer is not able to estimate correct cardinality using statistics.

To resolve this problem sql server has introduced computed column. So we have to create a computed column in the table tblMessage. In this example we are creating a other table tblMessage_Dense, inserting same records and creating a non- clustered index:

CREATE TABLE tblMessage_Dense(
    ntMessageID BIGINT IDENTITY PRIMARY KEY,
    ntSeverity INT,
    vcMessage VARCHAR(500),
    dtDate DATETIME,
    ntDensity AS ntSeverity * ntMessageID
)

INSERT tblMessage_Dense(ntSeverity,vcMessage,dtDate)
SELECT
    Severity,
    LEFT ([TEXT], 500),
    DATEADD (YEAR,severity,GETDATE())   
FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Density
ON tblMessage_Dense(ntDensity)
INCLUDE (vcMessage)

And our new sql query will be:

Query 2:

SELECT vcMessage FROM tblMessage_Dense
WHERE ntDensity = 0

Ok, now compare the performance of both the sql queries:

Logical read (Query 1): 1490
Logical read (Query 2):  9

So, Query 1 has to go for approximately 165 times more logical read than query 2.

Now check the actual execution plan of both of the query:


We can observe query 1 is using index scan while query 2 is index seek and overall cost of query 1 is 99% while of query 2 is only 1%.

So, now we can say query 2 is much better than query 1.

Note: Good news, a non – persisted computed column is not stored physically in memory.

Suppose in our application there a table tblQueue:

CREATE TABLE tblQueue (
    ntQueueID BIGINT IDENTITY PRIMARY KEY,
    vcQueueName VARCHAR(500)
)

INSERT tblQueue
SELECT  '1_' +
    CAST(message_id AS VARCHAR) + '_' +
    CAST(severity AS VARCHAR) + '_' +
    CAST(language_id AS VARCHAR)  FROM Sys.messages

This application very frequently executes a sql query which gets batch id form a column vcQueueName:

Query 1:

SELECT * FROM tblQueue
WHERE CAST (REVERSE (LEFT (REVERSE (vcQueueName), CHARINDEX ('_'REVERSE (vcQueueName)) - 1)) AS INT) = 1031

In this situation, it would be good idea to create a computed column BatchID in the table. In this example we are creating a new table tblQueue_Batch:  

CREATE TABLE tblQueue_Batch (
    ntQueueID BIGINT IDENTITY PRIMARY KEY,
    vcQueueName VARCHAR(500),
    ntBatchID AS  CASTREVERSE( LEFT( REVERSEvcQueueName), CHARINDEX('_',REVERSE(vcQueueName)) - 1)) AS INT)
)

INSERT tblQueue_Batch
SELECT '1_' +
    CAST (message_id AS VARCHAR) + '_' +
    CAST (severity AS VARCHAR) + '_' +
    CAST (language_id AS VARCHAR) FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Batch
ON tblQueue_Batch (ntBatchID)
INCLUDE (vcQueueName)

This sql query 2 is much more efficient than previous one:

Query 2:

SELECT * FROM tblQueue_Batch  WHERE ntBatchID = 1031

Logical read (Query 1): 430
Logical read (Query 2): 44
Batch query cost (Query 1): 91%
Batch query cost (Query 2): 9%

Tips:

If there is any cross relationship among fields or there is a complex expression in a field in a query predicates, it is better to create a computed column and then create a non-clustered index on it. 

Scenario 4:

In sql server query optimizer can estimate correct cardinality if sql statement use variable in query predicate only if that variable is parameter of stored procedures or functions. For example:

We are creating two stored procedures:

CREATE PROC uspGetMsg_1 (
    @Severity AS INT
)
AS
BEGIN
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = @Severity
END

Go

CREATE PROC uspGetMsg_2
AS
BEGIN
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = 15
END

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_2

From query performance point of view both queries has exactly same execution cost. There are no differences, either we use constant expression or pass it as a parameter. 

Now we are editing the stored procedures uspGetMsg_2:

ALTER PROC uspGetMsg_2 (
    @Severity AS INT
)
AS
BEGIN

    SET @Severity = @Severity - 1
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = @Severity
END

Now consider on the following two sql statements:

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_2 16

If we will check the execution plan of both of the queries:


Execution cost of query 2 is 95% while query 1 is only 5%

1. Actual cardinality: 3102
2. Cardinality (Query 1): 3102
3. Cardinality (Query 2): 69839

So, again query 2 is not able to estimate correct cardinality. If you will check, then you will find query 2 is using carnality of query :

SELECT vcMessage FROM tblMessage WHERE ntSeverity = 16

While we have updated the value varaible @Severity to 15. So ,it should use the cardanility of query:

SELECT vcMessage FROM tblMessage WHERE ntSeverity = 15

But it is not true. So we can say query optimizer estimate the cardinality from statistics on the basis value passed to parameter of a function or a stored procedure not the updated value of the parameter inside the function or stored procedures:

Tips:

If possible, don't update the value of parameters of a function or stored procedure before using in sql statement, instead of this pass the updated value to the parameters.

Scenario 5:

In sql server, we empty the proc cache by using sql statement:

DBCC FREEPROCCACHE

Note: Don't use this query at production server.

Consider on following two sql statements:

Query 1:

EXECUTE uspGetMsg_1 16

Query 2:

EXECUTE uspGetMsg_1 15

If we will check the execution plan we will find:

Query 1:

Actual cardinality: 69839
Estimated cardinality: 69839

Query 2:

Actual cardinality: 3102
Estimated cardinality: 69839

Now empty the proc cache:

DBCC FREEPROCCACHE

Now execute same two queries but in reverse order that is:

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_1 16

If we will check the execution plan we will find:

Query 1:

Actual cardinality: 3102
Estimated cardinality: 3102

Query 2:

Actual cardinality: 69839
Estimated cardinality: 3102

Here we can observe estimated cardinally of query 2 is equal to estimated cardinality of query 1. In sql server creates the statistics of any parameters of stored procedures or functions on the basis of value to pass parameters when it has executed first time. Next time onward it uses same cardinality if statistics, query plan etc doesn't change. Due to wrong estimation of cardinality query optimizer can choose wrong execution plan which may decrease the performance of queries.

We can optimize a sql query by specifying the value in the sql statement which mostly passes the application by using query hints OPTIMIZE FOR. It will override the default behavior that is instead of optimizing the sql query on the basis of what first time values had passed.  For example:

CREATE PROC uspGetMsg(
    @Severity AS INT
)
AS
BEGIN
    SELECT vcMessage FROM tblMessage
    WHERE ntSeverity = @Severity
    OPTION (OPTIMIZE FOR (@Severity = 15))
END

Tips:

Use OPTIMIZE FOR clause when you want to optimize a sql query on the basis of specific parameter value.

Scenario 6:

There are many situations where we need to update the value of the parameter according to some condition inside a stored procedure of a function. For example:

CREATE PROC uspGetMsg(
    @Severity AS INT
)
AS
BEGIN

    IF @Severity IS NULL
         SET @Severity = (SELECT MAX(ntSeverity) FROM tblMessage)
        
    SELECT vcMessage FROM tblMessage WHERE ntSeverity = @Severity
END

Consider a situation when our application has executed the following sql query at the first time:

EXECUTE uspGetMsg NULL

Query optimizer will estimate the cardinality on the basis of:

SELECT vcMessage FROM tblMessage 
WHERE ntSeverity = NULL

As we know NULL is never equal to any value. We use IS NULL to compare the null values. So this query will not return any result set. If a query doesn't return any value it cardinality will be one.

Let's check the cardinality in the execution plan:
Actual cardinality: 110
Estimated cardinality: 1

It will affect the performance of this query drastically.  Solution of this problem is, to create two stored procedures. For example:

CREATE PROC uspGetMsg (
    @Severity AS INT
)
AS
BEGIN

    IF @Severity IS NULL
         SET @Severity = (SELECT MAX (ntSeverity) FROM tblMessage)
        
    EXECUTE uspGetMsgNotNull @Severity
END

Go

CREATE PROC uspGetMsgNotNull @Severity
    @Severity AS INT
)
AS
BEGIN   
    SELECT vcMessage FROM tblMessage WHERE ntSeverity = @Severity
END

And from application use only the stored procedure uspGetMsg.

Tips:

If you want to update the value parameter of a stored procedure or a function create a similar procedure or function and execute it form base procedure or function by passing the updated value as a parameter. 

Scenario 7:

In sql server, default setting of auto creates and updates statistics is on except asynchronous statistics. It means query optimizer automatically creates and updates the statistics according to requirement. More specifically it auto creates the statistics of table for primary key columns when we create a table, for key columns of clustered and non-clustered index when we create indexes, for query predicates when we execute the query and updates the statistics when threshold limit reach after updation, deletion or insertions of records.

Limitation: Query optimizer can create only single columns statistics and also it cannot create filtered statistics.

So, there are many cases where we can improve the performance of sql statements by executing the user defined statistics. For example:

SELECT vcMessage FROM tblMessage
WHERE ntMessageId < 1000
    AND ntSeverity IN (10, 11, 12, 16)

If we will check the execution plan we will get:

Actual cardanality: 705
Estimated cardanality: 839.486

If we will create following filtered index:

CREATE STATISTICS Stat_Severity 
ON tblMessage(ntMessageId)
WHERE ntSeverity IN (10, 11, 12, 16)

Now estimated cardinality of previous query will be: 705.671

So, now query will perform better.

Tips:

Create user defined multi column statistics if query predicates have more than one fields of a table.

4 comments:

  1. Anonymous8/17/2012

    Thank you very much for this valuable post, the only suggestion I would like to give you is please search for 'observer' and replace it with 'observe'.

    Thanks
    Manish

    ReplyDelete
  2. Anonymous3/01/2013

    Thanks for this site

    ReplyDelete
  3. I like your analysis of the topic, but I have a question concerning the CREATE STATISTICS SQL at the end. You called it a filtered index. I have queries that will pull indexes and their associated fields from the system views. Will these named statistics do the same? I thought filtered a index would be:
    CREATE INDEX Stat_Severity
    ON tblMessage(ntMessageId)
    WHERE ntSeverity IN (10, 11, 12, 16)
    ;

    Does the metadata for the named statistics code get put in the same place?

    Thanks,
    John
    bobo8734@gmail.com

    ReplyDelete