Apr 29, 2012

Sql server query optimization tips : Tuning best practices with examples

If we will follow following tips or rules while writing SQL queries we can improve the performance of SQL queries:

1. Help Query optimizer to estimate correct statistics

In SQL server execution plan is decided by:

1. Cardinality: The total number of rows processed at each level of a query.

2. Query Plan: Algorithm to use to perform a task like searching, sorting etc.

When we execute a SQL query, create an index, insert records into a table etc SQL server automatically creates or updates the statistics of key fields of tables. In simple words, we can say statistics is an object which keeps information about total numbers of distinct records in a table.

There a few cases where query optimizer is unable to get correct information about statistics which decrease the performance of a SQL query. For examples:

First, we are creating a tblMessage table and inserting 97526 records into it and creating three indexes on it:

CREATE TABLE tblMessage (
    ntMessageID BIGINT IDENTITY PRIMARY KEY,
    ntSeverity INT,
    vcMessage VARCHAR(500),
    dtDate DATETIME
)

INSERT INTO tblMessage
SELECT
    Severity,
    LEFT ([TEXT], 500),
    DATEADD (YEAR,severity,GETDATE())   
FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Severity
ON tblMessage(ntSeverity)
INCLUDE (vcMessage)

CREATE NONCLUSTERED INDEX NC_Msg
ON tblMessage(vcMessage)

CREATE NONCLUSTERED INDEX NC_Date
ON tblMessage(dtDate)
INCLUDE (ntSeverity,vcMessage)

Note: We are setting statistics IO on to know the total logical reads:

SET STATISTICS IO ON

Scenario 1:

a. Consider on the following two SQL queries:

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LEFT (vcMessage,1) = 'a'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage LIKE 'a%'

Both of them will perform the same task from a query point of view. Now check the logical read in message tab:

Logical read (Query 1):  1471
Logical read (Query 2):  79

So, Query 1 has to go for approximately 18 times more logical read than query 2.

Now check the actual execution plan of both of the query:


We can observe query 1 is using index scan while query 2 is an index seek and overall cost of query 1 is 96% while of query 2 is only 4%.

So, now we can say query2 is much better than query 1.

b. Consider on the following two SQL queries:

Query 1:

SELECT * FROM tblMessage WHERE YEAR (dtDate) = '2012'

Query 2:

SELECT * FROM tblMessage
WHERE dtDate >= '2012-01-01' AND dtDate < '2013-01-01'

Both of them will perform the same task from a query point of view. Now check the logical read in message tab:

Logical read (Query 1):  1589
Logical read (Query 2):  10

So, Query 1 has to go for approximately 160 times more logical read than query 2.

Now if we will check the actual execution plan of both of the query we will observe query 1 is using index scan while query 2 is an index seek and overall cost of query 1 is 99% while of query 2 is only 1%.

In the same way, if we will analyze the following queries query 2 will perform much better than query 1:

C.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE (ntSeverity * 5)/2 = 35

Query 2:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity = (35/5) * 2

Logical read (Query 1): 1490
Logical read (Query 2): 16
Batch query cost (Query 1): 100%
Batch query cost (Query 2): 0%

D.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LOWER (vcMessage) = LOWER ('Text')

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage = 'Text'

Or

SELECT vcMessage FROM tblMessage
WHERE vcMessage = 'Text' COLLATE SQL_Latin1_General_CP1_CI_AS

Logical read (Query 1): 1471
Logical read (Query 2): 3
Batch query cost (Query 1): 100%
Batch query cost (Query 2): 0%

Note: Default collation of SQL server is the case insensitive so no need of user LOWER function to perform a case-insensitive comparison. If default collation of your database or table or column is case sensitive then we should use COLLATE clause with any case in insensitive collation. 

E.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE CAST (ntSeverity AS VARCHAR) = '14'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity = CAST ('14' AS INT)

Logical read (Query 1): 1490
Logical read (Query 2): 16
Batch query cost (Query 1): 99%
Batch query cost (Query 2): 1%

Question: Why query 2 is performing better than query 1 in all of the above examples?

Answer: If you will notice the WHERE clause in all above examples in query 1, you will get that we are performing some operations with the field of table tblMessage while in query 2 equivalent value has used in a constant expression.

If we perform following operations on a field of any table in query predicate:

1. Using any system function or user-defined function
2. A scalar operation like addition, multiplication etc.
3. Typecasting

In this situation, the sql server query optimizer is not able to estimate correct cardinality using statistics.

Query 1:

SELECT vcMessage FROM tblMessage
WHERE LEFT (vcMessage, 1) = 'a'

Query 2:

SELECT vcMessage FROM tblMessage
WHERE vcMessage LIKE 'a%'

Both of the above queries select total records: 4171

If we check the detail section of execution plan we will get:

Actual cardinality: 4171
Cardinality (Query 1): 2432
Cardinality (Query 2):  4131.76

So cardinality of query 2 (4131.76) is very near to correct cardinality i.e. total numbers records filter by WHERE clause and pass to SELECT clause (4171)

Tips: If possible, don't perform any operation on any field of a table in WHERE Clause, ON Clause, HAVING Clause. Instead of this, write an equivalent constant expression.

Scenario 2:

Consider the following two SQL statements:

Query 1:

SELECT vcMessage FROM tblmessage
WHERE vcMessage LIKE 'Text%'

Query 2:

DECLARE @Msg AS VARCHAR (500) = 'Text%'
SELECT vcMessage FROM tblmessage
WHERE vcMessage LIKE @Msg

If we will check the execution plan of both the queries:


Total query cost of query 2 is 96% while query 1 is only 4%!!!

What wrong with query 2?

Again, same cause. The query optimizer is not able to get correct cardinality use statistics.

1. Actual cardinality: 55
2. Carnality (Query 1):  40.9086
3. Carnality (Query 2): 6222.16

If we will use local variables in query predicate,   the sql server query optimizer is not able to estimate correct cardinality using statistics.

Tips:

Don't use local variables in WHERE Clause, ON Clause, HAVING Clause. Instead of this, use constant expressions.

Question: How to resolve this problem when this variable is the parameter of stored procedures or functions?

Scenario 3:

Consider on the following sql statement:

Query 1:

SELECT vcMessage FROM tblMessage
WHERE ntSeverity * ntMessageID = 0

There any many business logic where we use more than one field and perform some arithmetic operations in the query predicate.  This is called cross relationship among fields. Above query is example of cross relationship of field ntSeverity and ntMessageID  

If there is any cross relationship among different fields of a table in query predicate, SQL Server query optimizer is not able to estimate correct cardinality using statistics.

To resolve this problem SQL server has introduced computed column. So we have to create a computed column in the table tblMessage. In this example we are creating the other table tblMessage_Dense, inserting same records and creating a non- clustered index:

CREATE TABLE tblMessage_Dense(
    ntMessageID BIGINT IDENTITY PRIMARY KEY,
    ntSeverity INT,
    vcMessage VARCHAR(500),
    dtDate DATETIME,
    ntDensity AS ntSeverity * ntMessageID
)

INSERT tblMessage_Dense(ntSeverity,vcMessage,dtDate)
SELECT
    Severity,
    LEFT ([TEXT], 500),
    DATEADD (YEAR,severity,GETDATE())   
FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Density
ON tblMessage_Dense(ntDensity)
INCLUDE (vcMessage)

And our new SQL query will be:

Query 2:

SELECT vcMessage FROM tblMessage_Dense
WHERE ntDensity = 0

Ok, now compare the performance of both the SQL queries:

Logical read (Query 1): 1490
Logical read (Query 2):  9

So, Query 1 has to go for approximately 165 times more logical read than query 2.

Now check the actual execution plan of both of the query:


We can observe query 1 is using index scan while query 2 is an index seek and overall cost of query 1 is 99% while of query 2 is only 1%.

So, now we can say query 2 is much better than query 1.

Note: Good news, a non–persisted computed column is not stored physically in memory.

Suppose in our application there a table tblQueue:

CREATE TABLE tblQueue (
    ntQueueID BIGINT IDENTITY PRIMARY KEY,
    vcQueueName VARCHAR(500)
)

INSERT tblQueue
SELECT  '1_' +
    CAST(message_id AS VARCHAR) + '_' +
    CAST(severity AS VARCHAR) + '_' +
    CAST(language_id AS VARCHAR)  FROM Sys.messages

This application very frequently executes a sql query which gets batch id form a column vcQueueName:

Query 1:

SELECT * FROM tblQueue
WHERE CAST (REVERSE (LEFT (REVERSE (vcQueueName), CHARINDEX ('_'REVERSE (vcQueueName)) - 1)) AS INT) = 1031

In this situation, it would be a good idea to create a computed column BatchID in the table. In this example we are creating a new table tblQueue_Batch:  

CREATE TABLE tblQueue_Batch (
    ntQueueID BIGINT IDENTITY PRIMARY KEY,
    vcQueueName VARCHAR(500),
    ntBatchID AS  CASTREVERSE( LEFT( REVERSEvcQueueName), CHARINDEX('_',REVERSE(vcQueueName)) - 1)) AS INT)
)

INSERT tblQueue_Batch
SELECT '1_' +
    CAST (message_id AS VARCHAR) + '_' +
    CAST (severity AS VARCHAR) + '_' +
    CAST (language_id AS VARCHAR) FROM Sys.messages

CREATE NONCLUSTERED INDEX NC_Batch
ON tblQueue_Batch (ntBatchID)
INCLUDE (vcQueueName)

This SQL query 2 is much more efficient than previous one:

Query 2:

SELECT * FROM tblQueue_Batch  WHERE ntBatchID = 1031

Logical read (Query 1): 430
Logical read (Query 2): 44
Batch query cost (Query 1): 91%
Batch query cost (Query 2): 9%

Tips:

If there is any cross relationship among fields or there is a complex expression in a field in a query predicates, it is better to create a computed column and then create a non-clustered index on it. 

Scenario 4:

In SQL Server query optimizer can estimate correct cardinality if SQL statement use variable in query predicate only if that variable is a parameter of stored procedures or functions. For example:

We are creating two stored procedures:

CREATE PROC uspGetMsg_1 (
    @Severity AS INT
)
AS
BEGIN
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = @Severity
END

Go

CREATE PROC uspGetMsg_2
AS
BEGIN
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = 15
END

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_2

From query performance point of view, both queries have exactly same execution cost. There are no differences, either we use a constant expression or pass it as a parameter. 

Now we are editing the stored procedures uspGetMsg_2:

ALTER PROC uspGetMsg_2 (
    @Severity AS INT
)
AS
BEGIN

    SET @Severity = @Severity - 1
    SELECT vcMessage FROM tblMessage
WHERE ntSeverity = @Severity
END

Now consider the following two SQL statements:

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_2 16

If we will check the execution plan of both of the queries:


Execution cost of query 2 is 95% while query 1 is only 5%

1. Actual cardinality: 3102
2. Cardinality (Query 1): 3102
3. Cardinality (Query 2): 69839

So, again query 2 is not able to estimate correct cardinality. If you will check, then you will find query 2 is using carnality of query :

SELECT vcMessage FROM tblMessage WHERE ntSeverity = 16

While we have updated the value variable @Severity to 15. So, it should use the cardinality of query:

SELECT vcMessage FROM tblMessage WHERE ntSeverity = 15

But it is not true. So we can say query optimizer estimate the cardinality from statistics on the basis value passed to a parameter of a function or a stored procedure not the updated value of the parameter inside the function or stored procedures:

Tips:

If possible, don't update the value of parameters of a function or stored procedure before using in SQL statement, instead of this pass the updated value to the parameters.

Scenario 5:

In SQL server, we empty the proc cache by using SQL statement:

DBCC FREEPROCCACHE

Note: Don't use this query at production server.

Consider on following two SQL statements:

Query 1:

EXECUTE uspGetMsg_1 16

Query 2:

EXECUTE uspGetMsg_1 15

If we will check the execution plan we will find:

Query 1:

Actual cardinality: 69839
Estimated cardinality: 69839

Query 2:

Actual cardinality: 3102
Estimated cardinality: 69839

Now empty the proc cache:

DBCC FREEPROCCACHE

Now execute same two queries but in reverse order that is:

Query 1:

EXECUTE uspGetMsg_1 15

Query 2:

EXECUTE uspGetMsg_1 16

If we will check the execution plan we will find:

Query 1:

Actual cardinality: 3102
Estimated cardinality: 3102

Query 2:

Actual cardinality: 69839
Estimated cardinality: 3102

Here we can observe estimated cardinally of query 2 is equal to the estimated cardinality of query 1. In SQL server creates the statistics of any parameters of stored procedures or functions on the basis of value to pass parameters when it has executed the first time. Next time onward it uses the same cardinality if statistics, query plan etc doesn't change. Due to the wrong estimation of cardinality query optimizer can choose wrong execution plan which may decrease the performance of queries.

We can optimize a SQL query by specifying the value in the SQL statement which mostly passes the application by using query hints OPTIMIZE FOR. It will override the default behavior that is instead of optimizing the SQL query on the basis of what first-time values had passed.  For example:

CREATE PROC uspGetMsg(
    @Severity AS INT
)
AS
BEGIN
    SELECT vcMessage FROM tblMessage
    WHERE ntSeverity = @Severity
    OPTION (OPTIMIZE FOR (@Severity = 15))
END

Tips:

Use OPTIMIZE FOR clause when you want to optimize a SQL query on the basis of a specific parameter value.

Scenario 6:

There are many situations where we need to update the value of the parameter according to some condition inside a stored procedure of a function. For example:

CREATE PROC uspGetMsg(
    @Severity AS INT
)
AS
BEGIN

    IF @Severity IS NULL
         SET @Severity = (SELECT MAX(ntSeverity) FROM tblMessage)
        
    SELECT vcMessage FROM tblMessage WHERE ntSeverity = @Severity
END

Consider a situation when our application has executed the following SQL query at the first time:

EXECUTE uspGetMsg NULL

The query optimizer will estimate the cardinality on the basis of:

SELECT vcMessage FROM tblMessage 
WHERE ntSeverity = NULL

As we know NULL is never equal to any value. We use IS NULL to compare the null values. So this query will not return any result set. If a query doesn't return any value it cardinality will be one.

Let's check the cardinality in the execution plan:
Actual cardinality: 110
Estimated cardinality: 1

It will affect the performance of this query drastically.  The solution of this problem is, to create two stored procedures. For example:

CREATE PROC uspGetMsg (
    @Severity AS INT
)
AS
BEGIN

    IF @Severity IS NULL
         SET @Severity = (SELECT MAX (ntSeverity) FROM tblMessage)
        
    EXECUTE uspGetMsgNotNull @Severity
END

Go

CREATE PROC uspGetMsgNotNull @Severity
    @Severity AS INT
)
AS
BEGIN   
    SELECT vcMessage FROM tblMessage WHERE ntSeverity = @Severity
END

And from application use only the stored procedure uspGetMsg.

Tips:

If you want to update the value parameter of a stored procedure or a function create a similar procedure or function and execute it form base procedure or function by passing the updated value as a parameter. 

Scenario 7:

In SQL server, the default setting of auto creates and updates statistics is on except asynchronous statistics. It means query optimizer automatically creates and updates the statistics according to requirement. More specifically it auto creates the statistics of a table for primary key columns when we create a table, for key columns of clustered and non-clustered index when we create indexes, for query predicates when we execute the query and updates the statistics when threshold limit reach after the update, deletion or insertions of records.

Limitation: Query optimizer can create only single columns statistics and also it cannot create filtered statistics.

So, there are many cases where we can improve the performance of SQL statements by executing the user-defined statistics. For example:

SELECT vcMessage FROM tblMessage
WHERE ntMessageId < 1000
    AND ntSeverity IN (10, 11, 12, 16)

If we will check the execution plan we will get:

Actual cardinality: 705
Estimated cardinality: 839.486

If we will create following filtered index:

CREATE STATISTICS Stat_Severity 
ON tblMessage(ntMessageId)
WHERE ntSeverity IN (10, 11, 12, 16)

Now estimated cardinality of the previous query will be: 705.671

So, now query will perform better.

Tips:

Create user-defined multi-column statistics if query predicates have more than one fields of a table.