Apr 18, 2014

Indexes best practices in Sql server: Where to start creating indexes?

How to create indexes? Where to start creating indexes? What should be considered or what shouldn't be. It is a very common question of beginners. You may get a situation where your lead or your mentor is telling you to create all the indexes on a table. Let's assume this table:

 CREATE TABLE tblEmployee(
     ntEmployeeID BIGINT PRIMARY KEY,
     vcEmpName VARCHAR(200),
     dtDoj DATETIME,
     btGender BIT,
     vcSkill VARCHAR(2000),
     moSalary MONEY,
     ntHirarchyID BIGINT
)

You may think we should create an index on a primary key column that is ntEmployeeId or you may also think on vcEmpName or ntHirarchyID. Let's assume a table has more than 150 columns, then on how many columns should we create indexes?

A very common misconception about the index: People think, by viewing the table design we should create indexes.

But the truth is, by analyzing only table design we cannot create a single index even not a primary index (Maybe table has a primary key but your application only performs insert operation on it).  Table design may help to some extent. Then the again same question, what should be considered?  The answer is very simple: "What queries are executed on the table."   In different words, queries which are executed on a table decides what indexes should be created on the table.

How to know what queries are executed on a table?

Go through your application code. Prepare a list of all inline queries. If your application is using stored procedures or database function, then get all the queries from there. Or you may take help of this script. This script may return all the queries which have executed on a table:

Script_1:

SELECT
     SUBSTRING(
          qt.TEXT,
          (qs.statement_start_offset/2)+1,
          ((CASE qs.statement_end_offset
          WHEN -THEN DATALENGTH(qt.TEXT)
          ELSE qs.statement_end_offset
          END - qs.statement_start_offset)/2)+1
     AS SqlQuery
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
WHERE text LIKE N'%YourTableName%'
     AND text NOT LIKE '%sys.dm_exec_cached_plans%'

Script_2:

SELECT
[text] AS SqlQuery
FROM sys.dm_exec_cached_plans
CROSS APPLY sys.dm_exec_sql_text(plan_handle) AS st
WHERE text LIKE N'%YourTableName%'
     AND text NOT LIKE '%sys.dm_exec_cached_plans%'

Note: In above script replace "YourTableName" with actual table name on which you want to create indexes.

Now we got all the queries. What next? Now follow following index best practices rule.  

SQL Server index best practices:

1. Prepare a list of all the queries which will be executed on a table. If your list contains mostly INSERT, DELETE, UPDATE or MERGE statement, it means nature of your project is OLTP and try to avoid creating indexes. If it mostly contains a SELECT statement or has INSERT, DELETE, UPDATE or MERGE statement which performs the operation on a single row it means nature of your project is OLAP. And we should create indexes on a table. And then: Sort your query list on the basis of the frequency of execution and query execution time. 

2. All tables should have a primary index (clustered index) and should be created on a primary key column.
Scrip to get tables without clustered index:

SELECT
     OBJECT_NAME([OBJECT_ID]) AS TableName
FROM sys.indexes
WHERE index_id = 0
     AND OBJECTPROPERTY([OBJECT_ID],'IsUserTable') = 1

3. Create indexes on a table if the queries which fetch data from a table have at least following clauses:

Clause
Logical Processing Order
ON
1
WHERE
2
GROUP BY
3
HAVING
4
ORDER BY
5
DISTINCT
6
UNION
NA
EXCEPT
NA
INTERSECT
NA

Note: Here UNION means only UNION not UNION ALL

For example, let's assume following SQL queries are frequently executing which references the table tblEmployee where ntEmployeeID is the primary key of the table:

a.
Query:
SELECT TOP(5)* FROM tblEmployee

Discussion: This query doesn't perform filter, sorting or grouping so we should not consider this query while creating a new index.

b.
Query:
SELECT * FROM tblEmployee WHERE ntEmployeeID = 5

Discussion: This query performs a filter on the table tblEmployee. This filter is on the basis of column ntEmployeeID which is the primary key of a table and since the table has already clustered index on the primary table so we should not consider this query while creating a new index.

c.
Query:
SELECT * FROM tblEmployee FOR XML AUTO

Discussion: This query doesn't perform filtering, sorting or grouping, so we should not consider this query while creating a new index.

d.
Query: 
SELECT *
FROM tblEmployee E
INNER JOIN tblSkill S
ON S.ntEmployeeID = E.ntEmployeeID
WHERE S.vcSkill = 'Sql Server'

Discussion: This query performs filtering on the table tblEmployee. This filter is on the basis of column E.ntEmployeeID which is the primary key of a table and since the table has already clustered index on the primary table so we should not consider this query while creating a new index.

Note: We should consider this query while creating indexes on the tblSkill table.

e.
Query:

SELECT vcName,ntAge FROM tblEmployee

UNION ALL

SELECT vcName,ntAge FROM tblEmployee

Discussion: This query doesn't perform filtering, sorting or grouping, so we should not consider this query while creating a new index.

f.
Query:

SELECT vcName,ntAge FROM tblEmployee

UNION

SELECT vcName,ntAge FROM tblEmployee

Discussion: UNION clause always returns distinct values. To get the distinct value SQL server performs sorting on all the columns in the select list (in this example vcName and ntAge) if a select list has not any unique or primary key columns. This query should be considered while creating indexes.

g.
Query:
SELECT * FROM tblEmployee
GROUP BY vcEmpName

Discussion: This query performs grouping operation so the query should be considered while creating indexes.

4.  Create an index on the fields in which data are filtered only using following operators in the WHERE, ON and HAVING clause:

Operator
Type
=
Equality
Range
Range
<=
Range
>=
Range
IN
Range
BETWEEN
Range
LIKE
Range

Note: LIKE is allowed only in case of prefix matching. For example: LIKE 'abc%'  
For examples:

a.
Query:
SELECT vcName,ntAge
FROM tblEmployee
WHERE vcName = 'Greg'

Discussion: Field has filtered by equality operator. This query should be considered while creating indexes.

b.
Query:
SELECT vcName,ntAge
FROM tblEmployee
WHERE vcName LIKE 'Greg%'

Discussion: Field has filtered by prefix LIKE operator. This query should be considered while creating indexes.

c.
Query:
SELECT vcName,ntAge
FROM tblEmployee
WHERE vcName LIKE '%Greg%'

Discussion: Field has filtered by contains LIKE operator. The only prefix LIKE operator is allowed not a suffix or contains LIKE operator. This query should not be considered while creating indexes.

d.
Query:
SELECT vcName,ntAge
FROM tblEmployee
WHERE ntAge > 30

Discussion: Field has filtered by greater than an operator. This query should be considered while creating indexes.

e.
Query:
SELECT vcName,ntAge
FROM tblEmployee
WHERE ntAge <> 30

Discussion: Field has filtered by, not equal to operator. This query should not be considered while creating indexes.

3.  Don't create indexes on the columns which are an expression in the query. For examples:

a.
Query:
SELECT vcName
FROM tblStudent
WHERE LEFT(vcName,4)  = 'Greg'

Discussion: In this query LEFT(vcName,4)  is an expression. This query should not be considered while creating indexes.

b.
Query:
SELECT vcName
FROM tblStudent
WHERE vcFirstName + ' ' + vcLastName  = 'Alain Nisam'

Discussion: In this query vcFirstName + ' ' + vcLastName    is expression. This query should not be considered while creating indexes.

c.
Query:
SELECT vcName
FROM tblStudent
WHERE vcName = 'Greg'

Discussion: In this filtered condition on vcName column, which is not an expression. This query should be considered while creating indexes.

d.
Query:
SELECT * FROM tblStudent
ORDER BY LEFT(vcName,4)

Discussion: In this query LEFT(vcName,4)  is an expression. This query should not be considered while creating indexes.

4.  Try to avoid creating single column indexes if the multi-column index is possible. For examples:

SELECT *
FROM tblEmployee
WHERE a = 5 AND b = 10

In the above query don't create two indexes like this:

CREATE INDEX NCI_a ON tblEmployee(a)
CREATE INDEX NCI_b ON tblEmployee(b)
  
Instead of this create a multi column index:

CREATE INDEX NCI_a_b ON tblEmployee(a,b)

5. Order of column name in an index matters.   While creating an index, keep the column name in the best order. For examples:

Index_1:
CREATE INDEX NCI_a_b ON tblEmployee(a,b)

Index_2:
CREATE INDEX NCI_b_a ON tblEmployee(b,a)

Here index NCI_a_b and NCI_b_a are two totally different indexes. 

5. If you are creating multiple columns index, then keep all the columns or some columns with equality operator first, then only one column with range operators at the end. (See table in point 4).  If there are not any columns with range operator then only we should include columns in the GROUP BY Clause or ORDER BY clause.  For examples:

Query 1:
SELECT *
FROM  tblEmployee 
WHERE b = 5 AND c > 10 AND d = 15 AND e <= 20

Discussion:
Here columns with equality operator are b and d. So the order of columns b and d should be first. We can include both columns b and d or only one column, either b or d
Here columns with range operators are c and e. But we should include the only column in range operator. So at the end, we should include either c or e not both. In this way there can be eight possible indexes:

CREATE INDEX NCI_b_d_c ON tblEmployee(b,d,c)
CREATE INDEX NCI_d_b_c ON tblEmployee(d,b,c)

CREATE INDEX NCI_b_c ON tblEmployee(b,c)
CREATE INDEX NCI_d_c ON tblEmployee(d,c)

CREATE INDEX NCI_b_d_e ON tblEmployee(b,d,e)
CREATE INDEX NCI_d_b_e ON tblEmployee(d,b,e)
CREATE INDEX NCI_b_e ON tblEmployee(b,e)
CREATE INDEX NCI_d_e ON tblEmployee(d,e)
Following are not good indexes:

CREATE INDEX NCI_c_b_a ON tblEmployee(c,b,d) 
CREATE INDEX NCI_b_d_c_e ON tblEmployee(b,d,c,e) 

In the later points, we will discuss out of above eight indexes which one is best.

Query 2:
SELECT * FROM  tblEmployee 
WHERE b = 5 AND e < 10 AND f = 15
ORDER BY d,c

Discussion:
Here columns with equality operator are b and f. So the order of columns b and f should be first. We can include both columns b and f or only one column, either b or f
Here there is one column with range operator, that is column e. So at the end, we should include column e. 
To keep the data is sorted order we should include column d and c but we should not include any column after the column with range operator. So we have two options:

CREATE INDEX NCI_b_f_e ON tblEmployee(b,f,e)

Or

CREATE INDEX NCI_b_f_e ON tblEmployee(b,f,d,c)

6. If we are creating a multi-column index keep columns first according to the logical order of processing of different clauses (See table point 3). For example:

Query:
SELECT *
FROM tblFirst F INNER JOIN tblSecond S
ON F.= S.x
WHERE  F.= 10 AND S.= 50
GROUP BY F.c
ORDER BY F.d

Discussion:
Let's assume data has been fetched first from table tblSecond then from table tblFirst (Why? We will discuss it later)

Logical order of execution of different clause in the table tblSecond:
a. WHERE
Hence, the index should be:

CREATE INDEX NCI_y ON tblSecond(y)

Logical order of execution of different clause in the table tblFirst:
a. ON
b. WHERE
c. GROUP BY
d. ORDER BY
Hence, the index should be:

CREATE INDEX NCI_a_b_c_d ON tblFirst(a,b,c,d)

7. If there are two columns with the equality operators then column which most number of distinct values should include first.   For Examples:

Query:
SELECT *
FROM tblEmployee
WHERE vcSkills = 'Sql Server'
     AND ntLevel = 10
     AND moSalary = 50000

Discussion: Here, there are three columns with equality operators (vcSkills, ntLevel, moSalary). As we know the order of column in an index matters, so how to decide columns order while creating an index? For this we will have to find out total distinct value for each column in the table:

SELECT
     COUNT(DISTINCT vcSkills) AS vcSkills,
     COUNT(DISTINCT ntLevel) AS ntLevel,
     COUNT(DISTINCT moSalary) AS moSalary
FROM tblEmployee

Let's assume we get an output of above query something like this:

vcSkills
ntLevel
moSalary
50000
8000000
3000

It is obvious most number of distinct values is for column ntLevel. So column ntLevel should be included first.

To decide the total distinct value of vcSkills and moSalary where ntLevel = 10 we have to execute following script:

SELECT
     COUNT(DISTINCT vcSkills) AS vcSkills,
     COUNT(DISTINCT moSalary) AS moSalary
FROM tblEmployee
WHERE ntLevel = 10

Let's assume we get an output of above query something like this:

vcSkills
moSalary
60
2000

It is obvious that column moSalary has the most number of distinct. So column moSalary should be included after the column ntLevel. 

So we should create index something like this:

CREATE INDEX NCI_ntLevel_moSalary_vcSkills
ON tblEmployee(ntLevel,moSalary,vcSkills)

8. Don't create useless and low cardinality indexes.
As we know indexes are bad for insert, delete, update and merge queries. So we should create indexes if it is really necessary. We should avoid creating indexes of those columns which have very fewer numbers of distinct values. For example columns of data type BIT. It can have maximum two values so columns of data type bit should not be included as a key column of the index.

Query:
SELECT *
FROM tblEmployee
WHERE vcSkills = 'C'
     AND ntDesignation = 4

Discussion:
Finding the total distinct values of columns vcSkills and ntDesignation

SELECT
     COUNT(DISTINCT vcSkills) AS vcSkills,
     COUNT(DISTINCT ntDesignation) AS ntDesignation
FROM tblEmployee

Let's assume we get an output of above query something like this:

vcSkills
ntDesignation
80000
6
    
Here cardinality of a column is very less so we should not include ntDesignation as a key column of the index. So index should be:

CREATE INDEX NCI_vcSkills ON tblEmployee(vcSkills)

9. Don't create duplicate indexes. Examples of duplicate index:

Index_1:
CREATE INDEX NCI_a ON tblEmployee(a)

Index_2:
CREATE INDEX NCI_a_b ON tblEmployee(a,b)

Index_3:
CREATE INDEX NCI_b_a ON tblEmployee(b,a)

Here Index_1 is a duplicate of index_2 since it is a subset. So we should drop the index_1. But Index_2 and Index_3 are not duplicate indexes, since the order of key column matters.

10.  A query should be covered by an index. SQL server has introduced a special clause named INCLUDE. In this column include all those columns which are not key columns of an index and it used by columns. For examples:

Query:
SELECT a,b
FROM tblEmployee
WHERE c = 10 AND d = 15 AND e <> 0
ORDER BY f

Let assume key columns of the index is c and d. So rests columns are a,b,e and f. So index should be:

CREATE INDEX NCI_c_d ON tblEmployee(c,d)
INCLUDE(a,b,e,f)

2 comments: