Jan 6, 2012

What is clustered index in sql server with examples

What is clustered index in SQL server?

I already told you, when we create a table without any indexes on it is called heap. So we can say, in the previous post, table Student is a heap. We have faced problem to search the details of a student for RollNo = 111 in a heap. In a heap, sql server has to scan all the records to find out a student.  Possible solutions to this problem are:

Suggestion 1: We know roll no of two students cannot be same, why not tell to SQL server, roll number is a unique column?

Suggestion 2: Why not data of heap tblStudent is stored in sorted order on the column RollNo?

We can achieve the suggestion 1 by making column RollNo as a primary key or unique key.

According to suggestion number 2 data of the table tblStudent (more accurately heap) should be stored in the sorted order of RollNo. This new organization of heap tblStudent is called clustered index. So in case of a clustered index, SQL server doesn't create any new objects. Hence clustered index is a table in which is data is sorted on one or more primary or unique key columns.

After changing from tblStudent (Heap) to tblStudent (Clustered index):

tblStudent (Heap)
RollNo
Name
Country
Age
101
Greg
UK
23
112
Sachin
India
21
109
Akaram
Pakistan
22
107
Miyabi
China
18
108
Marry
Russia
27
103
Scott
USA
31
110
Benazir
Banglades
17
111
Miyabi
Japan
24
102
Rahul
India
27
113
Nicolus
France
19

tblStudent (Clustered index):
RollNo
Name
Country
Age
101
Greg
UK
23
102
Rahul
India
27
103
Scott
USA
31
107
Miyabi
China
18
108
Marry
Russia
27
109
Akaram
Pakistan
22
110
Benazir
Banglades
17
111
Miyabi
Japan
24
112
Sachin
India
21
113
Nicolus
France
19

FYQ about clustered index:

Q. Why a table can have only one clustered index?
Answer: Table itself is converted from heap to clustered index. So how a table can be two? Any object can have only one physical organization.

Q. Is it necessary to create a clustered index on primary key columns?

Answer: Not, it is not necessary but it is recommended. We can create clustered index any columns of a table.  

Understanding clustered index in depth?

When we create a clustered index on any table, physical organization of data in the table is changed. Data in the table are stored as binary search tree(B tree). For example, if we will create the clustered index named CL_Student_Roll with RollNo column of Student table as a key column of the clustered index, physical organization of data in the table will look something like this :



Description of above image:

It is B Tree representation where the top node is called root node and bottom nodes are called leaf nodes. All the nodes in between are called intermediate nodes. B Tree can have many intermediate nodes if a table has too many records. In this example, there are no intermediate nodes and we have assumed the size of each node is five i.e. each node can keep maximum five roll numbers. In clustered index leaf nodes keep the memory address of actual (physical) data row.

Now if we will execute following SQL query:

SELECT * FROM Student WHERE RollNo = 111

Since Student table has clustered index so now it will go for index scan instead of table scan to search the RollNo = 111. A scan will start from the root node in the clustered index CL_Student_Roll .

Scanning steps:

Scan 1: 107 < 111 so it will go to next element of a root node.
Scan 2: 110  < 111 so it will go to the next element of a root node. Since there is not any other element in root node so it will follow right most link or path.
Scan 3: In leaf node, 110 < 111 o it will go to next element of a leaf node.
Scan 4: In leaf node, 111 = 111 so it will follow this path and get the actual data.

Using clustered index we can access the actual data only in 4 scanning while using table scan we had to follow 8 scanning. If a table has too many rows then numbers of the scan will decrease drastically as compared to a table scan.

Time complexity of index scan is: O(log n)

If we observe estimated execution plan, it will look like:



Now table scan has changed to index seek. So it will scan using clustered index CL_Student_Roll

Simple syntax for creating clustered index:

CREATE UNIQUE [CLUSTERED] INDEX <IndexName>
ON <ObjectName>(
<ColumnName>  [ASC | DESC ] [ ,...n ]
)

Explanation of each term:

UNIQUE: It is optional. We can specify UNIQUE only when there is no duplicate data in key columns.

<IndexName>: It can be any valid name of the index.

<ObjectName>: It is the name of an object on which we are creating an index. It can be either table name or view name.

<ColumnName>: It is the name of a key column. We can specify more than one key column which is called composite key.

[ASC | DESC]: To specify the order of key column in ascending or descending order.

Some important points about clustered index:

1. If we create a table with primary key, SQL server automatically creates clustered index on that table. No need to create clustered index manually.

2. A table can have only one clustered index.  Hence the physical organization of data in a table is same as a clustered index and any table can have only one physical organization of data.

3. If a clustered index has more than one partition then each partition has a B-tree.

Creating a simple clustered index on student table:

CREATE UNIQUE CLUSTERED INDEX CL_Student_Roll
ON Student(RollNo)

Terms related to estimated execution plan:

Table scan: It is very slow and is used only if a table has no clustered index.

Index scan: It is also slow. It is used when a table has clustered index and either when non-key columns are present in WHERE clause or when a query has not been covered (will discuss later) or both.

Index Seek: It is very fast. Our goal is to achieve this.
If we will keep the mouse on an icon in the execution plan (GUI) a detailed screen will appear like this:


Terms related to detail section:

Predicate:

It is a condition in WHERE clause which is either non-key column or column which has not been covered.
Object: It is a name of the source from where it is getting data. It can be a name of a table, Clustered index or non-clustered index.

Output list: It is a list of names of the columns which it is getting from an object.

Seek Predicate: It is a condition in WHERE clause which is either key column or fully covered.

Our goal is to move the conditions to the predicate section to seek predicate.

Consider the sql query:

SELECT RollNo  FROM Student WHERE Age = 24

If we observe estimated execution plan it will look like:


Predicate is:

[DbName].[dbo].[Student].[Age] = CONVERT _IMPLICIT ( int, [@1], 0)

Object or source is Cl_Student_Roll that is clustered index we have created.

So it is obvious that it is scanning using a clustered index and hence it is index scan. So it is less efficient since searching is being done using the column age which is a non-key column of clustered index CL_Student_Roll. Its key column is RollNo. The time complexity of index scan is same as table scan i.e. O (n). As we know table can have only one clustered index, to make such type of query more efficient SQL server has introduced non-clustered index.

Important points about clustered index:

1. In a create table statement we can specify CLUSTERED and NONCLUSTERED keyword on only those columns which have either PRIMARY KEY or UNIQUE constraints. For example:

CREATE TABLE Emp(
    ID INT PRIMARY KEY CLUSTERED,
    RegNo INT UNIQUE NONCLUSTERED
)

A default value of PRIMARY KEY constraints is CLUSTERED if a table has no existing clustered index or if we have not specified CLUSTERED index on any columns of a UNIQUE constraint. Default value of UNIQUE constraints is NONCLUSTERED

2. Creating a primary key on a table also creates a unique non-clustered index on a table if a table has any existing clustered index.

3. A table or view can have only one clustered index.

4. The physical order of rows of a table is same as the logical order of key columns of the clustered index.

5. Leaf nodes of clustered index keep actual data rows.

6. When we create a clustered index on a table which has existing non-clustered indexes, all non-clustered indexes are rebuilt after the creation of a clustered index. In other words, we can say initially non-clustered indexes which are based on the heap are converted into a clustered index(We will discuss heap in a non-clustered index).  


3 comments:

  1. If i have one table ,i have to use one from clustered or non clustered index to which i should prefered for better performance :

    I am thinking about Clustered index since no nclustered index uses extra key look up & extra storage space

    Please mentioned ur review about above scenario.

    Thanks to all.....

    ReplyDelete
    Replies
    1. First of all it is bad to create non -clustered on primary key (Clustered index key) field. Yes you are correct. Clustered index is always better than non-clustered index if we are filtering or sorting records on the basis of primary key column.

      Delete