Feb 10, 2013

Write sql queries in set based approach sql server : Without cursors and loops

Let assume you have a vehicle whose capacity is to bring 1 million balls from one place to other place.
Your task to transfer 100 million balls from USA to UK via India where you have to color the balls. A person accepted this job. He picks up one ball from USA move to India for coloring then finally UK by using his vehicle. He is doing this job from last 2 years. You may laugh on the person, did he gone mad? Why he is picking up only one ball while capacity of his vehicle 1 million balls. He can do it 10 attempts. 
You will not believe same thing is done by many developers. After sometimes they faces huge processing time, insufficient memory, IO delays,    table lock, deadlock etc. Sometimes they say proudly says it is happening  due our application deals with large value data which looks same as when the person will say I’m dealing with large volume balls!!

Now I am coming to the point. Set based approach means data is not processed row by rows by using some kinds of loops, cursor etc.   Let me explain it by very simple example:    Let’s assume there is a requirement of an application is to add prefix “0” in mobile number of each users if it is not already there.

Approach one:
--Using while loop
DECLARE @ntUserID AS BIGINT
DECLARE @ntTempUserID AS BIGINT = 0
DECLARE @vcMobileNumber AS VARCHAR(15)

WHILE @ntTempUserID IS NOT NULL BEGIN
    
     SELECT TOP(1)
          @ntUserID = ntUserID, @vcMobileNumber = vcMobileNumber
     FROM tblUser
     WHERE ntUserID > @ntTempUserID
     ORDER BY ntUserID
    
     IF @vcMobileNumber IS NOT NULL
AND RIGHT(@vcMobileNumber,1) <> '0' BEGIN

          UPDATE tblUser SET vcMobileNumber = '0' + @vcMobileNumber
          WHERE ntUserID = @ntUserID
     END

     SET @ntTempUserID = @ntUserID
     SET @ntUserID = NULL
    
END

Approach two:
--Using Cursor
DECLARE @ntUserID AS BIGINT
DECLARE @vcMobileNumber AS VARCHAR(15)

DECLARE curUsers CURSOR
FOR SELECT ntUserID,vcMobileNumber FROM tblUser

OPEN curUsers
FETCH NEXT FROM curUsers INTO @ntUserID,@vcMobileNumber

WHILE @@FETCH_STATUS = 0 BEGIN
    
     IF @vcMobileNumber IS NOT NULL
AND RIGHT(@vcMobileNumber,1) <> '0' BEGIN

          UPDATE tblUser
SET vcMobileNumber = '0' + @vcMobileNumber
WHERE ntUserID = @ntUserID
     END

     FETCH NEXT FROM curUsers INTO @ntUserID,@vcMobileNumber
    
END

CLOSE curUsers
DEALLOCATE curUsers

Approach one and two both took around 20 to 25 seconds to update only 50k records.  Both methods are called row-wise approach. Since, it updates one row at time. Below script do same thing but script has written in set based way:

UPDATE tblUser
     SET vcMobileNumber = '0' + vcMobileNumber
WHERE vcMobileNumber IS NOT NULL
     AND RIGHT(vcMobileNumber,1) <> '0'

It took around 1 second to performance task. I am sharing a read world example, a database developer came to me he told me their application load XML data using a stored procedure to SQL Azure database. It is taking much time to load large XML data. I looked into the stored procedure. It was doing some complex transformation before inserting to destination tables. Script had written row wise methods. I only rewrite same script in set based way. It tested and took 45 seconds to load sample data and sent back to him. Next day he also tested and told now me it is working fine and taking 60 seconds. Then I told in your client machine it may take little less time since in my PC it was taking only 45 seconds. Then he replied No Ritesh, It is not about 45 seconds or 60 seconds, initially it was taking around 16 hours!! Now our client is going to be mad.

Now you can understates how much worst to write a transact SQL script in row-based way. It is very simple to recommend write queries in set based way. Avoid using cursor or loop over database objects like tables, views, procedures etc. But reality is very different. It is very easy to think or implement a task using loop or If-else statements. To implement same task without using loops and if-else is little bit difficult. We will have to think a lot of and sometimes it not possible   and will have to uses loops and cursor. Most of the business requirement is possible in set based way. You will get very rare situation when task can be done only in row based way. 

Any sql task is using single SELECT, UPDATE, DELETE, INSERT, MARGE etc. to accomplish it called set based approach. I am explaining it a very simple example. Let's assume there is an organization of 100k employees.    Our task is if salary of an employee is 5000 then triple it if it 10000 then double it. So we are writing transact update query for this:

DECLARE @ntEmployeeID AS INT = 0
DECLARE @moSalary AS MONEY

WHILE EXISTS (
          SELECT TOP (1) *
          FROM tblEmployee
          WHERE ntEmpID > @ntEmployeeID
          )
BEGIN
     SELECT TOP (1) @ntEmployeeID = ntEmpID
          ,@moSalary = moSalary
     FROM tblEmployee
     WHERE ntEmpID <> @ntEmployeeID
     ORDER BY ntEmpID

     IF @moSalary = 5000
          UPDATE tblEmployee
          SET moSalary = 3 * @moSalary
          WHERE ntEmpID = @ntEmployeeID
     ELSE IF @moSalary = 10000
          UPDATE tblEmployee
          SET moSalary = 2 * @moSalary
          WHERE ntEmpID = @ntEmployeeID

     SET @ntEmployeeID += 1

END

We can do same thing using cursor too. Now we are writing same transact query of same task by using set based approach:

UPDATE tblEmployee
SET moSalary = CASE
          WHEN moSalary = 5000
              THEN 3 * moSalary
          WHEN moSalary = 10000
              THEN 2 * moSalary
          END


What is difference between above two methods?

By using the while loop or cursor we are referencing the table tblEmployee 100k times. This query is very much equivalent to write:

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 1

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 2

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 3

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 4


.............................................
..............................................

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 100 k - 1

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 100 k



So each update is a different update statement. So sql server may have to parse and create the parse tree around 100k times, check its execution plan in the cached plan around 100k times and so on. While using the set based approach sql has not such overhead. It is single update statement. Exaction cost by using approach two is very costly compared to set based approach. If table has too many records we may get memory error, locking error, or may take 5 to 10 hours (while using set based approach it is possible withing 1 to 5 minutes) if we will  not follow set based approach.We can do same thing using cursor too. Now we are writing same transact query of same task by using set based approach:

UPDATE tblEmployee
SET moSalary = CASE
          WHEN moSalary = 5000
              THEN 3 * moSalary
          WHEN moSalary = 10000
              THEN 2 * moSalary
          END


What is difference between above two methods?

By using the while loop or cursor we are referencing the table tblEmployee 100k times. This query is very much equivalent to write:

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 1

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 2

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 3

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 4


.............................................
..............................................

UPDATE tblEmployee
SET moSalary = 3 * @moSalary
WHERE ntEmpID = 100 k - 1

UPDATE tblEmployee
SET moSalary = 2 * @moSalary
WHERE ntEmpID = 100 k


So each update is a different update statement. So sql server may have to parse and create the parse tree around 100k times, check its execution plan in the cached plan around 100k times and so on. While using the set based approach sql has not such overhead. It is single update statement. Exaction cost by using approach two is very costly compared to set based approach. If table has too many records we may get memory error, locking error, or may take 5 to 10 hours (while using set based approach it is possible withing 1 to 5 minutes) if we will  not follow set based approach.

How to write queries in set based way:

If there is need to use WHILE LOOPS or CURSOR over database object due to implement some IF-ELSE logic then remove the loop and replace IF-ELSE statement by equivalent CASE statement. For example:

We want to update the point of users according to their different activates:

DECLARE @ntUserID AS BIGINT
DECLARE @vcMobileNumber AS VARCHAR(15)
DECLARE @ntAge AS INT
DECLARE @dtCreateDate AS DATETIME
DECLARE @ntUserPoint AS SMALLINT = 0
DECLARE @TotalFriends AS SMALLINT
DECLARE @TotalPost AS SMALLINT

DECLARE curUsers CURSOR
FOR
SELECT ntUserID
     ,vcMobileNumber
     ,ntAge
     ,dtCreateDate
FROM tblUser

OPEN curUsers

FETCH NEXT
FROM curUsers
INTO @ntUserID
     ,@vcMobileNumber
     ,@ntAge
     ,@dtCreateDate

WHILE @@FETCH_STATUS = 0
BEGIN
     --Point due to age
     IF @ntAge < 15
          SET @ntUserPoint += 5
     ELSE IF @ntAge BETWEEN 15
              AND 30
          SET @ntUserPoint += 15
     ELSE
          SET @ntUserPoint += 10

     --Point due joining date    
     SET @ntUserPoint = DATEDIFF(MONTH, @dtCreateDate, GETDATE())

     --Point Due to total friends
     SELECT @TotalFriends = COUNT(*)
     FROM tblUserContact
     WHERE ntUserID = @ntUserID

     IF @TotalFriends < 50
          SET @ntUserPoint += 10
     ELSE IF @TotalFriends BETWEEN 50
              AND 100
          SET @ntUserPoint += 20
     ELSE
          SET @ntUserPoint += 30

     --Point Due to total posts
     SELECT @TotalPost = COUNT(*)
     FROM tblPost
     WHERE ntUserID = @ntUserID

     IF @TotalPost BETWEEN 5
              AND 20
          SET @ntUserPoint += 50
     ELSE
          SET @ntUserPoint += 100

     UPDATE tblUser
     SET ntUserPoint = @ntUserPoint
     WHERE ntUserID = @ntUserID

     FETCH NEXT
     FROM curUsers
     INTO @ntUserID
          ,@vcMobileNumber
          ,@ntAge
          ,@dtCreateDate
END

CLOSE curUsers
DEALLOCATE curUsers



It is taking around 35 minutes to execute in my system. Now I’m going to write equivalent script in set based way that removing loop and replacing IF-ELSE by equivalent CASE statement:

UPDATE tblUser
SET ntUserPoint = CASE
          WHEN ntAge < 15
              THEN 5
          WHEN ntAge BETWEEN 15
                   AND 30
              THEN 15
          ELSE 10
          END + DATEDIFF(MONTH, dtCreateDate, GETDATE()) + CASE
          WHEN (
                   SELECT COUNT(*)
                   FROM tblUserContact
                   WHERE ntUserID = tblUser.ntUserID
                   ) < 50
              THEN 10
          WHEN (
                   SELECT COUNT(*)
                   FROM tblUserContact
                   WHERE ntUserID = tblUser.ntUserID
                   ) BETWEEN 50
                   AND 100
              THEN 20
          ELSE 30
          END + CASE
          WHEN (
                   SELECT COUNT(*)
                   FROM tblPost
                   WHERE ntUserID = tblUser.ntUserID
                   ) BETWEEN 5
                   AND 20
              THEN 50
          ELSE 100
          END


Script in set based way took only 2 seconds compare to 35 in rows based way.

Query optimization suggestion:

1. Don't use while loop to perform SELECT, UPDATE, DELETE, INSERT, MARGE etc. operation if it possible using set based approach.

2. Don't use sql cursor to perform SELECT, UPDATE, DELETE, INSERT, MARGE etc. operation if it possible using set based approach.

How to know any task is possible without using loop or sql cursor:

It is most challenging task. It is possible by your depth of knowledge of transact sql statements and your expertise level which will make you different from others.  I am assigning few tasks and try to do it by using set based approach. At the end I am also providing the solution but first try to it by yourself which will improve your level.

Write transact sql queries using set based approaches:

1. Suppose there an employee table which has at least two columns vcFirstName and vcLastName. Task is we have to swap the value of vcFirstName and vcLastName with each others.

2. Suppose we have two identical tables say tblSource and tblDestination with different records. Both tables have a column ntID. Our task is to write a transact sql query using set based approach which satisfies:
a. If ntId is present in tblSource but not in tblDestination then insert in tblDestination
b. If ntId is not present in tblSource but it is in tblDestination then delete from tblDestination
c. if ntId is present in both tables then updated the rest columns from tblSource to tblDestination.

3.
ntId
vcData
1
10,20,30
2
20,40

Write a transact sql statement which output is like:

ntId
vcData
1
10
1
20
1
30
2
20
2
40

4. Let's assume we have two identical tables say tblSource and tblTarget. Both tables have at least three columns ntId, vcName, vcPhone and ntUpdateOrder. We have to update the table tblTarget from tblSource following columns on basis of ntId which satisfy:
a. if tblSource.vcName is NULL then keep the old value in tblTarget.vcName otherwise update it by new value.
b.   Update tblTarget.vcPhone by tblSource.vcPhone also each vcPhone must have prefix '0' if not the add it.
c. tblTarget.ntUpdateOrder is incremental integer number from 1 to n order which rows of table tblTarget has been updated.

5. We want to delete the 100 older records from table tblTraget on basis of dtCreated and keep the all deleted records in other archive table tblArchive with date of deletion. Column tblTraget.ntId   is primary key.

6. Let's assume there is table tblEmployee which has at least column vcName and ntAge. Write a transact sql query which will return the vcName and ntAge of those employees which has maximum. If there is more than one such employee then return all. You have to reference the table tblEmployee only at once.

7. Write a transact sql query which will accept two parameters @MinRange and @MaxRange. Value of parameters can from 0 to 1000. Query will return a result set of integer numbers. For examples:
If @MinRange = 2 and @MaxRange = 5 then output will be:

ntIndex 
2
3
4
5


8. Suppose we have to insert the data from tblSource tblImage. Table tblSouce and tblImage has columns ntCategoryId and vcImagePath also tblImage has one more column ntSortOrder. You have to copy the data from tblSource to tblImage in such a way that ntSortOrder has value from 1 to n in each ntCategoryId. For example:

tblSource:
ntCategoryId
vcImagePath
1
I:\imagea_1.jpg
1
I:\imageb_1.jpg
1
I:\imagec_1.jpg
2
I:\imagea_2.jpg
2
I:\imagea_2.jpg

tblImage 
ntCategoryId
vcImagePath
ntSortOrder
1
I:\imagea_1.jpg
1
1
I:\imageb_1.jpg
2
1
I:\imagec_1.jpg
3
2
I:\imagea_2.jpg
1
2
I:\imagea_2.jpg
2

One of the solutions:
  
1.
UPDATE tblEmployee
SET vcFirst_Name = vcLastName
     ,vcLastName = vcFirst_Name

2.
MERGE tblDestination AS D
USING tblSource AS S
     ON D.ntID = S.ntID
WHEN MATCHED
     THEN
          UPDATE
          SET vcName = S.vcName
WHEN NOT MATCHED BY SOURCE
     THEN
          DELETE
WHEN NOT MATCHED BY TARGET
     THEN
          INSERT (vcName)
          VALUES (S.vcName);

3.
SELECT ntId
     ,items
FROM tblData
CROSS APPLY dbo.split(vcData, ',')


Note: Here dbo.split has tabled value function split function. You can easily get it from web.

4.
DECLARE @ntUpdateOrder AS BIGINT = 0

UPDATE T
SET @ntUpdateOrder += 1
     ,vcName = CASE
          WHEN S.vcName IS NULL
              THEN T.vcName
          ELSE S.vcName
          END
     ,vcPhone = CASE
          WHEN LEFT(S.vcPhone, 1) <> '0'
              THEN '0' + S.vcPhone
          ELSE S.vcPhone
          END
     ,ntUpdateOrder = @ntUpdateOrder
FROM tblTarget T
INNER JOIN tblSource S ON T.ntId = S.ntId


5.
DELETE
FROM tblTarget
OUTPUT DELETED.*
     ,GETDATE()
INTO tblArchive
WHERE ntID IN (
          SELECT TOP (100) ntID
          FROM tblTarget
          ORDER BY dtCreated
          )

6.

SELECT TOP (1)
WITH TIES vcName
     ,ntAge
FROM tblEmployee
ORDER BY ntAge DESC

7. It is a tabled value function:
CREATE FUNCTION udfSequenceTable (
     @MinRange INTEGER = 0
     ,@MaxRange INTEGER
     )
RETURNS @SequenceTable TABLE (ntIndex INTEGER)
AS
BEGIN
          ;

     WITH cteSequenceTable
     AS (
          SELECT ntIndex
          FROM (
              SELECT @MinRange AS ntIndex
              ) ST
         
          UNION ALL
         
          SELECT ntIndex + 1
          FROM (
              SELECT 1 AS ntTemp
              ) ST
          INNER JOIN cteSequenceTable ON 1 = 1
          WHERE ntIndex < @MaxRange
          )
     INSERT INTO @SequenceTable
     SELECT ntIndex
     FROM cteSequenceTable
     OPTION (MAXRECURSION 32767)

     RETURN
END


8.
INSERT tblTarget
SELECT *
     ,ROW_NUMBER() OVER (
          PARTITION BY ntCategory ORDER BY ntCategory
          )
FROM tblSource

  
If you have any task which you think it can be possible by set based approach but you are not able write, you can ask here.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Very Informative. Thanks for the post!

    ReplyDelete