Showing posts with label SQL Server Performance Tuning. Show all posts
Showing posts with label SQL Server Performance Tuning. Show all posts

Jul 8, 2016

Complex or multiple tables join best practices SQL server performance tuning

How to optimize table JOIN if there are many tables in a JOIN statement in the SQL Server? How to decrease numbers of tables in a JOIN?

Jul 6, 2016

Efficient | fast way to populate large volume of data in a table c# SQL Server

Sometimes there is a need to populate | insert a large volume of data from application code(c#, Java, vb.net) to a table in the database (SQL Server).  If data volume is less then performance is not a big concern. But if the data volume is big we must have to consider the most efficient approach. Let us discuss different approaches and their advantages and disadvantages:

Looping over insert statement:
One of the easiest ways to write an insert statement inside a loop. And it will populate records one by one in the database. For example:
foreach (DataRow row in table.Rows) {

    using (SqlConnection connection = new SqlConnection(connectionString))
    {
        SqlCommand cmd = new SqlCommand("INSERT INTO Users(Name, Phone, Age) VALUES (@Name, @Phone, @Age)");
        cmd.CommandType = CommandType.Text;
        cmd.Connection = connection;
        cmd.Parameters.AddWithValue("@Name", row[1]);
        cmd.Parameters.AddWithValue("@Phone", row[2]);
        cmd.Parameters.AddWithValue("@Age", row[3]);
        connection.Open();
        cmd.ExecuteNonQuery();
    }
}
      
Here the problem with above script is there is separate database call for each row.  For a large volume of records above script will take a significant amount of time.

Multiple insert statement in a single query:

In this approach first all insert queries are saved in a string variable. Then whole insert statement is executed in the database at the same time. For example:

string Query = @"INSERT INTO Users(Name, Phone, Age) VALUES ('Name_1', 'Phone_1', 'Age_1')
INSERT INTO Users(Name, Phone, Age) VALUES ('Name_2', 'Phone_2', 'Age_2')
.....
.....
INSERT INTO Users(Name, Phone, Age) VALUES ('Name_n', 'Phone_n', 'Age_n')"

using (SqlConnection connection = new SqlConnection(connectionString))
{
    SqlCommand cmd = new SqlCommand(Query);
    cmd.CommandType = CommandType.Text;
    cmd.Connection = connection;
    connection.Open();
    cmd.ExecuteNonQuery();
}

In the above approach database is called only one time but inside database rows are being inserted one by one. So it is rows based approach and it is not efficient in case of a large volume of data.

Table value constructor:

In SQL server table value constructor is set based approach. And it is very efficient.  Instead of the full insert statement, we can save the query string in table value constructor format. For example:
string Query = @"INSERT INTO Users(Name, Phone, Age)
       VALUES ('Name_1', 'Phone_1', 'Age_1')
       VALUES ('Name_2', 'Phone_2', 'Age_2')
       .....
       .....
       VALUES ('Name_n', 'Phone_n', 'Age_n')"
                                 

Although this approach is very efficient but limited to 1000 rows. If there are more than 1000 rows above approach is not going to work.

Passing Data table:
One of the best approaches is passing the data table as a parameter to stored procedure. For example:

Create a table type in SQL server:
CREATE TYPE UserType AS TABLE(
      Name VARCHAR(50) NOT NULL,
      Phone VARCHAR(10),
      Age INT
)

Create a stored procedure which accepts UserType as a parameter:

CREATE PROCEDURE AddUsers(
    @UserTable UserType READONLY
)
AS
BEGIN
    INSERT INTO Users
      SELECT * FROM @UserTable
END

Using C# code, execute stored procedure and pass data table as a parameter:

using (SqlConnection conn = new SqlConnection(conn))
using (SqlCommand cmd = new SqlCommand("dbo.AddUsers", conn))
{
    var dt = new DataTable();

    cmd.CommandType = CommandType.StoredProcedure;
    cmd.Parameters.Add(new SqlParameter("@UserType", dt));  
    conn.Open();
    cmd.ExecuteNonQuery();
         
}

Using bulk insert:

SQL server support bulk inserts that loading the data from a file. For this first, we have to download or copy data file at the specified location using programming language. Then using BULK INSERT command data can be imported into a table. For example:

BULK INSERT EXACT_HELP.dbo.Users
FROM 'C:\Data\Users.txt' 
WITH
      ( 
            FIELDTERMINATOR =' |', 
            ROWTERMINATOR =' |\n' 
      ); 

We can also do by using C# SqlBulkCopy.

Pass data in the XML format in SQL server:

In your programming language, first of all, covert whole data table in XML format. Pass that XML to a stored procedure to populate the table. For example:

Create a stored procedure:

CREATE PROCEDURE AddUers(
    @UserTable XML
)
AS
BEGIN

      DECLARE @idoc int

      EXEC sp_xml_preparedocument @idoc OUTPUT, @UserTable;

    INSERT INTO Users
      SELECT  * 
      FROM OPENXML (@idoc, '/ROOT/Users',1) 
    WITH (
            Name VARCHAR(50),
            Phone VARCHAR(10),
            Age INT)

END

C# code to pass XML data:
                  
DataSet ds = new DataSet();
ds.Tables.Add(dt1);
string UserXml= ds.GetXml();

using (SqlConnection conn = new SqlConnection(conn))
using (SqlCommand cmd = new SqlCommand("dbo.AddUsers", conn))
{
    cmd.CommandType = CommandType.StoredProcedure;
    cmd.Parameters.Add("@UserTable", SqlDbType.Xml);   
    cmd.Parameters["@UserTable"].Value = UserXml;        
    conn.Open();
    cmd.ExecuteNonQuery();
         
}

Sep 17, 2015

Unwanted MANY TO MANY join execution plan SQL Server

Sometimes we need MANY TO MANY join in SQL Server. Sometime we don’t but event that sometimes SQL Server query optimizer chooses MANY TO MANY join. That was not our actual need. Let me explain it by an example:

Let’s assume we have two tables named UserPhoto and tblUser:

CREATE TABLE [dbo].[UserPhoto] (
       [UserPhotoID] [bigint] PRIMARY KEY IDENTITY(1, 1) NOT NULL
       ,[PhotoID] [int] NULL
       ,[PhotoPath] [varchar](100) NULL
)

CREATE TABLE [dbo].[tblUser] (
       [UserID] [bigint] PRIMARY KEY IDENTITY(1, 1) NOT NULL
       ,[UserName] [varchar](100) NULL
       ,[PhotoID] [int] NULL
)

Creating non-clustered indexes to improve performance:

CREATE NONCLUSTERED INDEX NCI_PhotoID
ON [dbo].[tblUser] ([PhotoID])
INCLUDE (
       [UserID]
       ,[UserName]
)

CREATE NONCLUSTERED INDEX NCI_PhotoID
ON [dbo].[UserPhoto] ([PhotoID])
INCLUDE ([PhotoPath])
  
Let’s assume a business constrains is an user can have only one photo. So we are populating data in both tables so that an cannot have more than one photoid:

INSERT INTO [UserPhoto] ([PhotoID],[PhotoPath])
SELECT CONCAT (language_id,message_id),CONCAT ('C:\Image\',message_id,'.png'              )
FROM sys.messages

INSERT INTO [tblUser] (UserName,PhotoID)
SELECT CONCAT ('User',PhotoID) ,PhotoID
FROM [UserPhoto]

Now we are going to execute below script which join above two tables on the basis of PhotoID:

SET STATISTICS IO ON

SELECT TOP (100000) U.UserID
       ,U.UserName
       ,UP.PhotoPath
FROM tblUser U
INNER JOIN UserPhoto UP ON U.PhotoID = UP.PhotoID

Let’s analyze its execution plan:


If we will notice on merge join we will get MANY TO MANY is true. While our business constraint says an user cannot have more than one photo.  Then while SQL Server query optimizer chosen MANY TO MANY join which is costlier?     

Now we are going to check total logical reads:

Table 'UserPhoto'. Scan count 5, logical reads 884, Total logical read = 5 * 884 = 4420

Table 'tblUser'. Scan count 5, logical reads 772, Total logical read = 5 * 772 = 3860

Reason behind this is SQL Server query optimizer doesn’t know about our business constraint “An User can have one and only one photo”. That is why it has generated bad execution plan that is MANY TO MANY. Hence it is our task to tell SQL server query optimizer about our business constraints. In this case we can say about our business constraint (An User can have one and only one photo) by creating below two UNIQUE NONCLUSTERED indexes:

CREATE UNIQUE NONCLUSTERED INDEX NCI_PhotoID
ON [dbo].[tblUser] ([PhotoID])
INCLUDE ([UserID],[UserName])
WITH (DROP_EXISTING = ON)

CREATE UNIQUE NONCLUSTERED INDEX NCI_PhotoID
ON [dbo].[UserPhoto] ([PhotoID])
INCLUDE ([PhotoPath])
WITH (DROP_EXISTING = ON)

Now if we will execute again same joining query:

SET STATISTICS IO ON

SELECT TOP (100000)
       U.UserID
       ,U.UserName
       ,UP.PhotoPath
FROM tblUser U
INNER JOIN UserPhoto UP ON U.PhotoID = UP.PhotoID

Let’s check execution plan:


Now MANY to MANY is false!!!

And total logical read:

Table 'UserPhoto'. Scan count 1, logical reads 487, Total Logical Read = 1 * 487 = 487
Table 'tblUser'. Scan count 1, logical reads 424, Total Logical Read = 1 * 424 = 424

Now this time total logical read is much lesser than previous one for same query. It was example of unwanted MANY TO MNAY join. This will unnecessarily degrade query the performance.

Sep 13, 2015

User defined scalar value function and query optimization Sql Server

Use of scalar function is good or bad performance analysis

Sometimes we encounter a situation where there is a little complex business logic where we need to create user defined scalar value functions. It is due to re-usability or due to simplification of main SQL statement.

Here only potential issue is execution plan of the query. SQL server query optimizer generates separate plan for User defined scalar function which is revoked by main or base execution plan. Let me explain it by a very good example:

--Creating sample table
CREATE TABLE tblMessage(
     MessageID BIGINT PRIMARY KEY IDENTITY,
     LanguageID INT,
     MessageType TINYINT
)

--Populating records into it
INSERT INTO tblMessage
SELECT language_id,severity FROM sys.messages

GO

--Creating scalar function to get count
CREATE FUNCTION dbo.udfMsgTypeCount(
     @MessageType TINYINT
)
RETURNS INT
AS
BEGIN
     DECLARE @MsgTypeCount AS INT

     SELECT @MsgTypeCount = COUNT(*) FROM tblMessage WHERE MessageType = @MessageType

     RETURN @MsgTypeCount
END

Now we are going to execute two similar queries, one using scalar user defined function and another using equivalent sub query of scalar function:

--Approach 1: Using scalar UDF:
SELECT TOP(1000)
     MessageID,
     LanguageID,
     dbo.udfMsgTypeCount(MessageType) AS Typecount
FROM tblMessage
ORDER BY MessageID

--Approach 2: Using subquery
SELECT TOP(1000)
     MessageID,
     LanguageID,
     (SELECT COUNT(*) FROM tblMessage WHERE MessageType = M.MessageType) AS Typecount
FROM tblMessage M
ORDER BY MessageID

Comparing execution plan:


According to execution plan first approach (using scalar value function) is better than approach two (Sub query) since cost of first approach is only 0% while second approach is 100%.

Comparing execution time:

Approach 1:
CPU time = 87313 ms, elapsed time = 87522 ms.

Approach 2:
CPU time = 312 ms, elapsed time = 588 ms.

According to execution time first approach is bad since it took around 1.4 minutes while second approach took only 0.6 seconds.
  
Question 1: Why execution plan is wrong?

Here execution plan is not wrong. It is limitation of execution plan. SQL Server execution plan doesn’t include the execution plan of invoking user defined scalar value functions.  If we will observe the execution plan of first approach it has only compute scalar physical operator (Inside green rectangle) for invoking UDF.

There is separate execution plan of UDF: We can get it by script:

SELECT query_plan
FROM sys.dm_exec_cached_plans
CROSS APPLY sys.dm_exec_query_plan(plan_handle)
WHERE objectid =  OBJECT_ID('dbo.udfMsgTypeCount')

Above plan of scalar UDF was not visible in base query execution plan and total cost has not been evaluated correctly. Hence it looks like execution plan of first approach was wrong. But in reality second approach is much better than first approach.

Questions:  I agree with the limitation that execution plan is unable to integrate the execution plan of invoking scalar UDF. But why first approach took 1.4 minutes while second approach took only 0.5 second? Why there is huge performance difference?

In case of scalar UDF execution plan of base query and invoking scalar UDF are generated independently. Hence sometimes overall execution plan may not efficient one compare to the combined execution plan which has been generated for whole script (second approach). Current demonstration is one of the good examples for this.

If we will analyze text actual execution plan (SET STATISTICS PROFILE ON ), in second approach there is a table spool which keeps only 16 records which is not in first approach that is why second approach executes faster:

Using scalar UDF:


Using subquery:


Hence in this situation we should avoid using scalar user defined function. Apart from this scalar value function is also not good for query parallelism.

Sep 6, 2015

Use of trigger SQL Server Query Performance good or bad

Avoid using DML trigger at inappropriate place

I am not going to tell triggers are bad but there are many situations where using of triggers degrade the performance.  For example, sometimes there are triggers which reference same time and perform the same task which could be done the script which fire it.  In that case there is no need to create triggers. Let me explain it by an example:

Approach 1: Using trigger

a. Creating a table:
CREATE TABLE tblOrder(
     OrderID BIGINT PRIMARY KEY IDENTITY,
     OrderType INT,
     Qty INT,
     InsertDate DATETIME
)

b. Creating a trigger to update the insertion date:

CREATE TRIGGER Trg_Insert_Date ON tblOrder
AFTER INSERT
AS
BEGIN
    
    UPDATE  tblOrder
    SET     InsertDate = GETDATE()
    FROM    tblOrder O
    INNER JOIN inserted I
    ON     O.OrderID = I.OrderID

END

c. Populating records in the order table and enabling statistics:

SET STATISTICS IO ON

INSERT INTO tblOrder(OrderType,Qty)
SELECT message_id, severity FROM Sys.messages

Total logical read:

Table 'tblOrder'. Scan count 0, logical reads 645023
Table 'Worktable'. Scan count 1, logical reads 585437
Table 'tblOrder'. Scan count 1, logical reads 1036

Approach 2: Using Default value

a. Dropping trigger and creating equivalent default constraint on InsertDate column:

DROP TRIGGER Trg_Insert_Date

ALTER TABLE tblOrder
ADD CONSTRAINT [def_insert_dt]
DEFAULT (GETDATE())
FOR InsertDate

b. Removing all records from table:

TRUNCATE TABLE tblOrder

c. Populating records once again:

SET STATISTICS IO ON

INSERT INTO tblOrder(OrderType,Qty)
SELECT message_id, severity FROM Sys.messages

Table 'tblOrder'. Scan count 0, logical reads 645023
Table 'Worktable'. Scan count 1, logical reads 585437

After comparing logical reads we can say there are 1036 extra logical reads from tblOrder if we use trigger. Hence in this particular example it is bad approach to use trigger.