Jul 8, 2016

Complex or multiple tables join best practices SQL server performance tuning

How to optimize table JOIN if there are many tables in a JOIN statement in the SQL Server? How to decrease numbers of tables in a JOIN?

 It is highly recommended we should not JOIN more than four or five tables in a single query. This number varies according to data volume of a table.  Why? 

I want to share a real world example. One senior developer joined a new company. There was a database which was facing a lot performance issues. To analyze the root cause, that developer writes a query which joins four tables.  Those tables contain a very large volume of data. Hence that query took time to fetch records and it locked four tables in significant period of time due to which other queries got blocked.  After some time it blocked whole database.  When client came to know that new employee got fired. A very simple mistake became cause this accident.

How to avoid numbers of tables in a JOIN?
Basic question is if there is business need to join more than five tables to fetch some data. What is alternative approach? How to meet this requirement without joining five tables? Answer is very simple. First join only three tables and load the intermediate data in a temporary table. Then join remaining tables with the temporary table. For example:

Business need:
SELECT T1.Id, T2.C, T3.C, T4.C, T5.C
FROM T1 INNER JOIN T2 ON T1.Id = T2.Id
INNER JOIN T3 ON T3.Id = T2.Id
INNER JOIN T4 ON T4.Id = T3.Id
INNER JOIN T5 ON T5.Id = T4.Id

Our approach should be:

--Step 1
SELECT T1.Id, T2.C, T3.C
INTO #T
FROM T1 INNER JOIN T2 ON T1.Id = T2.Id
INNER JOIN T3 ON T3.Id = T2.Id

--Step 2
SELECT #T.*, T4.C, T5.C
FROM #T INNER JOIN T4 ON #T.Id = T4.Id
INNER JOIN T5 ON T5.Id = T4.Id

But this solution is not as simple as looks. There are many performance considerations before splitting complex JOIN statement. A wrong splitting may decrease overall performance of the query and a correct splitting may improve overall query performance significantly. For example:

Data set up:
--Creating any five tables
CREATE TABLE TA(
AID INT PRIMARY KEY IDENTITY,
A1 INT,
A2 DATETIME DEFAULT(GETDATE())
)

CREATE TABLE TB(
BID BIGINT PRIMARY KEY,
B1 VARCHAR(10),
B2 UNIQUEIDENTIFIER
)

CREATE TABLE TC(
CID BIGINT PRIMARY KEY IDENTITY,
C1 UNIQUEIDENTIFIER DEFAULT(NEWID()),
C2 NVARCHAR(10)
)

CREATE TABLE TD(
DID BIGINT PRIMARY KEY IDENTITY,
D1 INT,
D2 NVARCHAR(10)
)

CREATE TABLE TE(
EID BIGINT PRIMARY KEY IDENTITY,
E1 VARCHAR(64),
E2 INT
)

GO

--Populating records
INSERT INTO TA(A1)
SELECT severity FROM sys.messages
GO 10

INSERT INTO TB(BID,B1)
SELECT AID,LEFT(NEWID(),10) FROM TA
GO

INSERT INTO TC(C1,C2)
SELECT NEWID(),B1 FROM TB
GO

INSERT INTO TD(D1,D2)
SELECT CID, V FROM TC ,(SELECT 'XXX' V UNION ALL SELECT 'YYYY') DT
UNION ALL
SELECT TOP(5)  CID ,'ZZZ' FROM TC 
GO

INSERT INTO TE(E1,E2)
SELECT  B1 + CAST(C1 AS VARCHAR(36)), 1
FROM TB INNER JOIN TC ON TB.BID = TC.CID

Let’s assume our business need is to join above all five tables and get output on the basis of those joined tables. Below is our actual query:

DECLARE @P1 AS INT = 16,@P2 VARCHAR(1)= 'Z',@P3 AS INT = 1
SELECT * FROM TA
INNER JOIN TB ON TA.AID = TB.BID
INNER JOIN TC ON TC.C2 = TB.B1
INNER JOIN TD ON TD.D1 = TC.CID
INNER JOIN TE ON TE.E1 = TB.B1 + CAST(TC.C1 AS VARCHAR(36))
WHERE TA.A1 = @P1
      AND LEFT(TD.D2,1) = @P2
      AND TE.E2 = @P3

As we discussed joining five tables is bad practice. Also sometimes it decreases performance.

Average execution time of above query in my PC: 1 minute and 52 seconds
Total logical read:
Workfile: 34 * 97936
TD: 19416
TE: 22226
TC: 17844
TB: 14732
TA: 7838
= 3411880

To simplify above script, we are going to split above join in two parts. First part will keep first three tables and second part remaining tables:

--First part
DECLARE @P1 AS INT = 16,@P2 VARCHAR(1)= 'Z',@P3 AS INT = 1
SELECT *
INTO #T
FROM TA
INNER JOIN TB ON TA.AID = TB.BID
INNER JOIN TC ON TC.C2 = TB.B1
WHERE TA.A1 = @P1
      AND TB.B2 IS NULL

--Second part
SELECT * FROM #T T
INNER JOIN TD ON TD.D1 = T.CID
INNER JOIN TE ON TE.E1 = T.B1 + CAST(T.C1 AS VARCHAR(36))
WHERE  LEFT(TD.D2,1) = @P2
      AND TE.E2 = @P3

Average execution time of above query in my PC: 2 minutes and 38 seconds
Total logical read:
Workfile: 10 * 21544 + 94 * 69336
TD: 19416
TE: 22226
TC: 17844
TB: 14732
TA: 7838
#T: 24992
= 6840072

After simplifying the query overall execution time has increased significantly. Logical read has become double. After simplifying the query performance has decreased by almost 50% than original query. Even that simplified query is better because large number of tables in a single query may keep tables locked for long period of time which is worst.

But this splitting of a complex query is wrong. We have to think a lot before simplifying a complex query. thumb rule is:

Split a complex query in such a way that it has to fetch minimum possible records from a table. In the other word logical read should be minimal.

Best way to split above query:

DECLARE @P1 AS INT = 16,@P2 VARCHAR(1)= 'Z',@P3 AS INT = 1

--First Part
SELECT * INTO #P
FROM  TD
INNER JOIN TC ON TD.D1 = TC.CID
WHERE   LEFT(TD.D2,1) = @P2

--Second part
SELECT *,TB.B1 + CAST(P.C1 AS VARCHAR(36)) ASINTO#Q
FROM #P P INNER JOIN TB ON P.C2 = TB.B1

--Third part
SELECT * FROM
#Q Q
INNER JOIN TA ON TA.AID = Q.BID
INNER JOIN TE ON TE.E1 = Q.N
WHERE TA.A1 = @P1
      AND TE.E2 = @P3

Average execution time of above query in my PC: 22 seconds
Total logical read:
Workfile: 0
TD: 19416
TE: 22226
TC: 17844
TB: 14732
TA: 15
#P: 1
#Q: 1
= 74235


We can notice that logical read has decreased almost 46 times than original query. Here we are gaining huge performance. Hence, a skilfully simplification of a complex query may achieve huge performance. Apart from this, an appropriate index on intermediate temporary table can get extra performance benefit. 

No comments:

Post a Comment