Feb 1, 2013

FIND_IN_SET not using index with comma delimited string in Mysql

In Mysql it is very difficult to optimize if filter key column is in a comma delimited string. We can filter the records by using FIND_IN_SET function but issue with this is, field which is used in this function are not used by the index. In this post we will discuss how to force Mysql to use index with csv data in a tricky way.  First of all I'm explaining the issue by an example:

Let's assume we have table say tblEmp.

CREATE TABLE tblEmployee(
     ntEmpID BIGINT PRIMARY KEY AUTO_INCREMENT,
     vcEmpName VARCHAR(100),
     ntAge INT,
     dtDateOfJoin DATETIME
);

Now I'm going to insert some records into it:

INSERT INTO tblEmployee(vcEmpName,ntAge,dtDateOfJoin)
SELECT
CONCAT('Emp_' ,A.ID),
40 - A.Sortlen,
NOW()
FROM INFORMATION_SCHEMA.COLLATIONS A,
     INFORMATION_SCHEMA.COLLATIONS B,
     INFORMATION_SCHEMA.COLLATIONS C
LIMIT 1000000;

Now I would like to filter the records from tblEmployee while ntEmpID are in a comma delimited string:

SELECT *
FROM tblEmployee
WHERE FIND_IN_SET(ntEmpID,'500000,500001,500002,500003,500004,500005,500006,500007,500008,500009,500010,500011,500012,500013,500014,500015,500016,500017,500018,500019,500020,500021,500022,500023,500024,500025,500026,500027,500028,500029,500030,500031,500032,500033,500034,500035,500036,500037,500038,500039,500040,500041,500042,500043,500044,500045,500046,500047,500048,500049,500050,500051,500052,500053,500054,500055,500056,500057,500058,500059,500060,500061,500062,500063,500064,500065,500066,500067,500068,500069,500070,500071,500072,500073,500074,500075,500076,500077,500078,500079,500080,500081,500082,500083,500084,500085,500086,500087,500088,500089,500090,500091,500092,500093,500094,500095,500096,500097,500098,500099,500100');

When I executed above query in my local system it took around 7 seconds.

While if same value is passed in the IN clause, for example:

SELECT *
FROM tblEmployee
WHERE ntEmpID IN (500000,500001,500002,500003,500004,500005,500006,500007,500008,500009,500010,500011,500012,500013,500014,500015,500016,500017,500018,500019,500020,500021,500022,500023,500024,500025,500026,500027,500028,500029,500030,500031,500032,500033,500034,500035,500036,500037,500038,500039,500040,500041,500042,500043,500044,500045,500046,500047,500048,500049,500050,500051,500052,500053,500054,500055,500056,500057,500058,500059,500060,500061,500062,500063,500064,500065,500066,500067,500068,500069,500070,500071,500072,500073,500074,500075,500076,500077,500078,500079,500080,500081,500082,500083,500084,500085,500086,500087,500088,500089,500090,500091,500092,500093,500094,500095,500096,500097,500098,500099,500100);

It is taking only 0.016 seconds in my local system.  Here there is significant difference in the execution time for same requirement. But problem is we cannot use IN clause since our data is in csv string format :(

Why there is so difference. I now we are comparing the query execution plan by using explain statement:

Explain of first query (Using FIND_IN_SET):
ID
Select_Type
Table
Type
Key
Key_Len
Ref
Rows
Extra
1
Simple
tblEmployee
All
997906
Using Where

Explain of second query (Using IN):
ID
Select_Type
Table
Type
Key
Key_Len
Ref
Rows
Extra
1
Simple
tblEmployee
Range
Primary
8
101
Using Where
    
We can observe the difference. First query has to check 997906 rows while second query has to check only 101 rows to fetch only 101 records. This difference is due to second query is using PRIMARY index that is clustered index(Its key is ntEmpID) while first query is not able to.

After some effort I found out a trick to use index while filter data is in a comma delimited string.

Step 1: Create a table. Say tblIndex:

CREATE TABLE tblIndex(
ntIndex INTEGER PRIMARY KEY AUTO_INCREMENT
);

Step 2: Insert some records into it according to max possible keys in the comma delimited string. In My case comma delimited keys string cannot be more than 400.  

INSERT INTO tblIndex(ntIndex)
SELECT NULL
FROM INFORMATION_SCHEMA.COLLATIONS A,
     INFORMATION_SCHEMA.COLLATIONS B  
LIMIT 400;

Step 3: Write same query in the following way:

SELECT E.*
FROM tblEmployee E
INNER JOIN tblIndex
ON ntEmpID = (SUBSTRING_INDEX(SUBSTRING_INDEX(@vcKeyValue, ',', ntIndex), ',', -1))
WHERE ntIndex <= + LENGTH(@vcKeyValue) - LENGTH(REPLACE(@vcKeyValue,',',''));

When I executed this query in my local system it took only 0.032 seconds :)

It has solved my problem.  Now check the explain statement of this query:

ID
Select_Type
Table
Type
Key
Key_Len
Ref
Rows
Extra
1
Simple
TblIndex
Range
Primary
4
NULL
102
Using Where;Using Index
1
Simple
tblEmployee
Eq_Ref
Primary
8
Func
1
Using Where

Now query is using Primary index!!!

No comments:

Post a Comment