有一網(wǎng)友問:關(guān)于MS SQLSERVER索引優(yōu)化問題:
有表Stress_test(id int, key char(2))
id 上有普通索引;
key 上有簇索引;
id 有有限量的重復(fù);
key 有無限量的重復(fù);
現(xiàn)在我需要按邏輯與查詢表中key='Az' AND key='Bw' AND key='Cv' 的id
求教高手最有效的查詢語(yǔ)句
測(cè)試環(huán)境:
Hardware:P4 2.6+512M+80G
Software:windows server 2003(Enterprise Edition)+Sqlserver 2000 +sp3a
首先我們建立一個(gè)測(cè)試的數(shù)據(jù),為使數(shù)據(jù)盡量的分布和隨即,我們通過RAND()來隨機(jī)產(chǎn)生2個(gè)隨機(jī)數(shù)再組合成一個(gè)字符串,首先插入的數(shù)據(jù)是1,000,000條記錄,然后在循環(huán)插入到58,000,000條記錄。
因?yàn)槭请S機(jī)產(chǎn)生的數(shù)據(jù),所以如果你自己測(cè)試的數(shù)據(jù)集和我測(cè)試的會(huì)不一樣,但對(duì)索引的優(yōu)化和運(yùn)行的效率是一樣的。
下面的“--//測(cè)試腳本”是產(chǎn)生測(cè)試數(shù)據(jù)的腳本,你可以根據(jù)需要修改 @maxgroup, @maxLoop的值,比如測(cè)試1百萬的記錄可以:
Select @maxgroup=1000
Select @maxLoop=1000
如果要測(cè)試5千萬:
Select @maxgroup=5000
Select @maxLoop=10000
所以如果你的SERVER或PC比較慢,請(qǐng)耐心等待.....,
(在我的PC上運(yùn)行的速度是插入1百萬條的時(shí)間是1.14m,插入5千八百萬條的時(shí)間是19.41m,重新建立INDEX的時(shí)間是34.36m)
作為一般的開發(fā)人員很容易就想到的語(yǔ)句:
--語(yǔ)句1
select a.[id] from
(select distinct [id] from stress_test where [key] = 'Az') a,
(select distinct [id] from stress_test where [key] = 'Bw') b ,
(select distinct [id] from stress_test where [key] = 'Cv') c
where a.id = b.id and a.id = c.id
--語(yǔ)句2
select [id]
from stress_test
where [key]='Az' or [key]='Bw' or [key]='Cv'
group by id having(count(distinct [key])=3)
--語(yǔ)句5
SELECT distinct a.[id] FROM stress_test AS a,stress_test AS b,stress_test AS c
WHERE a.[key]='Az' AND b.[key]='Bw' AND c.[key]='Cv'
AND a.[id]=b.[id] AND a.[id]=c.[id]
但作為T-SQL的所謂“高手”可能會(huì)認(rèn)為這種寫法很“土”,也顯得沒有水平,所以會(huì)選擇一些子查詢和外連接的寫法,按常理子查詢的效率是比較高的:
--語(yǔ)句3
select distinct [id] from stress_test A where
not exists (
select 1 from
(select 'Az' as k union all select 'Bw' union all select 'Cv') B
left join stress_test C on C.id=A.id and B.[k]=C.[key]
where C.id is null)
--語(yǔ)句4
select distinct a.id from stress_test a
where not exists
( select * from keytb c
where not exists
( select * from stress_test b
where
b.id = a.id
and
c.kf1 = b.[key]
)
)
我們先分析這幾條語(yǔ)句(針對(duì)5千8百萬條數(shù)據(jù)進(jìn)行分析):
請(qǐng)大家要特別留心Estimated row count的值。
語(yǔ)句1:從執(zhí)行規(guī)劃中我們可以看出,MSSQLSERVER選擇的索引優(yōu)化非常有規(guī)律,先通過CLUSTERED INDEX篩選出符合[KEY]='Az'條件的ID,然后進(jìn)行HASH MATCH,在找出ID相等的;依次類推最終檢索到符合所有條件的記錄。中間的Estimated row count的值都不大。
語(yǔ)句2:從執(zhí)行規(guī)劃中我們可以看出,是先通過CLUSTERED INDEX篩選出符合 [key]='Az' or [key]='Bw' or [key]='Cv' 符合所有條件的ID,然后分組進(jìn)行2次HASH MATCH 所有的ID。我們可以看出Estimated row count的值是越來越少,從最初的369,262到最后排序的只有402。
語(yǔ)句3:從執(zhí)行規(guī)劃中我們可以看是非常復(fù)雜的,是先通過3組 通過CONSTANT SCAN和NON-CLUSTERED INDEX檢索出符合 A.ID=C.ID AND [key]='**' 的記錄3組,然后分組進(jìn)行外鍵匹配,再將3組的數(shù)據(jù)合并,排序,然后再和一個(gè)NON-CLUSTERED INDEX檢索出的記錄集進(jìn)行外鍵匹配,我們可以看出MSSQLSERVER會(huì)對(duì)所有的記錄(5千萬條)記錄進(jìn)行分組,Estimated row count的值是:58,720,000,所以這句T-SQL的瓶頸是對(duì)5千萬條記錄進(jìn)行分組。
語(yǔ)句4:從執(zhí)行規(guī)劃中我們可以看和語(yǔ)句3有相似之處,都要對(duì)所有的記錄(5千萬條)記錄進(jìn)行分組,所以這是檢索的瓶頸,而且使用的索引都是NON-CLUSTERED INDEX。
語(yǔ)句5:從執(zhí)行規(guī)劃中我們可以看出,先通過CLUSTERED INDEX檢索出符合[Key]='Az'的記錄集,然后進(jìn)行HASH MATCH和SORTS,因?yàn)閿?shù)量少所以是非常會(huì)的,在和通過NON-CLUSTERED INDEX檢索[KEY]='Bw'的記錄進(jìn)行INNER JOIN,在和通過CLUSTERED INDEX檢索[KEY]='Cv'的記錄進(jìn)行合并,最后是對(duì)4百萬條數(shù)據(jù)進(jìn)行分組檢索,如果是6列,我們可以看出Estimated row count的值是遞增,越來越大,最后的分組檢索的Estimated row count的值是3.46E+15,這已經(jīng)形成巨大的瓶頸。
我們可以先測(cè)試一下小的數(shù)據(jù)量(50000條);
大家可以下面測(cè)試腳本的:
Select @maxgroup=500
Select @maxLoop=100
----------------------------------------------------------------------
|------------------語(yǔ)句 1----語(yǔ)句 2----語(yǔ)句 3----語(yǔ)句 4----語(yǔ)句 5----|
| 5萬(3列) 5ms 19ms 37ms 59ms 0ms
| 5萬(6列) 1ms 26ms 36ms 36ms 1ms
從測(cè)試的的數(shù)據(jù)來看,語(yǔ)句5的效率是最高的,幾乎沒有花費(fèi)時(shí)間,而語(yǔ)句2的效率只能說是一般。如果測(cè)試到這里就結(jié)束了,我們可以毫不猶豫的選擇語(yǔ)句 5 :-(,繼續(xù)進(jìn)行下面的測(cè)試.....
我們測(cè)試百萬條以上的記錄:
1.先對(duì)1百萬條記錄進(jìn)行測(cè)試(選取3列)
2.先對(duì)1百萬條記錄進(jìn)行測(cè)試(選取6列)
3.對(duì)5千萬條數(shù)據(jù)測(cè)試(選取3列)
4.對(duì)5千萬條數(shù)據(jù)測(cè)試(選取6列)
統(tǒng)計(jì)表1:
----------------------------------------------------------------------
|------------------語(yǔ)句 1----語(yǔ)句 2----語(yǔ)句 3----語(yǔ)句 4----語(yǔ)句 5----|
| 1百萬(3列) 0.77% 0.41% 49.30% 48.99% 0.52%
| 1百萬(6列) 1.61% 0.81% 48.99% 47.44% 1.14%
| 5千萬(3列) 0.14% 0.18% 48.88% 48.86% 1.93%
| 5千萬(6列) 0.00% 0.00% 0.00% 0.00% 100.00%
統(tǒng)計(jì)表2:
----------------------------------------------------------------------
|------------------語(yǔ)句 1----語(yǔ)句 2----語(yǔ)句 3----語(yǔ)句 4----語(yǔ)句 5----|
| 1百萬(3列) 9ms 22ms 723ms 753ms 4ms
| 1百萬(6列) 15ms 38ms 764ms 773ms 11ms
| 5千萬(3列) 575ms 262ms 110117ms 110601ms 12533ms
| 5千萬(6列) 1070ms 576ms 107988ms 109704ms 10m以上
測(cè)試總結(jié):(我們可以比較關(guān)注:語(yǔ)句 2和語(yǔ)句 5)
1.在1百萬條記錄的情況下,語(yǔ)句 5是最快的,但在5千萬條記錄下是最慢的。這說明INDEX的優(yōu)化一定的情況下,數(shù)據(jù)量不同,檢索的效率也是不同的。我們平時(shí)在寫T-SQL時(shí)一般關(guān)注的時(shí)INDEX的使用,只要我們寫的T-SQL是利用CLUSTERED INDEX,我們就認(rèn)為是最優(yōu)化了,其實(shí)這是一個(gè)誤區(qū),我們還要關(guān)注Estimated row count的值,大量的I/O操作是我們應(yīng)該關(guān)注的,所以我們應(yīng)該根據(jù)數(shù)據(jù)量的不同選擇相應(yīng)的T-SQL語(yǔ)句,不要認(rèn)為在小數(shù)據(jù)量下是最高的在大數(shù)據(jù)量的狀態(tài)下也許是最慢的:-(。
2.在執(zhí)行規(guī)劃中最快的,并不是運(yùn)行最快的,我們可以看在1百萬(6列)在這行中,語(yǔ)句 2和語(yǔ)句 5的比例是0.81%:1.14%,但實(shí)際的運(yùn)行效率是,38ms:11ms。所以,我們?cè)谶x擇T-SQL是要考慮本地I/O的速度,所以在優(yōu)化語(yǔ)句時(shí)不僅要看執(zhí)行規(guī)劃還要計(jì)算一下具體的效率。
在測(cè)試的語(yǔ)句上加入:
SET STATISTICS TIME ON/OFF
SET STATISTICS IO ON/OFF
是一個(gè)很好的調(diào)試方法。
3.綜合評(píng)價(jià),語(yǔ)句 2的效率是最高的,執(zhí)行效率沒有隨數(shù)據(jù)量變化而有很大的差別。
4.執(zhí)行規(guī)劃越簡(jiǎn)單的語(yǔ)句(語(yǔ)句1),綜合效率越高,反之則越低(語(yǔ)句3,語(yǔ)句4)。
5.在平時(shí)寫T-SQL語(yǔ)句時(shí),一定要根據(jù)不同的數(shù)據(jù)量進(jìn)行測(cè)試,雖然都是用CLUSTERED INDEX,但檢索的效率卻大相徑庭。
--//測(cè)試腳本
USE Northwind
GO
if exists(select * from sysobjects where name=N'stress_test' and type='U')
Drop table stress_test
GO
--//定義測(cè)試的表stress_test,存放所有的測(cè)試數(shù)據(jù)
Create table stress_test([id] int,[key] char(2))
GO
--//插入測(cè)試的數(shù)據(jù)
Set nocount on
--//變量定義
Declare @id int --//Stress_test ID 值
Declare @key char(2) --//Stress_test [key] 值
Declare @maxgroup int --//組最大的循環(huán)數(shù)
Declare @maxLoop int --//ID最大的循環(huán)數(shù)
Declare @tempGroup int --//臨時(shí)變量
Declare @tempLoop int --//臨時(shí)變量
Declare @tempint1 int --//臨時(shí)變量
Declare @tempint2 int --//臨時(shí)變量
Declare @rowcount int --//記錄事務(wù)提交的行數(shù)
--//初始化變量
Select @id=1
Select @maxgroup=1000
Select @maxLoop=1000
Select @tempGroup=1
Select @tempLoop=1
Select @key=''
Select @rowcount=0
while @tempLoop<=@maxLoop
begin
while @tempGroup<=@maxGroup
begin
select @tempint1=65+convert(int,rand()*50)
select @tempint2=65+convert(int,rand()*100)
if (@tempint1>=122 or @tempint2>=122)
begin
select @tempint1=@tempint1-100
select @tempint2=@tempint2-100
if (@tempint1<=65 or @tempint2<=65)
begin
select @tempint1=@tempint1+57
select @tempint2=@tempint2+57
end
end
select @key=char(@tempint1)+char(@tempint2)
if @rowcount=0
begin tran ins
insert into stress_test([id],[key])values(@id,@key)
select @rowcount=@rowcount+1
if @rowcount>3000 --//判斷當(dāng)行數(shù)達(dá)到3000條時(shí),開始提交事務(wù)
begin
commit tran ins
select @rowcount=0
end
select @tempGroup=@tempgroup+1
end
if @rowcount>0
begin
commit tran ins
select @rowcount=0
end
select @tempGroup=1
select @id=@id+1
select @tempLoop=@tempLoop+1
end
GO
--//刪除KEY值為NULL的記錄
delete stress_test where [key]is null
GO
--//建立簇索引PK_STRESS
Create Clustered index pk_stress on stress_test([Key])
--//建立非簇索引NI_STRESS_ID
Create NonClustered index NI_stress_id on stress_test([id])
GO
--//定義測(cè)試的表keytb
if exists(select * from sysobjects where name=N'keytb' and type='U')
Drop table keytb
GO
create table keytb -----//存放你需要匹配的值的表
(
kf1 varchar(20)
)
--//存放你需要匹配的值,暫定為三個(gè)
insert into keytb(kf1) values('Az');
insert into keytb(kf1) values('Bw');
insert into keytb(kf1) values('Cv');
--insert into keytb(kf1) values('Du');
--insert into keytb(kf1) values('Ex');
--insert into keytb(kf1) values('Fy');
GO
下面我們就開始測(cè)試幾種T-SQL的INDEX優(yōu)化問題:
--先對(duì)1百萬條/1億條記錄進(jìn)行測(cè)試(選取3列)的T-SQL:
PRINT '第一種語(yǔ)句:'
SET STATISTICS TIME ON
SET STATISTICS IO ON
select a.[id] from
(select distinct [id] from stress_test where [key] = 'Az') a,
(select distinct [id] from stress_test where [key] = 'Bw') b ,
(select distinct [id] from stress_test where [key] = 'Cv') c
where a.id = b.id and a.id = c.id
GO
PRINT '第二種語(yǔ)句:'
select [id]
from stress_test
where [key]='Az' or [key]='Bw' or [key]='Cv'
group by id having(count(distinct [key])=3)
GO
PRINT '第三種語(yǔ)句:'
select distinct [id] from stress_test A where
not exists (
select 1 from
(select 'Az' as k union all select 'Bw' union all select 'Cv') B
left join stress_test C on C.id=A.id and B.[k]=C.[key]
where C.id is null)
GO
PRINT '第四種語(yǔ)句:'
select distinct a.id from stress_test a
where not exists
( select * from keytb c
where not exists
( select * from stress_test b
where
b.id = a.id
and
c.kf1 = b.[key]
)
)
GO
PRINT '第五種語(yǔ)句:'
SELECT distinct a.[id] FROM stress_test AS a,stress_test AS b,stress_test AS c
WHERE a.[key]='Ac' AND b.[key]='Bb' AND c.[key]='Ca'
AND a.[id]=b.[id] AND a.[id]=c.[id]
GO
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
--先對(duì)1百萬條/1億條記錄進(jìn)行測(cè)試(選取6列)的T-SQL:
PRINT '第一種語(yǔ)句:'
SET STATISTICS TIME ON
SET STATISTICS IO ON
select a.[id] from
(select distinct [id] from stress_test where [key] = 'Az') a,
(select distinct [id] from stress_test where [key] = 'Bw') b ,
(select distinct [id] from stress_test where [key] = 'Cv') c,
(select distinct [id] from stress_test where [key] = 'Du') d,
(select distinct [id] from stress_test where [key] = 'Ex') e,
(select distinct [id] from stress_test where [key] = 'Fy') f
where a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]
GO
PRINT '第二種語(yǔ)句:'
select [id]
from stress_test
where [key]='Az' or [key]='Bw' or [key]='Cv' or [Key]='Du'or [Key]='Ex'or [Key]='Fy'
group by id having(count(distinct [key])=6)
GO
PRINT '第三種語(yǔ)句:'
select distinct [id] from stress_test A where
not exists (
select 1 from
(select 'Az' as k union all select 'Bw' union all select 'Cv'union all select 'Du'union all select 'Ex'union all select 'Fy') B
left join stress_test C on C.id=A.id and B.[k]=C.[key]
where C.id is null)
GO
PRINT '第四種語(yǔ)句:'
select distinct a.id from stress_test a
where not exists
( select * from keytb c
where not exists
( select * from stress_test b
where
b.id = a.id
and
c.kf1 = b.[key]
)
)
GO
PRINT '第五種語(yǔ)句:'
SELECT distinct a.[id] FROM stress_test AS a,stress_test AS b,stress_test AS c,stress_test AS d,stress_test AS e,stress_test AS f
WHERE a.[key]='Az' AND b.[key]='Bw' AND c.[key]='Cv' AND d.[key]='Du' AND e.[key]='Ex' AND f.[key]='Fy'
and a.[id] = b.[id] and a.[id] = c.[id] and a.[id] = d.[id] and a.[id] = e.[id] and a.[id] = f.[id]
GO
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
請(qǐng)參考:
http://expert.csdn.net/Expert/topic/2630/2630484.xml?temp=.9921686