目前進行一個應用系統的效能測試,所測試的交易含有大量的資料庫Select/Insert/Update的動作,執行完一次交易大概要執行大約3、4千行的 Dynamic SQL語句。測試結果發現效能非常地差,同時50個使用者執行該交易,就需要32分鐘才能將交易做完,利用nmon來監控其CPU使用率發現這30多分鐘的CPU都是滿載使用。
因為這樣的效能實在不能接受,因此開始進行DB2及應用程式的效能調整,因為是第一次有機會做這樣的測試,所以一切就在邊作邊學的過程進行,幸好DB2 Infocenter寫的都還滿詳細的,讓我能一步步達到我的目的。接下來就分享整個Tunning的過程。
首先,要找出使應用程式DB使用效能的原因,我們必需啟用DB2的 monitor switch收及一些統計資訊,輸入 DB2 GET MONITOR SWITCHES 指令可以得到目前這些switche的狀態。
[db2inst1@db01 source_db]$ db2 get monitor switches
Monitor Recording Switches
Switch list for db partition number 0
Buffer Pool Activity Information (BUFFERPOOL) = OFF
Lock Information (LOCK) = OFF
Sorting Information (SORT) = OFF
SQL Statement Information (STATEMENT) = OFF
Table Activity Information (TABLE) = OFF
Take Timestamp Information (TIMESTAMP) = ON 12/12/2009 15:13:18.256534
Unit of Work Information (UOW) = OFF
如上面所示,目前除了TIMESTAMP這個預設會打開的switch外,其它都是關閉的狀態,要啟動任一個 switch只需輸入 db2 UPDATE MONITOR SWITCHES USING XXX ON,比方說若要啟用LOCK這個switch的話,就輸入db2 UPDATE MONITOR SWITCHES USING LOCK ON,結果如下
[db2inst1@db01 source_db]$ db2 update monitor switches using lock on
DB20000I The UPDATE MONITOR SWITCHES command completed successfully.
回到DB2的效能檢視,因為DB2的使用率已達到100%,因此暫不朝向有太多lock的方向思考,且db2 v9版以後,已經有很多Database Configuration參數,如heap size、bufferpool等都可以動態調整了,也不朝向Database Configuration的方向思考。排除上述的因子後,決定直接看 SQL執行的效能如何。
要查看SQL執行效能,首先要把 STATEMENT 這個 Monitor Switch打開,如下
[db2inst1@db01 source_db]$ db2 update monitor switches using statement on
DB20000I The UPDATE MONITOR SWITCHES command completed successfully. 接著模擬1個使用者,執行所要測試的交易,交易執行完後,使用 GET SNAPSHOT FOR DYNAMIC SQL ON database_name 的指令,收集執行一次交易,所有會執行的dynamic SQL統計資訊。將其導向一個檔案
[db2inst1@db01 source_db]$ db2 get snapshot for dynamic sql on test_db > executed_sql
檢視該檔案,可發現每一筆錄到的dynamic SQL的Snapshot的格式如下,在這裡會用到的是 Total Execution Time、Total CPU Time及 SQL statement這三個欄位的資訊(用藍色標出的部分)
Number of executions = 1
Number of compilations = 1
Worst preparation time (ms) = 7
Best preparation time (ms) = 7
Internal rows deleted = 0
Internal rows inserted = 0
Rows read = 1590
Internal rows updated = 0
Rows written = 0
Statement sorts = 0
Statement sort overflows = 0
Total sort time = 0
Buffer pool data logical reads = Not Collected
Buffer pool data physical reads = Not Collected
Buffer pool temporary data logical reads = Not Collected
Buffer pool temporary data physical reads = Not Collected
Buffer pool index logical reads = Not Collected
Buffer pool index physical reads = Not Collected
Buffer pool temporary index logical reads = Not Collected
Buffer pool temporary index physical reads = Not Collected
Buffer pool xda logical reads = Not Collected
Buffer pool xda physical reads = Not Collected
Buffer pool temporary xda logical reads = Not Collected
Buffer pool temporary xda physical reads = Not Collected
Total execution time (sec.microsec)= 25.435582
Total user cpu time (sec.microsec) = 25.377444
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time (milliseconds) = 0
Statement text = SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH) CSHDRVOH, SUM(TRANSFER_DR_NO_OF_VOH) TRNDRVOH, SUM(CASH_CR_NO_OF_VOH) CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH) TRNCRVOH FROM ACCT01 T1 WHERE VOH_SEQ_NO = (SELECT MIN(VOH_SEQ_NO) FROM ACCT01 WHERE T1.VOH_NO = VOH_NO AND T1.BRANCH_NO = BRANCH_NO) AND VOH_DATE = '20090518' AND VOH_DATE = ORG_VOH_DATE AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99') AND BRANCH_NO = '9964' GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO
因為所錄到的SQL數量非常龐大,無法用肉眼找出那些SQL耗掉最多的CPU time,因此在這裡寫了一個簡單的小程式來過濾這些record。程式內容如下(下列程式會挑出所有執行時間大於0.5秒的SQL,若要改成其它時間,只需在藍色該行的if 敘述中改為其它值
#!/usr/bin/perl
undef $/;
my $data = <>;
my $long_execution_sql_count=0;
while($data=~/(Total execution time \(sec.microsec\)= (.+?)\n.+?\n.+?\n.+?\n.+?\n.+?\n)/g)
{
if ($2>0.5) #這裡使用0.5會挑出所有執行時間大於0.5秒的SQL。
{
print $1,"\n";
$long_execution_sql_count++;
}
}
print "Total Long Execution SQLs: ",$long_execution_sql_count;
使用此小程式來過濾Dynamic SQL snapshot 所錄到的結果,可發現這個應用程式最耗時的SQL語句如下:
Total execution time (sec.microsec)= 27.081611
Total user cpu time (sec.microsec) = 27.067162
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time (milliseconds) = 0
Statement text = SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
FROM ACCT01 T1
WHERE VOH_SEQ_NO =
(SELECT MIN(VOH_SEQ_NO)
FROM ACCT01
WHERE T1.VOH_NO =VOH_NO AND T1.BRANCH_NO =BRANCH_NO)AND VOH_DATE
='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
AND BRANCH_NO ='9964'
GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO
是的,只單單執行一次這個SQL就要花費 27秒,很顯然地,它就是要被tune的SQL。接下來使用db2expln這個命令來Explain DB2的Optimizer倒底是如何來執行這個SQL的,下面範例會把explain的結果,輸出到 sql_explain這個檔案中。先把上述要explain的SQL語句,寫入input_sql這個檔案中,執行下列指令
[db2inst1@db01 ~]$ db2expln -database test_db -stmtfile input_sql -terminator ';' -o sql_explain -graph
DB2 Universal Database Version 9.5, 5622-044 (c) Copyright IBM Corp. 1991, 2007
Licensed Material - Program Property of IBM
IBM DB2 Universal Database SQL and XQUERY Explain Tool
Output is available in "sql_explain".
接下來檢視sql_explain這個output檔案,以下擷取 Optimizer Plan這個Tree的結果,可以看到這個SQL已經使用Index Search了,所以應該也不是沒建 index的問題。
Rows
Operator
(ID)
Cost
1.03085
RETURN
( 1)
4517.06
|
1.03085
GRPBY
( 2)
4517.06
|
1.03085
TBSCAN
( 3)
4517.06
|
1.03085
SORT
( 4)
4517.06
|
3.43618
NLJOIN
( 5)
4517.06
/ \
6.87235 0.5
FETCH FILTER
( 6) ( 8)
556.13 3547.89
/ \ |
171.809 5.10504e+06 1
IXSCAN Table: GRPBY
( 7) DB2INST1 ( 9)
93.3323 ACCT01 3547.89
| |
5.10504e+06 0.531085
Index: IXSCAN
DB2ADMIN (10)
ACCT01 3547.89
|
5.10504e+06
Index:
DB2ADMIN
ACCT01
由上面的Tree可以看到,最大的cost來自Index Scan,判斷可能是因為資料太多而導致,再仔細檢視一下這個耗時的SQL,發現它是由一個主要的Select join Sub-Select的結果組成,且主Select與Sub-Select都是指向同一個Table。看到標示紅色部分的SQL的寫法有點奇怪T1.VOH_NO其實就是VOH_NO;而T1.BRANCH_NO其實就是BRANCH_NO,因為ACCT01這個表格有很多record,如果有500萬筆的record的話,這樣寫就等於做500萬筆對500萬筆資料的比對,難怪會消耗大量的CPU time。
SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
FROM ACCT01 T1
WHERE VOH_SEQ_NO =
(SELECT MIN(VOH_SEQ_NO)
FROM ACCT01
WHERE T1.VOH_NO =VOH_NO AND T1.BRANCH_NO =BRANCH_NO)AND VOH_DATE
='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
AND BRANCH_NO ='9964'
GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO
經過與程式開發人員確認後,原本的Sub-select還是有其意義,因此需保留。幸好,他想到在該Sub-select中加入另一個條件,縮小資料的範圍,上述的Sub-select修改如下
SELECT MIN(VOH_SEQ_NO) FROM ACCT01
WHERE T1.VOH_NO = VOH_NO AND VOH_DATE = '20090518'
AND T1.BRANCH_NO = BRANCH_NO
結果重新執行這個SQL,就發現CPU time大幅縮小
Total execution time (sec.microsec)= 0.051433
Total user cpu time (sec.microsec) = 0.037532
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time (milliseconds) = 0
Statement text = SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
FROM ACCT01 T1
WHERE VOH_SEQ_NO =
(SELECT MIN(VOH_SEQ_NO) FROM ACCT01
WHERE T1.VOH_NO = VOH_NO AND VOH_DATE = '20090518'
AND T1.BRANCH_NO = BRANCH_NO)AND VOH_DATE
='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
AND BRANCH_NO ='9964'
GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO
接著再回到50個人來測試這個交易,發現執行時間有所縮減,由原本的32分鐘,縮減至27分鐘,這樣的縮減還是無法令人滿意…
再做一次 Dynamic SQL statement 的 Snapshot,發現大多數的SQL statement的執行時間都在3秒以內,只是CPU time 只要0.07秒! 這應該是執行時間很久的原因吧!
Total execution time (sec.microsec)= 2.891382
Total user cpu time (sec.microsec) = 0.076323
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time (milliseconds) = 0
Statement text = UPDATE ACCT04 SET CASH_DR_AMT = 0, TRANSFER_DR_AMT = 0, CASH_CR_AMT = 0, TRANSFER_CR_AMT = 0, CASH_DR_NO_OF_VOH = 0, TRANSFER_DR_NO_OF_VOH = 0, CASH_CR_NO_OF_VOH = 0, TRANSFER_CR_NO_OF_VOH = 0, DR_ACCT_CODE_CURR_BAL = DR_ACCT_CODE_PREV_BAL, CR_ACCT_CODE_CURR_BAL = CR_ACCT_CODE_PREV_BAL WHERE TX_DATE = '20090518' AND BRANCH_NO = '9963'
檢視一下ACCT04這個表格,它足足佔用了5G的空間,而且這個交易執行非常多次這個SQL,如果時間都等在從這個表格找資料、更新資料,會大幅拖慢交易執行速度。為了避免這個問題,這裡考慮把這個表格放在獨立的 TABLE SPACE中,然後assign給他獨立的Buffer Pool。
以下為建立這個獨立的Buffer Pool的指令
CREATE BUFFERPOOL ACCT04 IMMEDIATE SIZE 100000 PAGESIZE 4 K ;
再來是建立這個獨立的Table Space的指令,選用上面建立的獨立Buffer Pool
CREATE LARGE TABLESPACE ACCT04_TABSPACE PAGESIZE 4 K MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 64 OVERHEAD 10.5 PREFETCHSIZE 64 TRANSFERRATE 0.14 BUFFERPOOL ACCT04 ;
最後再把ACCT04這個表格建在新建的 TABLE SPACE上,再執行50人的case時,就發現時間可由原本的 32分鐘縮減到13分鐘,且DB2的 CPU utilization也不會在100%使用率持續太久(如下圖),Tuning暫告完成