2009年12月12日 星期六

DB2 Database Performance Tuning實例

目前進行一個應用系統的效能測試,所測試的交易含有大量的資料庫Select/Insert/Update的動作,執行完一次交易大概要執行大約3、4千行的 Dynamic SQL語句。測試結果發現效能非常地差,同時50個使用者執行該交易,就需要32分鐘才能將交易做完,利用nmon來監控其CPU使用率發現這30多分鐘的CPU都是滿載使用。

image

因為這樣的效能實在不能接受,因此開始進行DB2及應用程式的效能調整,因為是第一次有機會做這樣的測試,所以一切就在邊作邊學的過程進行,幸好DB2 Infocenter寫的都還滿詳細的,讓我能一步步達到我的目的。接下來就分享整個Tunning的過程。

首先,要找出使應用程式DB使用效能的原因,我們必需啟用DB2的 monitor switch收及一些統計資訊,輸入 DB2 GET MONITOR SWITCHES 指令可以得到目前這些switche的狀態。

[db2inst1@db01 source_db]$ db2 get monitor switches

            Monitor Recording Switches

Switch list for db partition number 0
Buffer Pool Activity Information  (BUFFERPOOL) = OFF
Lock Information                        (LOCK) = OFF
Sorting Information                     (SORT) = OFF
SQL Statement Information          (STATEMENT) = OFF
Table Activity Information             (TABLE) = OFF
Take Timestamp Information         (TIMESTAMP) = ON  12/12/2009 15:13:18.256534
Unit of Work Information                 (UOW) = OFF

如上面所示,目前除了TIMESTAMP這個預設會打開的switch外,其它都是關閉的狀態,要啟動任一個 switch只需輸入 db2 UPDATE MONITOR SWITCHES USING XXX ON,比方說若要啟用LOCK這個switch的話,就輸入db2 UPDATE MONITOR SWITCHES USING LOCK ON,結果如下

[db2inst1@db01 source_db]$ db2 update monitor switches using lock on
DB20000I  The UPDATE MONITOR SWITCHES command completed successfully.

回到DB2的效能檢視,因為DB2的使用率已達到100%,因此暫不朝向有太多lock的方向思考,且db2 v9版以後,已經有很多Database Configuration參數,如heap size、bufferpool等都可以動態調整了,也不朝向Database Configuration的方向思考。排除上述的因子後,決定直接看 SQL執行的效能如何。

要查看SQL執行效能,首先要把 STATEMENT 這個 Monitor Switch打開,如下

[db2inst1@db01 source_db]$ db2 update monitor switches using statement on
DB20000I  The UPDATE MONITOR SWITCHES command completed successfully.

接著模擬1個使用者,執行所要測試的交易,交易執行完後,使用 GET SNAPSHOT FOR DYNAMIC SQL ON    database_name  的指令,收集執行一次交易,所有會執行的dynamic SQL統計資訊。將其導向一個檔案

[db2inst1@db01 source_db]$ db2 get snapshot for dynamic sql on test_db > executed_sql

檢視該檔案,可發現每一筆錄到的dynamic SQL的Snapshot的格式如下,在這裡會用到的是 Total Execution Time、Total CPU Time及 SQL statement這三個欄位的資訊(用藍色標出的部分)

Number of executions               = 1
Number of compilations             = 1
Worst preparation time (ms)        = 7
Best preparation time (ms)         = 7
Internal rows deleted              = 0
Internal rows inserted             = 0
Rows read                          = 1590
Internal rows updated              = 0
Rows written                       = 0
Statement sorts                    = 0
Statement sort overflows           = 0
Total sort time                    = 0
Buffer pool data logical reads     = Not Collected
Buffer pool data physical reads    = Not Collected
Buffer pool temporary data logical reads   = Not Collected
Buffer pool temporary data physical reads  = Not Collected
Buffer pool index logical reads    = Not Collected
Buffer pool index physical reads   = Not Collected
Buffer pool temporary index logical reads  = Not Collected
Buffer pool temporary index physical reads = Not Collected
Buffer pool xda logical reads      = Not Collected
Buffer pool xda physical reads     = Not Collected
Buffer pool temporary xda logical reads    = Not Collected
Buffer pool temporary xda physical reads   = Not Collected
Total execution time (sec.microsec)= 25.435582
Total user cpu time (sec.microsec) = 25.377444
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time  (milliseconds) = 0
 
Statement text                     = SELECT CURR_NO,   DEPT_NO,        GL_ACCT_NO,     SUM(CASH_DR_NO_OF_VOH) CSHDRVOH,        SUM(TRANSFER_DR_NO_OF_VOH) TRNDRVOH,    SUM(CASH_CR_NO_OF_VOH) CSHCRVOH,        SUM(TRANSFER_CR_NO_OF_VOH) TRNCRVOH   FROM ACCT01  T1  WHERE VOH_SEQ_NO = (SELECT MIN(VOH_SEQ_NO) FROM ACCT01                   WHERE T1.VOH_NO = VOH_NO                        AND T1.BRANCH_NO = BRANCH_NO)   AND VOH_DATE = '20090518'       AND VOH_DATE = ORG_VOH_DATE     AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')        AND BRANCH_NO = '9964' GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO

因為所錄到的SQL數量非常龐大,無法用肉眼找出那些SQL耗掉最多的CPU time,因此在這裡寫了一個簡單的小程式來過濾這些record。程式內容如下(下列程式會挑出所有執行時間大於0.5秒的SQL,若要改成其它時間,只需在藍色該行的if 敘述中改為其它值

#!/usr/bin/perl
undef $/;
my $data = <>;
my $long_execution_sql_count=0;
while($data=~/(Total execution time \(sec.microsec\)= (.+?)\n.+?\n.+?\n.+?\n.+?\n.+?\n)/g)
{
        if ($2>0.5)          #這裡使用0.5會挑出所有執行時間大於0.5秒的SQL。
        {
                print $1,"\n";
                $long_execution_sql_count++;
        }
}

print "Total Long Execution SQLs: ",$long_execution_sql_count;

使用此小程式來過濾Dynamic SQL snapshot 所錄到的結果,可發現這個應用程式最耗時的SQL語句如下:

 Total execution time (sec.microsec)= 27.081611 
Total user cpu time (sec.microsec) = 27.067162
 
Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time  (milliseconds) = 0
Statement text                     = SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
          SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
          CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
  FROM ACCT01 T1
  WHERE VOH_SEQ_NO =
     (SELECT MIN(VOH_SEQ_NO)
     FROM ACCT01
     WHERE T1.VOH_NO =VOH_NO AND T1.BRANCH_NO =BRANCH_NO)AND VOH_DATE
          ='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
          AND BRANCH_NO ='9964'
  GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO

是的,只單單執行一次這個SQL就要花費 27秒,很顯然地,它就是要被tune的SQL。接下來使用db2expln這個命令來Explain DB2的Optimizer倒底是如何來執行這個SQL的,下面範例會把explain的結果,輸出到 sql_explain這個檔案中。先把上述要explain的SQL語句,寫入input_sql這個檔案中,執行下列指令

[db2inst1@db01 ~]$ db2expln -database test_db -stmtfile input_sql -terminator ';' -o sql_explain -graph

DB2 Universal Database Version 9.5, 5622-044 (c) Copyright IBM Corp. 1991, 2007
Licensed Material - Program Property of IBM
IBM DB2 Universal Database SQL and XQUERY Explain Tool

Output is available in "sql_explain".

接下來檢視sql_explain這個output檔案,以下擷取 Optimizer Plan這個Tree的結果,可以看到這個SQL已經使用Index Search了,所以應該也不是沒建 index的問題。

Rows  
                   Operator
                     (ID)  
                     Cost  
                   1.03085
                   RETURN 
                    ( 1)  
                   4517.06
                     |    
                   1.03085
                    GRPBY 
                    ( 2)  
                   4517.06
                     |    
                   1.03085
                   TBSCAN 
                    ( 3)  
                   4517.06
                     |    
                   1.03085
                    SORT  
                    ( 4)  
                   4517.06
                     |    
                   3.43618
                   NLJOIN 
                    ( 5)  
                   4517.06
                  /       \
          6.87235            0.5  
           FETCH           FILTER 
           ( 6)             ( 8)  
          556.13           3547.89
         /       \           |    
  171.809    5.10504e+06      1   
  IXSCAN     Table:         GRPBY 
   ( 7)      DB2INST1       ( 9)  
  93.3323    ACCT01        3547.89
    |                        |    
5.10504e+06               0.531085
Index:                     IXSCAN 
DB2ADMIN                    (10)  
ACCT01                    3547.89 
                            |      
                        5.10504e+06
                        Index:     
                        DB2ADMIN   
                        ACCT01
     

由上面的Tree可以看到,最大的cost來自Index Scan,判斷可能是因為資料太多而導致,再仔細檢視一下這個耗時的SQL,發現它是由一個主要的Select join Sub-Select的結果組成,且主Select與Sub-Select都是指向同一個Table。看到標示紅色部分的SQL的寫法有點奇怪T1.VOH_NO其實就是VOH_NO;而T1.BRANCH_NO其實就是BRANCH_NO,因為ACCT01這個表格有很多record,如果有500萬筆的record的話,這樣寫就等於做500萬筆對500萬筆資料的比對,難怪會消耗大量的CPU time。

SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
          SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
          CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
  FROM ACCT01 T1
  WHERE VOH_SEQ_NO =
     (SELECT MIN(VOH_SEQ_NO)
     FROM ACCT01
     WHERE T1.VOH_NO =VOH_NO AND T1.BRANCH_NO =BRANCH_NO
)AND VOH_DATE
          ='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
          AND BRANCH_NO ='9964'
  GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO

經過與程式開發人員確認後,原本的Sub-select還是有其意義,因此需保留。幸好,他想到在該Sub-select中加入另一個條件,縮小資料的範圍,上述的Sub-select修改如下

SELECT MIN(VOH_SEQ_NO) FROM ACCT01
            WHERE T1.VOH_NO = VOH_NO AND VOH_DATE = '20090518'
             AND T1.BRANCH_NO = BRANCH_NO

結果重新執行這個SQL,就發現CPU time大幅縮小

Total execution time (sec.microsec)= 0.051433
Total user cpu time (sec.microsec) = 0.037532

Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time  (milliseconds) = 0
Statement text                     = SELECT CURR_NO, DEPT_NO, GL_ACCT_NO, SUM(CASH_DR_NO_OF_VOH)CSHDRVOH,
          SUM(TRANSFER_DR_NO_OF_VOH)TRNDRVOH, SUM(CASH_CR_NO_OF_VOH)
          CSHCRVOH, SUM(TRANSFER_CR_NO_OF_VOH)TRNCRVOH
  FROM ACCT01 T1
  WHERE VOH_SEQ_NO =
     (SELECT MIN(VOH_SEQ_NO) FROM ACCT01
            WHERE T1.VOH_NO = VOH_NO AND VOH_DATE = '20090518'
             AND T1.BRANCH_NO = BRANCH_NO)AND VOH_DATE
          ='20090518' AND ORG_VOH_DATE= '20090518' AND (VOH_SPLIT_NO = '00' OR VOH_SPLIT_NO = '99')
          AND BRANCH_NO ='9964'
  GROUP BY CURR_NO, DEPT_NO, GL_ACCT_NO

接著再回到50個人來測試這個交易,發現執行時間有所縮減,由原本的32分鐘,縮減至27分鐘,這樣的縮減還是無法令人滿意…

再做一次 Dynamic SQL statement 的 Snapshot,發現大多數的SQL statement的執行時間都在3秒以內,只是CPU time 只要0.07秒!  這應該是執行時間很久的原因吧!

Total execution time (sec.microsec)= 2.891382
Total user cpu time (sec.microsec) = 0.076323

Total system cpu time (sec.microsec)= 0.000000
Total statistic fabrication time (milliseconds) = 0
Total synchronous runstats time  (milliseconds) = 0
Statement text                     = UPDATE ACCT04   SET CASH_DR_AMT = 0, TRANSFER_DR_AMT = 0, CASH_CR_AMT = 0,        TRANSFER_CR_AMT = 0, CASH_DR_NO_OF_VOH = 0,     TRANSFER_DR_NO_OF_VOH = 0, CASH_CR_NO_OF_VOH = 0,       TRANSFER_CR_NO_OF_VOH = 0,      DR_ACCT_CODE_CURR_BAL = DR_ACCT_CODE_PREV_BAL,          CR_ACCT_CODE_CURR_BAL = CR_ACCT_CODE_PREV_BAL  WHERE TX_DATE =  '20090518'      AND BRANCH_NO = '9963'

檢視一下ACCT04這個表格,它足足佔用了5G的空間,而且這個交易執行非常多次這個SQL,如果時間都等在從這個表格找資料、更新資料,會大幅拖慢交易執行速度。為了避免這個問題,這裡考慮把這個表格放在獨立的 TABLE SPACE中,然後assign給他獨立的Buffer Pool。

以下為建立這個獨立的Buffer Pool的指令
CREATE BUFFERPOOL ACCT04 IMMEDIATE  SIZE 100000 PAGESIZE 4 K ;
再來是建立這個獨立的Table Space的指令,選用上面建立的獨立Buffer Pool
CREATE  LARGE  TABLESPACE ACCT04_TABSPACE PAGESIZE 4 K  MANAGED BY AUTOMATIC STORAGE EXTENTSIZE 64 OVERHEAD 10.5 PREFETCHSIZE 64 TRANSFERRATE 0.14 BUFFERPOOL  ACCT04 ;
最後再把ACCT04這個表格建在新建的 TABLE SPACE上,再執行50人的case時,就發現時間可由原本的 32分鐘縮減到13分鐘,且DB2的 CPU utilization也不會在100%使用率持續太久(如下圖),Tuning暫告完成

image

沒有留言:

張貼留言