| 導購 | 订阅 | 在线投稿
分享
 
 
 

高級複制中mlog$表的HWM過高導致性能問題

來源:互聯網  2008-06-17 05:33:28  評論

這篇論壇文章(賽迪網技術社區)針對高級複制中由于mlog$表的HWM過高導致的性能問題進行了詳細的講解,更多內容請參考下文:

某系統升級後,報告事務端的數據庫反應非常慢,用戶大量投訴。

現場檢查系統後,發現CPU占用到了80%(平時該系統正常時只有10%~20%)。wio也非常高(說明磁盤讀寫嚴重)。一段時間後,現場將statspack報告發回來了:

STATSPACK report for

DB Name DB Id Instance Inst Num Release Cluster Host

------------ ----------- ------------ -------- ----------- ------- ------------

OLTPDB 3781951398 OLTPDB 1 9.2.0.5.0 NO jnaip21

Snap Id Snap Time Sessions Curs/Sess Comment

------- ------------------ -------- --------- -------------------

Begin Snap: 3 27-Sep-04 13:15:58 32 83.4

End Snap: 4 27-Sep-04 14:10:36 37 83.4

Elapsed: 54.63 (mins)

Cache Sizes (end)

~~~~~~~~~~~~~~~~~

Buffer Cache: 720M Std Block Size: 8K

Shared Pool Size: 112M Log Buffer: 512K

Load Profile

~~~~~~~~~~~~ Per Second Per Transaction

--------------- ---------------

Redo size: 11,763.97 7,400.17

Logical reads: 8,251.68 5,190.75

Block changes: 83.17 52.32

Physical reads: 2,488.23 1,565.23

Physical writes: 6.15 3.87

User calls: 411.88 259.09

Parses: 56.57 35.59

Hard parses: 0.29 0.18

Sorts: 14.59 9.18

Logons: 0.12 0.08

Executes: 80.04 50.35

Transactions: 1.59

% Blocks changed per Read: 1.01 Recursive Call %: 28.14

Rollback per transaction %: 0.17 Rows per Sort: 1818.01

Instance Efficiency Percentages (Target 100%)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Buffer Nowait %: 99.99 Redo NoWait %: 100.00

Buffer Hit %: 69.88 In-memory Sort %: 99.99

Library Hit %: 99.78 Soft Parse %: 99.50

Execute to Parse %: 29.32 Latch Hit %: 99.98

Parse CPU to Parse Elapsd %: 96.72 % Non-Parse CPU: 99.02

Shared Pool Statistics Begin End

------ ------

Memory Usage %: 93.60 93.71

% SQL with executions>1: 20.28 19.05

% Memory for SQL w/exec>1: 25.59 25.88

Top 5 Timed Events

~~~~~~~~~~~~~~~~~~ % Total

Event Waits Time (s) Ela Time

-------------------------------------------- ------------ ----------- --------

db file scattered read 599,637 3,886 63.79

CPU time 1,142 18.74

db file sequential read 238,983 629 10.33

enqueue 263 366 6.00

log file sync 5,154 23 .37

-------------------------------------------------------------

Wait Events for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4

-> s - second

-> cs - centisecond - 100th of a second

-> ms - millisecond - 1000th of a second

-> us - microsecond - 1000000th of a second

-> ordered by wait time desc, waits desc (idle events last)

Avg

Total Wait wait Waits

Event Waits Timeouts Time (s) (ms) /txn

---------------------------- ------------ ---------- ---------- ------ --------

db file scattered read 599,637 0 3,886 6 115.1

db file sequential read 238,983 0 629 3 45.9

enqueue 263 4 366 1390 0.1

log file sync 5,154 0 23 4 1.0

log file parallel write 15,143 8,060 20 1 2.9

db file parallel write 1,008 0 12 12 0.2

control file parallel write 1,065 0 4 4 0.2

buffer busy waits 3,195 0 4 1 0.6

... ...

SQL ordered by Gets for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4

-> End Buffer Gets Threshold: 10000

-> Note that resources reported for PL/SQL includes the resources used by

all SQL statements called within the PL/SQL code. As individual SQL

statements are also reported, it is possible and valid for the summed

total % to exceed 100

CPU Elapsd

Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value

--------------- ------------ -------------- ------ -------- --------- ----------

6,509,469 1,060 6,141.0 24.1 66.35 64.61 2833840857

Module: JDBC Thin Client

select lib.code, lib.name, lib.info, leaf.leafdirindex, tp.tpname, from t_commend com, t_leafinfo leaf, t_lib lib,t_tpinfo tp where

2,258,385 343 6,584.2 8.3 65.41 1289.82 3307794804

Module: oracle@qdaip12 (TNS V1-V3)

SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO

","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */

DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"

FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1

AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO

2,251,917 343 6,565.4 8.3 66.51 1306.09 1606105459

Module: oracle@qdaip12 (TNS V1-V3)

SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI

NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU

PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".

"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO

NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1

2,249,895 343 6,559.5 8.3 66.34 1098.20 4039622144

delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1

... ...

Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value

--------------- ------------ -------------- ------ -------- --------- ----------

2,129,843 343 6,209.5 26.1 66.34 1098.20 4039622144

delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1

2,129,004 343 6,207.0 26.1 65.41 1289.82 3307794804

Module: oracle@qdaip12 (TNS V1-V3)

SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO

","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */

DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID"

FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1

AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO

2,123,001 343 6,189.5 26.0 66.51 1306.09 1606105459

Module: oracle@qdaip12 (TNS V1-V3)

SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI

NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU

PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU".

"T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO

NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1

1,724,081 278 6,201.7 21.1 54.05 887.05 2153620124

update "RU"."MLOG$_T_LOOP" set snaptime$$ = :1 where snap

time$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS')

關注statspack中以下信息:Buffer Hit非常低,只有69.88%;在Top Event的等待事件中,db file scattered read的等待最高,占到了63.79%。很明顯,這是由于數據文件掃描導致的buffer cache下降非常嚴重。而就肯定是由于存在全表掃描(全表掃描到的數據塊不會被cache在buffer cache,而是會已最快的速度被置換出內存,這就是爲什麽全表掃描會導致buffer hit下降)語句所致。

再看top sql中的內容,我們發現,大多數top sql都涉及到一個奇特的表:MLOG$_T_LOOP。這張表的使用是不會出現在應用程序中。因此相關語句都是被系統調用的。那麽這張表是做什麽用的呢?

我們的系統爲了將業務處理與業務查詢分離,通過高級複制將業務處理端的某些表同步到業務查詢端去。而mlog$_表就是記錄最後一次刷新後,被同步表的數據變化的日志表,當查詢端(即快照端)的刷新作業(我們設置10秒鍾間隔運行)通過mlog$_讀取到源表的數據變化,更新查詢端相應的表。而業務處理端檢測到所有快照端都更新過這些數據後,就會將這些數據刪除。此外,mlog表上是沒有任何索引,因此對mlog表的查詢肯定是全表掃描。但是由于我們的刷新作業10秒鍾更新一次,而10秒鍾所産生的數據變化量非常小,在正確情況下,這樣的全表掃描是不會影響到系統性能的。

我查了一下MLOG$_T_LOOP,發現它記錄的數據確實非常少(一般只有幾條數據),但是對它查詢鍾,physical reads卻達到了6200多,這顯然不正常。說明這張表的高水位(HWM,高水位記錄的是表曾經達到的最大數據塊,就像洪水過後留下的水位線。而對表的掃描並不是以當前的實際水位爲基准的,而是以高水位爲基准的,也就是說會掃描表曾經占用過的所有數據塊)非常高。

通過現場了解,我們得知現場在做升級時,並沒有停止同步刷新作業,而是將刷新作業間隔延長到2個小時。升級過程中,兩邊的數據庫都沒有停止。但是,期間在業務處理端做了一些數據處理,特別是表T_LOOP有大量的數據更新。

問題很明確了!由于升級過程中T_LOOP有大量的數據更新,而快照沒有被刷新。因此mlog$_t_loop中保存了兩個小時內的所有數據變化記錄,導致高水位上升!而刷新作業再次恢複到10秒間隔後,盡管mlog$_t_loop中的數據量下降了,但是它的高水位並沒有下降!

于是安排現場將業務處理端的客戶連接斷開,然後停止快照端的刷新作業,手工刷新一次同步後,檢查業務處理端的mlog$表,確保這些表爲空,然後truncate mlog$表,恢複同步作業,最後恢複業務處理端的客戶連接。經過一段時間觀察後,數據庫恢複正常。

這篇論壇文章(賽迪網技術社區)針對高級複制中由于mlog$表的HWM過高導致的性能問題進行了詳細的講解,更多內容請參考下文: 某系統升級後,報告事務端的數據庫反應非常慢,用戶大量投訴。 現場檢查系統後,發現CPU占用到了80%(平時該系統正常時只有10%~20%)。wio也非常高(說明磁盤讀寫嚴重)。一段時間後,現場將statspack報告發回來了: STATSPACK report for DB Name DB Id Instance Inst Num Release Cluster Host ------------ ----------- ------------ -------- ----------- ------- ------------ OLTPDB 3781951398 OLTPDB 1 9.2.0.5.0 NO jnaip21 Snap Id Snap Time Sessions Curs/Sess Comment ------- ------------------ -------- --------- ------------------- Begin Snap: 3 27-Sep-04 13:15:58 32 83.4 End Snap: 4 27-Sep-04 14:10:36 37 83.4 Elapsed: 54.63 (mins) Cache Sizes (end) ~~~~~~~~~~~~~~~~~ Buffer Cache: 720M Std Block Size: 8K Shared Pool Size: 112M Log Buffer: 512K Load Profile ~~~~~~~~~~~~ Per Second Per Transaction --------------- --------------- Redo size: 11,763.97 7,400.17 Logical reads: 8,251.68 5,190.75 Block changes: 83.17 52.32 Physical reads: 2,488.23 1,565.23 Physical writes: 6.15 3.87 User calls: 411.88 259.09 Parses: 56.57 35.59 Hard parses: 0.29 0.18 Sorts: 14.59 9.18 Logons: 0.12 0.08 Executes: 80.04 50.35 Transactions: 1.59 % Blocks changed per Read: 1.01 Recursive Call %: 28.14 Rollback per transaction %: 0.17 Rows per Sort: 1818.01 Instance Efficiency Percentages (Target 100%) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Buffer Nowait %: 99.99 Redo NoWait %: 100.00 Buffer Hit %: 69.88 In-memory Sort %: 99.99 Library Hit %: 99.78 Soft Parse %: 99.50 Execute to Parse %: 29.32 Latch Hit %: 99.98 Parse CPU to Parse Elapsd %: 96.72 % Non-Parse CPU: 99.02 Shared Pool Statistics Begin End ------ ------ Memory Usage %: 93.60 93.71 % SQL with executions>1: 20.28 19.05 % Memory for SQL w/exec>1: 25.59 25.88 Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- db file scattered read 599,637 3,886 63.79 CPU time 1,142 18.74 db file sequential read 238,983 629 10.33 enqueue 263 366 6.00 log file sync 5,154 23 .37 ------------------------------------------------------------- Wait Events for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4 -> s - second -> cs - centisecond - 100th of a second -> ms - millisecond - 1000th of a second -> us - microsecond - 1000000th of a second -> ordered by wait time desc, waits desc (idle events last) Avg Total Wait wait Waits Event Waits Timeouts Time (s) (ms) /txn ---------------------------- ------------ ---------- ---------- ------ -------- db file scattered read 599,637 0 3,886 6 115.1 db file sequential read 238,983 0 629 3 45.9 enqueue 263 4 366 1390 0.1 log file sync 5,154 0 23 4 1.0 log file parallel write 15,143 8,060 20 1 2.9 db file parallel write 1,008 0 12 12 0.2 control file parallel write 1,065 0 4 4 0.2 buffer busy waits 3,195 0 4 1 0.6 ... ... SQL ordered by Gets for DB: OLTPDB Instance: OLTPDB Snaps: 3 -4 -> End Buffer Gets Threshold: 10000 -> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100 CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 6,509,469 1,060 6,141.0 24.1 66.35 64.61 2833840857 Module: JDBC Thin Client select lib.code, lib.name, lib.info, leaf.leafdirindex, tp.tpname, from t_commend com, t_leafinfo leaf, t_lib lib,t_tpinfo tp where 2,258,385 343 6,584.2 8.3 65.41 1289.82 3307794804 Module: oracle@qdaip12 (TNS V1-V3) SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO ","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */ DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID" FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1 AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO 2,251,917 343 6,565.4 8.3 66.51 1306.09 1606105459 Module: oracle@qdaip12 (TNS V1-V3) SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU". "T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1 2,249,895 343 6,559.5 8.3 66.34 1098.20 4039622144 delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1 ... ... Physical Reads Executions Reads per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 2,129,843 343 6,209.5 26.1 66.34 1098.20 4039622144 delete from "RU"."MLOG$_T_LOOP" where snaptime$$ <= :1 2,129,004 343 6,207.0 26.1 65.41 1289.82 3307794804 Module: oracle@qdaip12 (TNS V1-V3) SELECT /*+ */ "A2"."PHONENUMBER","A2"."TGROUPID","A2"."LOOPNO ","A2"."TCODE" FROM "RU"."T_LOOP" "A2", (SELECT /*+ */ DISTINCT "A3"."LOOPNO" "LOOPNO","A3"."TGROUPID" "TGROUPID" FROM "RU"."MLOG$_T_LOOP" "A3" WHERE "A3"."SNAPTIME$$">:1 AND "A3"."DMLTYPE$$"<>'D') "A1" WHERE "A2"."LOOPNO"="A1"."LOOPNO 2,123,001 343 6,189.5 26.0 66.51 1306.09 1606105459 Module: oracle@qdaip12 (TNS V1-V3) SELECT /*+ */ DISTINCT "A1"."LOOPNO","A1"."TGROUPID" FROM "RI NG"."MLOG$_T_LOOP" "A1" WHERE (("A1"."LOOPNO","A1"."TGROU PID")<>ALL (SELECT "A2"."LOOPNO","A2"."TGROUPID" FROM "RU". "T_LOOP" "A2" WHERE "A1"."LOOPNO"="A2"."LOOPNO" AND "A1"."TO NEGROUPID"="A2"."TGROUPID")) AND "A1"."SNAPTIME$$">:1 AND "A1 1,724,081 278 6,201.7 21.1 54.05 887.05 2153620124 update "RU"."MLOG$_T_LOOP" set snaptime$$ = :1 where snap time$$ > to_date('2100-01-01:00:00:00','YYYY-MM-DD:HH24:MI:SS') 關注statspack中以下信息:Buffer Hit非常低,只有69.88%;在Top Event的等待事件中,db file scattered read的等待最高,占到了63.79%。很明顯,這是由于數據文件掃描導致的buffer cache下降非常嚴重。而就肯定是由于存在全表掃描(全表掃描到的數據塊不會被cache在buffer cache,而是會已最快的速度被置換出內存,這就是爲什麽全表掃描會導致buffer hit下降)語句所致。 再看top sql中的內容,我們發現,大多數top sql都涉及到一個奇特的表:MLOG$_T_LOOP。這張表的使用是不會出現在應用程序中。因此相關語句都是被系統調用的。那麽這張表是做什麽用的呢? 我們的系統爲了將業務處理與業務查詢分離,通過高級複制將業務處理端的某些表同步到業務查詢端去。而mlog$_表就是記錄最後一次刷新後,被同步表的數據變化的日志表,當查詢端(即快照端)的刷新作業(我們設置10秒鍾間隔運行)通過mlog$_讀取到源表的數據變化,更新查詢端相應的表。而業務處理端檢測到所有快照端都更新過這些數據後,就會將這些數據刪除。此外,mlog表上是沒有任何索引,因此對mlog表的查詢肯定是全表掃描。但是由于我們的刷新作業10秒鍾更新一次,而10秒鍾所産生的數據變化量非常小,在正確情況下,這樣的全表掃描是不會影響到系統性能的。 我查了一下MLOG$_T_LOOP,發現它記錄的數據確實非常少(一般只有幾條數據),但是對它查詢鍾,physical reads卻達到了6200多,這顯然不正常。說明這張表的高水位(HWM,高水位記錄的是表曾經達到的最大數據塊,就像洪水過後留下的水位線。而對表的掃描並不是以當前的實際水位爲基准的,而是以高水位爲基准的,也就是說會掃描表曾經占用過的所有數據塊)非常高。 通過現場了解,我們得知現場在做升級時,並沒有停止同步刷新作業,而是將刷新作業間隔延長到2個小時。升級過程中,兩邊的數據庫都沒有停止。但是,期間在業務處理端做了一些數據處理,特別是表T_LOOP有大量的數據更新。 問題很明確了!由于升級過程中T_LOOP有大量的數據更新,而快照沒有被刷新。因此mlog$_t_loop中保存了兩個小時內的所有數據變化記錄,導致高水位上升!而刷新作業再次恢複到10秒間隔後,盡管mlog$_t_loop中的數據量下降了,但是它的高水位並沒有下降! 于是安排現場將業務處理端的客戶連接斷開,然後停止快照端的刷新作業,手工刷新一次同步後,檢查業務處理端的mlog$表,確保這些表爲空,然後truncate mlog$表,恢複同步作業,最後恢複業務處理端的客戶連接。經過一段時間觀察後,數據庫恢複正常。
󰈣󰈤
王朝萬家燈火計劃
期待原創作者加盟
 
 
 
>>返回首頁<<
 
 
 
 
 熱帖排行
 
 
 
靜靜地坐在廢墟上,四周的荒凉一望無際,忽然覺得,淒涼也很美
© 2005- 王朝網路 版權所有