Monday, March 26, 2018

Finding IO statistics in Linux Environment for Slow Disks

I have written 2 articles (AWR 1, AWR 2) in relation to the disk latency and how to read AWR reports to investigate IO slowness. In this article I will explain how we check disks IO performance in Linux systems. We will check how the disks where our datafiles and redo log files are stored are performing. These disks could be simple Linux mount points, or ASM disks. In case of ASM, we will need to find out the disks that are part of ASM diskgroups so that we can check the performance of those disks. For example, following is the way how we can find out the disks that are part of ASM diskgroup.

Monday, March 19, 2018

Reading and Understanding AWR Report for IO or Disk latency - 2

This a second article regarding IO latency issues investigation using AWR. First article can be found here. In this article I will further explain about checking IO latencies at the OS level in Linux

Log file sync

We had a production server running on Virtual Machine (vmware), and after a downtime, we started receiving complains about slow database. AWR report showed that “log fie sync” wait event that comes under COMMIT wait class was at the top, and database was spending more than 30% of its time on log file sync wait. Log file sync wait even can be observed in a very busy OLTP database, but it should not consume this much time as we were seeing, and should be found at the bottom of the list of top wait events.

Monday, March 12, 2018

Reading and Understanding AWR Report for IO or Disk latency - 1

Recently I performed a failover of my Oracle database (running on Linux) to my standby database, and after the switchover, application team started complaining about extreme slowness. I was using OEM Cloud Control and the graph was showing high waits for “free buffer wait” and alert log started showing Checkpoint not Complete. Since I never saw these waits on my previous primary server (now standby), so first thing came into my mind was that the disks on the standby server (now primary) are probably very slow, because hardware of my servers was very old. Servers also had internal disks (not SAN or NAS). I generated AWR report for the time when database was running fine and without any performance issue, and then a latest time report based on latest snapshots to see what is going wrong with the IO.

Tuesday, March 6, 2018

ORA-00742: Log read detects lost write in thread %d sequence %d block %

During real time apply on one of my physical standby RAC database , the managed recovery process crashed with this error message, following is the entry in alert log file.

CORRUPTION DETECTED: In redo blocks starting at block 169592count 142 for thread 4 sequence 157

Sat Jul 02 19:12:25 2016

MRP0: Background Media Recovery terminated with error 742