Monday, September 22, 2014

Issues with Bind Variables and Bind Peeking

I would explain bellow that how bind peeking played a disastrous role in one of my databases. For usage of bind variables, a good post by Arup Nanda is available here

From one of my OLTP databases (10.2.0.4), suddenly I started receiving OEM alerts for “run queue length” as follows. (My server has total 24 CPUs)

------------------------------------------------
Target Name=<TARGET_NAME>
Target type=Host
Host=<HOST_NAME>
Occurred At=11-Sep-2014 06:04:13 SGT
Message=CPU Load is 27.08, crossed warning (18) or critical (24) threshold.
Metric=Run Queue Length (5 minute average)
Metric value=27.08
Severity=Critical
Acknowledged=No
Notification Rule Name=Host Availability and Critical States
Notification Rule Owner=SYSMAN
------------------------------------------------

While further investigating the issue, I found that most of the time was consumed by a query which was part of every transaction (running at that particular time) executed by each session. This meant that each session was executing this query and therefore overall most of the resources consumption was only by this single query.  In OEM (or you can query v$sql_plan using SQL_ID), I could see multiple plans listed for this SQL and I suspected that perhaps this current plan caused this query to take longer time to finish, and since each session had to execute this query, session would not return until this query completes. Since each session was executing this SQL, the session was not leaving the CPU and new sessions were waiting for the CPU to be released and eventually raising “Run Queue Length” metrics value.

Since my SQL was using bind variables, immediate thing which came into my mind was that probably this SQL was aged out of shared pool and when it was reloaded, it loaded a new execution plan (the bad one) into the shared pool (plan could change because of bind peeking done during plan generation of SQL). So, I issued “alter system flush shared pool” (not a very good idea to invalidate all cursors), to invalidate the cursors in the shared pool and let it reload all cursors again (including this SQL). This thing worked; as I saw a new plan for this SQL and “run queue length” immediately came down to less than 10.

From dba_hist_sql_plan, I found that there were total 6 different execution plans for this SQL and then I used awrsqrpt.sql script to fetch the execution details of this SQL against each of this plan from AWR repository and following were the interesting details I extracted from SQL AWR reports

Plan with hash value 465940006, Average Physical Reads per execution => 319
Plan with hash value 1122690139, Average Physical Reads per execution => 36
Plan with hash value 976138646, Average Physical Reads per execution => 1285
Plan with hash value 2677530476, Average Physical Reads per execution => 144
Plan with hash value 3841102669, Average Physical Reads per execution => 37
Plan with hash value 2920758856, Average Physical Reads per execution => 155

You can see that plan 976138646 is doing a lot of physical reads and this was actually the culprit at this time as it was slowing down the execution of this particular SQL. After I flushed the shared pool, this plan (976138646) was gone and next plan loaded was 1122690139 which made the things better for me. The other better plan here is 3841102669 which was also in use in the past.

Solution to the problem
Invalidating the cursors can help to reload the new and better plan, but this is not the permanent solution. Starting 11g, adaptive cursor sharing feature is very helpful as Oracle may select a different execution plan based on the values in bind variables during different executions of same SQL.

In 10g, oracle profiles and starting from 11g, sql plan management features could be used to attach the plans with particular SQLs to avoid this kind of situation.

No comments:

Popular Posts - All Times