Wednesday, February 01, 2012

Latch: row cache objects, session_cached_cursors and a database link

What could possibly be common about the above things?

Sometime ago I was involved in tuning a large scale production system which experienced quite a lot of latch contention. In fact the contention was bad enough as to render the entire system unusable during peak hours.

A typical AWR report during times with moderate workload looked like this:

Event                      Waits           Time(s) Avg wait (ms)   % DB time  Wait Class
DB CPU                     21,805          36.95  
latch: row cache objects   14,524,462      11,552  1               19.57      Concurrency
db file sequential read    2,697,778       8,988   3               15.23      User I/O
gc current block 3-way     4,202,356       3,599   1               6.10       Cluster
gc current block 2-way     3,670,293       2,330   1               3.95       Cluster

(in reality it was a 7-node RAC cluster but it is not really relevant to our subject).

It is not hard to spot latch: row cache objects on a second place consuming almost 20% of DB time. This event was rapidly escalating whether load increased quickly bringing the entire cluster into unusable state as far as any front-end application were concerned. The total number of gets during an hour (the interval with which this particular AWR report had been made) totaled 1,281,167,103 which averages a whopping 355,879 gets per second!

The first step was to take a look at the Dictionary Cache statistics and see if there were any obvious deviations:
Cache                   Get Requests    Pct Miss
dc_awr_control          72              9.72
dc_database_links       124,452         0.05
dc_files                46,900          17.14
dc_global_oids          555,599         0.07
dc_histogram_data       8,962,576       1.01
dc_histogram_defs       412,885,094     0.02
dc_object_grants        52,475          1.13
dc_objects              2,860,222       0.64
dc_profiles             61,189          0.01
dc_rollback_segments    392,885         0.00
dc_segments             657,653         7.72
dc_sequences            1,324           67.67
dc_table_scns           3,206           3.31
dc_tablespaces          1,253,322       0.06
dc_users                5,475,824       0.02
global database name    65,092          0.01
kqlsubheap_object       10,910          0.46
outstanding_alerts      721             95.98

dc_histogram_defs really stands out but so far provides no direct clues. Though the entire dictionary cache seems to be quite busy. Let's take a look at instance activity statistics:
                  Per Second   Per Transaction  Per Exec        Per Call
DB Time(s):       16.4         0.5              0.00            0.00
DB CPU(s):        6.1          0.2              0.00            0.00
Redo size:        87,543.5     2,398.6
Logical reads:    336,555.7    9,221.2
Block changes:    449.3        12.3
Physical reads:   1,018.3      27.9
Physical writes:  36.8         1.0
User calls:       41,250.4     1,130.2
Parses:           6,264.4      171.6
Hard parses:      27.1         0.7
W/A MB processed: 33,769,202.4 925,233.0
Logons:           18.3         0.5
Executes:         14,230.5     389.9
Rollbacks:        3.4          0.1
Transactions:     36.5

A couple of things to notice:
  1. The number of executions is not that high (at least for an IBM box with 56 CPU cores, that is).
  2. Execute to Parse ratio is not particularly good nor bad (6,264.4 parses per second compared to 14,230.5 executions).
  3. None of the above seems to be enough to justify 355,879 latch: row cache objects gets per second, even when you consider 27.1 hard parses/sec.

There had to be more going on. Looking at the SQL ordered by Parse Calls section of the report revealed that all the top parsing queries were coming from a database link. Which prompted me to ask additional questions...

It turned out that the system had a somewhat weird architecture. A 7-node RAC cluster served as a back end to a numerous front-end Oracle databases which were essentially acting as PL/SQL-engines, accessing all the data through the database link while dong all the processing. The architectural feasibility of such solution is debatable (to say the least) but let's not steer away from the main topic of our discussion.

I geared to find out whether there was some special case which could cause latch: row cache objects to shoot through the roof when we deal with a lot of queries executed through a database link. After a number of experiments I eventually found a cause which explains the title of this blog post.

Take a look at the following example:

SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
PL/SQL procedure successfully completed

Accessing a remote table 1000 times through a database link only caused 79 row cache objects gets on the destination database. However, exactly the same example can produce quite a different result:

SQL> alter session set session_cached_cursors=0;
Session altered
SQL> declare
  2   l_gets number;
  3  begin
  4   select gets into l_gets from v$latch@dblink where name='row cache objects';
  6   for i in 1 .. 1000
  7   loop
  8    for cur in (select null from dual@dblink) loop null; end loop;
  9   end loop;
 11   select sum(gets)-l_gets into l_gets from v$latch@dblink where name='row cache objects';
 12   dbms_output.put_line(to_char(l_gets));
 13  end;
 14  /
PL/SQL procedure successfully completed

Now we jumped from 79 gets to 3082 which is a 39x increase! What's really astonishing is the fact that it's a cursor cache miss on the source which wreaks havoc on the destination. Evidently, the solution to the problem had nothing to do with the database where the problem was observed. The "PL/SQL-databases" had an insufficient setting for session_cached_cursors and once that was corrected the issue disappeared.

It is somewhat ironic that that problem would have never appears had the data and PL/SQL be allowed to be kept in the same database (where they belong).


  1. srivenu kadiyala6:46 AM

    Did you happen to sql trace a session which is hung during the issue time? It would have shown the recursive sql for populating the row cache.

    Also any stats gathering running on big tables with lot of partitions during issue time? Stats gathering on tables/indexes with large partitions tends to first load all histogram data into row cache before their cleanout. This could cause contention for loading of histograms for other tables.

  2. Hi Alex
    Good post. How did you determine that you had to set session_cached_cursors to 0 to reproduce the issue.
    Did you compare the init.ora of source and dest and found this and tried by making them the same

    Thank you

  3. Kumar,

    in reality the source systems had session_cached_cursor set to a default value (which in 10g happens to be 50) but I didn't go by comparing initialization parameters anyway. What I did know is that remote execution were hitting row cache objects latches like there is no tomorrow but I did not know why. It took me sometime of what essentially was trial and error in order to discover the cause. The default value of 50 just happened to be not enough due to the vast amount of different queries submitted through the database links. I set session_cached_cursors=0 in my blog post for the sake of demonstrating the issue.

  4. srivenu kadiyala2:18 PM

    Alex, Forgot one more point, is it a HP server?

  5. Srivenu,

    you will find the answer in the blog post, I mention that it's an IBM box. But the issue is not platform-related and have nothing to do with the vendor.

  6. A very interesting post, I'll be following your blog from now on. Keep up the good work!

    Best regards,