Thursday, July 23, 2015

ORA-15410: Disks in disk group do not have equal size

12.1.0.2 ASM introduced a new feature which might prevent you from adding dissimilar size disks into a normal or high redundancy disk groups. The relevant MOS note is Doc ID 1938950.1.

Unfortunately I've found that some information in this note requires further clarification.

COMPATIBLE.ASM

The note suggests that the check is only enforced when the Disk Group has 'COMPATIBLE.ASM'='12.1.0.2' attribute set. This is incorrect and can be easily demonstrated:
SQL> select name, compatibility from v$asm_diskgroup;

NAME                           COMPATIBILITY
------------------------------ --------------------
DATA                           11.2.0.3.0

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.
As you can see my Disk Group has 'COMPATIBLE.ASM'='11.2.0.3.0' which did not prevent it from getting ORA-15410.

What happens if you have disks of dissimilar sizes already in the Disk Group?

Let's say you've upgraded from a previous release and your normal/high redundancy disk group had disks of dissimilar sizes in it already. What happens when you try to add a disk after the upgrade to 12.1.0.2 ASM? Some of the questions that immediately jump to mind:
  • Will it always rise an error because the existing disks have different sizes forcing you to make everything of the same size first?
  • Will it let you add a disk as long as the size is the same to one of the disks you already have?
  • Will it let you add a disk with a new size?
Here I have a 12.1.0.2 normal redundancy Disk Group with two disks of dissimilar size:
SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
Let's add a third disk of some other size:
SQL> alter diskgroup data add disk '/dev/sdd' size 2g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        2048
As you can see, as long as you have disks of dissimilar sizes already in the disk group, the check appears to be ignored. I can resize the disks as well:
SQL> alter diskgroup data resize disk DATA_0002 size 4g;

Diskgroup altered.

SQL> select dg.name group_name,
    dg.type,
    dg.compatibility,
    d.name disk_name,
    d.total_mb
  from v$asm_diskgroup dg, v$asm_disk d
  where dg.group_number=d.group_number
  order by d.disk_number;  2    3    4    5    6    7    8

GROUP_NAME TYPE   COMPATIBILITY        DISK_NAME    TOTAL_MB
---------- ------ -------------------- ---------- ----------
DATA       NORMAL 12.1.0.2.0           DATA_0000        4096
DATA       NORMAL 12.1.0.2.0           DATA_0001        5120
DATA       NORMAL 12.1.0.2.0           DATA_0002        4096
Now what happens if I drop DATA_0001 thus making all disks to be of equal size?
SQL> alter diskgroup data drop disk DATA_0001;

Diskgroup altered.

SQL> alter diskgroup data add disk '/dev/sdc';
alter diskgroup data add disk '/dev/sdc'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15410: Disks in disk group DATA do not have equal size.
It appears that the moment the Disk Group got all disks of the same size the check got enforced.

Conclusion

Based on the tests I've demonstrated above the rules for ORA-15410 on 12.1.0.2 ASM appears to be:
  • If all the disks in a Disk Group are of the same size then the check is enforced, regardless of what 'COMPATIBLE.ASM' value is.
  • If a Disk Group contains disks of different sizes then the check is ignored, regardless of what 'COMPATIBLE.ASM' value is.
(keeping in mind that the check can only happen for normal/high redundancy Disk Groups)

Wednesday, July 15, 2015

Wrong Results

It is interesting how a combination of technologies in Oracle can play in a way which produce a seemingly bizarre outcomes.

Consider the following query:
SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);

no rows selected
Note that I'm doing a left outer join, however, the query somehow managed to loose a single row I have in the subquery factoring (and just in case you're wondering I've used subquery factoring for simplicity and replacing it with a real table makes no difference).

The first reaction is how could something as simple as this go so terribly wrong?

Let's take a look at the plan:
SQL> with v as
(
        select 20 n from dual
) select distinct v.n
                from v, t
                where v.n=t.n(+);  2    3    4    5    6

Execution Plan
----------------------------------------------------------
Plan hash value: 627307704

----------------------------------------------------------------------------------------
| Id  | Operation                       | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |      |     1 |     9 |     6  (17)| 00:00:01 |
|   1 |  HASH UNIQUE                    |      |     1 |     9 |     6  (17)| 00:00:01 |
|*  2 |   FILTER                        |      |       |       |            |          |
|*  3 |    HASH JOIN OUTER              |      |     1 |     9 |     5   (0)| 00:00:01 |
|   4 |     VIEW                        |      |     1 |     3 |     2   (0)| 00:00:01 |
|   5 |      FAST DUAL                  |      |     1 |       |     2   (0)| 00:00:01 |
|*  6 |     MAT_VIEW REWRITE ACCESS FULL| MV_T |    10 |    60 |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("MV_T"."GID"=0)
   3 - access("V"."N"="MV_T"."N"(+))
   6 - filter("MV_T"."N"(+) IS NOT NULL)
As it turns out the query rewrite is in play but the real oddity lies in the Predicate Information section. Line #2 reads filter("MV_T"."GID"=0) which took me by surprise (where did it come from?) as well as explained why the row went missing. A predicate like that essentially turned our query into an inner join!

The answer lies in how this materialized view was created and the cool trick the optimizer tried to do which didn't quite work out. Indeed, I have created the table and the materialized view in the following way:
SQL> create table t as
  2   select mod(level, 10) n, level m
  3    from dual
  4    connect by level <= 1000;
Table created

SQL> create materialized view mv_t enable query rewrite as
  2   select n, sum(m), grouping_id(n) gid
  3    from t
  4    group by rollup(n);
Materialized view created
The materialized view has two levels of aggregation due to the use of a rollup. I'm also using a grouping_id function to identify each grouping level. So what the optimizer tried to do is apply the GID = 0 predicate in order to eliminate the rollup data but, unfortunately, it also turned our query into an inner join equivalent as a result (final query taken from the optimizer trace):
SELECT DISTINCT "V"."N" "N"
 FROM (SELECT 20 "N" FROM "SYS"."DUAL" "DUAL") "V", "ROOT"."MV_T" "MV_T"
 WHERE "MV_T"."GID" = 0
  AND "V"."N" = "MV_T"."N"(+)
Of course the correct predicate in this case should be "MV_T"."GID" (+) = 0.

I have tried the above test case on both 11.2.0.3 and 12.1.0.2.0 with both versions producing the same wrong results.

Saturday, March 07, 2015

Converting non-CDB database to a PDB when TDE is in use

Converting a non-CDB database to a PDB is a rather straightforward process. However once TDE (Transparent Data Encryption) gets involved certain things become not so obvious so I've decided to write a small guide on how to accomplish that. In order for a non-CDB database to be converted to a PDB it's version needs to be at least 12.1.0.2.0.

Encryption Wallet Location

My encryption wallet location is set to the following (sqlnet.ora):
ENCRYPTION_WALLET_LOCATION=
  (SOURCE =
    (METHOD = FILE)
    (METHOD_DATA =
      (DIRECTORY = /u01/app/oracle/admin/$ORACLE_SID/wallet/tde)
    )
  )
Create a non-CDB database

Let's start by creating a non-CDB database. I will call this database db1 and it will later be converted to a pluggable database called pdb1:
dbca -silent \
-createDatabase \
-templateName New_Database.dbt \
-gdbName db1 \
-createAsContainerDatabase false \
-sysPassword oracle \
-systemPassword oracle \
-emConfiguration none \
-datafileDestination /u02/oradata \
-redoLogFileSize 128 \
-recoveryAreaDestination /u02/fra \
-storageType FS \
-characterSet al32utf8 \
-nationalCharacterSet al16utf16 \
-automaticMemoryManagement false \
-initParams filesystemio_options=setall \
-initParams session_cached_cursors=100 \
-totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete
Since I have PSU2 applied I need to run datapatch once the database has been created:
[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? db1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose
Create the wallet and encrypted table

We're now ready to setup the wallet and create a user with an encrypted table:
[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/db1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 15:51:21 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter system set encryption key identified by "db1";

System altered.

SQL> alter system set db_create_file_dest='/u02/oradata';

System altered.

SQL> create tablespace encrypted datafile size 64m encryption using 'AES256' default storage (encrypt);

Tablespace created.

SQL> create user encrypt_user identified by "encrypt_user" default tablespace encrypted;

User created.

SQL> alter user encrypt_user quota unlimited on encrypted;

User altered.

SQL> create table encrypt_user.z_encrypted as
        select dbms_random.string('x', 100) s
                from dual
                connect by level <= 100;  2    3    4

Table created.
Note that I did not use the new administer key management syntax on purpose and instead created the wallet the old style way as I expect most of the existing environments to be the same. We will explore the significance of this in a moment.

Create a CDB

Before our non-CDB database can be converted to a PDB and plugged into a CDB we need to create the container database first:
[oracle@ora12cr1 ~]$ dbca -silent \
> -createDatabase \
> -templateName New_Database.dbt \
> -gdbName cdb12cr1 \
> -createAsContainerDatabase true \
> -sysPassword oracle \
> -systemPassword oracle \
> -emConfiguration none \
> -datafileDestination /u02/oradata \
> -redoLogFileSize 128 \
> -recoveryAreaDestination /u02/fra \
> -storageType FS \
> -characterSet al32utf8 \
> -nationalCharacterSet al16utf16 \
> -automaticMemoryManagement false \
> -initParams filesystemio_options=setall \
> -initParams session_cached_cursors=100 \
> -totalMemory 1024
Creating and starting Oracle instance
1% complete
...
100% complete
As before I need to run the datapatch utility:
[oracle@ora12cr1 ~]$ . oraenv
ORACLE_SID = [oracle] ? cdb12cr1
The Oracle base remains unchanged with value /u01/app/oracle
[oracle@ora12cr1 ~]$ $ORACLE_HOME/OPatch/datapatch -verbose

Setup CDB with a wallet

Our newly created CDB needs to be setup with a wallet before we proceed with plugging a PDB which utilizes TDE. Let's do it now:
[oracle@ora12cr1 ~]$ mkdir -p /u01/app/oracle/admin/cdb12cr1/wallet/tde
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 17:23:58 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> administer key management
        create keystore '/u01/app/oracle/admin/cdb12cr1/wallet/tde'
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        set key identified by "cdb12cr1"
        with backup;  2    3

keystore altered.

Here I have setup the wallet and the master encryption key using the new syntax.

Wallets

Before we move forward let's explore the differences between db1 (created using the old syntax) and cdb12cr1 (created using the new syntax) wallets:
[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
Trusted Certificates:
[oracle@ora12cr1 ~]$  orapki wallet display -wallet /u01/app/oracle/admin/cdb12cr1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.Ad+A607CLU+Ivx4f4E7KUYIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Trusted Certificates:
Notice how db1 has the encryption key listed under ORACLE.SECURITY.DB.ENCRYPTION while cdb12cr1 in addition has the encryption key listed under the new ORACLE.SECURITY.KM.ENCRYPTION as well.

In practice what I found is unless your encryption key is listed under ORACLE.SECURITY.KM.ENCRYPTION it will not be exported when doing administer key management export keys. As a result when you import the export file the master key required to decrypt the data will not be there potentially leaving you in a peculiar situation especially if you have used an in-place conversion. Even if you specifically export the key using the with identifier in syntax you will not be able to import the key because specific key exports are not permitted to be imported into the PDBs.

Add ORACLE.SECURITY.KM.ENCRYPTION to db1's wallet

This step is only required if you created the wallet without using the new administer key management syntax. Re-keying the wallet will generate a new master key preserving the old master key necessary to decrypt the data while adding missing ORACLE.SECURITY.KM.ENCRYPTION entries at the same time. Execute while connected to db1:
SQL> administer key management
        set key identified by "db1"
        with backup;  2    3

keystore altered.
If we look at the wallet we can see that it now contains the necessary entires:
[oracle@ora12cr1 ~]$ orapki wallet display -wallet /u01/app/oracle/admin/db1/wallet/tde
Oracle PKI Tool : Version 12.1.0.2
Copyright (c) 2004, 2014, Oracle and/or its affiliates. All rights reserved.

Enter wallet password:
Requested Certificates:
Subject:        CN=oracle
User Certificates:
Oracle Secret Store entries:
ORACLE.SECURITY.DB.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.DB.ENCRYPTION.MASTERKEY
ORACLE.SECURITY.ID.ENCRYPTION.
ORACLE.SECURITY.KB.ENCRYPTION.
ORACLE.SECURITY.KM.ENCRYPTION.AXZlzWtP/U/xv+vihPzeaGkAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.AY/sdMkWZk/4v3YgBKzZtcIAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
ORACLE.SECURITY.KM.ENCRYPTION.MASTERKEY
Trusted Certificates:
Prepare db1 to be converted into a PDB

Before db1 can be plugged into a container database it needs to be converted to a PDB and it's encryption keys exported. Shutdown db1 and open it in read only:
SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      db1
SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
SQL> startup mount restrict
ORACLE instance started.

Total System Global Area 1241513984 bytes
Fixed Size                  2923872 bytes
Variable Size             452985504 bytes
Database Buffers          771751936 bytes
Redo Buffers               13852672 bytes
Database mounted.
SQL> alter database open read only;

Database altered.
We can now export the encryption keys:
SQL> administer key management
        set keystore open
        identified by "db1";  2    3

keystore altered.

SQL> administer key management
        export keys with secret "db1"
        to '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "db1";  2    3    4

keystore altered.
Note that in order for the export operation to work the wallet needs to be explicitly opened with a password otherwise you will receive an error. In case of an auto login (local) wallet you will have to close the wallet and reopen it with a password.

The next step is to generate metadata necessary for PDB conversion and shutdown db1 database:
SQL> exec dbms_pdb.describe(pdb_descr_file => '/u01/app/oracle/db1.xml');

PL/SQL procedure successfully completed.

SQL> shutdown immediate
Database closed.
Database dismounted.
ORACLE instance shut down.
Convert and plug db1 into a CDB

We can now login into cdb12cr1 and perform in-place conversion of db1 which does not requires any datafiles to be copied. Of course if something were to go wrong with the conversion process you might end up in a situation where you need to restore your original database from a backup so use this approach with care.
SQL> show parameter db_name

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
db_name                              string      cdb12cr1
SQL> create pluggable database pdb1
        using '/u01/app/oracle/db1.xml'
        nocopy tempfile reuse;  2    3

Pluggable database created.
Before pdb1 can be opened we need to run the script which will convert pdb1's data dictionary:
SQL> alter session set container=pdb1;

Session altered.

SQL> @?/rdbms/admin/noncdb_to_pdb.sql
Once the script completes we can open pdb1:
SQL> alter pluggable database pdb1 open;

Warning: PDB altered with errors.
The error while opening the PDB tells us that the encryption key is missing (can be seen in pdb_plug_in_violations view). Let's go and import the key now:
[oracle@ora12cr1 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Sat Mar 7 18:22:23 2015

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> alter session set container=pdb1;

Session altered.

SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> administer key management
        import keys with secret "db1"
        from '/u01/app/oracle/admin/cdb12cr1/wallet/tde/db1.exp'
        identified by "cdb12cr1"
        with backup;  2    3    4    5

keystore altered.

SQL> alter pluggable database pdb1 close;

Pluggable database altered.

SQL> alter pluggable database pdb1 open;

Pluggable database altered.
The encryption keys are now imported and we no longer get an error. Note that as with the export in order to import the keys the wallet must be explicitly opened with a password. We can verify that everything is in order by querying the encrypted table:
SQL> administer key management
        set keystore open
        identified by "cdb12cr1";  2    3

keystore altered.

SQL> select count(*) from encrypt_user.z_encrypted;

  COUNT(*)
----------
       100

Monday, September 08, 2014

Zone Maps On Commit Refresh Oddities

One of the ways Zone Maps can be refreshed when the underlying table data is changed is fast on commit. This is similar to how materialized views can be refreshed with the exception that a Zone Map does not need a materialized view log to do so.

It can also lead to some peculiar side effects.

Test setup

Let's begin by creating a test table with the on commit refresh materialized zone map:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
 
SQL> create materialized zonemap zm$t refresh fast on commit on t (n);
 
Done

SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Zone Staleness (first session)

When a session updates a row in the table it will also mark the corresponding zone in the zone map as stale by setting ZONE_STATE$=1:
SQL> update t set n=0 where n=1;
 
1 row updated
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           1        748 <=== Zone marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
A stale zone is always scanned when performing Zone Map pruning regardless of the predicates involved. This makes sense otherwise we won't be able to select the updated row back while performing Zone Map pruning at the same time because the value is now out of range. With the zone marked as stale we have no such problem:
SQL> select n from t where n=0;
 
         N
----------
         0
A zone will be marked as fresh when the session commits by setting ZONE_STATE$=0. Note that our session did not commit yet.

Second session

So what happens if some other session updates a row belonging to the stale zone map and then commits? Does the zone gets invalidated by being considered fresh again? Let's find out what happens:
SQL> update t set n=2 where n=2;
 
1 row updated
 
SQL> commit;
 
Commit complete
 
SQL> select * from zm$t order by 1;
 
  ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- ---------- ---------- ----------- ----------- ----------
3241022750          1        748           0           0        748 <=== Zone is no longer marked as stale
3241022750        749       1756           0           0       1008
3241022750       1757       2764           0           0       1008
3241022750       2765       3772           0           0       1008
3241022750       3773       4780           0           0       1008
3241022750       4781       5000           0           0        220
 
6 rows selected
Indeed the zone is no longer marked as stale! Now let's go back to our first session...

First session

So what happens if we try to select the same updated row in the first session?
SQL> select n from t where n=0;

no rows selected
We've lost the ability to see the updated row thanks to the second session invalidating the zone! Worse yet we still won't be able to see the row even after this session commits meaning any query in the database which relies on this Zone Map to do the pruning will be unable to see the data either:
SQL> commit;

Commit complete.

SQL> select n from t where n=0;

no rows selected

SQL> select /*+ no_zonemap(t scan) */ n from t where n=0;

         N
----------
         0
The only way to select the row is to disable Zone Map pruning.

The end result is we have a perfectly "valid" Zone Map as long as Oracle is concerned which is out of sync with the actual table data:
SQL> select zonemap_name, invalid, stale, unusable, compile_state from dba_zonemaps;
 
ZONEMAP_NAME INVALID STALE   UNUSABLE COMPILE_STATE
------------ ------- ------- -------- -------------------
ZM$T         NO      NO      NO       VALID 

The fact that Oracle documentation claims that on commit refresh zone maps stay transactionally fresh suggests that this behavior is clearly a bug resulted from incorrect handling of concurrent transactions modifying the data belonging to the same zone.

Tuesday, August 19, 2014

Zone Map Zone ID's

Just a quick follow up to my previous post on how Zone ID's are calculated.

Let's take the following example:
SQL> select rid, sys_op_zone_id(rid) zone_id
  2   from
  3   (
  4    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  5     from dual
  6   );
 
RID                     ZONE_ID
------------------ ------------
AAAS5KAAHAAABYDAAA 324580438021

Recalling that extended ROWID has the following format (a nice picture from Oracle Documentation):



In the binary format that would correspond to:
  • Data Object Number -- 32 bits
  • Relative File Number -- 10 bits
  • Block Number -- 22 bits
  • Row Number -- 16 bits
We know that Row Number is irrelevant for the Zone Maps because they deal with block ranges. With that in mind a simple conversion to a (base 10) number would be:

(Block Number) + (Relative File Number) * 2 ^ 22 + (Data Object Number) * 2 ^ 32

Applying the above formula to our ROWID:
SQL> select
  2    dbms_rowid.rowid_block_number(rid) +
  3    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  4    dbms_rowid.rowid_object(rid) * power(2,32) base_10
  5   from
  6   (
  7    select chartorowid('AAAS5KAAHAAABYDAAA') rid
  8     from dual
  9   );
 
        BASE_10
---------------
332370368534019

What's left after that is to slice the number into Zone Map chunk size (2^10 by default, thanks to Timur for pointing that out in the comments sections for the previous post):
SQL> select
  2    trunc((
  3    dbms_rowid.rowid_block_number(rid) +
  4    dbms_rowid.rowid_relative_fno(rid) * power(2,22) +
  5    dbms_rowid.rowid_object(rid) * power(2,32)
  6    )/power(2,10)) zone_id,
  7    sys_op_zone_id(rid) sys_zone_id
  8   from
  9   (
 10    select chartorowid('AAAS5KAAHAAABYDAAA') rid
 11     from dual
 12   );
 
     ZONE_ID  SYS_ZONE_ID
------------ ------------
324580438021 324580438021

That's all there is to it!

Monday, August 18, 2014

Zone Maps

Zone Maps is a new feature that got officially introduced in 12.1.0.2.0 so I've decided to take a closer look.

From the Oracle Documentation:
For full table access, zone maps allow I/O pruning of data based on the physical location of the data on disk, acting like an anti-index.

Let's start by creating a test table:
SQL> create table t pctfree 95 clustering by linear order (n) as
  2   select level n, rpad('x', 200, 'x') v
  3    from dual
  4    connect by level <= 5000;
 
Table created
I've used a high pctfree setting to make sure the table gets spread out on disk -- each row will occupy it's own block:
SQL> select count(*) num_rows,
  2    count(distinct dbms_rowid.rowid_block_number(rowid)) num_blocks
  3   from t;
 
  NUM_ROWS NUM_BLOCKS
---------- ----------
      5000       5000
Zone Maps do not require attribute clustering, however, I did use clustering in this particular case to make sure that value ranges for column N do not overlap when physically stored on disk to make it easier to see what's happening when a Zone Map is created.

Now let's create a Zone Map:
SQL> create materialized zonemap zm$t on t (n);
 
Done
The first thing to notice is we now got additional table in our schema with the same name as a Zone Map:
SQL> select * from zm$t order by zone_id$;
 
    ZONE_ID$    MIN_1_N    MAX_1_N ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
------------ ---------- ---------- ----------- ----------- ----------
324421046272          1        748           0           0        748
324421046273        749       1756           0           0       1008
324421046274       1757       2764           0           0       1008
324421046275       2765       3772           0           0       1008
324421046276       3773       4780           0           0       1008
324421046277       4781       5000           0           0        220
 
6 rows selected
As you can see we've got six zones defined with most of them covering a range of about thousand rows with the exception of the first and the last ones. I can now map each ZONE_ID$ to it's respective block range on disk:
SQL> select zone_id$, min_block_id, max_block_id, zone_rows$
  2  from (
  3  select zm$t.zone_id$,
  4    min(dbms_rowid.rowid_block_number(t.rowid)) min_block_id,
  5    max(dbms_rowid.rowid_block_number(t.rowid)) max_block_id,
  6    max(zone_rows$) zone_rows$
  7   from t, zm$t
  8   where t.n between zm$t.min_1_n and zm$t.max_1_n
  9   group by zm$t.zone_id$
 10  ) order by zone_id$;
 
  ZONE_ID$   MIN_BLOCK_ID MAX_BLOCK_ID ZONE_ROWS$
------------ ------------ ------------ ----------
324421046272          179         1023        748
324421046273         1026         2047       1008
324421046274         2050         3071       1008
324421046275         3074         4095       1008
324421046276         4098         5119       1008
324421046277         5122         5343        220
 
6 rows selected
Based on a couple more tests I've done the algorithm appears to work itself out until the fist block number in a segment which divides evenly by 1024 after which all subsequent blocks get mapped to 8MB regions. The last map has 221 blocks because that's where the end of the last table block happens to be.

Zone Maps seems to be completely independent from the actual extents on disk. In my tests I was able to get multiple Zone Maps pointing at the same extent and in the case above we have multiple extents covered by the same Zone Map. In all cases zones were 8MB in size (more on that a little bit later). By the way if you're wondering why do we have 1008 rows and a little bit strange starting block_ids (i.e. 1026 instead of 1024 and so on) remember that the fist couple blocks are used for first level bmb and do not store actual table data.

So how is this all pieced together?

SYS_OP_ZONE_ID

SYS_OP_ZONE_ID function computes ZONE_ID$ for a given ROWID value. The calculation appears to be a straightforward math based on the ROWID value and assumes 8MB Zone Map chunks -- this explains why Zone Maps are created adjacent to 8MB boundaries since ZONE_ID$ values in the Zone Map are computed using the same function:
SQL> select rowid, SYS_OP_ZONE_ID(rowid)
  2   from t
  3   where rownum=1;
 
ROWID              SYS_OP_ZONE_ID(ROWID)
------------------ ---------------------
AAAS4kAAFAAAACzAAA          324421046272

Essentially if we imagine the entire possible database space divided into 8MB regions this function would return into which region a particular ROWID value would belong.

SYS_ZMAP_FILTER

Let's look at the explain plan for the following query:
SQL> select * from t where n=1;


Execution Plan
----------------------------------------------------------
Plan hash value: 2931408918

--------------------------------------------------------------------------------
| Id  | Operation                              | Name | Rows  | Bytes | Cost (%C
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |      |     1 |   205 |  1380
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| T    |     1 |   205 |  1380
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - storage("N"=1)
       filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
              BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_N" > :1 OR zm."MAX_1_N"
              < :2) THEN 3 ELSE 2 END END FROM "ROOT"."ZM$T" zm WHERE zm."ZONE_LEVEL$"=0 ORDER BY
              zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),1,1)<3 AND "N"=1)
We can see that SYS_ZMAP_FILTER appears to be the function involved in figuring out which Zone Maps needs to be accessed in order to execute the query. The condition inside the query (zm."MIN_1_N" > :1 OR zm."MAX_1_N" < :2) will be used to eliminate Zone Maps which do not have a value we're looking for and is dynamically constructed based on the predicate(-s) we have in the query. From here ZONE_ID$ can be mapped back to ROWID ranges (a reverse operation relative to SYS_OP_ZONE_ID) in order to scan only required portions of the data on disk.

This looks to be a very exciting feature and I can't help but think that it's a little bit disappointing that it's limited to Exadata storage only as it could have been very useful on other systems due to lack of storage indexes support.

Thursday, February 20, 2014

AWR Top 5 Timed Foreground Events

I've noticed that people post how to get AWR Top 5 Timed Foreground Events other a range of snapshots using a SQL query from time to time. Since this is something I've done for years here is the version of the SQL I use in case somebody finds it useful:
select case wait_rank when 1 then inst_id end "Inst Num",
 case wait_rank when 1 then snap_id end "Snap Id",
 case wait_rank when 1 then begin_snap end "Begin Snap",
 case wait_rank when 1 then end_snap end "End Snap",
 event_name "Event",
 total_waits "Waits",
 time_waited "Time(s)",
 round((time_waited/total_waits)*1000) "Avg wait(ms)",
 round((time_waited/db_time)*100, 2) "% DB time",
 substr(wait_class, 1, 15) "Wait Class"
from (
select
  inst_id,
  snap_id, to_char(begin_snap, 'DD-MM-YY hh24:mi:ss') begin_snap,
  to_char(end_snap, 'hh24:mi:ss') end_snap,
  event_name,
  wait_class,
  total_waits,
  time_waited,
  dense_rank() over (partition by inst_id, snap_id order by time_waited desc)-1 wait_rank,
  max(time_waited) over (partition by inst_id, snap_id) db_time
from (
select
  s.instance_number inst_id,
  s.snap_id,
  s.begin_interval_time begin_snap,
  s.end_interval_time end_snap,
  event_name,
  wait_class,
  total_waits-lag(total_waits, 1, total_waits) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) total_waits,
  time_waited-lag(time_waited, 1, time_waited) over
   (partition by s.startup_time, s.instance_number, stats.event_name order by s.snap_id) time_waited,
  min(s.snap_id) over (partition by s.startup_time, s.instance_number, stats.event_name) min_snap_id
from (
 select dbid, instance_number, snap_id, event_name, wait_class, total_waits_fg total_waits, round(time_waited_micro_fg/1000000, 2) time_waited
  from dba_hist_system_event
  where wait_class not in ('Idle', 'System I/O')
 union all
 select dbid, instance_number, snap_id, stat_name event_name, null wait_class, null total_waits, round(value/1000000, 2) time_waited
  from dba_hist_sys_time_model
  where stat_name in ('DB CPU', 'DB time')
) stats, dba_hist_snapshot s
 where stats.instance_number=s.instance_number
  and stats.snap_id=s.snap_id
  and stats.dbid=s.dbid
  and s.dbid=3870213301
  and s.instance_number=1
  and stats.snap_id between 190 and 195
) where snap_id > min_snap_id and nvl(total_waits,1) > 0
) where event_name!='DB time' and wait_rank <= 5
order by inst_id, snap_id;

Inst Snap  Begin Snap        End Snap Event                            Waits    Time(s) Avg wait(ms)  % DB time Wait Class
---- ----- ----------------- -------- --------------------------- ---------- ---------- ------------ ---------- ---------------
   1   191 20-02-14 14:10:10 14:20:10 cell smart table scan           631829    9569.43           15      79.03 User I/O
                                      DB CPU                                    1202.09                    9.93 
                                      direct path read temp            66074    1006.82           15       8.32 User I/O
                                      PX Deq: Slave Session Stats      11730     429.91           37       3.55 Other
                                      latch: shared pool               28134     162.47            6       1.34 Concurrency
   1   192 20-02-14 14:20:10 14:30:11 cell smart table scan          1391832    4620.11            3      67.39 User I/O
                                      DB CPU                                    1017.78                   14.85 
                                      direct path read temp            76329     977.95           13      14.26 User I/O
                                      PX Deq: Slave Session Stats      25043     401.53           16       5.86 Other
                                      latch free                       38836      214.1            6       3.12 Other
   1   193 20-02-14 14:30:11 14:40:14 cell smart table scan          2448539   11075.29            5       79.3 User I/O
                                      DB CPU                                    1529.93                   10.95 
                                      PX Deq: Slave Session Stats      44242    1520.01           34      10.88 Other
                                      direct path read temp            77583     985.65           13       7.06 User I/O
                                      latch free                       67518     376.52            6        2.7 Other
   1   194 20-02-14 14:40:14 14:50:15 direct path read temp            99224      857.3            9      71.63 User I/O
                                      DB CPU                                     328.78                   27.47 
                                      name-service call wait              91        5.4           59       0.45 Other
                                      PX Deq: Slave Session Stats         83       0.17            2       0.01 Other
                                      direct path write                  194       0.12            1       0.01 User I/O
   1   195 20-02-14 14:50:15 15:00:18 DB CPU                                    1188.84                   98.15 
                                      log switch/archive                   1      10.01        10010       0.83 Other
                                      direct path read temp              775       3.96            5       0.33 User I/O
                                      cell smart table scan             1393        1.1            1       0.09 User I/O
                                      cell single block physical         148        0.9            6       0.07 User I/O
                                      read                                                                      
 
 
25 rows selected