Introduction
Today I would be discussing the impact of a pluggable database failure (particularly media failure) on the other pluggable databases and the parent container database.
In this article series, I am trying to make an attempt in answering the concerns or questions raised by my dear friend Nassyam Basha in his post PDB is Painful to CDB any cost – 12c ?
In the first section, I would be just repeating the demonstration that my friend (Nassyam Basha) had shown in the mentioned article.
During the demonstration, it was observed that, when a particular datafile belonging to a particular pluggable database (PDB) goes missing or corrupt, CKPT (process responsible for updating controlfile and datafile headers as well as for calling DBWR to flush dirty buffers to disk) background process is causing the container database (CDB$ROOT) to terminate, in turn causing all the pluggable databases (PDB) attached to the container to terminate.
This is a serious concern with respect to the multi-tenant database architecture, which is raising the obvious question on the capability and functionality of the pluggable database architecture. It is like, your business is in a cloud infrastructure and a single component failure is causing the entire infrastructure to fail, which is not at all desirable and probably that cloud would not be the recommended choice to deploy.
Demonstration (based on the referenced article)
Lets, take a look at the simple simulation of the problem.
I have a container database PRODCDB, which is hosting 4 (four) pluggable databases.
sys@PRODCDB> select name,open_mode,cdb from v$database; NAME OPEN_MODE CDB --------- -------------------- --- PRODCDB READ WRITE YES sys@PRODCDB> select name,dbid,open_mode from v$pdbs; NAME DBID OPEN_MODE ------------------------------ ---------- ---------- PDB$SEED 4103948816 READ ONLY PRODPDB1 4276769587 READ WRITE PRODPDB2 4149756065 READ WRITE PRODPDB3 4199535790 READ WRITE PRODPDB4 4072086604 READ WRITE
Lets simulate an artificial failure by deleting one of the datafile belonging to a particular PDB. I am randomly choosing the pluggable database PRODPDB4.
sys@PRODCDB> show con_name CON_NAME ------------------------------ PRODPDB4 sys@PRODCDB> select tablespace_name,file_id,file_name from dba_data_files order by 1,2; TABLESPACE_NAME FILE_ID FILE_NAME --------------- ---------- -------------------------------------------------------------------------------- SYSAUX 33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435 SYSTEM 32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861 USERS 43 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864212453 USERS 45 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf
Lets introduce an artificial datafile loss.
11:15:47 sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf -rw-r----- 1 oracle oinstall 10493952 Nov 21 11:12 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf 11:16:06 sys@PRODCDB> !rm /app/oracle/data/prodcdb/prodpdb4_users_2.dbf 11:16:23 sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf ls: /app/oracle/data/prodcdb/prodpdb4_users_2.dbf: No such file or directory
Now, lets instruct the CKPT process to perform a checkpoint.
11:16:40 sys@PRODCDB> alter system checkpoint; ERROR: ORA-03114: not connected to ORACLE alter system checkpoint * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 7177 Session ID: 20 Serial number: 9
My container database PRODCDB is terminated. Here are the errors that were logged in alert log file.
Fri Nov 21 11:16:49 2014 Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_5848.trc: ORA-63999: data file suffered media failure ORA-01116: error in opening database file 45 ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3 Fri Nov 21 11:16:49 2014 Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_5848.trc: ORA-63999: data file suffered media failure ORA-01116: error in opening database file 45 ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3 USER (ospid: 5848): terminating the instance due to error 63999 Fri Nov 21 11:16:50 2014 System state dump requested by (instance=1, osid=5848 (CKPT)), summary=[abnormal instance termination]. System State dumped to trace file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_diag_5828.trc Dumping diagnostic data in directory=[cdmp_20141121111650], requested by (instance=1, osid=5848 (CKPT)), summary=[abnormal instance termination].
Now, trying to start the container database (CDB) PRODCDB results in following errors.
idle> startup ORACLE instance started. Total System Global Area 521936896 bytes Fixed Size 2290264 bytes8 Variable Size 264244648 bytes Database Buffers 251658240 bytes Redo Buffers 3743744 bytes Database mounted. ORA-01157: cannot identify/lock data file 45 - see DBWR trace file ORA-01110: data file 45: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf'
The container database (CDB) was able to MOUNT. However, while opening the database, it (DBWR in particular) could not find the datafile (FILE# 45 in our case) belonging to the pluggable database PRODPDB4.
The quickest solution to bring back the container along with the attached pluggable databases would be to immediately take the lost datafile (FILE# 45 in our case) OFFLINE and start the container and pluggable databases. We can later recover the lost file (using the available backup).
First, identify the pluggable database to which the lost datafile belongs to.
idle> select name,dbid,open_mode from v$pdbs where con_id=(select CON_ID from v$datafile where file#=45); NAME DBID OPEN_MODE ------------------------------ ---------- ---------- PRODPDB4 4072086604 MOUNTED
Now, take the lost datafile OFFLINE by logging in to the respective pluggable database.
idle> alter session set container=PRODPDB4; Session altered. idle> alter database datafile 45 offline; Database altered.
Now, login to the container database (CDB$ROOT) and open it (already in MOUNT state due to the last STARTUP attempt)
idle> show con_name CON_NAME ------------------------------ CDB$ROOT idle> alter database open; Database altered.
Now, we can open the attached pluggable databases as follows.
idle> alter pluggable database all open; Pluggable database altered. sys@PRODCDB> select name,dbid,open_mode from v$pdbs; NAME DBID OPEN_MODE ------------------------------ ---------- ---------- PDB$SEED 4103948816 READ ONLY PRODPDB1 4276769587 READ WRITE PRODPDB2 4149756065 READ WRITE PRODPDB3 4199535790 READ WRITE PRODPDB4 4072086604 READ WRITE
So, our container as well as the pluggable databases are back ONLINE. However, we are yet to restore and recover the missing file belonging to the pluggable database. We can use RMAN to recover the datafile. I am skipping that part here.
In the main article, which I have initially referred, following were the questions asked.
1) if in case of system datafile lost of Pluggable database, what happens to CDB?
2) If i shutdown the PDB, will it impact to any other PDB part of CDB?
3) You can drop PDB anytime, then why CDB can’t stop and startup in case of system or user datafile lost of PDB?
4) If you have 10 PDB’s of one CDB, if there is lost of any single datafile of PDB(pdb1) and i have scenario to startup and shutdown my CDB and other PDBs except damaged(pdb1) why i can’t startup CDB or other PDB’s?
I would try to give quick answers to these questions.
1) Answer: Depends (CDB may terminate). Note, I did not say “it will terminate”. Stay tuned for explanation.
2) Answer: Not at all, SHUTDOWN a PDB is similar to closing a PDB with the ‘ALTER PLUGGABLE DATABASE CLOSE’ command, which has no impact on other PDBs.
sys@PRODCDB> show con_name CON_NAME ------------------------------ PRODPDB4 sys@PRODCDB> shutdown Pluggable Database closed. sys@PRODCDB> alter session set container=CDB$ROOT; Session altered. sys@PRODCDB> alter pluggable database PRODPDB3 close; Pluggable database altered. sys@PRODCDB> select name,dbid,open_mode from v$pdbs; NAME DBID OPEN_MODE ------------------------------ ---------- ---------- PDB$SEED 4103948816 READ ONLY PRODPDB1 4276769587 READ WRITE PRODPDB2 4149756065 READ WRITE PRODPDB3 4199535790 MOUNTED PRODPDB4 4072086604 MOUNTED
3) Answer: This is because there is only one CONTROLFILE which is for both CDB and its associated PDBS. For the CDB to STARTUP completely, DBWR must identify all the ONLINE datafiles listed in the controlfile.
4) Answer: As per the demonstration that I have shown , we can still open the CDB and all the PDB (s), even if we have a lost datafile from a PDB.
Explanation on the Instance Termination
I was bit curious here. I was not ready to believe that Oracle would offer a cloud infrastructure (multi-tenant pluggable database architecture) which is not functioning like a cloud. There are savvy programmers working for Oracle, writing the core of this muti-tenant architecture and undoubtedly they would have tested these scenarios before releasing the new multi-tenant architecture. I am pretty sure, we are missing something in our simulation, which is leading the cloud (container database) to fail.
With this thought in my mind, I had started researching on this and finally ended up finding the cause noted in MOS Note: Doc ID 1605755.1.
Here is the logic behind the instance termination by CKPT. Prior to Oracle version 11.2.0.2, media failure for any datafile (except from SYSTEM tablespace) would result in to the particular datafile to be OFFLINE provided the database is in ARCHIVELOG mode. However, Oracle has introduced a fix Bug 7691270 Crash the DB in case of write errors (rather than just offline files) in 11.2.0.2, where media failure for a datafile leads to the instance termination.
This fix for media failure is controlled by a new hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE with the following set of values
_DATAFILE_WRITE_ERRORS_CRASH_INSTANCE=TRUE (default): When set to TRUE, any datafile media failure would cause the instance termination when a database process tries to write to that datafile.
_DATAFILE_WRITE_ERRORS_CRASH_INSTANCE=FALSE : When set to FALSE, would restore the previous functionality (pre 11.2.0.2) and would make the datafile OFFLINE (provided database is in ARCHIVELOG mode and the datafile is not from SYSTEM tablespace).
Avoiding the Instance Termination
Now, lets test the same scenario by restoring the media failure behaviour.
Check the value of the hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE ( would be TRUE by default)
sys@PRODCDB> SELECT a.ksppinm Param , b.ksppstvl SessionVal , 2 c.ksppstvl InstanceVal, a.ksppdesc Descr 3 FROM 4 x$ksppi a , x$ksppcv b , x$ksppsv c 5 WHERE 6 a.indx = b.indx AND 7 a.indx = c.indx AND 8 a.ksppinm LIKE '/_datafile_write_errors_crash_instance%' escape '/' 9 ; PARAM SESSIONVAL INSTANCEVAL DESCR ---------------------------------------- ---------- --------------- -------------------------------------------------- _datafile_write_errors_crash_instance TRUE TRUE datafile write errors crash instance
Now, lets set the hidden parameter _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE to FALSE in order to avoid instance crash when database process tries to write to a datafile with media failure.
sys@PRODCDB> alter system set "_datafile_write_errors_crash_instance"=FALSE; System altered. sys@PRODCDB> SELECT a.ksppinm Param , b.ksppstvl SessionVal , 2 c.ksppstvl InstanceVal, a.ksppdesc Descr 3 FROM 4 x$ksppi a , x$ksppcv b , x$ksppsv c 5 WHERE 6 a.indx = b.indx AND 7 a.indx = c.indx AND 8 a.ksppinm LIKE '/_datafile_write_errors_crash_instance%' escape '/' 9 ; PARAM SESSIONVAL INSTANCEVAL DESCR ---------------------------------------- ---------- --------------- -------------------------------------------------- _datafile_write_errors_crash_instance FALSE FALSE datafile write errors crash instance
Lets simulate an artificial media failure again. This time, I am expecting the database to run even after the media failure.
sys@PRODCDB> select tablespace_name,file_id,file_name,online_status from dba_data_files order by 1,2; TABLESPACE_NAME FILE_ID FILE_NAME ONLINE_ --------------- ---------- ----------------------------------------------------------------------------- ------- SYSAUX 33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435 ONLINE SYSTEM 32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861 SYSTEM USERS 46 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864302607 ONLINE USERS 47 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf ONLINE sys@PRODCDB> !rm /app/oracle/data/prodcdb/prodpdb4_users_2.dbf sys@PRODCDB> !ls -lrt /app/oracle/data/prodcdb/prodpdb4_users_2.dbf ls: /app/oracle/data/prodcdb/prodpdb4_users_2.dbf: No such file or directory
Lets perform a checkpoint as it tries to write to the datafiles and controlfile to update the header information. If checkpoint is working, we can assure that database would be running without impacting the remaining pluggable databases and their container database.
sys@PRODCDB> select sysdate from dual; SYSDATE -------------------- 22-NOV-2014 13:51:08 sys@PRODCDB> alter system checkpoint; System altered. sys@PRODCDB> / System altered.
As expected, checkpoint worked fine this time with _DATAFILE_WRITE_ERRORS_CRASH_INSTANCE being set to FALSE. Lets check the status of the datafile which has the media failure.
sys@PRODCDB> select tablespace_name,file_id,file_name,online_status from dba_data_files order by 1,2; TABLESPACE_NAME FILE_ID FILE_NAME ONLINE_ --------------- ---------- ----------------------------------------------------------------------------- ------- SYSAUX 33 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/sysaux.300.864131435 ONLINE SYSTEM 32 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/system.299.861526861 SYSTEM USERS 46 +DATA/PRODCDB/05E6D1ADF1341A67E05305E6A8C088D7/DATAFILE/users.375.864302607 ONLINE USERS 47 /app/oracle/data/prodcdb/prodpdb4_users_2.dbf RECOVER sys@PRODCDB>
As, we can see the datafile (FILE# 47 in our case) is now in RECOVER (OFFLINE) state. We can now recover it without impacting other pluggable database and container database.
This time, the alert log just reported the media failure with notification about the datafile (FILE# 47 in our case) being put in OFFLINE.
Sat Nov 22 13:51:14 2014 Beginning global checkpoint up to RBA [0x46.8150.10], SCN: 2422023 Sat Nov 22 13:51:14 2014 Errors in file /app/oracle/diag/rdbms/prodcdb/prodcdb/trace/prodcdb_ckpt_10263.trc: ORA-01171: datafile 47 going offline due to error advancing checkpoint ORA-01116: error in opening database file 47 ORA-01110: data file 47: '/app/oracle/data/prodcdb/prodpdb4_users_2.dbf' ORA-27041: unable to open file Linux-x86_64 Error: 2: No such file or directory Additional information: 3 Completed checkpoint up to RBA [0x46.8150.10], SCN: 2422023 Beginning global checkpoint up to RBA [0x46.8154.10], SCN: 2422029 Completed checkpoint up to RBA [0x46.8154.10], SCN: 2422029
Oracle has mentioned in the note that, this new undocumented parameter is being introduced as a FIX. However, as per my opinion it is more of a kind of bug rather than a fix (especially for the muti-tenant architecture) considering the fact that the FIX is not documented or published.
Reference
PDB is Painful to CDB any cost – 12c ?
Media Failure Of Any PDB Datafile Crashes The Complete CDB (Doc ID 1605755.1)
Bug 7691270 Crash the DB in case of write errors (rather than just offline files)