Thursday 17 March 2016

How to Perform Clusterware Compatibility Testing in Oracle RAC

+ Compatiblity Testing can be done using Oracle Certification Environment (OCE) kit

Installing the Oracle Certification Environment Software for Oracle RAC

The OCE Certification Kit required to certify the system for Oracle RAC 11g Release 1 (11.1) is available for download only. The Single Instance certification tests should be completed prior to installing the OCE kit for Oracle RAC. Refer to the previous section if necessary. Once Single Instance testing has successfully completed, the single instance OCE installations must be archived to allow OCE installations for Oracle RAC to succeed. The OCE kit for Oracle RAC should be installed separately on each node of the cluster. If the ORACLE_HOME is located on a shared disk, multiple installations of OCE will not be possible. In that case, it will not be possible to run OCE tests simultaneously, and the time required to complete certification will be greatly increased. To install the OCE Kit:
1.      Download OCE for Oracle RAC 11g Release 1 (11.1) archive to a suitable location, such as /tmp/oce. The OCE archives are either CPIO archives, or compressed CPIO archives.
If compressed, extract as follows:
gunzip -c OCE ARCHIVE | cpio -idmv
If not compressed, extract as follows:
cpio -idmv < OCE_ARCHIVE
Where, OCE_ARCHIVE is the name of the archive.
2.      Enter the following command to run the OCE Installer:
3.  $ archive_location/oce_install.sh
The Environment variable screen is displayed.
4.      Enter values for each of the environment variables as described on screen.
Note: You must press the Enter key once to enter the new variable value, and then press the Enter key again to move onto the next variable.
5.      Type Done when finished and press the Enter key.
The installation progress screen is displayed. When all stages are complete, the installer will exit.
6.      Check $ORACLE_HOME/OCEinstallRAC.log file, and verify that there are no errors.
7.      Download the Binaries Package for OCE for Oracle RAC 11g Release 1 (11.1) which matches your platform, and extract the archive as explained in Step 1.
8.      Enter the following command to run the OCE binaries installation script.
9.  $ /tmp/oce/oce_exes_install.sh
10.  Check the OCE Kit installation log file, $ORACLE_HOME/OCE/install_log.txt, and verify that the installation was successful.
The kit is installed in the $ORACLE_HOME/oce directory.

Preparing the System for Multi-Node High Availability Services Testing
The Oracle RAC High Availability Services test suite must use shared storage for data files. Depending on the type of shared storage utilized, some preliminary setup may be required. Before running the Oracle RAC High Availability Services test suite, complete the following:
  • If you are using raw devices or logical volumes for shared storage perform the following steps:

Note:
In this example, the OCE user is oracle, which is a member of the dba group; there are 4 nodes in the cluster; and the OCE logical volumes are located in /dev/ocevg/ directory.

    1. Set up the devices or logical volumesrequired by the tests.
    2. Ensure raw devices are accessible and writable across all nodes by the OCE user
      1. # chown -R oracle:dba /dev/ocevg
      2. # chmod -R og+w /dev/ocevg
    3. Export $ORACLE_HOME/oce/work$ORACLE_HOME/dbs, and $ORACLE_HOME/network/admin from the node 1 to all other nodes in the cluster.
On node 1:
      1. # exportfs -i -o rw <node2>:$ORACLE_HOME/oce/work \
<node3>:$ORACLE_HOME/oce/work \
<node4>:$ORACLE_HOME/oce/work
      1. # exportfs -i -o rw <node2>:$ORACLE_HOME/dbs \
<node3>:$ORACLE_HOME/dbs \
<node4>:$ORACLE_HOME/dbs
    1. # exportfs -i -o rw <node2>:$ORACLE_HOME/network/admin \
<node3>:$ORACLE_HOME/network/admin \
<node4>:$ORACLE_HOME/network/admin
    1. $ORACLE_HOME/oce/work$ORACLE_HOME/network/admin, and $ORACLE_HOME/dbs must be mounted on all secondary nodes from the primary (exported) node.
On all nodes except node 1:
      1. # mkdir –p $ORACLE_HOME/oce/work
      2. # chown oracle:dba $ORACLE_HOME/oce/work
      3. # mount <node1>:$ORACLE_HOME/oce/work $ORACLE_HOME/oce/work
      4. # mount <node1>:$ORACLE_HOME/dbs $ORACLE_HOME/dbs
      5. # mount <node1>:$ORACLE_HOME/network/admin \
$ORACLE_HOME/network/admin
  • If you are using OCFS or NAS or a vendor clustered file system (CFS), and ORACLE_HOME directory is not located on shared partition, then perform the following steps:

Note:
If NAS, ensure that the appropriate mount options are employed when mounting the NAS partition. Oracle requires specific mount options. Consult your NAS Filer documentation for further details.

    1. Symbolically link $ORACLE_HOME/dbs to the OCFS/CFS/NAS partition on all nodes (in this example, the OCFS/CFS/NAS partition is at /sharedfs).
On node 1:
    1. mkdir /sharedfs/dbs
    2. chown oracle:dba /sharedfs
On all nodes:
      1. mv $ORACLE_HOME/dbs $ORACLE_HOME/dbs.BAK
      2. ln -s /sharedfs/dbs $ORACLE_HOME/dbs
    1. Export $ORACLE_HOME/oce/work and $ORACLE_HOME/network/admin from the primary node.
      1. # exportfs -i -o rw <node2>:$ORACLE_HOME/oce/work \
<node3>:$ORACLE_HOME/oce/work \
<node4>:$ORACLE_HOME/oce/work
    1. # exportfs -i -o rw <node2>:$ORACLE_HOME/network/admin \
<node3>:$ORACLE_HOME/network/admin \
<node4>:$ORACLE_HOME/network/admin
    1. $ORACLE_HOME/oce/work and $ORACLE_HOME/network/admin must be mounted on all secondary nodes from the primary (exported) node. Default mount options will suffice.
On all nodes except node 1:
      1. # mkdir –p $ORACLE_HOME/oce/work
      2. # chown oracle:dba $ORACLE_HOME/oce/work
      3. # mount <node1>:$ORACLE_HOME/oce/work $ORACLE_HOME/oce/work
      4. # mount <node1>:$ORACLE_HOME/dbs $ORACLE_HOME/dbs
      5. # mount <node1>:$ORACLE_HOME/network/admin \
$ORACLE_HOME/network/admin
  • If you are using OCFS or CFS accessing a shared Oracle home directory, no setup is required.
  • Ensure that no databases are running.

Starting Test Manager
To start Test Manager:
  1. Ensure that the DISPLAY environment variable is set appropriately for your system. To verify that it is, try starting up xclock. If you do not see the clock, or you receive errors, DISPLAY is not set appropriately. You must correct any errors before proceeding.
  2. Enter the following command to launch OCE Test Manager:
3.  $ORACLE_HOME/oce/bin/startTM.sh > /tmp/OCETM.log 2>&1
The OCE Main Menu and OCE Test Manager windows appear.
Running a Test for the First Time
If this is your first time running certification tests, you must perform the following steps:
  1. Start Test Manager as described in Starting Test Manager.
  2. From the OCE - Main Menu window, double click Utilities.
  3. Run the bmchk test by selecting it and clicking Execute.
  4. When the test completes, click Results in the Test Manager window to check the outcome. If the test fails, you must analyze the output ($INST_HOME/work/bmchk) and resolve any issues. Do not proceed with testing until bmchk executes successfully.
  5. Run sdbck (the Seed Database Verification utility) test by selecting it and clicking Execute.
  6. When the test completes, click Results in the Test Manager window to check the outcome. If the test fails, you must analyze the output ($INST_HOME/work/sdbck) and resolve any issues. Do not proceed with testing until sdbck executes successfully.
  7. Run cssck (the CSS Daemon Verification utility) test by selecting it and clicking Execute.
  8. When the test completes, click Results in the Test Manager window to check the outcome. If the test fails, you must analyze the output ($INST_HOME/work/cssck) and resolve any issues. Do not proceed with testing until cssck executes successfully.

Running the OCE Test Suites

The OCE release consists of a set of test suites that you run from Test Manager. Each test suite consists of one or more individual tests. To complete the certification, run each of the Test Suites in the kit for the product for which you are certifying your system.
1.      From the OCE - Main Menu screen, double-click Complete test suites.
2.      From the screen that appears, select the test suite you want to run and click Execute to run it.
The test suite runs.
Test Manager creates two entries for the test suite in the Test Manager window; one in the Suite Name field and another in the History field:
o    The entry in the Current Test field is displayed only for the duration of a test. It displays the time at which you requested the test, and if it starts, when it started. Test Manager might display some tests with a status of Waiting until resources become available on the system.
o    The entry in the History field displays the time you requested it.
When a test finishes, Test Manager deletes its entry in the Current Tests field and adds another entry to the History field showing when the test finished.






Below are the Test plans for Oracle Clusterware Compatibility (Destructive) Testing
(Category : ORACLE HIGH AVAILABILITY FEATURES)
Clusterware Test Category
[Test Code]
Action  Target
Detailed Test Execution
Expected Test Outcome
Actual Test Outcome
[D]
Oracle HA Features


[HW-CW-09]

Run multiple cluvfy operations during Oracle Clusterware and RAC install  All RAC hosts

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option   2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB
Preconditions:
·          Type `cluvfy` to see all available command syntax and options

Steps:
1- Run cluvfy precondition
2- Do the next install step
3- Run cluvfy post-condition
(cluvfy comp software –n node_list) to check the file permissions
No need to collect CRS/RDBMS log for this test.  You need to submit the output for cluvfy.


Vendor Clusterware:
- same as RAC

RAC:
-           Correct cluster verification checks given the state of the cluster hardware and software

Pls provide cvu related logs under
$CRS_HOME/cv/log


[HW-CW-10]

Run concurrent crsctl start/stop crs commands to stop or start Oracle Clusterware in planned mode  All RAC hosts

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option 2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB


Preconditions:
·          Initiate all Workloads
·          Identify both CSS and CRS master nodes
·          Type `crsctl` as root to see all available command syntax and options

Steps:
1- As root user, run `crsctl stop crs` command concurrently on more than one RAC host, to stop the resident Oracle Clusterware stack
2- Wait until the target Oracle Clusterware stack is fully stopped (via `ps` command)
3- As root user, run `crsctl start crs
    -wait` command concurrently on more than one RAC host, to start the resident Oracle Clusterware stack


Vendor Clusterware:
- N/A

RAC:
Stop:  All Oracle Clusterware daemons stop without leaving open ports or zombie processes
Start:  All Oracle Clusterware daemons start without error messages in stdout or any of the CRS, CSS or EVM traces
Start:  All registered HA resource states match the “target” states, as per 
“crsctl stat res –t”

For 12cR1, collect
“crsctl stat res –t” in a 60s loop from beginning till the end of run.  Attach the output for auditing.

[HW-CW-11]

Run other concurrent crsctl commands, such as crsctl check crs,   All RAC hosts

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option   2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB
Preconditions:
·          Initiate all Workloads
·          Identify both CSS and CRS master nodes
·          Type `crsctl` as root to see all available command syntax and options

Steps:
1-        As root user, run any `crsctl check crs` commands concurrently on all nodes
2-        As root user, run any `crsctl check cluster -all` commands concurrently on all nodes


Vendor Clusterware:
- same as RAC

RAC:
-           Both `crsctl check crs` and `crsctl check cluster -all` commands produce the appropriate, useful output, without any error messages
-           Collect output for step 1 and step 2

[HW-CW-12]

Votedisk and OCR operation

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option 2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB
Preconditions:
·          Make sure votedisk on ASM diskgroup
·          Make sure ASM OCR files are used
·          Make sure at least one normal redundancy ASM Diskgroup with three failgroups is created and its “compatible.asm” attribute is set to “11.2”;

Steps:
1-        Make sure crs stack are running in all nodes.

2-        Run “crsctl query css votedisk” to check configured VFs;
3-        Run “crsctl replace votedisk +{ASM_DG_NAME}”(As crs user or root user);
4-        Run “crsctl query css votedisk” to get the new VF list;
5-        Run “ocrconfig –add +{ASM_DGNAME}” as root user;
6-        Run “ocrcheck” to verify the OCR files;
7-        Restart CRS stack and then verify the VF/OCR after it comes back;

Variants:
1. Add up to 5 OCR files and restart CRS stack;


RAC:
-           In 12cR1, we can support up to 5 OCRs;

-            





[HW-CW-13]
crsctl command to manage Oracle clusterware stack

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option 2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB
Preconditions:
·          CRS stack is up and running on all nodes.

Steps:
1-        Run ‘crsctl check cluster –all’ to get the stack status on all cluster nodes. Make sure stack status of all cluster nodes are correct;
2-        Run ‘crsctl stop cluster –all’ to stop all CRS resource (CSSD/CRSD/EVMD) with application resources;
3-        Run ‘crsctl status cluster –all’ to make sure CRS resource are OFFLINE;
4-        Run ‘crsctl start cluster –all’ to bring back the whole cluster stack
RAC:
-           After running “crsctl stop cluster –all”, make sure all ocssd/evmd/crsd processes are stopped on all cluster nodes by “ps –ef”.
For 12cR1, collect
“crsctl stat res –t” in a 60s loop from beginning till the end of run.  Attach the output for auditing.


[HW-CW-14]

OCR stores in ASM’s diskgroup and kill asm fatal process

Configuration:
GNS:
Gns with dhcp (1)
Gns without dhcp (2)
Without gns (3)
Preferred option 1, if not applicable option   2,  if still not applicable option 3

ASM:
Flex asm (1)
Standard asm (2)
Preferred option 1, if not applicable option 2

DB:
CDB
Preconditions:
·          Initiate Workloads

Steps:
·          Make sure only ASM OCR files   are used by “ocrcheck –config”;

·          Kill the ASM pmon process on   the CRSD PE Master node;

Variants:
   Repeat the same test on non-OCR Master node.
Clusterware:
Because OCR is stored in ASM, if ASM fails or is brought down on crsd pe master, CRSD pe master will exit and  select a new crsd pe master

- ASM, CRSD will be automatically restarted.

-RDBMS instance should connect to other available asm instance in flex asm env 

- After CRSD restart, all resources’ state shouldn’t change

- New crsd pe master node should be the old crsd pe standby master A new crsd pe standby master should be   elected on other nodes.

(CRSD should recover resources’ previous state)

For 12cR1, collect
“crsctl stat res –t” in a 60s loop from beginning till the end of run.  Attach the output for auditing.



Collect Logfiles
Run each destructive test, taking note of the test start time, test stop time and fault injection time.  On the surviving node (if applicable), run the “date; crsctl stat res –t; sleep 60” in a loop

At the end of the test run, please collect the following logs and put them in directory <CRSHome>/log with the name format as [log_name]_[hostname] and then tar up and compress  with file name <VendorName>_<TestCode>.tar.gz.(e.g. WidgetCorp_HW-STOR-07.tar.gz):
·         Under <CRSHome>/log/<hostname>, the following logs are required
o   alert[hostname].log
o   crsd/crsd.log
o   cssd/ocssd.log
o   evmd/evmd.log
o   ohasd/ohasd.log
o   gpnpd/gpnpd.log
o   diskmon/diskmon.log
o   mdnsd/mdnsd.log
o   ctssd/ctssd.log
o   agent/*
o   gipcd/gipcd.log  (11.2.0.2 new feature)
o   cvu/cvulog/*.log (11.2.0.2 new feature)
o   cvu/cvutrc/*     (11.2.0.2 new feature)
o   srvm/*
o   admin/*
o   acfs/*
o   crfmond/*
o   crflogd/*
o   racg/*
o   gnsd/* (if gns configured)



Monday 14 March 2016

Clusterware Compatibility (Destructive) Testing


Destructive tests include forced failures by software and hardware while the system is running with either minimal or high workload. Oracle software - one or more of Oracle background processes is killed manually. OS software – one or more of the cluster daemons is killed manually or the system is forced to reboot. Hardware: Manual removal of network or disk connectivity or power supply.
There are two major categories of cluster compatibility tests:
 Clusterware (Destructive):
Starting with Oracle Database 10g, the certification and validation process has been enhanced to include hardware destructive tests executed under high system load.
   Cluster File System:
  Starting with Oracle Database 11g, the certification and validation process has been further enhanced to include a set of destructive and high availability tests, designed to verify the use of cluster file system to support the various Oracle Clusterware and Real Application Clusters components.


Sunday 13 March 2016

Useful adop options and syntax

1.Using analytics parameter in adop apply phase


$ adop phase=apply analytics=yes

Specifying this option will cause adop to run the following scripts and generate the associated output files (reports):

ADZDCMPED.sql - This script is used to display the differences between the run and patch editions, including new and changed objects. The output file location is: /u01/R122_EBS/fs_ne/EBSapps/log/adop/<adop_sessionID>/<apply_directory>/<context_name>/adzdcmped.out.

ADZDSHOWED.sql - This script is used to display the editions in the system. The output file location is: /u01/R122_EBS/fs_ne/EBSapps/log/adop/<adop_sessionID>/<apply_directory>/<context_name>adzdshowed.out.

ADZDSHOWOBJS.sql - This script is used to display the summary of editioned objects per edition. The output file location is: /u01/R122_EBS/fs_ne/EBSapps/log/adop/<adop_sessionID>/<apply_directory>/<context_name>adzdshowobjs.out

ADZDSHOWSM.sql - This script is used to display the status report for the seed data manager. The output file location is: /u01/R122_EBS/fs_ne/EBSapps/log/adop/<adop_sessionID>/<apply_directory>/<context_name>adzdshowsm.out

Note: The analytics parameter should only be used when required, because of the extra processing needed.



2. flags=autoskip 

 e.g adop phase=apply patches=12345678  flags=autoskip 

 This option is an alternative for  "Continue as if it were successful"?  in adpatch, very useful in cases where patch failed to compile forms/reports and exiting.
 We need not restart the adop session just to compile a single forms (.fmb,.fmx) file.
 Make sure to review the autoskip.log logfile and fix the issues in autoskip log whenever you use autoskip flag in adop cycle.

3.  skipsyncerror=(yes|no)  [default: no]

 Specifies whether to ignore errors that may occur during incremental file system synchronization.  This might happen if you applied
 a patch in the previous patching cycle that had errors but decided to continue with the cutover.  When the patch is synchronized on
 the next patching cycle, the apply errors may occur again, but can be ignored.

4. wait_on_failed_job=(yes|no)  [default: no]

 Controls whether adop apply command exits when all workers have failed.  Instead of exiting, you can force adop to wait, and use the "adctrl" to retry failed jobs.

e.g adop phase=apply patches=7777 wait_on_failed_job=yes


5. Abort

The Online Patching Cycle can be aborted at any time prior to Cutover

e.g adop phase=abort

This is needed  if unrecoverable error happened or the user decides that patch is not needed.
If adop phase=apply failed, user should try abandon=yes first.
The abort command drops the database patch edition and  returns the system to normal runtime state.  Immediately following abort, you must also run a full cleanup and
fs_clone operation to fully remove effects of the failed online patching cycle.


            

What Happens in ADOP apply Phase

Apply Phase is executed in PATCH Filesystem
During Apply phase of an ADOP cycle, adop execute patch drivers to update Patch Edition
We can have multiple apply phases in an adop cycle, Multiple patches including customizations can be installed in this phase.

The production application is online and accessible to users. RUN filesystem is not affected by the changes.
Patches are applied to the copy (Patch Edition)
 Changes are made in the isolation of an Edition 
The running application is unaffected by these changes


workers 
specifies number of parallel workers, automatically calculated if not specified  
abandon 
abandons a patching session (if set to "yes") and must have an opposite specification to that of parameter "restart".
restart
 restarts a patching session (if set to "yes") and must have an opposite specification to that of parameter "abandon".

Example: $ adop phase=apply workers=4 abandon=yes restart=no patches=1234567,8909228



For more adop options and their usage Click here

Autopatch error: The worker should not have status 'Running' or 'Restarted' at this point

If you are restarting a failed patch session in oracle applications, sometimes you may encounter the error

AutoPatch error:
The worker should not have status 'Running' or 'Restarted' at this point.

Telling workers to quit...

All workers have quit.

Connecting to APPS......Connected successfully.

AutoPatch error:

Error running SQL and EXEC commands in parallel

Cause:

1. adpatch or adop process was killed from OS level while patch is being applied

2. Database shutdown or terminated



Solution:

Using adctrl utility, use option 4 to change the worker status to Failed


Review the messages above, then press [Return] to continue.

                    AD Controller Menu
     ---------------------------------------------------

     1.    Show worker status

     2.    Tell worker to restart a failed job

     3.    Tell worker to quit

     4.    Tell manager that a worker failed its job

     5.    Tell manager that a worker acknowledges quit

     6.    Restart a worker on the current machine

     7.    Exit


Enter your choice [1] : 4

Enter the worker number(s)/range(s) or 'all' for all workers,
or press [Return] to go back to the menu : all

Status changed to 'Failed' for worker 1.
Status changed to 'Failed' for worker 2.
Status changed to 'Failed' for worker 3.
Status changed to 'Failed' for worker 4.
Review the messages above, then press [Return] to continue.



select option "1. Show Worker Status" , The worker status will be "Failed" 

Now select option 2. "2. Tell worker to restart a failed job" 


Enter the worker number(s)/range(s) or 'all' for all workers,
or press [Return] to go back to the menu : all

Status changed to 'Fixed, restart' for worker 1.
Status changed to 'Fixed, restart' for worker 2.
Status changed to 'Fixed, restart' for worker 3.
Status changed to 'Fixed, restart' for worker 4.


Restart adpatch 

In 11i, R12.1.X version of oracle applications

when applying patch using adpatch , Select option Yes When it prompts for

'Do you wish to Continue with Previous adpatch Session'

In R12.2.x, Use restart and abandon parameters


adop phase=apply patches=123456 restart=yes abandon=no 

To restart the patch session from where it failed.