Total Pageviews

Saturday, November 19, 2011

Troubleshooting the replication issues

Steps to Troubleshoot CUCM Database Replication Problems in 6.x and 7.x
(written by Bill Benninghoff, heavily borrowing from material written by Laurie Dotter and Nancy Balsbaugh)
If installation has proceeded correctly, then the informix cdr service should be running on the publisher and on each sub in the cluster.   “Cdr” in this context means “Continuous Data Replication”, not call detail records.
 
In order to setup replication, scripts run during the install process that do these things:
 
a.       define replication on the pub
b.       define the template on the pub and realize it (tells pub what to replicate)
c.       define replication for each sub
d.       realize the template on each sub
e.       synch the data between the pub and subs using “cdr sync” or “cdr check”
 
It is possible that this process broke down at one of the steps.
 
If you look at the RTMT replication counter and see that the replication state counter is a 3 or a 4 for a given server that means replication has failed for that server.
 
Here are some suggested steps to troubleshoot replication.
 
  1. Check the replication status using the following      command logged in on the pub as admin:
 
    1. utils db replication status
This will generate an output file.  Study the file to see if replication is setup to each server and if the data is in synch among the servers.
For example:
SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm         2 Active   Local                0               
g_bldr_ccm5_ccm         3 Active   Connected       0 Sep  6 16:27:15
This section above means that replication is working on the pub (local) and on the sub (connected)
Node                  Rows     Extra   Missing  Mismatch Processed
---------------- --------- --------- --------- --------- ---------
g_bldr_ccm4_ccm          0         0         0         0         0
g_bldr_ccm5_ccm          0         0         0         0         0
This section above means that there are no rows missing between the databases on the two servers.  They are in perfect synch.
Use “utils dbreplication repair all”  command if replication is set up, but some tables are out of sync.  If only one sub is out of sync, you can run this on one node, else use “utils dbreplication repair all” to fix it for all nodes
Here is an example of a problem with replication from the ouptut file:
SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm         2 Active   Local           0               
g_bldr_ccm5_ccm         3 Active   Dropped       636 Sep 11 14:01:20
If you see that a server’s status is “Dropped” or “Quiescent” or just missing from the table, then you will need to troubleshoot the network connection between the pub and subs.
Another useful diagnostic command is “cdr list serv”.  You have to be root to run this command and you can run it on the pub and on each sub to show which servers have been defined from the perspective of the server you are on, and what state those defined servers are in.  Here is an example of the output of that command:
SERVER                 ID STATE    STATUS     QUEUE  CONNECTION CHANGED
-----------------------------------------------------------------------
g_acopup01_ccm          2 Active   Local           0               
g_acopus01_ccm          3 Active   Connected       0 Dec  6 19:31:46
g_acopus02_ccm          4 Active   Connected       0 Dec  6 19:31:47
g_acopus03_ccm          5 Active   Connected       0 Dec  6 19:31:47
g_acoput01_ccm          6 Active   Connected       0 Dec  6 19:31:46
g_erepus04_ccm          7 Active   Connected       0 Dec  14 11:08:53
g_erepus05_ccm          8 Active   Connected       0 Dec  6 19:31:46
g_erepus06_ccm          9 Active   Connected       0 Dec  6 19:31:47
g_ereput02_ccm         10 Active   Connected       0 Dec  6 19:31:46
g_londos07_ccm         11 Active   Connected       0 Dec  6 19:31:47
The status of “Local” or “Active” is good.  A bad status would be “Dropped” or “Quiescent”.  If the server is missing from this list then it is not yet defined.
  1. Now assuming that one of your      subs is missing or dropped from the list, the first thing to look at is      possible network errors.   Do the      following to test the network connection:
 
    1. ping       the pub from the sub with a large amount of data:
     
 
ping <pub name> -s 1500
    1. ping       the sub from the pub with a large amt of data:
 
ping <sub name> -s 1500
    1. Verify       that cluster manager (clm) (or       ipsec_mgr in 5.1.2 and earlier) is responding to the host by analyzing       /var/log/active/platform/log/clustermgr* logs. (It is platform_mgr in       earlier loads).
Clm is responsible for adding hosts to the iptables rules. clm on sub and clm on pub exchange handshakes.  Clm on the pub puts the sub in the policy injected state and adds the host to the iptables rules allowing replication to work.  So, if iptables is blocking replication, the clm's are not talking. Clm communicates over 8500/udp and often times with large packets which means they are fragmented. If pmtu discovery is broken (ie., icmp packets are dropped/not sent) or fragments are not allowed through the network then clm does not communicate, iptables is not open, and as a result replication does not work.
In the clm logs on the pub look for entries about communications with the sub, most importantly one saying that the sub was put into policy injected state.
    1. Make       sure that the dbl rpc service is running on the subs and on the pub.  Do this by typing this command as root       on the subs where replication is not working:
dbl rpchello <pub name>
If that command returns an error then check to see if the dbl rpc service is DBLrunning by doing this:
            ps –ef | grep dblrpc
If you don’t see anything that is a problem.  Dblrpc must be running on the sub in order for replication to be setup.  Once replication is established dblrpc no longer needs to run.
To start up the dbl rpc service on the sub do this as root:
            controlcenter.sh  "A Cisco DB Replicator"  start
    1. Another       thing you may need to do in order to enable the sub and the pub informix       processes to talk with each other is to turn off the Linux firewall which       is done with the iptables command as root:
 
/etc/init.d/iptables stop
    1. As       user informix you can start the following command shell by typing this:
Dbaccess
This runs a program in which you can select “connnect” and try to connect to the informix database on each server in the cluster.  If you are able to connect from the sub to the pub then the network connection is good and your problem is something else.
  1. Check the log files on the pub      and subs:
 
    1. look       at these four files and make sure the entries in these files on the pub       match the entries in these files on the subs:
1.       /etc/hosts
2.       /etc/services (very bottom of the file)
3.       /home/informix/.rhosts
4.       /usr/local/active/cm/db/informix/etc/sqlhosts
 
    1. In       particular, make sure that in the sqlhosts file there is only one entry       for each node in the cluster.
 
 
  1. When you are sure that the      network connectivity is working run the following commands to establish      the replication:
a.  on the sub run this as admin:   utils dbreplication stop
b.  on the pub run this as admin:  utils dbreplication stop
c.  on the pub run this as admin:  utils dbreplication reset <name of sub that is not working>
  1. Check if replication is now      working:
    1. go       to this directory as root on the pub:
/var/log/active/cm/trace/dbl
Run this command :
ls –lrt
This will list all the files in the directory in with the most recent files at the bottom.  Scan that list to see if there is a file in there with the word “define” in the filename and also the name of the server or servers that are having trouble.
For example:
rw-rw-rw-  1 root  root    9392 Sep 15 15:55 dbl_repl_cdr_define_nw104a_202-2007_09_15_15_53_27.log
If this file shows up it is a good sign.  The system is attempting to define replication for that server.  Open the file and make sure there are no errors in the file.
    1. Another       thing to do is to issue this command on the pub:
onstat –m
This will display the tail of the ccm.log.  This is the main informix replication log which shows what is currently happening with replication.   Look for possible errors in that file.
  1. Additional      notes:
 
To see the CLI commands, type:
admin:   utils dbreplication  ?
Most commonly used 4 commands are:
                 utils dbreplication status 
       (checks each table on all servers, sees if tables out of synch)
                 utils dbreplication repair
       (use to sync tables, run it on the pub. Can be run for a sub, or all.
       “utils dbreplication repair <nodename>” will sync one sub.
       “utils dbreplication repair all” will sync all.)
                 utils dbreplication stop
       (use this command on sub and pub before a “utils dbreplication reset” .
         Run the “stop” locally on sub, then pub. If you are going  to reset all, run stop on each sub, then    on the pub)
                 utils dbreplication reset
        (use to restart replication on one sub or all nodes.
        Use “utils dbreplication reset <nodename>” to reset replication on one sub.
        Use “utils dbreplication reset all” to reset replication on pub and all subs.)