Steps to Troubleshoot CUCM Database Replication Problems in 6.x and 7.x
(written by Bill Benninghoff, heavily borrowing from material written by Laurie Dotter and Nancy Balsbaugh)
If
installation has proceeded correctly, then the informix cdr service
should be running on the publisher and on each sub in the cluster. “Cdr” in this context means “Continuous Data Replication”, not call detail records.
In order to setup replication, scripts run during the install process that do these things:
a. define replication on the pub
b. define the template on the pub and realize it (tells pub what to replicate)
c. define replication for each sub
d. realize the template on each sub
e. synch the data between the pub and subs using “cdr sync” or “cdr check”
It is possible that this process broke down at one of the steps.
If
you look at the RTMT replication counter and see that the replication
state counter is a 3 or a 4 for a given server that means replication
has failed for that server.
Here are some suggested steps to troubleshoot replication.
- Check the replication status using the following command logged in on the pub as admin:
- utils db replication status
This will generate an output file. Study the file to see if replication is setup to each server and if the data is in synch among the servers.
For example:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm 2 Active Local 0
g_bldr_ccm5_ccm 3 Active Connected 0 Sep 6 16:27:15
This section above means that replication is working on the pub (local) and on the sub (connected)
Node Rows Extra Missing Mismatch Processed
---------------- --------- --------- --------- --------- ---------
g_bldr_ccm4_ccm 0 0 0 0 0
g_bldr_ccm5_ccm 0 0 0 0 0
This section above means that there are no rows missing between the databases on the two servers. They are in perfect synch.
Use “utils dbreplication repair all” command if replication is set up, but some tables are out of sync. If only one sub is out of sync, you can run this on one node, else use “utils dbreplication repair all” to fix it for all nodes
Here is an example of a problem with replication from the ouptut file:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_bldr_ccm4_ccm 2 Active Local 0
g_bldr_ccm5_ccm 3 Active Dropped 636 Sep 11 14:01:20
If
you see that a server’s status is “Dropped” or “Quiescent” or just
missing from the table, then you will need to troubleshoot the network
connection between the pub and subs.
Another useful diagnostic command is “cdr list serv”. You
have to be root to run this command and you can run it on the pub and
on each sub to show which servers have been defined from the perspective
of the server you are on, and what state those defined servers are in. Here is an example of the output of that command:
SERVER ID STATE STATUS QUEUE CONNECTION CHANGED
-----------------------------------------------------------------------
g_acopup01_ccm 2 Active Local 0
g_acopus01_ccm 3 Active Connected 0 Dec 6 19:31:46
g_acopus02_ccm 4 Active Connected 0 Dec 6 19:31:47
g_acopus03_ccm 5 Active Connected 0 Dec 6 19:31:47
g_acoput01_ccm 6 Active Connected 0 Dec 6 19:31:46
g_erepus04_ccm 7 Active Connected 0 Dec 14 11:08:53
g_erepus05_ccm 8 Active Connected 0 Dec 6 19:31:46
g_erepus06_ccm 9 Active Connected 0 Dec 6 19:31:47
g_ereput02_ccm 10 Active Connected 0 Dec 6 19:31:46
g_londos07_ccm 11 Active Connected 0 Dec 6 19:31:47
The status of “Local” or “Active” is good. A bad status would be “Dropped” or “Quiescent”. If the server is missing from this list then it is not yet defined.
- Now assuming that one of your subs is missing or dropped from the list, the first thing to look at is possible network errors. Do the following to test the network connection:
- ping the pub from the sub with a large amount of data:
ping <pub name> -s 1500
- ping the sub from the pub with a large amt of data:
ping <sub name> -s 1500
- Verify that cluster manager (clm) (or ipsec_mgr in 5.1.2 and earlier) is responding to the host by analyzing /var/log/active/platform/log/clustermgr* logs. (It is platform_mgr in earlier loads).
Clm is responsible for adding hosts to the iptables rules. clm on sub and clm on pub exchange handshakes. Clm on the pub puts the sub in the policy injected state and adds the host to the iptables rules allowing replication to work. So,
if iptables is blocking replication, the clm's are not talking. Clm
communicates over 8500/udp and often times with large packets which
means they are fragmented. If pmtu discovery is broken (ie., icmp
packets are dropped/not sent) or fragments are not allowed through the
network then clm does not communicate, iptables is not open, and as a
result replication does not work.
In
the clm logs on the pub look for entries about communications with the
sub, most importantly one saying that the sub was put into policy
injected state.
- Make sure that the dbl rpc service is running on the subs and on the pub. Do this by typing this command as root on the subs where replication is not working:
dbl rpchello <pub name>
If that command returns an error then check to see if the dbl rpc service is DBLrunning by doing this:
ps –ef | grep dblrpc
If you don’t see anything that is a problem. Dblrpc must be running on the sub in order for replication to be setup. Once replication is established dblrpc no longer needs to run.
To start up the dbl rpc service on the sub do this as root:
controlcenter.sh "A Cisco DB Replicator" start
- Another thing you may need to do in order to enable the sub and the pub informix processes to talk with each other is to turn off the Linux firewall which is done with the iptables command as root:
/etc/init.d/iptables stop
- As user informix you can start the following command shell by typing this:
Dbaccess
This
runs a program in which you can select “connnect” and try to connect to
the informix database on each server in the cluster. If you are able to connect from the sub to the pub then the network connection is good and your problem is something else.
- Check the log files on the pub and subs:
- look at these four files and make sure the entries in these files on the pub match the entries in these files on the subs:
1. /etc/hosts
2. /etc/services (very bottom of the file)
3. /home/informix/.rhosts
4. /usr/local/active/cm/db/informix/etc/sqlhosts
- In particular, make sure that in the sqlhosts file there is only one entry for each node in the cluster.
- When you are sure that the network connectivity is working run the following commands to establish the replication:
a. on the sub run this as admin: utils dbreplication stop
b. on the pub run this as admin: utils dbreplication stop
c. on the pub run this as admin: utils dbreplication reset <name of sub that is not working>
- Check if replication is now working:
- go to this directory as root on the pub:
/var/log/active/cm/trace/dbl
Run this command :
ls –lrt
This will list all the files in the directory in with the most recent files at the bottom. Scan
that list to see if there is a file in there with the word “define” in
the filename and also the name of the server or servers that are having
trouble.
For example:
rw-rw-rw- 1 root root 9392 Sep 15 15:55 dbl_repl_cdr_define_nw104a_202-2007_09_15_15_53_27.log
If this file shows up it is a good sign. The system is attempting to define replication for that server. Open the file and make sure there are no errors in the file.
- Another thing to do is to issue this command on the pub:
onstat –m
This will display the tail of the ccm.log. This is the main informix replication log which shows what is currently happening with replication. Look for possible errors in that file.
- Additional notes:
To see the CLI commands, type:
admin: utils dbreplication ?
Most commonly used 4 commands are:
utils dbreplication status
(checks each table on all servers, sees if tables out of synch)
utils dbreplication repair
(use to sync tables, run it on the pub. Can be run for a sub, or all.
“utils dbreplication repair <nodename>” will sync one sub.
“utils dbreplication repair all” will sync all.)
utils dbreplication stop
(use this command on sub and pub before a “utils dbreplication reset” .
Run the “stop” locally on sub, then pub. If you are going to reset all, run stop on each sub, then on the pub)
utils dbreplication reset
(use to restart replication on one sub or all nodes.
Use “utils dbreplication reset <nodename>” to reset replication on one sub.
Use “utils dbreplication reset all” to reset replication on pub and all subs.)