Monitoring MySQL Replication

From AgileApps Support Wiki

Run the Monitoring Scripts

Use these scripts to track the health of the MySQL replication process:

This script monitors the health of the replication process. It sends an email containing the quartz status (if quartz is configured), the last 20 lines from mysqld.log, and the status of the slave server.
Run it periodically on the slave server, and check the reported values. (See below for information on what to look for.)
Sends you an email if anything new has been placed into mysqld.log (error messages, warnings, or notices).
Run it every 15 minutes on both master and slave servers.
Sends a message if the Slave SQL thread is not working.
Run it every 15 minutes on the slave server.

To use the scripts, substitute your values for the following strings:

  • <username> - Database username.
  • <password> - Database password.
  • <path_to_mysql_log> - The path to the MySQL log. For example: /var/log/mysqld.log
  • <support_email> - The email address that should receive the monitoring emails.

Cron Job Schedule

As root, use the crontab –e command to set up a regular monitoring schedule like this one:

 15  6,18 * * * $path_to_scripts/scripts/mail_replication_log.pl
*/15  *   * * * $path_to_scripts/scripts/monitor_mysql_log.pl 
*/15  *   * * * $path_to_scripts/scripts/scripts/monitor_replication.pl

Those settings run the mail_replication script every day at 6:15 and 18:15. The other two scripts run every 15 minutes.

Reviewing the Messages

  • monitor_replication sends a message only when the Slave SQL thread isn't running. At a minimum, you'll need to restart that process. Ideally, you'll find out what killed it and take preventative measures.
  • monitor_mysql_log sends a message when the MySQL log has been updated. The last 20 lines of the log are included in the message. Scan it for errors, warnings, and notices.
  • mail_replication_log provides a general indication of the health of the system. Things to look for:
  • The Last_Errno and Last_Error values tell you the last error that occurred.
  • The Seconds_Behind_Master value compares timestamps generated by the Quartz scheduler to give you the lag time between the master and server databases. That useful indicator is at the end of the message, where it's easy to find.
  • Ideally, the value should be zero.
  • If there is a large lag, then most likely replication is running, but the synchronization thread is not working properly. The solution is to run the STOP SLAVE and START SLAVE commands.
  • Once running, it's a good idea to run the script manually a few times.
  • You should see the lag time decreasing.
  • If it isn't going down, it's a sign of deeper troubles that require investigation and resolution.
(There might be network issues, for example, or database problems.)

Deciding when to Failover

The failover process turns the replication (slave) server into the new primary database (master) server.

In general, this action should be taken only in an emergency situation.

For example:

  • The master database is down and cannot be restored.
  • The master can be restored from backups, but the restore process will take more than an hour.

Here are the decision criteria:

  • How much time it will take to get the main database running again. (This is preferred option, when practical, as little data is lost. But the platform is non-operational until it happens, or until failover occurs.)
  • How much data will be lost by switching to the replication server. (It will be a minute amount, if the replication process was healthy. But it is likely to be a non-zero amount.)
  • How much time the failover process will take.
  • How much data will be lost by restoring the main database from backups. (The data loss could be significantly higher, depending on when the last backup occurred. For that reason, the backups are best reserved for recovery from a catastrophe.)

Failing Over

Assumptions:

  • Master and slave servers are configured properly.
  • Both have a unique id in my.cnf
  • The master server is not available, forcing a failover to the replication server.

Process:

1. Stop all application servers.
2. Disable replication on the slave server:
  • Execute as root in mysql console: STOP SLAVE;
  • Execute "RESET SLAVE;"
    That command tells the slave server to forget it's position in binary log.
  • Add skip-slave-start to the my.cnf file.
    That command prevents the replication thread from starting when MySQL is launched (as this server is now the master).
3. On each application server, edit $AGILEAPPSHOME/tomcat/conf/server.xml.
Change the database IP address in the url parameter:
<Resource name="jdbc/shareddbpool" auth="Container" type="javax.sql.DataSource"
  ... 
  url="jdbc:mysql://$IP_ADDR_TO_CHANGE:3306/relationals?..."
/>
4. On each application server (if the slave server was configured to run reports),
edit $AGILEAPPSHOME/tomcat/conf/server.xml.
Change the url parameter:
<Resource name="jdbc/REDUNDANTSRC1" auth="Container" type="javax.sql.DataSource" 
  ...
  url="jdbc:mysql://$IP_ADDR_TO_CHANGE/relationals?..." 
/>
5. On each application server (if quartz is configured), edit $AGILEAPPSHOME/tomcat/conf/RN/quartz.properties
Change the database IP address in the url parameter:
org.quartz.dataSource.myDS.URL = jdbc:mysql://$IP_ADDR_TO_CHANGE/relationals?...
6. On each application server, edit $AGILEAPPSHOME/tomcat/conf/RN/activiti.cfg.xml
Change the database IP address:
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" >
  <property name="..." value=..." />
  <property name="url" value="jdbc:mysql://$IP_ADDR_TO_CHANGE/relationals?..." />
  ...
</bean>
7. Restart all memcache servers and start all application servers
8. Take a dump of the new master database.
9. Set up a new replication server.
Learn more: Setting Up a MySQL Replication Server