waqas bhatti / notes / the mariadb galera cluster database

updated on 2017-03-23: updated my.cnf for MariaDB 10.1 (added wsrep_on keyword arg and dedicated user for replication instead of using root), fixed missing comment # in my.cnf, added extra steps necessary if Debian repo for Percona is unavailable.

originally written 2016-10-01.

Background

Since the HATSouth exoplanet survey is distributed over geographical sites on three different continents (in Chile, Namibia, and Australia), it can be challenging to manage the data flow from each site to the headquarters at Princeton and vice-versa. All processing carried out on the frames taken is logged to a MySQL database, present at each site and at headquarters. These "calibration" databases must be synchronized periodically so that we know where our data are and what to do with the image frames next.

Before 2015, we were using MySQL dumps to files, rsyncing these around the network, and reconciling any commits on each server (while locking entire tables to ensure consistency). This was automated, but would often fail as some sites would complete their processing without issues, but others might have failed, leaving the entire schema inconsistent, and thus requiring manual intervention to sort things out. This would require pausing our automatic calibration and photometry pipelines, thus making us fall further and further behind with respect to the HATSouth observations.

Moving to a clustered database

We eventually decided to evaluate clustered databases as a potential solution to our problem. The basic idea was to have a commit to the database at any one of the sites or at headquarters be applied to all database servers in the network immediately, rather than having to wait for reconciliation. This effectively called for a multi-master replication scheme. Since we were already using MySQL, we decided to try out the MariaDB Galera Cluster database.

Below is a short guide to getting started with a MariaDB Galera Cluster, with parameters tuned for multi-master replication over a wide area network. Note that we don't use Debian/RPM packages for installation. We need to run the MariaDB Galera servers on a non-standard port and from a non-default location because we want to keep the system MySQL installation around for other local databases. If this isn't an issue for your particular case, feel free to use the official MariaDB Debian or RPM packages from their repository; these have Galera replication built-in and service control is far more straightforward.

After some teething issues and tweaking of replication parameters over a period of six months or so, we've finally converged on a mostly-stable clustered database that works as we expected. It's a bit slower in the short term than the previous solution since any commit must be acknowledged by all participating servers (by design), but the consistency guarantee outweighs this particular downside. Extended Internet outages (4-6 hours long) sometimes happen at our remote sites, but the network usually recovers automatically. For longer outages, the calibration database node that lost connectivity to the rest of the network shuts down automatically to preserve its own consistency—this doesn't affect observing. We have monitoring enabled on the network, so we get notified of such problems, and can recover by restarting the database node, which then resyncs automatically in most cases.

Installing the database

We use the pre-built binaries available from the MariaDB download page. Most of our computers at the sites run older versions of Debian, so we install some required packages from the Debian archive repository, and the Percona Debian repository:

Enable the Percona repository, and then install some required packages (we use Debian, but RPM-based installs are similar):

[root@server:~ ]# apt-get update && apt-get install lsof libaio1 libaio-dev socat pv
[root@server:~ ]# apt-get update && apt-get install percona-xtrabackup

NOTE: If you're still stuck on Debian 6, the Percona repositories above won't work. You'll need to get the xtrabackup binaries from Percona, and copy them over to the /usr/bin directory:

[root@server:~ ]# wget https://www.percona.com/downloads/XtraBackup/Percona-XtraBackup-2.4.6/binary/tarball/percona-xtrabackup-2.4.6-Linux-x86_64.tar.gz
[root@server:~ ]# tar xvf percona-xtrabackup-2.4.6-Linux-x86_64.tar.gz
[root@server:~ ]# cd percona-xtrabackup-2.4.6-Linux-x86_64/bin
[root@server:bin ]# cp * /usr/bin
[root@server:bin ]# cd ../man/man1; cp * /usr/local/share/man/man1/
[root@server:man1 ]# mandb

Next, extract the tarball, copy it to the install location (we use /opt/mariadb-galera), add a user and group for the database server, and make the data directory (we use /data/mariadb-galera):

[root@server:~ ]# cd /opt; tar xvf mariadb-tarball.tar.gz
[root@server:/opt ]# mv mariadb-tarball mariadb-galera # fix the dirname
[root@server:/opt ]# groupadd mariadb
[root@server:/opt ]# useradd -g mariadb mariadb
[root@server:/opt ]# cd /data; mkdir /data/mariadb-galera; cd /data
[root@server:/data ]# chown -R mariadb.mariadb mariadb-galera

# if you have an older Debian (6 or 7), you may need to
# symlink your libcrypto and libssl so mariadb can find it
[root@server:~ ]# cd /usr/lib
[root@server:~ ]# ln -s libssl.so.0.9.8 libssl.so.6
[root@server:~ ]# ln -s libcrypto.so.0.9.8 libcrypto.so.6

Configuration files

We'll need to create a my.cnf in the /opt/mariadb-galera directory to activate the replication provider and tune parameters that control the replication.

Below is an example my.cnf file, with wsrep_provider_options replication parameters tuned for use with high-latency WAN links between the various database nodes. It's best to use a very large Galera cache (gcache.size—we use 5 GB), so a full state-snapshot transfer (SST) isn't necessary after an extended outage leaves one or more nodes behind the rest of the cluster.

It's good to use longer timeout values (evs.inactive_timeout, evs.suspect_timeout, evs.keepalive_period, evs.join_retrans_period, evs.install_timeout) so brief outages won't kick a node off the cluster. Another item to check is wsrep_retry_autocommit, which we set to 5, so every commit is tried at least five times before returning with an error. Finally, we set a large number of commit-apply threads (wsrep_slave_threads) so that each node applies changes as fast as possible: we use 8 for our quad-core machines.

Useful logging options include logging all commit-conflicts (wsrep_log_conflicts) and logging everything if you're trying to hunt down a problem (wsrep_debug).

# The following options will be passed to all MariaDB clients
[client]
port            = 3307  # note this is non-standard
socket          = /data/mariadb-galera/mariadb.sock

# The MariaDB server
[mysqld]

# settings for datadir and basedir
port            = 3307 # again non-standard, since we run MySQL and MariaDB both
socket          = /data/mariadb-galera/mariadb.sock
user            = mariadb
datadir = /data/mariadb-galera
basedir = /opt/mariadb-galera

innodb_file_per_table

skip-external-locking
key_buffer_size = 16M
max_allowed_packet = 1M
table_open_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M

## innodb settings for galera cluster ##
binlog_format = ROW
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_locks_unsafe_for_binlog=1
innodb_buffer_pool_size = 128M
innodb_flush_log_at_trx_commit = 0  # apparently recommended by Galera

## wsrep settings for galera cluster ##
wsrep_on=ON  # this is needed to turn on replication for MariaDB 10.1
wsrep_cluster_name = [your cluster name]
wsrep_cluster_address = 'gcomm://[IP address of other cluster node],[IP address of another cluster node],[etc...]'
wsrep_node_name = [name of this cluster node]
wsrep_node_address = [IP address of this cluster node]
wsrep_provider = '/opt/mariadb-galera/lib/libgalera_smm.so'

# can set this to rsync as well (but that's unreliable)
# xtrabackup requires extra packages
wsrep_sst_method=xtrabackup-v2

# this is the user that runs the xtrabackup-v2 replication
# for MariaDB 10.1, the root user can't connect remotely,
# so we use a dedicated user for this purpose called 'repliuser'
wsrep_sst_auth=repliuser:[repliuser password]

## uncomment this for if you want all nodes to get SSTs from one node
# wsrep_sst_donor = [master node]

# verbose logging
wsrep_debug=OFF # change to ON if you want everything logged
wsrep_log_conflicts=ON

# retry in case of deadlocks in autocommit mode
wsrep_retry_autocommit=5

# increase number of wsrep threads
wsrep_slave_threads=8

# tune these to match timeouts and delays expected from the WAN connection
# between nodes
wsrep_provider_options = "evs.keepalive_period = PT3S;evs.suspect_timeout = PT5M;evs.inactive_timeout = PT7M;evs.join_retrans_period=PT0.6S;evs.install_timeout = PT7M;gcache.size=5G"


# Point the following paths to different dedicated disks
#tmpdir         = /tmp/

#skip-networking
# this isn't really required if skip_networking is commented out
# and you want MariaDB to listen on all interfaces
# if you want MariaDB to listen only a single interface,
# set its IP address below
# bind-address = [IP address of this cluster node]

# binary logging is required for point in time recovery
# note that this makes tons of binary .log files in the datadir
# comment this out if not needed
log-bin=mysql-bin

# required unique id between 1 and 2^32 - 1
# defaults to 1 if master-host is not set
# but will not function as a master if omitted
server-id       = 1

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash
# enforces a 'where' in update statements
safe-updates

[myisamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

Replication methods and firewalls

We use the Percona xtrabackup-v2 method for sending changesets across the cluster. It's also possible to use rsync, but that's proven unreliable for us. Using xtrabackup-v2 requires the Percona innobackupex script installed (from Percona's apt repository, discussed above).

To make the cluster work correctly, you'll need to open a bunch of ports on the server firewalls, as well as on any intervening site firewalls. We use the following port schema: TCP and UDP 4444 open, TCP 4567 open, TCP 4568 open, and TCP 3307 (used by innobackupex) open.

Starting up the cluster

The steps below can be replicated on all nodes of the cluster. First, install the database's system tables:

[root@server:/opt/mariadb-galera ]# ./scripts/mysql_install_db --defaults-file=/opt/mariadb-galera/my.cnf

Next, start up the server to set up the root user:

[root@server:/opt/mariadb-galera ]# /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster --skip-grant-tables &

Check the file /data/mariadb-galera/[hostname].err for log messages and sort out any issues. Then, connect to the server and set up the root user and any other users and databases you want.

[root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root
# switch to the mysql database
MariaDB [(none)]> use mysql;

# set password for root
MariaDB [mysql]> update user set password=PASSWORD("[mariadb root password]") where user='root';

# delete anonymous users
MariaDB [mysql]> delete from user where user = '';

# update grant tables with changes
MariaDB [mysql]> flush privileges;

Shut down the servers on all nodes (just use kill %1) after you're done setting them up. Next, we'll set up a dedicated user for replication via the xtrabackup-v2 method. This requires starting up the local server again and adding the user:

# start the server
[root@server:/opt/mariadb-galera ]# /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster &

# connect to the server, you'll need your root password this time
[root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root -p

# switch to the mysql database
MariaDB [(none)]> use mysql;

# add a user dedicated to replication
# this will be used by the xtrabackup-v2 replication method
# and can log in from anywhere that has access to this DB over the network
# you may want to restrict privileges below if needed

MariaDB [mysql]> create user 'repliuser'@'%' identified by '[repliuser password]';
MariaDB [mysql]> grant all privileges on *.* to 'repliuser'@'%';

# update grant tables with changes
MariaDB [mysql]> flush privileges;

Kill the server again to conclude initial user set up. Then, start one node in "new cluster" mode (this will be the leader node for now), and the other nodes in normal mode, so they can all find each other.

# on the first node
[root@firstnode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster >> /var/log/mariadb-stdout.log &

# on the other nodes
[root@othernode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid >> /var/log/mariadb-stdout.log &

To check if the nodes are connected, connect to one of the nodes via mysql and check WSREP status variables. wsrep_cluster_status should be Primary, and wsrep_evs_state should be be OPERATIONAL.

[root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root

MariaDB [(none)]> show status like '%cluster%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_conf_id    | 54                                   |
| wsrep_cluster_size       | 5                                    |
| wsrep_cluster_state_uuid | 9e2320c6-f4fd-11e5-bdbf-b235a8f9fd59 |
| wsrep_cluster_status     | Primary                              |
+--------------------------+--------------------------------------+
4 rows in set (0.00 sec)

MariaDB [(none)]> show status like '%wsrep_evs_state%';
+-----------------+-------------+
| Variable_name   | Value       |
+-----------------+-------------+
| wsrep_evs_state | OPERATIONAL |
+-----------------+-------------+
1 row in set (0.00 sec)

Once the cluster is up, any subsequent restarts of any of the servers should not use the --wsrep-new-cluster flag, since this will dissociate the node from the cluster. As long as one node stays up, the cluster will remain intact. If all nodes go down, then you'll have to do state recovery.

At this point, you can install the service file included with the MariaDB binary distribution: /opt/mariadb-galera/support-files/mysql.server to use with update-rc.d and /etc/init.d/. We usually don't bother with this and just add the appropriate command line to /etc/rc.local to make sure the server starts up at reboot and use kill to make the mariadb main process exit cleanly if needed. The cluster replication usually takes care of handling data resyncs, and we have monitoring for when a cluster node goes down.

Database backups

We use the innobackupex script that comes with the Percona xtrabackup package (that we installed earlier) to handle backups. This doesn't require long-held locks on tables so downtime while taking backups is minimal. Here's a wrapper shell script that can be used with cron:

#!/bin/bash

DBUSER=root
DBPASS=[your DB root password here]
DBSOCK=/data/mariadb-galera/mariadb.sock
DBCONF=/opt/mariadb-galera/my.cnf
BAKDIR=/path/to/backup/directory

# find older backups and kill them
echo "removing backups older than 30 days"
find $BAKDIR -type d -ctime +30 -exec rm -rvf '{}' \;

# do the actual backup for today
echo "starting today's backup"
innobackupex --defaults-file=$DBCONF --user=$DBUSER --password=$DBPASS --socket=$DBSOCK  --ftwrl-wait-threshold=40 --ftwrl-wait-query-type=all --ftwrl-wait-timeout=180 --kill-long-queries-timeout=20 --kill-long-query-type=all --galera-info $BAKDIR

After a backup has been completed (wait for it to say "Completed OK"), you'll need to prepare the backup before it's in restore-ready state:

[root@server:~ ]# innobackupex --apply-log /path/to/backup/directory/[datetime of backup]

To restore from a backup:

# restore the backup
[root@server:~ ]# innobackupex --copy-back /path/to/backup/directory/[datetime of backup]

# restore the permissions back to mariadb
[root@server:~ ]# chown -R mariadb.mariadb /data/mariadb-galera

Monitoring the network

The sophisticated way to do this is to install ClusterControl, Galera's web interface for monitoring a replication cluster. We just use a shell script that runs as a periodic cronjob, that does the following: pings each database node and if it's up, logs in via SSH and executes mysql with a simple select count(*) statement on one of our smaller replicated tables that gets updated frequently. If all is well, we see the same (steadily increasing) count on all nodes. If there's a mismatch or if one of the nodes is not pingable, unreachable via SSH, or database server on the node is down, we get alert emails. We then log in manually to see what's happening.

This system has worked fairly well for us. There are usually a few false-alarm alert emails per week because of nodes under load taking too long to respond, but those are easy to verify.

Recovering from multiple-node failure

Sometimes the entire cluster falls apart because of some nodes dropping out of the cluster, breaking quorum, and then forcing them to shut down. Recovery in this case usually means finding the cluster node that has the most advanced state, designating that node as the new leader, and then starting up all other nodes one by one, until they all join the new cluster.

To find the most advanced state of the cluster, we need to temporarily start each node in isolated mode. This is done by commenting out the wsrep_cluster_address = gcomm://... line to make sure the server doesn't expect any incoming node connections, and then starting the node using mysqld_safe with the --wsrep-new-cluster option. Then, connect to the server and figure out the replication state.

MariaDB [(none)]> show status like '%wsrep_cluster_state_uuid%';
+--------------------------+--------------------------------------+
| Variable_name            | Value                                |
+--------------------------+--------------------------------------+
| wsrep_cluster_state_uuid | 6873b24f-4066-11e5-ba5a-3a0ab9be4ca4 |
+--------------------------+--------------------------------------+
1 row in set (0.00 sec)

MariaDB [(none)]> show status like '%wsrep_last_committed%';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_last_committed | 1086816 |
+----------------------+---------+
1 row in set (0.00 sec)

The wsrep_last_committed value is what you want to compare across all nodes. The node with the highest value is the one with the most advanced state and should be used to reform the cluster. Shut down the server, uncomment the ...gcomm://... line, then restart the nodes one by one:

# on the new leader node
[root@firstnode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster >> /var/log/mariadb-stdout.log &

# on the other nodes
[root@othernode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid >> /var/log/mariadb-stdout.log &

If you were lucky, there'll be no need for full state-snapshot transfers (SSTs) across the network because the Galera cache (we set this to 5 GB) held enough changesets for an incremental state transfer to work. The cluster should come back quickly in this case. If not, a full SST will take place, and you'll have to wait a bit (sometimes around 8 hours in our case with slow internet at some of our sites) for the cluster to come back completely.

Flushing binary logs

Although it's not necessary for Galera replication to use MySQL binary logs, these might be useful to keep around for point-in-time recovery. Over time, these .log files can fill up the MariaDB data directory. To periodically remove them, one can do the following:

MariaDB [(none)]> set global expire_logs_days=10;
Query OK, 0 rows affected (0.00 sec)

MariaDB [(none)]> flush logs;
Query OK, 0 rows affected, 20 warnings (3 min 5.57 sec)

The first command above sets the expiration period for the binary logs, so they'll be automatically removed in the future if they're older than 10 days for the example above. The second command flushes the logs immediately, removing anything older than the expiration date.

Useful references