updated on 2017-03-23: updated my.cnf for MariaDB 10.1 (added wsrep_on keyword arg and dedicated user for replication instead of using root), fixed missing comment # in my.cnf, added extra steps necessary if Debian repo for Percona is unavailable.
originally written 2016-10-01.
Background
Since the HATSouth exoplanet survey is distributed over geographical sites on three different continents (in Chile, Namibia, and Australia), it can be challenging to manage the data flow from each site to the headquarters at Princeton and vice-versa. All processing carried out on the frames taken is logged to a MySQL database, present at each site and at headquarters. These "calibration" databases must be synchronized periodically so that we know where our data are and what to do with the image frames next.
Before 2015, we were using MySQL dumps to files, rsyncing these around the network, and reconciling any commits on each server (while locking entire tables to ensure consistency). This was automated, but would often fail as some sites would complete their processing without issues, but others might have failed, leaving the entire schema inconsistent, and thus requiring manual intervention to sort things out. This would require pausing our automatic calibration and photometry pipelines, thus making us fall further and further behind with respect to the HATSouth observations.
Moving to a clustered database
We eventually decided to evaluate clustered databases as a potential solution to our problem. The basic idea was to have a commit to the database at any one of the sites or at headquarters be applied to all database servers in the network immediately, rather than having to wait for reconciliation. This effectively called for a multi-master replication scheme. Since we were already using MySQL, we decided to try out the MariaDB Galera Cluster database.
Below is a short guide to getting started with a MariaDB Galera Cluster, with parameters tuned for multi-master replication over a wide area network. Note that we don't use Debian/RPM packages for installation. We need to run the MariaDB Galera servers on a non-standard port and from a non-default location because we want to keep the system MySQL installation around for other local databases. If this isn't an issue for your particular case, feel free to use the official MariaDB Debian or RPM packages from their repository; these have Galera replication built-in and service control is far more straightforward.
After some teething issues and tweaking of replication parameters over a period of six months or so, we've finally converged on a mostly-stable clustered database that works as we expected. It's a bit slower in the short term than the previous solution since any commit must be acknowledged by all participating servers (by design), but the consistency guarantee outweighs this particular downside. Extended Internet outages (4-6 hours long) sometimes happen at our remote sites, but the network usually recovers automatically. For longer outages, the calibration database node that lost connectivity to the rest of the network shuts down automatically to preserve its own consistency—this doesn't affect observing. We have monitoring enabled on the network, so we get notified of such problems, and can recover by restarting the database node, which then resyncs automatically in most cases.
Installing the database
We use the pre-built binaries available from the MariaDB download page. Most of our computers at the sites run older versions of Debian, so we install some required packages from the Debian archive repository, and the Percona Debian repository:
Enable the Percona repository, and then install some required packages (we use Debian, but RPM-based installs are similar):
[root@server:~ ]# apt-get update && apt-get install lsof libaio1 libaio-dev socat pv [root@server:~ ]# apt-get update && apt-get install percona-xtrabackup
NOTE: If you're still stuck on Debian 6, the Percona
repositories above won't work. You'll need to get the xtrabackup
binaries from Percona, and copy them over to the /usr/bin
directory:
[root@server:~ ]# wget https://www.percona.com/downloads/XtraBackup/Percona-XtraBackup-2.4.6/binary/tarball/percona-xtrabackup-2.4.6-Linux-x86_64.tar.gz [root@server:~ ]# tar xvf percona-xtrabackup-2.4.6-Linux-x86_64.tar.gz [root@server:~ ]# cd percona-xtrabackup-2.4.6-Linux-x86_64/bin [root@server:bin ]# cp * /usr/bin [root@server:bin ]# cd ../man/man1; cp * /usr/local/share/man/man1/ [root@server:man1 ]# mandb
Next, extract the tarball, copy it to the install location (we
use /opt/mariadb-galera
), add a user and group for the
database server, and make the data directory (we
use /data/mariadb-galera
):
[root@server:~ ]# cd /opt; tar xvf mariadb-tarball.tar.gz [root@server:/opt ]# mv mariadb-tarball mariadb-galera # fix the dirname [root@server:/opt ]# groupadd mariadb [root@server:/opt ]# useradd -g mariadb mariadb [root@server:/opt ]# cd /data; mkdir /data/mariadb-galera; cd /data [root@server:/data ]# chown -R mariadb.mariadb mariadb-galera # if you have an older Debian (6 or 7), you may need to # symlink your libcrypto and libssl so mariadb can find it [root@server:~ ]# cd /usr/lib [root@server:~ ]# ln -s libssl.so.0.9.8 libssl.so.6 [root@server:~ ]# ln -s libcrypto.so.0.9.8 libcrypto.so.6
Configuration files
We'll need to create a my.cnf
in
the /opt/mariadb-galera
directory to activate the
replication provider and tune parameters that control the
replication.
Below is an example my.cnf file,
with wsrep_provider_options
replication parameters tuned
for use with high-latency WAN links between the various database
nodes. It's best to use a very large Galera cache
(gcache.size
—we use 5 GB), so a full state-snapshot
transfer (SST) isn't necessary after an extended outage leaves one or
more nodes behind the rest of the cluster.
It's good to use longer timeout values (evs.inactive_timeout,
evs.suspect_timeout, evs.keepalive_period, evs.join_retrans_period,
evs.install_timeout
) so brief outages won't kick a node off the
cluster. Another item to check is wsrep_retry_autocommit
,
which we set to 5, so every commit is tried at least five times before
returning with an error. Finally, we set a large number of commit-apply
threads (wsrep_slave_threads
) so that each node applies
changes as fast as possible: we use 8 for our quad-core machines.
Useful logging options include logging all commit-conflicts
(wsrep_log_conflicts
) and logging everything if you're
trying to hunt down a problem (wsrep_debug
).
# The following options will be passed to all MariaDB clients [client] port = 3307 # note this is non-standard socket = /data/mariadb-galera/mariadb.sock # The MariaDB server [mysqld] # settings for datadir and basedir port = 3307 # again non-standard, since we run MySQL and MariaDB both socket = /data/mariadb-galera/mariadb.sock user = mariadb datadir = /data/mariadb-galera basedir = /opt/mariadb-galera innodb_file_per_table skip-external-locking key_buffer_size = 16M max_allowed_packet = 1M table_open_cache = 64 sort_buffer_size = 512K net_buffer_length = 8K read_buffer_size = 256K read_rnd_buffer_size = 512K myisam_sort_buffer_size = 8M ## innodb settings for galera cluster ## binlog_format = ROW default_storage_engine = InnoDB innodb_autoinc_lock_mode = 2 innodb_locks_unsafe_for_binlog=1 innodb_buffer_pool_size = 128M innodb_flush_log_at_trx_commit = 0 # apparently recommended by Galera ## wsrep settings for galera cluster ## wsrep_on=ON # this is needed to turn on replication for MariaDB 10.1 wsrep_cluster_name = [your cluster name] wsrep_cluster_address = 'gcomm://[IP address of other cluster node],[IP address of another cluster node],[etc...]' wsrep_node_name = [name of this cluster node] wsrep_node_address = [IP address of this cluster node] wsrep_provider = '/opt/mariadb-galera/lib/libgalera_smm.so' # can set this to rsync as well (but that's unreliable) # xtrabackup requires extra packages wsrep_sst_method=xtrabackup-v2 # this is the user that runs the xtrabackup-v2 replication # for MariaDB 10.1, the root user can't connect remotely, # so we use a dedicated user for this purpose called 'repliuser' wsrep_sst_auth=repliuser:[repliuser password] ## uncomment this for if you want all nodes to get SSTs from one node # wsrep_sst_donor = [master node] # verbose logging wsrep_debug=OFF # change to ON if you want everything logged wsrep_log_conflicts=ON # retry in case of deadlocks in autocommit mode wsrep_retry_autocommit=5 # increase number of wsrep threads wsrep_slave_threads=8 # tune these to match timeouts and delays expected from the WAN connection # between nodes wsrep_provider_options = "evs.keepalive_period = PT3S;evs.suspect_timeout = PT5M;evs.inactive_timeout = PT7M;evs.join_retrans_period=PT0.6S;evs.install_timeout = PT7M;gcache.size=5G" # Point the following paths to different dedicated disks #tmpdir = /tmp/ #skip-networking # this isn't really required if skip_networking is commented out # and you want MariaDB to listen on all interfaces # if you want MariaDB to listen only a single interface, # set its IP address below # bind-address = [IP address of this cluster node] # binary logging is required for point in time recovery # note that this makes tons of binary .log files in the datadir # comment this out if not needed log-bin=mysql-bin # required unique id between 1 and 2^32 - 1 # defaults to 1 if master-host is not set # but will not function as a master if omitted server-id = 1 [mysqldump] quick max_allowed_packet = 16M [mysql] no-auto-rehash # enforces a 'where' in update statements safe-updates [myisamchk] key_buffer_size = 20M sort_buffer_size = 20M read_buffer = 2M write_buffer = 2M [mysqlhotcopy] interactive-timeout
Replication methods and firewalls
We use the Percona xtrabackup-v2
method for sending
changesets across the cluster. It's also possible to
use rsync
, but that's proven unreliable for us. Using
xtrabackup-v2 requires the Percona innobackupex script installed (from
Percona's apt repository, discussed above).
To make the cluster work correctly, you'll need to open a bunch of ports on the server firewalls, as well as on any intervening site firewalls. We use the following port schema: TCP and UDP 4444 open, TCP 4567 open, TCP 4568 open, and TCP 3307 (used by innobackupex) open.
Starting up the cluster
The steps below can be replicated on all nodes of the cluster. First, install the database's system tables:
[root@server:/opt/mariadb-galera ]# ./scripts/mysql_install_db --defaults-file=/opt/mariadb-galera/my.cnf
Next, start up the server to set up the root user:
[root@server:/opt/mariadb-galera ]# /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster --skip-grant-tables &
Check the file /data/mariadb-galera/[hostname].err
for
log messages and sort out any issues. Then, connect to the server and
set up the root user and any other users and databases you want.
[root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root # switch to the mysql database MariaDB [(none)]> use mysql; # set password for root MariaDB [mysql]> update user set password=PASSWORD("[mariadb root password]") where user='root'; # delete anonymous users MariaDB [mysql]> delete from user where user = ''; # update grant tables with changes MariaDB [mysql]> flush privileges;
Shut down the servers on all nodes (just use kill %1
)
after you're done setting them up. Next, we'll set up a dedicated user
for replication via the xtrabackup-v2 method. This requires starting
up the local server again and adding the user:
# start the server [root@server:/opt/mariadb-galera ]# /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster & # connect to the server, you'll need your root password this time [root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root -p # switch to the mysql database MariaDB [(none)]> use mysql; # add a user dedicated to replication # this will be used by the xtrabackup-v2 replication method # and can log in from anywhere that has access to this DB over the network # you may want to restrict privileges below if needed MariaDB [mysql]> create user 'repliuser'@'%' identified by '[repliuser password]'; MariaDB [mysql]> grant all privileges on *.* to 'repliuser'@'%'; # update grant tables with changes MariaDB [mysql]> flush privileges;
Kill the server again to conclude initial user set up. Then, start one node in "new cluster" mode (this will be the leader node for now), and the other nodes in normal mode, so they can all find each other.
# on the first node [root@firstnode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster >> /var/log/mariadb-stdout.log & # on the other nodes [root@othernode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid >> /var/log/mariadb-stdout.log &
To check if the nodes are connected, connect to one of the nodes
via mysql
and check WSREP status
variables. wsrep_cluster_status
should
be Primary
, and wsrep_evs_state
should be
be OPERATIONAL
.
[root@server:/opt/mariadb-galera ]# ./bin/mysql --socket=/data/mariadb-galera/mariadb.sock -u root MariaDB [(none)]> show status like '%cluster%'; +--------------------------+--------------------------------------+ | Variable_name | Value | +--------------------------+--------------------------------------+ | wsrep_cluster_conf_id | 54 | | wsrep_cluster_size | 5 | | wsrep_cluster_state_uuid | 9e2320c6-f4fd-11e5-bdbf-b235a8f9fd59 | | wsrep_cluster_status | Primary | +--------------------------+--------------------------------------+ 4 rows in set (0.00 sec) MariaDB [(none)]> show status like '%wsrep_evs_state%'; +-----------------+-------------+ | Variable_name | Value | +-----------------+-------------+ | wsrep_evs_state | OPERATIONAL | +-----------------+-------------+ 1 row in set (0.00 sec)
Once the cluster is up, any subsequent restarts of any of the servers
should not use the --wsrep-new-cluster
flag, since this
will dissociate the node from the cluster. As long as one node stays up,
the cluster will remain intact. If all nodes go down, then you'll have
to do state recovery.
At this point, you can install the service file included with the
MariaDB binary
distribution: /opt/mariadb-galera/support-files/mysql.server
to use with update-rc.d
and /etc/init.d/
. We
usually don't bother with this and just add the appropriate command line
to /etc/rc.local
to make sure the server starts up at
reboot and use kill
to make the mariadb main process exit
cleanly if needed. The cluster replication usually takes care of
handling data resyncs, and we have monitoring for when a cluster node
goes down.
Database backups
We use the innobackupex script that comes with the Percona xtrabackup package (that we installed earlier) to handle backups. This doesn't require long-held locks on tables so downtime while taking backups is minimal. Here's a wrapper shell script that can be used with cron:
#!/bin/bash DBUSER=root DBPASS=[your DB root password here] DBSOCK=/data/mariadb-galera/mariadb.sock DBCONF=/opt/mariadb-galera/my.cnf BAKDIR=/path/to/backup/directory # find older backups and kill them echo "removing backups older than 30 days" find $BAKDIR -type d -ctime +30 -exec rm -rvf '{}' \; # do the actual backup for today echo "starting today's backup" innobackupex --defaults-file=$DBCONF --user=$DBUSER --password=$DBPASS --socket=$DBSOCK --ftwrl-wait-threshold=40 --ftwrl-wait-query-type=all --ftwrl-wait-timeout=180 --kill-long-queries-timeout=20 --kill-long-query-type=all --galera-info $BAKDIR
After a backup has been completed (wait for it to say "Completed OK"), you'll need to prepare the backup before it's in restore-ready state:
[root@server:~ ]# innobackupex --apply-log /path/to/backup/directory/[datetime of backup]
To restore from a backup:
# restore the backup [root@server:~ ]# innobackupex --copy-back /path/to/backup/directory/[datetime of backup] # restore the permissions back to mariadb [root@server:~ ]# chown -R mariadb.mariadb /data/mariadb-galera
Monitoring the network
The sophisticated way to do this is to
install ClusterControl,
Galera's web interface for monitoring a replication cluster. We just use
a shell script that runs as a periodic cronjob, that does the following:
pings each database node and if it's up, logs in via SSH and
executes mysql
with a simple
select count(*)
statement on one of our smaller replicated
tables that gets updated frequently. If all is well, we see the same
(steadily increasing) count on all nodes. If there's a mismatch or if
one of the nodes is not pingable, unreachable via SSH, or database
server on the node is down, we get alert emails. We then log in manually
to see what's happening.
This system has worked fairly well for us. There are usually a few false-alarm alert emails per week because of nodes under load taking too long to respond, but those are easy to verify.
Recovering from multiple-node failure
Sometimes the entire cluster falls apart because of some nodes dropping out of the cluster, breaking quorum, and then forcing them to shut down. Recovery in this case usually means finding the cluster node that has the most advanced state, designating that node as the new leader, and then starting up all other nodes one by one, until they all join the new cluster.
To find the most advanced state of the cluster, we need to
temporarily start each node in isolated mode. This is done by commenting
out the wsrep_cluster_address = gcomm://...
line to make
sure the server doesn't expect any incoming node connections, and then
starting the node using mysqld_safe
with
the --wsrep-new-cluster
option. Then, connect to the server
and figure out the replication state.
MariaDB [(none)]> show status like '%wsrep_cluster_state_uuid%'; +--------------------------+--------------------------------------+ | Variable_name | Value | +--------------------------+--------------------------------------+ | wsrep_cluster_state_uuid | 6873b24f-4066-11e5-ba5a-3a0ab9be4ca4 | +--------------------------+--------------------------------------+ 1 row in set (0.00 sec) MariaDB [(none)]> show status like '%wsrep_last_committed%'; +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_last_committed | 1086816 | +----------------------+---------+ 1 row in set (0.00 sec)
The wsrep_last_committed
value is what you want to
compare across all nodes. The node with the highest value is the one
with the most advanced state and should be used to reform the
cluster. Shut down the server, uncomment
the ...gcomm://...
line, then restart the nodes one by
one:
# on the new leader node [root@firstnode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid --wsrep-new-cluster >> /var/log/mariadb-stdout.log & # on the other nodes [root@othernode:/opt/mariadb-galera ]# nohup /opt/mariadb-galera/bin/mysqld_safe --defaults-file=/opt/mariadb-galera/my.cnf --pid-file=/data/mariadb-galera/pid >> /var/log/mariadb-stdout.log &
If you were lucky, there'll be no need for full state-snapshot transfers (SSTs) across the network because the Galera cache (we set this to 5 GB) held enough changesets for an incremental state transfer to work. The cluster should come back quickly in this case. If not, a full SST will take place, and you'll have to wait a bit (sometimes around 8 hours in our case with slow internet at some of our sites) for the cluster to come back completely.
Flushing binary logs
Although it's not necessary for Galera replication to use MySQL binary logs, these might be useful to keep around for point-in-time recovery. Over time, these .log files can fill up the MariaDB data directory. To periodically remove them, one can do the following:
MariaDB [(none)]> set global expire_logs_days=10; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> flush logs; Query OK, 0 rows affected, 20 warnings (3 min 5.57 sec)
The first command above sets the expiration period for the binary logs, so they'll be automatically removed in the future if they're older than 10 days for the example above. The second command flushes the logs immediately, removing anything older than the expiration date.
Useful references
- Galera documentation index
- Galera state snapshot transfer setup
- How to reset the cluster quorum manually if everything breaks
- Galera caveats (important!)
- Galera replication parameters reference
- Galera system variables reference
- Troubleshooting Galera cluster issues
- Geo-distributed Galera clusters over WAN
- Adding the Galera web-interface (ClusterControl) to an existing cluster
- Docs for Percona XtraBackup (this does backups and replication)
- The Several Nines blog (useful Galera tips)