Home > Development > Berkeley DB Java Edition High Availability

Berkeley DB Java Edition High Availability

February 17, 2011

It’s been over a year and a half since my last real blog post. What better way to get back into the swing of things than a technical post about some things I’ve been kicking around at work?

We’ve been playing around a bit with Berkeley DB Java Edition, specifically the High Availability components (BDB JE HA…for short). If you’re like me, you may not have realized that there was anything more to the Java edition of BDB than the local B-Tree goodness. As it turns out, BDB JE has an HA component, which allows you to quickly set up networked, replicated BDBs. Start a single, master node and then start up replica nodes, letting them know where the master is. The nodes use a Paxos algorithm to determine a leader, which will be handling all writes for the duration of their tenure as master. The master, predictably, replicates to the replicas. Both the replicas and the master can serve reads, meaning that you can scale reads by adding more replicas.

The HA environment manages itself. Nodes come and go, new leaders are elected, events are sent back and forth notifying the new master and replicas of their statuses. Because BDB JE is a library, developers implement the application logic on top of the local database. Applications (a web application, for instance) receive read and write requests. Reads can be fulfilled locally. Writes can be handled locally by the master. If a write request comes to a replica, the replica must consult it’s knowledge of the HA topology and proxy the write request to the application running on the master.

Our question of the day was how quickly BDB JE HA can elect a new master if the current master is taken out of service. Not content to test with a handful of nodes, we burned an EC2 machine image containing a simple BDB JE HA application. The application was simple: set up a local BDB and either wait for other nodes (if it’s the first) or connect to the first node (if it’s not the first). An initial instance of the AMI was started to get the cluster rolling. One the instance was functional, 59 new instances of the AMI were launched. For these 59 instances, we took advantage of EC2 user-data to provide instance configuration information, which included the hostname and port of the original instance for the other 59 instances to connect to. With this in place, the 60 instances automatically configure themselves into a replicated cluster.

Each node logs the result of a leader election. So we log into a single instance, cat the log and determine which node is presently the leader. Logging into that node and killing the Java process causes the remaining 59 nodes to immediately switch from the REPLICA state to the UNKNOWN state. This indicates that a leader election is underway. After a period of time, one of the 59 nodes declares itself the new MASTER while the other 58 nodes return to life as a REPLICA. Below I have included the log output from this process…

db-ip-10-170-31-125 (Wed Feb 16 18:26:59 UTC 2011) - REPLICA
master is db-ip-10-170-85-7
db-ip-10-170-31-125 (Wed Feb 16 18:31:47 UTC 2011) - UNKNOWN
db-ip-10-170-31-125 (Wed Feb 16 18:32:22 UTC 2011) - REPLICA
master is db-ip-10-170-99-156

Here you can see the node starts life as a replica. At 18:31:47 the current master (db-ip-10-170-85-7) is terminated and the cluster goes into the UNKNOWN state. At 18:32:22 (35 seconds later) the node goes back to the REPLICA state and db-ip-10-170-99-156 is crowned the new cluster master. Repeating the process, murdering the new master, yields a much faster second election…

db-ip-10-170-31-125 (Wed Feb 16 18:51:50 UTC 2011) - UNKNOWN
db-ip-10-170-31-125 (Wed Feb 16 18:51:53 UTC 2011) - REPLICA
master is db-ip-10-170-90-48

This time the election takes only 3 seconds. After repeating this cycle another 5 times, the elections hold steady at around 2-3 seconds. Bear in mind, the master DB is not under any write load during this process, so these are ideal conditions. The leader election takes into account which replica is most caught up replicating the master, so a cluster experiencing higher write volume might take longer to elect a new master.

Anyway, this was a quick, fun project. BDB JE is a nice tool to kick around and getting more familiar and comfortable with EC2 is never a bad thing.

Here’s to it not taking another year and a half before my next post.

Advertisement
Categories: Development
  1. February 17, 2011 at 3:34 pm | #1

    Welcome back!

    • February 17, 2011 at 4:40 pm | #2

      Thanks, Chad! I’d be lying if I said the Etsy engineering blog posts didn’t serve as inspiration.

Comments are closed.
Follow

Get every new post delivered to your Inbox.