Build a Powerful Backend with AWS (MSK + RDS + EC2) & Debezium

8 min readMay 18, 2021

In the time I’ve spent working with various startup and development teams, I’ve found that the most successful ones all tend to have mastered two things:

Robust, scalable database design — typically on PostgreSQL or MySQL
Reliable PubSub messaging —typically using Apache Kafka

These two fundamental pieces of your backend are at the crux of features like push notifications, search indexing, and caching. While database design and pubsub may seem straightforward in development, maintaining them at scale can be extremely frustrating and time consuming, especially for a small team.

The purpose of this tutorial is to show you how to configure and deploy a robust backend using AWS RDS for PostgreSQL, AWS MSK for Kafka and Debezium to connect the two. The best part about this configuration: it’s fully managed. This means you can scale to millions of users without having to worry about ugly things like DB replication or Kafka broker-level configs.

Prerequisites:

An AWS Account with Console and CLI Access
A SQL Client (I recommend the free version of DB Visualizer).
Beginner’s Understanding of SQL & Linux Commands

Part I: Launch an MSK Cluster

An MSK Cluster can take about 25mins to become available (while AWS provisions all the necessary resources). Let’s get it launched so that it can be built in the background.

Build a Custom Kafka Configuration

Before we start, be aware that running a managed Kafka cluster can be pricey. We’ll build the cheapest option, which will cost about $30/month with burstable broker instances.

To start, we’ll use the AWS CLI (v2) to create a custom Kafka configuration called kafka-config. Start by creating a new file with the following options:

auto.create.topics.enable = true
zookeeper.connection.timeout.ms = 1000

Note: If you’re using TextEdit on Mac, click Format > Make Plain Text.

Now we’re going to use the file you just created to build a new MSK configuration on your AWS account:

aws kafka create-configuration --name “debezium-kafka-config” --description “MSK configuration for use with Debezium." --kafka-versions “2.3.1" --server-properties fileb://{PATH-TO-KAFKA-CONFIG}

You should get back a JSON response that includes the ARN of the config you just created.

Note: If the CLI complains that it can’t find your config file (even though the path is correct), you may be using a different version of the CLI. This link should help you get around the issue.

Deploy a Kafka Cluster

Now that you have your configuration, head over to the MSK Dashboard (on the AWS Console) and we’ll launch a new Kafka cluster.

For creation method, choose Custom create.
Choose the AWS-recommended Kafka version.
Under ‘Configuration’, click Custom Configuration and select the config you just uploaded.
Keep the number of availability zones at 3 (recommended) and select 3 private subnets for each of your broker nodes.
Keep the default security group (we’ll edit this later).
For ‘Brokers’ scroll all the way down to the t3.kafka.small instances. (If you need higher throughput for a production workload, go with one of the m5 instance options.)
For development / testing purposes I recommend giving each of your broker nodes 30GiB of EBS storage. (Again, higher throughput workloads will require more, but this can always be auto-scaled).
Access control method: None.
Encryption: Check TLS & Plaintext
Choose Basic Monitoring (AWS charges extra for everything else 🥴)
Click Create Cluster.

Congrats — You just successfully built a fully-managed Kafka cluster. 🎉

Part II: Build a PostgreSQL DB on RDS

This section will walk you through setting up a low-cost PostgreSQL DB from scratch on RDS. If you already have a DB set up, skip to the Debezium configuration section.

Deploy a PostgreSQL DB

Navigate to the RDS Dashboard and click Create Database. Use the following configuration options :

Engine: PostgreSQL with the latest version (currently 12.5).
Template: Dev/Test
Configure your credentials and store them somewhere safe (you’ll need them later).
Storage: General Purpose (SSD)
Instance class: Burstable db.t3.small class.
Turn OFF auto-scaling storage. If the Debezium connector is misconfigured, it can eat up storage and cause your storage bill to increase linearly. See this issue. Once you’re confident your connector is properly configured, you can enable auto-scaling.
Standby Instance: No
Public access: Yes
Click Create Database.

Your DB will take a few minutes to become available.

Debezium Configuration

Now we’ll add configuration options to your DB instance so that it works with the Debezium connector.

Go to Parameter Groups from the RDS Dashboard and click Create Parameter Group.
Name this parameter group something like cdc-enabled-postgres, locate the option with the key rds.logical_replication and set the value to 1. Click Create Parameter Group.
If your DB instance status has transitioned to ‘available’, click Modify.
Scroll to Additional Configuration, set the initial database name to postgres and switch the parameter group to the one you just created.
Schedule the modifications to take effect immediately and click Modify Instance. Your instance state will transition to ‘Modifying’.
Once the instance is modified you’ll need to reboot it for the rds.logical_replication setting to update. Select your instance, then click Actions > Reboot.

Next we need to access your DB instance from a SQL client to change user-level configurations for Debezium.

Go to your RDS instance and copy it’s endpoint:

Now, open a new DB connection via your SQL client. Enter the endpoint you copied as the host/server name and enter the DB credentials that you saved during creation.

Note: If you have difficulty establishing a connection, make sure that your instance’s security group allows access from your IP address.

To verify that your wal_level property is set to logical by running the following command:

SHOW wal_level;

Now we’ll configure a postgres user with the required Debezium permissions:

CREATE role debezium WITH PASSWORD 'mypassword' login;GRANT rds_superuser TO debezium;
GRANT SELECT ON ALL TABLES IN SCHEMA public to debezium;
GRANT CREATE ON DATABASE postgres TO debezium;

Finally, we need to configure a table to receive ‘heartbeat’ messages from Debezium. Heartbeat messages help solve a known issue with the Postgres connector which can cause the DB instance to run out of storage.

CREATE TABLE debezium_heartbeat (heartbeat VARCHAR NOT NULL)

Note: We’re granting the debezium user superuser access for development purposes, but you’ll likely want more fine-grained permissions in a production environment.

Your Postgres instance is now configured for use with Debezium ✅.

Part III: Build the Debezium Connector for Postgres

In the last step, we’ll put everything together by connecting our Postgres database to our Kafka cluster. To do this, we need to launch a client that will host Kafka Connect and the Debezium connector.

Deploy a Client Instance

One limitation of using MSK to manage a Kafka Cluster is that we can only access your cluster from within our VPC. This means that we need to launch an EC2 instance which will serve as our Kafka Client.

Deploy a t3.small EC2 instance with the latest Ubuntu image (you can use the Amazon Linux AMI, but some of the commands will differ). Ensure that its in the same VPC as your MSK Cluster.
Make note of the security group used when launching your instance.
SSH into your instance using the .pem key you downloaded at launch.

Install Kafka Connect

Run the following commands to install the confluent hub:

sudo apt-get update
wget -qO - https://packages.confluent.io/deb/5.3/archive.key | sudo apt-key add -sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/5.3 stable main"sudo apt-get update && sudo apt-get install confluent-hub-client confluent-common confluent-kafka-2.12

Click on Client Information from the dashboard of your MSK Cluster

Copy the Plaintext addresses of your bootstrap servers:

Edit the Kafka Connect config. Paste the addresses from the previous step in for the bootstrap-servers option:

sudo nano /etc/kafka/connect-distributed.propertiesbootstrap.servers=${COPIED_BOOTSTRAP_SERVER_ADDRESSES}
group.id=debezium-cluster
offset.storage.replication.factor=2
config.storage.replication.factor=2
status.storage.replication.factor=2
plugin.path=/usr/share/java,/usr/share/confluent-hub-components

Install the Debezium Connector

Add a .service file that can be used to start Kafka Connect:

sudo nano /lib/systemd/system/confluent-connect-distributed.service[Unit]
Description=Apache Kafka - connect-distributed
Documentation=http://docs.confluent.io/
After=network.target[Service]
Type=simple
User=cp-kafka
Group=confluent
ExecStart=/usr/bin/connect-distributed /etc/kafka/connect-distributed.properties
TimeoutStopSec=180
Restart=no[Install]
WantedBy=multi-user.target

Now we’ll install the Debezium PostgreSQL Connector.

sudo confluent-hub install debezium/debezium-connector-postgresql:latest

And start Kafka Connect using systemctl:

systemctl enable confluent-connect-distributed
systemctl start confluent-connect-distributed

Create a JSON config for the PostgreSQL connector:

nano postgres.json{
"name": "debezium-pg-connector",
"config": {
  "name": "debezium-pg-connector",
  "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
  "database.server.id": "1",
  "tasks.max": "1",
  "slot.name": "debezium"
  "database.history.kafka.bootstrap.servers": [BOOTSTRAP_SERVERS]
  "database.server.name": "khyber",
  "database.dbname": "postgres",
  "database.hostname": [RDS_ENDPOINT],
  "database.port": "5432",
  "database.user": "debezium",
  "database.password": [MY_PASSWORD],
  "plugin.name": "pgoutput",
  "internal.key.converter.schemas.enable": "false",
  "transforms.unwrap.add.source.fields": "ts_ms",
  "key.converter.schemas.enable": "false",
  "internal.key.converter": "org.apache.kafka.connect.json.JsonConverter",
  "heartbeat.interval.ms": "30000",
  "heartbeat.action.query": "INSERT INTO debezium_heartbeat (heartbeat) VALUES ('thump')",
  "internal.value.converter.schemas.enable": "false",
  "value.converter.schemas.enable": "false",
  "internal.value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "value.converter": "org.apache.kafka.connect.json.JsonConverter",
  "key.converter": "org.apache.kafka.connect.json.JsonConverter",
  "transforms": "unwrap",
  "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
 }
}

Save your config file and then create your connector using the Confluent REST API:

curl -X POST -H “Accept: application/json” -H “Content-Type: application/json” http://localhost:8083/connectors -d @postgres.json

You can ensure that your connector is working properly by making a GET request for the connector’s status:

curl GET http://localhost:8083/connectors/debezium-pg-connector/status

It should return a response similar to the following:

You’ve successfully built the Debezium connector for Postgres. Row level changes will now be sent directly into your Kafka cluster. ✅

And that’s it! You have a starter backend for a real-time application that can respond to any change made to your database.

Next Steps

Now that you can produce messages in Kafka (through the Debezium connector), you’ll probably want a way to consume them. There are a couple of ways to do this depending on your end goal:

For Search indexing, you can use Confluent’s pre-built Elasticsearch Sink Connector, which has good documentation and will automatically index your row-level changes.
For sending notifications to users, you’ll have to make some form of network request to a delivery service. I suggest following my other article, Consume Kafka Messages in Node.js with Lambda Functions, which will give you a bare-bones consumer to get you up and running.
If you’re a Java developer and you’re familiar with Apache Flink, you can use Flink’s Kafka Connector to process messages.