Data Stream

Introduction to Manhattan’s answer for your data replication and archiving requirements.

Introduction

Data Stream is a sub-system of Manhattan Active® Platform that is responsible to replicate data from the production MySQL database to a configurable set of target data stores. Production data is replicated by reading the MySQL binary logs and is synced with the target (near) real-time. Data Stream is built using open-source components and a framework developed by Manhattan called Gravina.

Change data capture, or CDC, is a well-established software design pattern for a system that monitors and captures the changes in data so that other software can respond to those changes. CDC captures row-level changes to database tables and passes corresponding change events to a CDC stream. Applications can read these change event streams and access these change events in the order in which they occurred. change data capture helps to bridge traditional data stores and new cloud native event-driven architectures.

Gravina Overview

Gravina is an internal system and framework library build by Manhattan that reads the source MySQL binary log, converts each event into a message in a CDC capturing mechanism (Kafka), and relays them to the target system in the format the target system can interpret.

Customers can implement Data Stream to replicate production data to a set of supported target systems, which as of the writing of this document include the following:

Manhattan Active® Platform engineering team continues to build support for additional replication targets based on product management and customer requirements.

Gravina Architecture

Following is the summary of the Gravina components that enable the Data Stream functionality:

Extractor

Gravina Extractor is a Spring Boot microservice that embeds Shyiko mysql-binlog-connector-java and integrates with Apache Kafka. Extractor reads binary logs from the source MySQL database, converts each event into a message, and publishes them to Kafka. The extraction process executes as a single-threaded background job.

Kafka

Apache Kafka plays the role of CDC capture system in Gravina architecture. It is used as to stream events form the extractor into the target consumer - one of the supported Gravina replicator components. The messages in Kafka are partitioned by database entity groups into separate Kafka topics so that they can be consumed concurrently by the replicators.

Replicator

Gravina Replicator is a Spring Boot microservice that subscribes to the Kafka topics mentioned above to read the CDC events. These events are then converted to a payload format suitable for the target system. Each replicator implementation is specific to the target it serves (for example, replicator implementations for Google Cloud Pub/Sub target and a MySQL replica on customer’s Google Cloud SQL endpoint). As shown in the diagram above, to increase throughput, multiple instances of the replicator can be run to process CDC events from Kafka topics in parallel. Likewise, multiple types of Gravina replicators can also coexist in the same environment to transmit the replication events to different types of target systems concurrently.

Supported Replication Modes (Generally Available)

While Gravina can technically replicate the data stream to a wide range of target stores, presently Manhattan supports the following replication modes as generally available options:

Data Save with Google Cloud SQL

Production data is replicated to a Google Cloud SQL instance owned and managed by Manhattan. The customer has private access to the database instance, and a read only authorization to query and report from this database. Manhattan remains responsible to operate, monitor, and maintain the database instance.

Data Stream with Google Cloud Pub/Sub

Production data is streamed as CDC events to a Google Cloud Pub/Sub endpoint owned and managed by the customer. Manhattan will need authorization and network access to post events to this Pub/Sub endpoint. Customer remains responsible to operate, monitor, and maintain the Pub/Sub endpoint.

Data Stream with Google Cloud SQL

Production data is replicated to a Google Cloud SQL instance owned and managed by the customer. Manhattan will need authorization and network access to write the replication events to this database instance. Customer remains responsible to operate, monitor, and maintain the database instance.

Replication Modes in Preview (BETA)

The replication modes described below are in preview for select set of implementations and are considered available as beta:

Data Save with Google Cloud Pub/Sub

Production data is streamed as CDC events to a Google Cloud Pub/Sub endpoint owned and managed by Manhattan. The customer has private access to the Pub/Sub endpoint, and a read only authorization to consume the messages from the topic subscription. Manhattan remains responsible to operate, monitor, and maintain the Pub/Sub endpoint.

Learn More

Author

  • Kartik Pandya: Vice President, Manhattan Active® Platform, R&D.

Last modified April 25, 2024: Update deploy.yml (aa43072)