This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Active API Blog

This is the blog section. It has two categories: News and Releases.

Files in these directories will be listed in reverse chronological order.

Manhattan Active® Platform Release Engineering

Behind the scenes: How we’ve built the CI and release management mechanisms to fully automate builds, testing and delivery of Manhattan Active® Platform microservices.

Manhattan Active® Platform Release Engineering is a subsystem consisting of processes, conventions, and a set of tools that enable the transit and promotion of code from a developer’s laptop to a production environment. This document describes the inner workings of this subsystem.

Release Engineering consists of two main aspects:

  • Continuous Integration Pipeline: Covers how code committed to Git (Bitbucket) repositories goes through various stages of build and test phases to become production ready. Code is built and released as deliverable binary artifacts (i.e. Docker images, NPM packages, JARs or Zip archives)
  • Release Pipeline: Covers how the built artifacts are formally released and delivered for deployments in the target runtime. Release mechanism includes routine code delivery to the target runtimes (known as Code Drops) or are priority patches that may be released on demand (known as Hotfixes).

CI & Release Pipeline

Continuous Integration

The typical development process can be summarized as the following:

  • Every microservice has a separate codebase in its own Git repository
  • Every development team works on the microservices they are responsible for and push the changes to the respective feature branches
  • The continuous integration pull request process builds and tests the code from the feature branches
  • Upon successful build and tests, the development team merges the code to the mainline (trunk or “master”)
  • The continuous integration process then builds and tests the code from the trunk, builds the Docker image, and publishes to a private Docker registry

In order to be able to build and test the code that developers commit to the Git repositories, we have put in place the Continuous Integration process using Jenkins.

Jenkins Deployment Topology

The topology has been designed in a way to maximize inbound traffic while ensuring the security and stability of the CI subsystem. Jenkins is deployed on a Kubernetes cluster, with the worker (“slave” in Jenkins-speak) nodes being the Kubernetes pods that are dynamically provisioned when the build event occurs.

Key features of the CI pipeline with Jenkins:

  • Runs on Kubernetes for high scalability and performance
  • Runs an average of 600+ builds per day, or over 4000 builds in a week
  • Supports building, testing, and packaging framework libraries, Docker containers CDN deliverable packages, mobile builds and generic zip and tar archives.
  • One click deployment of Jenkins cluster using Kubernetes with all configuration
  • Automated node provisioning using preemptible virtual machines for cost optimizations
  • Blue-green deployment support for Jenkins cluster
  • Auto-renewal of SSL certificate
  • Multi-zone high availability of the worker nodes

CI Deployment

Continuous Integration Pipelines

Continuous Integration Pipeline is the backbone for the release engineering process during the development cycles and responsible for building, testing and publishing the standard artifacts (i.e. Docker images, NPM packages, WAR, JARs or Zip archives). CI pipeline uses Jenkins multi-branch pipeline jobs to perform various tasks specific to the target Bitbucket repository. The pipeline serves for components and as well as the frameworks. This section summarizes a set of CI pipelines that are most relevant to developers:

Component pipeline

The component pipeline job runs once for each component, and is triggered when a new commit is pushed to the target branch of the component’s Bitbucket repository. The component pipeline job monitors a set of known branch name patters for which a build will be automatically triggered:

master Branch

Purpose of the component’s master branch pipeline is to build and test the component’s code from its master branch where the code from other branches is merged and integrated. Only the artifacts built, tested and published from the master branch are prompted as release candidates, and subsequently released for the downstream deployments, including the Ops managed customer environments. One exception to this rule is the priority (also known as “Hot Fix” or “X” pipeline, described in “priority Branch” section below) where code from a branch other than the master branch is promoted to downstream environments for the purposes of addressing priority or “hot” defects.

Commits pushed to the master branch of the component’s Bitbucket repository triggers the pipeline job, which goes through the phases as summarized below. Clicking on each phase shows the list of activities and logs performed by that phase:

CI Master

Phases
  • Init: Initializes the job and the workspace for the job

  • Clone: Clones the target component’s repository at the tip commit of the master branch; records the tip commit ID, which will be used for tagging purposes

  • BaseImage: Builds the Docker image for the component. The built Docker image is then tagged as base-<commit_id>-<build_number>, and pushed to Quay.io. This image is then used in subsequent phases. Note, that the Docker image is never built again in the entire pipeline. The same Docker image built in this phase then “travels” through the rest of the pipeline, and upon successful completion of the pipeline, the same Docker image is tagged as gold and pushed to Quay.io

  • Parallel: Two identical phases (Parallel-Y and Parallel-Z) are triggered in parallel. Both phases run identical set of tests with the only difference being the value of ACTIVE_RELEASE_ID environment variable. This phase runs the full suite of tests that accompany the components, which typically include BASIC, EXTENDED and END-TO-END tests (components can customize which of these types of tests to run or skip via their Jenkinsfile using relevant parameters).

    • Parallel-Y: This parallel phase is also known as the “Y Pipeline”. This phase runs the identical set of tests as the Parallel-Z phase, but with the ACTIVE_RELEASE_ID set to the most recently released (“GA”) release identifier. For example, if the most recently released quarterly release is 19.3 (or 2019-06), then the global environment configuration for Parallel-Y phase will have ACTIVE_RELEASE_ID=2019-06. By setting the ACTIVE_RELEASE_ID to the most recent GA release identifier, the tests in this phase will use the feature flags configured to be enabled for that release. In other words, with the example of the 19.3 release, Parallel-Y phase runs the component’s tests assuming the feature flags for the 19.3 release are enabled, but the feature flags for the upcoming 19.4 release are disabled. For full details on how feature flags work, see this document.

    • Parallel-Z: This parallel phase is also known as the “Z Pipeline”. This phase runs the identical set of tests as the Parallel-Y phase, but with the ACTIVE_RELEASE_ID set to the upcoming quarterly release identifier. For example, if the most recently released quarterly release is 19.3, then the upcoming quarterly release is 19.4 (or 2019-09). In this example, ACTIVE_RELEASE_ID for the Parallel-Z pipeline is configured as 2019-09, which would enable all feature flags for this release for the tests that run in this phase.

      Y and Z Pipelines

      • BASIC tests: Include the component’s unit tests
      • EXTENDED tests: Include the component’s functional integration tests that require the component’s dependencies to be running. As part of this phase, the component is launched using launchme.sh, and the dependencies are launched using launch-deps.sh before the tests are executed.
      • END-TO-END tests: Include the component’s end-to-end functional scenarios that span across more than 2 dependencies. These tests are typically designed to cover system-level integration scenarios, rather than testing the component’s functionality in isolation.
    • GoldTag: Upon successful completion of all (100%) tests, the base-<commit_id>-<build_number> image built as part of the BaseImage phase will be tagged as gold. The gold tag of the component’s Docker image is then pushed to Quay.io. Along with the gold tag, the component’s client libraries, if enabled in the component’s Jenkinsfile, is also published to JFrog.

    • Mail: An email notification is sent out to DL_R&D_CI_NOTIFICATIONS alias with the status and links to the Jenkins pipeline.

    • Support for additional parallel phases as part of the component’s CI pipeline: Certain components require additional test phases to run in parallel in addition to the standard Parallel-Y and Parallel-Z phases depending on the spring profiles configured in the configuration repository as different profiles for the component.

      Additional Profiles

pullrequest Branch

Purpose of the component’s pullrequest branch pipeline is to build and test the component’s code from the given pullrequest branch. The built artifacts are never promoted to downstream environments, but provides a way for the developers to build and test their code before merging the code into the master branch. Commits pushed to any branch named pullrequest/<some_suffix> to the component’s Bitbucket repository triggers the pipeline job, which goes through the similar phases mentioned above for master but without any parallel phase. Upon successful completion of the pipeline, the docker image created in the BaseImage phase is tagged as pr-gold and pushed to Quay.io.

While the default behavior of the pullrequest branch pipeline is to execute each of the pipeline phases described above, the developer has control over what all phases are actually executed if a different behavior is desired. A developer, while making the commit, can insert a tag (defined for each phase ) to control what build phases have to be executed.

priority Branch

Purpose of the component’s priority branch pipeline is to enable developers to make priority bug fix on a previously released Code Drop which is presently deployed in production environments. Typically, the defects that cause work-stoppage or result in a significant loss of productivity at the customer is considered a Priority bug, and is usually escalated to the levels of senior executives. When a Priority bug is reported, the expected time window of fixing it and deploying the fix in the production environment is very short (12 to 48 hours, typically). The Priority Bug-Fix Pipeline will help the development team with a quicker turnaround for producing a bug fix for such a defect.

Commits pushed to any branch named priority/<code_drop_id> to the component’s Bitbucket repository triggers the pipeline job, which goes through the similar phases mentioned above for master but without any parallel phase. Upon successful completion of the pipeline, the docker image created in the BaseImage phase is tagged as xgold and pushed to Quay.io.

future Branch

Purpose of the component’s future branch pipeline is to build and test the component’s code from the given future branch. It allows for the developers to “branch out” the code from master for significant changes or enhancements that may not make it back into master for a considerable period of time, or perhaps, never. In concept, future branch pipeline is akin to pullrequest branch pipeline, but the future branch pipeline acts and works more like its master counterpart by going through the same detailed build phases as the master branch. The primary difference between the master branch pipeline and future branch pipeline is that the artifacts produced from the future branch are never promoted to a release or deployment to downstream environments.

Commits pushed to any branch named future/<some_suffix> to the component’s Bitbucket repository triggers the pipeline job, which goes through the similar phases mentioned above for master. Upon successful completion of the pipeline, the docker image created in the BaseImage phase is tagged as fgold and pushed to Quay.io.

Framework pipeline

The framework pipeline job runs once for each component, and is triggered when a new commit is pushed to the target branch of the framework’s Bitbucket repository. The framework pipeline job monitors a set of known branch name patters for which a build will be automatically triggered:

master Branch

Purpose of the framework’s master branch pipeline is to build and test the framework’s code from its master branch where the code from other branches is merged and integrated. This branch publishes the artifacts i.e. JARs, which will be consumed by the components.

FW master

Release & Maintenance Branch

Purpose of the framework’s Release & Maintenance branch pipeline is to maintain and publish a specific version of the framework JARs. The build process is same as master.

Release Engineering

At the end of the continuous integration pipeline, a microservice Docker image is tagged as a Release Candidate or rc. The RC-tagged Docker images go through a series of production assurance validations. These validations include upgrade tests (to ensure compatibility between versions), user interface tests (to test user experience and mobile applications) and business workflow simulations (to ensure that the end-to-end business scenarios continue operating without regressing). If regressions are found, they are treated as a critical defect and either addressed or toggled off using feature flags so that the code can be released without the regression of the functionality.

Release Candidates

As explained above in the master branch build section for the application components, the commits pushed to the master branch of a component’s repository results in a build pipeline, at the end of which, the newly built (and tested) Docker images is published with two tags to Quay.io:

Gold tag: com-manh-cp-<component_short_name>:gold (for example: com-manh-cp-inventory:gold); and Absolute tag: com-manh-cp-<component_short_name>:-<commit_ID>- (for example: com-manh-cp-inventory:2.68.0-9cd931d-1910180647) The gold tag is considered a “rolling” tag - a tag that will get overwritten every time a new image with gold tag is published. In contrast, the absolute tag is always unique, and will be recorded in Quay.io forever.

While gold tag designates a component’s Docker image as a “stable” image, it still is not considered ready for release until it goes through additional functional and performance testing. Each component team is responsible to test the gold tagged images (commonly, but incorrectly referred to as “gold image”) and declare readiness of that image for release. This is performed by the means of the rc tags. The component teams test the gold tagged image, and upon successful validation, tag the same absolute version as rc using an automated job. In other words, when a Docker image is rc tagged, for a short period, there will be 3 tags, all pointing to the same physical Docker image: gold, absolute tag and rc.

Deployment Metadata

Metadata about every deployment artifact that gets delivered to Operations, and deployed in the customer stacks is maintained as a YAML file in release-pipeline repository. The metadata is managed as 5 categories:

  • cattle: The application components

  • pets: The essential services

  • configuration: The application and deployment configuration

  • binaries: The mobile app builds and Point of Sale Edge Server binaries

  • infrastructure: The infrastructure tools that are used for the purpose of deployment and environment management Each entry in the metadata is (unsurprisingly) called a “metadata entry”. Each metadata entry defines a few attributes:

    • name: Fully qualified name of the deployment artifact, such as com-manh-cp-inventory , ma-cp-rabbitmq or manh-store-android.
    • shortName: Short name of the deployment artifact, such as inventory, rabbitmq or manh-store-android.
    • repository: The Bitbucket repository name of the deployment artifact
    • groups: One or more groups that this deployment artifact belongs to. If a component is in all group, it will be considered part of every group. For more details see the following section.
    • runtime: One or more runtimes that are supported by this deployment artifact. Most artifacts support docker and jar runtimes. However, for some special cases, the artifacts can also support war runtime (such as POS Edge Server).
    • platform: One or more platforms that are applicable by this deployment artifact. Supported values are one or more of aws (deployments on AWS/ECS), Kubernetes (deployments with GCP/GKE/Rubik) and side (SIDE deployments)
    • containerPort: The port on which the application “inside” the Docker container listens to HTTP traffic
    • virtualPort: The host’s port on which the Docker container listens to HTTP traffic (applicable only for SIDE deployments; on ECS and Kubernetes, the port value is assigned dynamically by the underlying orchestrator)
    • hasDb: true if the deployment artifact has a Database dependency; false otherwise.
    • importance: Relative importance of this deployment artifact. Value must be between 0-9. This value is used to decide the priority class of the deployment in Kubernetes - artifacts with more importance are less likely to get preempted.
    • resourceLimit: Memory requests (container memory and JVM -Xmx) for the deployment artifact
    • environmentVariables: Any additional pre-configured environment variables that the deployment artifact may need at runtime (deprecated - do not use) The deployment metadata is used to generate deployment artifacts for the target deployment. Sidekick provides APIs to fetch list of deployment artifacts based on the values of group, platform and runtime. The fetched list is then used by the deployment tooling to create the deployment specifications (such as docker-compose.yml for SIDE instances, or Kubernetes spec files for Rubik deployments).

Groups

The group attribute for the metadata entry for each deployment artifact contains one or more values. By associating a deployment artifact to one or more groups, the artifact declares that it “belongs to” any of those groups. The “group” is an arbitrary value and not a predefined enumeration of values. However, each group value must be meaningful in the context of Manhattan products. Currently, the following groups are defined by one or more deployment artifacts

Groups

Code Drops

Manhattan Active® Platform Code Drops are the instrument by which the development team deliver the set of deployment artifacts to the Cloud Operations team (and for internal use, to services teams). A code drop is a manifest, or a “bill of lading” containing the set of deployment artifacts, and their versions that are being released as part of that code drop.

Code Drop Manifest

Every code drop manifest consists of 4 sets of deployment artifacts:

  • Docker images of the application components (affectionately known as the “cattle”), essential services (or “pets”) and infrastructure (or “rancher”) tools. See Deployment Units for the detailed description of the “Pets and Cattle” analogy.
  • Configuration zip containing the archived form of the configuration repository.
  • Mobile App binaries representing the application mobile apps. In general, the “binaries” can be used as vehicle to release any form a binary, not limited to the mobile apps. But the current purpose of the binaries is to include only the mobile app binaries.

Types of Code Drop Manifests

A code drop can be released as one of the three flavors:

  • dev-ready: A code drop that is only used for internal purposes, and for development only. The dev-ready code drop manifest contains the same set of deployment artifacts, but their versions are marked as gold (for Docker images) and latest (for Configuration zip or mobile app binaries). This allows the application teams to take a dev-ready drop for their development environments or other internal environments, while ensuring that the standard set of deployment artifacts are all available.

  • stage-ready: A code drop that is only used for internal purposes, and for development or QA, or for staging the release before it is promoted to downstream environments. The stage-ready code drop manifest contains the same set of deployment artifacts, but their versions are marked as rc (for Docker images) and latest (for Configuration zip or mobile app binaries). This allows the application teams to take a stage-ready drop for their development environments or other internal environments, while ensuring that the standard set of deployment artifacts are all available.

  • prod-ready: A code drop that is only ready to be released to Operations and for promotion to downstream environments. The prod-ready code drop manifest contains the same set of deployment artifacts, but their versions are marked as the actual release ready versions for Docker images, Configuration zip and mobile binaries. At the time of the code drop event (typically at the end of the 2-week period after completing the necessary testing and validation activities with the deployment artifacts from the stage-ready manifest), the absolute versions of then-current rc tags of the Docker images, most recent Configuration zip file and mobile binaries are recorded in the prod-ready manifest. So, as an example, when we prepared the code drop 2.0.60.x on September 27, 2019, the stage-ready versions of the deployment artifacts were recorded as absolute versions, which produced a prod-ready manifest as shown in the example below (compare it with the snippets shown above for dev-ready and stage-ready manifests - they look nearly identical, except for the difference in the versions):

Prod-Ready

By recording the absolute versions at a given moment, we essentially “lock down” the versions that will be released to the Operations team. This allows the development team to release a precise and static set of versions across the deployment artifacts. In contrast, the dev-ready and stage-ready manifests point to a more “fluid” gold or rc tags, respectively, which are dynamic in nature by design.

Code Drop Version Convention

To identify a prod-ready manifest, or a code drop that is delivered to Operations team, a specific versioning convention is used that includes 4 digits: 2.0.XX.YY, where XX and YY are two-digit numbers. The digit XX represents the Code Drop number, and YY represents the Hot Fix number for that code drop. For example, prod-ready manifest identified as 2.0.59.0 is the original manifest for the Code Drop 59. However, when updated to 2.0.59.1 or 2.0.59.13, the last digit represents the Hot Fix # 1 or # 13 in the code drop 59, in this example.

Code Drop Workflow

In summary, the dev-ready, stage-ready and prod-ready manifests follow the workflow as described in the diagram below:

CD-Workflow

Hot Fixes

As explained in the Code Drop section above, the bi-weekly releases and the shipped manifest are part of a recurring process, and hence scheduled events. However, there are times when high priority CIIs (usually to report high or critical impact defects, or to ask for other time-sensitive changes) are reported by the services team, Operations or the customers (including the internal customers such as the PA or Sales). The requester may not be able to wait for the requested changes until the next code drop to be released and deployed in their environments. In these cases, upon approvals by senior leaders of the application teams (usually at least a Director-level lead), a change may be allowed to be released as a “Hot Fix” on the most recently shipped code drop.

Example of a Hot Fix

A Hot Fix request is typically reacted by delivering a Docker container (if the desired change was made in an application component or an essential service), a new Configuration zip (if the change is in the configuration repository), or a new mobile binary. Regardless of the type of deployment artifact being delivered as a Hot Fix, the change is recorded in the most recent prod-ready manifest in the release-pipeline repository branch specific to that code drop, and a new version of the manifest is delivered to the Operations team.

As an example, assume that component-order was released as com-manh-cp-order:1.2.3-abcdefg-1909030621 as part of say, code drop 2.0.58.4. For a CII that was reported for the component-order, a new Hot Fix version was built with a change. Assume that the revised version is com-manh-cp-order:1.2.4-ghijklm-1009132056. To release this Hot Fix, the prod-ready manifest of the code drop 2.0.58.x is updated to record the new version of component-order, and the version of the manifest is incremented to 2.0.58.5.

Cumulative Nature of Hot Fixes

Hot Fixes for a given Code Drop are cumulative. In other words, when the Code Drop 2.0.58.0 is initially delivered to the Operations team, it contains no hot fixes. With each hot fix (or set of hot fixes if made available at the same time) the manifest’s version number is incremented to 2.0.58.1, 2.0.58.2 and so forth. It is possible that a single Hot Fix increment may have a single change (e.g., component-order only), or may have multiple changes (e.g., component-order and component-payment). However, the last digit of the code drop version will always be incremented by 1. In other words, when 2.0.58.2 is delivered, the Operations teams will also get the Hot Fix delivered as part of 2.0.58.1.

Learn More

Authors

  • Pravas Kumar Mahapatra: Sr Manager, Continuous Integration & Release Engineering, Manhattan Active® Platform, R&D.

DevOps @ Manhattan and Manhattan Reliability Engineering

Manhattan’s DevOps philosophy and how our reliability engineering team ensures high SLAs of your environment.

Manhattan Active® Platform is built, delivered, and supported by the Manhattan R&D engineering. The engineering teams follow distributed development and continuous integration practices. The development teams responsible for individual microservices are small, and geographically collocated in most cases. This document describes the practices, procedures and tooling that support the DevOps and Reliability Engineering operations:

Development and Release

This section covers the “left” side of the DevOps practices in Manhattan, describing how developers code, build, test and release the software binaries for deployments in customer environments:

Continuous Integration

The typical development process can be summarized as the following:

  • Every microservice has a separate codebase in its own Git repository
  • Every development team works on the microservices they are responsible for and push the changes to the respective feature branches
  • The continuous integration pull request process builds and tests the code from the feature branches
  • Upon successful build and tests, the development team merges the code to the mainline (trunk or “master”)
  • The continuous integration process then builds and tests the code from the trunk, builds the Docker image, and publishes to a private Docker registry

The microservice codebase goes through a set of distinct phases in the continuous integration pipeline. In each phase, a special category of tests is executed for the microservice code, and only upon successful completion of all tests, the code moves to the next phase. These phases also act as the quality gates and ensure that code is release-ready only if successfully passes through all quality gates. Each development team develops, builds, and tests their microservices in isolation or with minimal dependencies from other microservices. This model allows for independent development, and loose coupling between microservices. Where necessary, the tests use mocks to represent its dependencies, which allows every microservice to maintain and adhere to service contracts for backward compatibility of interactions between microservices.

Release Engineering

At the end of the continuous integration pipeline, a microservice Docker image is tagged as a Release Candidate or rc. The RC-tagged Docker images go through a series of production assurance validations. These validations include upgrade tests (to ensure compatibility between versions), user interface tests (to test user experience and mobile applications) and business workflow simulations (to ensure that the end-to-end business scenarios continue operating without regressing). If regressions are found, they are treated as a critical defect and either addressed or toggled off using feature flags so that the code can be released without the regression of the functionality.

After production assurance vetting, the release candidate Docker images go through a process of release to the Manhattan Active® Operations team, where they are promoted to the customer environments - first to the customer’s lower lifecycle environment, followed by the customer’s production environment. Manhattan Active® Operations uses a set of deployment tools built by Manhattan engineering teams to perform the customer environment deployments and upgrades. The deployment tools work natively with Kubernetes and Google Kubernetes Engine to execute the deployment tasks.

Security Focus in Development

Manhattan fully understands the criticality of security in software engineering and the need to build a stronger security posture starting with the foundation: the code. The topics below summarize the key security aspects that relate to secure software development practice at Manhattan:

Securing Codebase

The platform and product codebase is maintained in private Git repositories in Atlassian Bitbucket. Access to these repositories is authenticated via either HTTPS (using OAuth2 SSO with Manhattan corporate directory) and SSH (using Git SSH private key). Specific user groups with access permissions, such as read, write and admin are created for controlling access to the code repositories. Users, including automated systems such as the Continuous Integration and Release Pipeline, are then assigned to the designated groups based on their role and development needs.

Likewise, the built binaries, such as Docker containers or JAR files, are also stored in private binary repositories such Google Container Registry and JFrog Maven repository. Access to the binary repositories is also controlled via OAuth2 SSO with Manhattan corporate directory and Docker login credentials. Access controls apply uniformly to Manhattan development staff and automated build, testing and release subsystems.

Writing Secure Code

Manhattan uses a combination of security development practices for ensuring the codebase released to run in production is free of security vulnerabilities and to identify new vulnerabilities or security “hot spots” that may be introduced either in the code that the developers committed or in the 3rd party libraries that are included in the built binaries:

  • Security Code Review: Architects, development leads and other senior developers are expected to perform thorough code reviews of all pull requests before merging them to the mainline. In addition to the functionality and performance characteristics, one of the key aspects of this review is code security. The reviewers use specific rules to review security aspects of the code, including OWAPS Top 10 guidelines, security policies established by Manhattan, and their product expertise to identify potential security gaps and have them addressed.
  • Threat Analysis: Developers receive regular feedback from the security teams (SecOps, Global Security, and Manhattan Active® Platform security) on potential threat scenarios and hypothetical use-cases that involve a breach, referred to as “What If?” cases. Using this feedback and depending on the severity of the threat as evaluated by the internal security experts, the development teams analyze the potential risks with their code, improve the logic or to perform necessary updates, and define the mitigation plans to tackle the potential risk scenarios.
  • Security Code Scans: Complete codebase of every framework and microservice is scanned for security vulnerabilities, code quality, and checking best coding practices on every code commit. Manhattan uses SonarCloud to perform code scanning as part of the Continuous Integration pipeline. Additionally, all binaries, including 3rd party libraries used by Manhattan Active® Platform, are checked with static scans via Veracode before being released as part of the upcoming bi-weekly code drop. If these scans identify potential vulnerabilities in Docker containers, JAR files or other libraries that would eventually be deployed in production, then the change advisory board blocks the code drop from being released to the customer environments until the vulnerabilities are addressed.
  • Network Security Monitoring: Manhattan Active® Platform utilizes Lacework for network security monitoring. The monitoring is enabled for quality assurance and UAT environments used by the development teams. Network security monitoring identifies unusual, or potentially risky network connectivity or call behaviors by continuously monitoring TCP traffic (SSH, HTTP, JDBC, etc.) of the end-to-end deployment. Alerts flagged by Lacework are used by the developers for security risk analysis and correcting the network behavior.

DevOps Mindset

Manhattan engineering teams follow DevOps principles and consider customer deployments and operations as integral parts of the development process. While roles, responsibilities and skill sets separate the development and operation staff, the overall team is part of a single R&D organization, and every member of the organization is responsible for accuracy, consistency and validity of the development and operations processes. A few of the key DevOps practices followed by Manhattan R&D are summarized below:

  • Agile development lifecycle: Development teams follow well-defined development cycles, or sprints. The sprints are typically 2-week cycles, but for some teams or during certain periods of the year, the sprints could be 1-week or 3-week long. The development teams typically publish the release candidates for their deployment artifacts once or several times during a sprint. During a sprint, development teams focus on new product features, tooling, addressing technical debt or bug-fixing activities.
  • Feature flags: Feature flags are used rigorously to control the availability of new features, and for “dark deploying” new features that may require additional vetting before becoming generally available. Feature flags are maintained at the level of each individual product feature and could go across more than one microservice. Feature flags are owned and managed by the development teams, who decide when a certain feature is ready for customer consumption. Feature flags allow Manhattan engineering teams to release code from trunk on a frequent basis without having to rely heavily on feature or topic branches for longer periods of time.
  • Streaming Delivery: The tested release candidates are continuously delivered to the production assurance teams, and upon successful validation, promoted to the customer environments frequently. The process of continuously delivering new versions of Manhattan Active® application components and supporting services is referred to as Streaming Delivery.
  • Quarterly Product Releases: While the code for the new features is released throughout the development sprints, these features are not made generally available to customers until the next quarterly product release event. At the start of the new quarter, the features developed during the previous quarter are enabled for the customer use. Enabling (and potentially, subsequent disabling) of new features is governed by Feature flags as explained above.
  • Postmortem: The regression breaks caught during the production assurance, or in certain cases, reported by customers are reviewed at regular intervals, and participated by the development teams responsible for the regression break. Regression reviews serve as a mechanism to find underlying deficiencies of the development processes and tooling with a focus on continuous improvement. Likewise, critical incidents such as outages or a functional or performance degradation are analyzed for a detailed root cause analysis by the engineering teams. The focus of the no-blame postmortem exercise is to understand the root cause of the problem such that it can be turned into feedback to the development teams for improvements.
  • Mining: Experienced senior engineers are tasked with performing forensic analysis of chronic issues, or potential future problems that are harder to isolate, recreate or debug. While the occurrences of such complex problems is infrequent, when they do surface, they need immediate reaction, troubleshooting and future-proofing.
  • SLA Review: A monthly review is conducted by the Manhattan R&D leadership team to summarize and deliberate the customer-initiated incidents to analyze the customer impact due to the reported incidents, and to measure the service quality and SLA. While like the postmortem process, the focus of the SLA review is to measure operations and development processes and practices against the Service Level Objectives and customer expectations.

Manhattan Reliability Engineering

Manhattan Reliability Engineering (or MRE) team is part of the Manhattan Active® Operations under the Manhattan R&D organization, and consists of a set of software engineers, deployment architects and system engineers; primarily responsible for ensuring the stability, availability, security, and lower cost of ownership for the customer environments. The MRE team is tasked with building, instrumenting, and operating a set of automation and management tools that are used to perform a set of key tasks crucial in maintaining reliable systems:

  1. Supervising the environment deployments, upgrades, availability, and scalability, by automating these aspects of environment management and providing visibility into the actions performed by the automated systems.
  2. Monitoring the environments, by automating the collection and visibility of system metrics, logs, and other traces, such that the automated systems can produce the necessary notifications, alerts and reports promptly, accurately, and efficiently.
  3. Diagnosing and troubleshooting the systemic or localized problems, or to perform the root cause analysis of problems reactively, via the built-in instrumentation of command-line tools, plumbing APIs, or web consoles

The Reliability Workflow

The MRE team follows the reliability workflow described below for preventing, diagnosing, and treating system or performance issues.

The MRE team focuses on proactively preventing problems by continuously monitoring the customer environments, and by treating the early signs of problems. However, when incidents do occur, the MRE team relies on the reactive alerting mechanism, and troubleshooting tools to diagnose the symptoms to potentially address the problem before they result in service degradation or outages. Significant focus is put on two key concepts of the reliability engineering to strengthen the stability and efficiency of Manhattan Active® Platform:

Mean Time to Repair: Measuring the engineering performance based on how quickly the team can repair or restore an outage. MTTR is directly relevant to the customer as it is a measure of the amount of downtime the customer sees. The MRE team is responsible for continuously improving the monitoring ecosystem with the primary goal of reducing the MTTR.

Automation: Systems, processes and tooling that can be automated help with improving consistency of deployment behavior, minimize the potential for human errors, reduce the operational costs, and create the ability to learn & forecast from previous problems. Automation reduces toil and enables engineering teams to focus more on strategic initiatives. The MRE team is responsible for building and operating sub-systems for automatic deployment & upgrade processes, self-healing, and self-service.

Monitoring Ecosystem

Manhattan Active® Platform consists of a monitoring ecosystem for automating and managing consistency, reliability, and resilience of the customer environments. Some key constituents of the ecosystem are listed below:

LOGS COLLECTOR

Manhattan Active® Platform is integrated with Elasticsearch, Fluentd and Kibana as the toolset at the core of its logging subsystem. These services are instrumented with SLF4J and Docker to stream the logs from the stack components into the Logs Collector:

  • Elasticsearch is the log aggregator across all stack components and supporting services.
  • Fluentd is the mechanism to capture the log streams and index them into Elasticsearch.
  • Kibana is used as a log viewer, and for executing log search queries or dashboards.

Additionally, the Logs Collector has a set of internally available APIs that can be used by other tools to read the log data for generating alerts and visualizations.

METRICS COLLECTOR

Manhattan Active® Platform uses Prometheus, Grafana and Alert Manager as the toolset at the core of its monitoring subsystem:

  • Prometheus scraps metrics from all stack components and supporting services at fixed intervals and persists them in a time-series database.
  • The time-series database can be queried by Grafana to show various visualizations in configurable dashboards.
  • The metrics data in Prometheus is also queried by Alert Manager to scan for problem conditions and raise alerts. Alert Manager can be configured to emit the alerts via web-hooks to alert notification services such as PagerDuty.

Metrics Collector collects and persists three types of metrics data:

  • Infrastructure metrics: metrics at the level of GKE node pool, Kubernetes, and Docker
  • Application metrics: metrics collected from the application components at the level of Java and Spring Boot
  • Functional metrics: metrics collected from the application components at the level of business functionality.

MONITOR

The Monitor is a platform component as part of the Manhattan Active® Platform and performs specific operations on the logs and metrics data to make them usable for alerting based on application conditions. The Monitor uses internal APIs provided by the Logs Collector (via Elasticsearch) and Metrics Collector (via Prometheus) to scrape for a predefined set of patterns and labels, and churns alert-able data from the raw logs and metrics. The output is read by the Metrics Collector, which then can produce alerts for specific conditions determined by the Monitor. Examples of some of these conditions are listed below:

  • Watching for a specific log pattern, such as exception messages or textual values
  • API call counters and response time aggregation
  • Health-check probes for the supporting services such as messaging or caching
  • Overall health indicators for the system based on key performance metrics

Additionally, the Monitor also integrates every deployment stack with Keyhole - the global monitoring dashboard - such that specific alerting data can be streamed directly from the deployment stack to Keyhole, allowing for near real-time notifications of alert conditions.

KEYHOLE

Keyhole is the global monitoring dashboard built using Grafana and Prometheus running on Google Cloud Run platform. Keyhole receives alert notifications from all customer environments, and depending on the severity of alerts, it derives and indicates a deployment stack’s health, along with the recent trend of critical alerts. Keyhole also has an ability to drill down into each deployment stack’s monitoring dashboard from the global dashboard, allowing for easy access to individual customer environments directly from the global dashboard.

Manhattan Active® Operation teams continuously monitor Keyhole dashboard to get a comprehensive, and real-time view of the customer environments across the globe.

SLAMMER

Slammer is the global alert capture and analysis database built using Google BigQuery and Google Cloud Run platform. Slammer receives alert notifications from all customer environments. Using the historical alert data, Slammer can perform various analytics such as alert trends, service quality and SLO analysis.

Slammer periodically publishes the report of key alerts and their trends from the customer environments, providing crucial learnings that are used for continuous improvements and predictions for future scenarios under similar conditions.

ARECIBO

Arecibo is the global activity recorder that captures system interactions from all deployment stacks and persists them in a database built on Google BigQuery. Arecibo enables a centralized view of system interactions such as inbound or outbound HTTP calls, asynchronous message transmissions, and extension-point invocations. The activities recorded by Arecibo are helpful in identifying latency and performance of these system interactions.

Arecibo periodically publishes the report of latency and performance data of system interactions across all deployment stacks. These reports are used for performance engineering and tuning of the system infrastructure and application code.

WIRETAP

Wiretap is a platform component as part of the Manhattan Active® Platform, which performs specific operations on the application data to produce functional metrics, which are then scraped and persisted by the Metrics Collector as time-series data. These metrics are subsequently inspected for alert conditions to reveal potential inconsistencies in business transactions. Wiretap provides insights into the symptoms of functional problems, or conditions that could lead to such problems.

Examples of some alert conditions that Wiretap captures are listed below:

  • Order header and detail in different status
  • Manifest closed but not all LPNs on manifest are shipped
  • Locations with negative inventory

SONAR

Sonar is the synthetic probe to keep track of the help of every HTTP endpoint deployed as part of the Manhattan Active® Platform. Sonar “pings” the HTTP endpoints across customer environments at a scheduled frequency and records the HTTP status of the response. If the status, latency, or availability of the response do not meet the previously defined criteria, Sonar then reports the error condition as an alert to be evaluated and acted upon by MRE.

LACEWORK

Lacework is a security tracing solution that helps the MRE team identify anomalous activity that deviates from normal behavior in a customer environment and may indicate a threat. Lacework leverages behavioral models based on the usage patterns of the system, and reports unusual behaviors, invocations, and software components, which can then be analyzed by the security analysts to detect threats, risks, and potential exposures.

PINPOINT APM

Pinpoint APM is an open-source application performance monitor agent and visualizer, integrated with the Manhattan Active® Platform. Pinpoint APM agent captures performance metrics and stack traces via bytecode instrumentation and persists these metrics in HBase. These metrics can then be analyzed with Pinpoint APM web console to identify performance bottlenecks or other symptoms of performance degradation.

The graphic below summarizes the monitoring, alerting, and reporting ecosystem deployed as part of the Manhattan Active® Platform:

Monitoring Diagram

Manhattan Active Monitoring

Manhattan Associates monitors our active platform for availability, events, and security.

We use Prometheus to data mine for application status via Kubernetes and a custom service (Snitch) that queries component endpoints that provide health and other detailed information.

Alertmanager works with Prometheus which raises events based on configured rules. These events are then sent to our centralized event collection service (Slammer - Service Level Agreement manager). Slammer will hold any critical events to allow self resolution/healing to take place. If a resolution has not been received in the allotted time, Slammer will create an alert and send it to our paging service (Pagerduty).

We have a custom service that will actively ping (Manhattan Sonar) the external endpoints that our customers use to verify that they are available. If the service is unavailable, it will send the event to Slammer via a webhook.

Slammer stores all events into a data lake which we used to identify trends and produce daily reports for internal big picture consumption.

From a security perspective, Manhattan Associates’ cloud security team monitors all of our customer and internal environments with Lacework. Lacework uses machine learning to bubble up unusual activities as well as any security vulnerabilities and common vulnerabilities and exposures (CVE). This allows prompt identification, and resolution of events of interest.

Incident Response Workflow

The Manhattan Active® Network Operation Center (or NOC, for short) is the team primarily responsible for monitoring the health and availability of the customer environments. The NOC team is tasked with initiating the resolution workflow when a system or functional alert is triggered.

A variety of curated and refined set of system and functional alerts are defined as part of Manhattan Active® Platform deployments. These alerts fall in two broad categories:

  • Proactive alerts act as the detection mechanism for possible conditions that could result in incidents in the future. These alerts also provide feedback to improve the auto-recovery mechanism built in the Manhattan Active® Platform
  • Reactive alerts act as the detection of a potentially active incident. These alerts carry a higher priority, and result in actions by the NOC and MRE team to repair the problem as promptly as possible to reduce the mean time to repair the service.

The lists below include some key proactive and reactive alerts that are monitored by the NOC team:

The workflow shown below describes the rules of engagement and communication for resolving alerts with a singular focus of minimizing the downtime or degradation and exceeding the SLA of the Manhattan Active® solutions.

Learn More

Author

  • Kartik Pandya: Vice President, Manhattan Active® Platform, R&D.

HTTP Traffic Routing and TLS Certificates

How Manhattan Active® Platform routes HTTP traffic securely and how the certificate management and renewals are performed.

Introduction

Manhattan Active® Platform is deployed as a distributed application using Google Kubernetes Engine on Google Cloud Platform. MA products exposes two HTTPS endpoints by default,and they are the Authentication endpoint and Application endpoint.These endpoints are exposed to external using NGINX ingress controller. NGINX Ingress Controller is an Ingress controller that manages external access to HTTP services in a Kubernetes cluster using NGINX. All the configurable traffic routing using the ingress resource,and TLS termination for Manhattan Application are done on the ingress controller level. Currently, the NGINX ingress controller exposes the MA application to the outside world using the Cloud provider TCP load balancer.

All the application and authentication urls are mapped to the cloud load balancer with a publicly resolvable DNS names.

TLS Routing

All the traffic to, from and within Manhattan Active® Platform is encrypted with TLS v1.2 by default. The inbound HTTPS load balancer listens to port 443. The HTTPS endpoints accept traffic from Manhattan Active® Platform web user interface, mobile applications, and REST clients. The ingress controller has configured to serve the application in HTTPS. It works by offloading this functionality from the application. We have configured the SSL certificate in the ingress controller for all the ingress resource rules created for the MA application.

Also, We have automated the lifecycle of the TLS certificates with Cert-manager and Let’s Encrypt as a certification authority. Cert-manager automates the provisioning of HTTPS certificates within the Kubernetes cluster. It provides custom resources to simplify the provisioning, renewal, and use of those certificates.

End-to-End Encryption

With NGINX, we can achieve end‑to‑end encryption of all requests in addition to it making Layer-7 routing decisions. In this case the clients communicate with NGINX over HTTPS, which then decrypts the requests and then re‑encrypts them before sending them downstream to the application gateway, or the Zuul Server.

SSL / TLS Client Certificates

NGINX can handle SSL/TLS client certificates and can be configured to make them optional or required. Client certificates are a way of restricting access to the application to only the authorized clients without requiring a password. We can control the certificates by adding revoked certificates to a certificate revocation list (CRL), which NGINX checks to determine whether a client certificate is still valid.

Cert Manager

The Cert Manager adds certificates and certificate issuers as resource types in a Kubernetes clusters, and simplifies the process of provisioning, renewing and using these certificates. It can issue certificates from a variety of supported sources, including Let’s Encrypt, HashiCorp Vault, and private PKI. Cert Managers ensure the validity of the certificates, and automatically renews the certificate at a configurable time before the expiration.

Manhattan Active® Platform uses Let’s Encrypt, a global Certificate Authority (CA). The Cert Manager ecosystem has API based automation built with Let’s Encrypt to renew and provision the certificates, which then are distributed to many deployment stacks via a Kubernetes cronjob that detects the need for renewal and requests a renewed certificate from the central certificate governing system. Let’s Encrypt serves as a platform for advancing TLS security best practices. All certificates issued or revoked will be publicly recorded and available for anyone to inspect.

How Cert-Manager works in Manhattan Active Platform

Cert-manager runs within in a dedicated tools Kubernetes clusters as a series of deployment resources. It utilizes CustomResourceDefinitions to configure Certificate Authorities and request certificates.Along with cert-manager we have configured Let’s Encrypt as the ClusterIssuer resources which represent a certificate authority. For every customer we are creating a wildcard domain certificate which is injected as a kubernetes secret used by the ingress resources in the cluster.Also, the Let’s Encrypt CA is using the DNS-01 challenge to validate the domain name of the certificate before issuing the certificate.

How certificates are automatically renewed in Manhattan Active® Platform

Cert-manager will automatically renew Certificates. It will calculate when to renew a Certificate based on the issued X. 509 certificate’s duration, and a ‘renewBefore’ value which specifies how long before expiry a certificate should be renewed. Default duration configured in Manhattan Active® Platform is 90 days. There is a cronjob running on the manhattan tools cluster to periodically runs the helper scripts found with in the cert manager in order to generate/renew SSL certificates using the cert manager from LetsEncrypt.

All the generated, and the renewed certificates are stored on a shared storage used by the sidekick application. Each custom environment will download the ssl certificate from the sidekick application and will expose the certificate via a Kubernetes secret. There is a dedicated cronjob running on each custom environment to validate and fetch the certificate before it expires.

Zuul Routing

Zuul is an API gateway in the Manhattan Active® Platform that provides dynamic routing,monitoring and resiliency,security and more. The NGINX ingress controller traffic matching the ingress rules are sent to Zuul server first and then Zuul delegates the authenticated user traffic request to the other application components running in the microservice deployment. Zuul will also run as another microservice application running on the kubernetes cluster. Authentication service, called as oauth server is another Manhattan platform component responsible for both authentication and authorization. All the client initiated traffic which are not authenticated are redirected to auth server by zuul server for completing the authentication by generating the valid JWT tokens.

Zuul and Auth service are exposed to outside world by using the NGINX ingress controller. Ingress resource routing rules are created on the NGINX controller for both zuul and auth service. When the client traffic hit the NGINX ingress controller via the cloud load balancer, NGINX will compare the host header of the user traffic, and the ingress rules on the controller, and if the header match then it will forward the traffic to the Zuul server. For the first time login, Zuul will redirect the call to auth server and that will be another NGINX ingress routing call to auth server for authenticating.

Learn More

Authors

  • Giri Prasad Jayaraman: Technical Director, Manhattan Active® Platform, R&D
  • Akhilesh Narayanan : Sr Principal Software Engineer, Manhattan Active® Platform, R&D

Creating a new Blog Document

How to create a new blog document in this repository

The manh Hugo theme includes sample archetypes that can get you started with a new Markdown file based on the Kind of document. The starting document includes starting YAML metadata (Front Matter) which influences the way the document is displayed. The steps to create a new document are as follows

1. Pull the latest from this Repo

Many developers may be contributing to this Repo and theme changes may have occured since you last worked in the Repo, so make it a habit to pull frequently.

git checkout main
git pull --recurse-submodules

2. Create a branch for your new document

Create a new branch for your changes which can later be reviewed / approved through a pull request.

Please follow a standard branch naming convention

  • use all lowercase letters
  • use - for spaces
  • provide a short name of the document being created
git checkout -b new-blog-entry-2022-05-02

3. Run the Hugo Command to generate your new Document in the appropriate location

A series of sub folders below the top level section provide additional organization. Each document is created as a markdown file within its own folder along with an images folder.

FolderDescription
docs/conceptsGeneral Concepts
docs/faqFrequently Asked Questions - 1 Document per Question
docs/getting-startedStarter Guides
docs/guidesGeneral Guides and Other Documents
docs/referenceReference Documents
docs/tutorialsTutorials
blogs/articleBlog entries
# Example for creating a new Blog in the "article" sub section
hugo new blog/article/2022-05-02-example

# Example for creating a new Blog in the "article" sub section
hugo new blog/article/2022-05-02-example

4. Update Front Matter

Update the YAML section at the top of the index.md file created for your document. Important items to update include -

  • title and linkTitle
  • description

Comments should help explain the variables and their use, feel free to remove / update comments in this section as part of your documentation.

5. Add Content

Finally! Create the documentation that you want to share with your colleagues and the development community. The starting document has some examples of Markdown to help. Using your favorite code editor, you can keep an eye on how the markdown is going to look.

VS Code Users : Here is a handy extension - doc markdown

6. Review your Content Locally

Once you have a good working draft, you can review how it will appear in the site by running Hugo on your local machine.

# Returns the local server hosting pages
# by default, this is likely //localhost:1313/
hugo server

NOTE: You can live edit your documentation, so snap your browser and editor side by side and live edit.

7. Commit your changes and push

Once you have completed your documentation, commit and push your changes to the remote repo

# Supply a commit message and push your changes
git add -a
git commit -m "initial draft"
git push --set-upstream-to origin/new-blog-entry-2022-05-02

8. Create a Pull Request

Now submit your pull request for another colleague to review your documentation.