DevOps @ Manhattan and Manhattan Reliability Engineering
Manhattan Active® Platform is built, delivered, and supported by the Manhattan R&D engineering. The engineering teams follow distributed development and continuous integration practices. The development teams responsible for individual microservices are small, and geographically collocated in most cases. This document describes the practices, procedures and tooling that support the DevOps and Reliability Engineering operations:
Development and Release
This section covers the “left” side of the DevOps practices in Manhattan, describing how developers code, build, test and release the software binaries for deployments in customer environments:
The typical development process can be summarized as the following:
- Every microservice has a separate codebase in its own Git repository
- Every development team works on the microservices they are responsible for and push the changes to the respective feature branches
- The continuous integration pull request process builds and tests the code from the feature branches
- Upon successful build and tests, the development team merges the code to the mainline (trunk or “master”)
- The continuous integration process then builds and tests the code from the trunk, builds the Docker image, and publishes to a private Docker registry
The microservice codebase goes through a set of distinct phases in the continuous integration pipeline. In each phase, a special category of tests is executed for the microservice code, and only upon successful completion of all tests, the code moves to the next phase. These phases also act as the quality gates and ensure that code is release-ready only if successfully passes through all quality gates. Each development team develops, builds, and tests their microservices in isolation or with minimal dependencies from other microservices. This model allows for independent development, and loose coupling between microservices. Where necessary, the tests use mocks to represent its dependencies, which allows every microservice to maintain and adhere to service contracts for backward compatibility of interactions between microservices.
At the end of the continuous integration pipeline, a microservice Docker image is tagged as a Release Candidate or
rc. The RC-tagged Docker images go through a series of production assurance validations. These validations include upgrade tests (to ensure compatibility between versions), user interface tests (to test user experience and mobile applications) and business workflow simulations (to ensure that the end-to-end business scenarios continue operating without regressing). If regressions are found, they are treated as a critical defect and either addressed or toggled off using feature flags so that the code can be released without the regression of the functionality.
After production assurance vetting, the release candidate Docker images go through a process of release to the Manhattan Active® Operations team, where they are promoted to the customer environments - first to the customer’s lower lifecycle environment, followed by the customer’s production environment. Manhattan Active® Operations uses a set of deployment tools built by Manhattan engineering teams to perform the customer environment deployments and upgrades. The deployment tools work natively with Kubernetes and Google Kubernetes Engine to execute the deployment tasks.
Security Focus in Development
Manhattan fully understands the criticality of security in software engineering and the need to build a stronger security posture starting with the foundation: the code. The topics below summarize the key security aspects that relate to secure software development practice at Manhattan:
The platform and product codebase is maintained in private Git repositories in Atlassian Bitbucket. Access to these repositories is authenticated via either HTTPS (using OAuth2 SSO with Manhattan corporate directory) and SSH (using Git SSH private key). Specific user groups with access permissions, such as read, write and admin are created for controlling access to the code repositories. Users, including automated systems such as the Continuous Integration and Release Pipeline, are then assigned to the designated groups based on their role and development needs.
Likewise, the built binaries, such as Docker containers or JAR files, are also stored in private binary repositories such Google Container Registry and JFrog Maven repository. Access to the binary repositories is also controlled via OAuth2 SSO with Manhattan corporate directory and Docker login credentials. Access controls apply uniformly to Manhattan development staff and automated build, testing and release subsystems.
Writing Secure Code
Manhattan uses a combination of security development practices for ensuring the codebase released to run in production is free of security vulnerabilities and to identify new vulnerabilities or security “hot spots” that may be introduced either in the code that the developers committed or in the 3rd party libraries that are included in the built binaries:
- Security Code Review: Architects, development leads and other senior developers are expected to perform thorough code reviews of all pull requests before merging them to the mainline. In addition to the functionality and performance characteristics, one of the key aspects of this review is code security. The reviewers use specific rules to review security aspects of the code, including OWAPS Top 10 guidelines, security policies established by Manhattan, and their product expertise to identify potential security gaps and have them addressed.
- Threat Analysis: Developers receive regular feedback from the security teams (SecOps, Global Security, and Manhattan Active® Platform security) on potential threat scenarios and hypothetical use-cases that involve a breach, referred to as “What If?” cases. Using this feedback and depending on the severity of the threat as evaluated by the internal security experts, the development teams analyze the potential risks with their code, improve the logic or to perform necessary updates, and define the mitigation plans to tackle the potential risk scenarios.
- Security Code Scans: Complete codebase of every framework and microservice is scanned for security vulnerabilities, code quality, and checking best coding practices on every code commit. Manhattan uses SonarCloud to perform code scanning as part of the Continuous Integration pipeline. Additionally, all binaries, including 3rd party libraries used by Manhattan Active® Platform, are checked with static scans via Veracode before being released as part of the upcoming bi-weekly code drop. If these scans identify potential vulnerabilities in Docker containers, JAR files or other libraries that would eventually be deployed in production, then the change advisory board blocks the code drop from being released to the customer environments until the vulnerabilities are addressed.
- Network Security Monitoring: Manhattan Active® Platform utilizes Lacework for network security monitoring. The monitoring is enabled for quality assurance and UAT environments used by the development teams. Network security monitoring identifies unusual, or potentially risky network connectivity or call behaviors by continuously monitoring TCP traffic (SSH, HTTP, JDBC, etc.) of the end-to-end deployment. Alerts flagged by Lacework are used by the developers for security risk analysis and correcting the network behavior.
Manhattan engineering teams follow DevOps principles and consider customer deployments and operations as integral parts of the development process. While roles, responsibilities and skill sets separate the development and operation staff, the overall team is part of a single R&D organization, and every member of the organization is responsible for accuracy, consistency and validity of the development and operations processes. A few of the key DevOps practices followed by Manhattan R&D are summarized below:
- Agile development lifecycle: Development teams follow well-defined development cycles, or sprints. The sprints are typically 2-week cycles, but for some teams or during certain periods of the year, the sprints could be 1-week or 3-week long. The development teams typically publish the release candidates for their deployment artifacts once or several times during a sprint. During a sprint, development teams focus on new product features, tooling, addressing technical debt or bug-fixing activities.
- Feature flags: Feature flags are used rigorously to control the availability of new features, and for “dark deploying” new features that may require additional vetting before becoming generally available. Feature flags are maintained at the level of each individual product feature and could go across more than one microservice. Feature flags are owned and managed by the development teams, who decide when a certain feature is ready for customer consumption. Feature flags allow Manhattan engineering teams to release code from trunk on a frequent basis without having to rely heavily on feature or topic branches for longer periods of time.
- Streaming Delivery: The tested release candidates are continuously delivered to the production assurance teams, and upon successful validation, promoted to the customer environments frequently. The process of continuously delivering new versions of Manhattan Active® application components and supporting services is referred to as Streaming Delivery.
- Quarterly Product Releases: While the code for the new features is released throughout the development sprints, these features are not made generally available to customers until the next quarterly product release event. At the start of the new quarter, the features developed during the previous quarter are enabled for the customer use. Enabling (and potentially, subsequent disabling) of new features is governed by Feature flags as explained above.
- Postmortem: The regression breaks caught during the production assurance, or in certain cases, reported by customers are reviewed at regular intervals, and participated by the development teams responsible for the regression break. Regression reviews serve as a mechanism to find underlying deficiencies of the development processes and tooling with a focus on continuous improvement. Likewise, critical incidents such as outages or a functional or performance degradation are analyzed for a detailed root cause analysis by the engineering teams. The focus of the no-blame postmortem exercise is to understand the root cause of the problem such that it can be turned into feedback to the development teams for improvements.
- Mining: Experienced senior engineers are tasked with performing forensic analysis of chronic issues, or potential future problems that are harder to isolate, recreate or debug. While the occurrences of such complex problems is infrequent, when they do surface, they need immediate reaction, troubleshooting and future-proofing.
- SLA Review: A monthly review is conducted by the Manhattan R&D leadership team to summarize and deliberate the customer-initiated incidents to analyze the customer impact due to the reported incidents, and to measure the service quality and SLA. While like the postmortem process, the focus of the SLA review is to measure operations and development processes and practices against the Service Level Objectives and customer expectations.
Manhattan Reliability Engineering
Manhattan Reliability Engineering (or MRE) team is part of the Manhattan Active® Operations under the Manhattan R&D organization, and consists of a set of software engineers, deployment architects and system engineers; primarily responsible for ensuring the stability, availability, security, and lower cost of ownership for the customer environments. The MRE team is tasked with building, instrumenting, and operating a set of automation and management tools that are used to perform a set of key tasks crucial in maintaining reliable systems:
- Supervising the environment deployments, upgrades, availability, and scalability, by automating these aspects of environment management and providing visibility into the actions performed by the automated systems.
- Monitoring the environments, by automating the collection and visibility of system metrics, logs, and other traces, such that the automated systems can produce the necessary notifications, alerts and reports promptly, accurately, and efficiently.
- Diagnosing and troubleshooting the systemic or localized problems, or to perform the root cause analysis of problems reactively, via the built-in instrumentation of command-line tools, plumbing APIs, or web consoles
The Reliability Workflow
The MRE team follows the reliability workflow described below for preventing, diagnosing, and treating system or performance issues.
The MRE team focuses on proactively preventing problems by continuously monitoring the customer environments, and by treating the early signs of problems. However, when incidents do occur, the MRE team relies on the reactive alerting mechanism, and troubleshooting tools to diagnose the symptoms to potentially address the problem before they result in service degradation or outages. Significant focus is put on two key concepts of the reliability engineering to strengthen the stability and efficiency of Manhattan Active® Platform:
Mean Time to Repair: Measuring the engineering performance based on how quickly the team can repair or restore an outage. MTTR is directly relevant to the customer as it is a measure of the amount of downtime the customer sees. The MRE team is responsible for continuously improving the monitoring ecosystem with the primary goal of reducing the MTTR.
Automation: Systems, processes and tooling that can be automated help with improving consistency of deployment behavior, minimize the potential for human errors, reduce the operational costs, and create the ability to learn & forecast from previous problems. Automation reduces toil and enables engineering teams to focus more on strategic initiatives. The MRE team is responsible for building and operating sub-systems for automatic deployment & upgrade processes, self-healing, and self-service.
Manhattan Active® Platform consists of a monitoring ecosystem for automating and managing consistency, reliability, and resilience of the customer environments. Some key constituents of the ecosystem are listed below:
Manhattan Active® Platform is integrated with Elasticsearch, Fluentd and Kibana as the toolset at the core of its logging subsystem. These services are instrumented with SLF4J and Docker to stream the logs from the stack components into the Logs Collector:
- Elasticsearch is the log aggregator across all stack components and supporting services.
- Fluentd is the mechanism to capture the log streams and index them into Elasticsearch.
- Kibana is used as a log viewer, and for executing log search queries or dashboards.
Additionally, the Logs Collector has a set of internally available APIs that can be used by other tools to read the log data for generating alerts and visualizations.
Manhattan Active® Platform uses Prometheus, Grafana and Alert Manager as the toolset at the core of its monitoring subsystem:
- Prometheus scraps metrics from all stack components and supporting services at fixed intervals and persists them in a time-series database.
- The time-series database can be queried by Grafana to show various visualizations in configurable dashboards.
- The metrics data in Prometheus is also queried by Alert Manager to scan for problem conditions and raise alerts. Alert Manager can be configured to emit the alerts via web-hooks to alert notification services such as PagerDuty.
Metrics Collector collects and persists three types of metrics data:
- Infrastructure metrics: metrics at the level of GKE node pool, Kubernetes, and Docker
- Application metrics: metrics collected from the application components at the level of Java and Spring Boot
- Functional metrics: metrics collected from the application components at the level of business functionality.
The Monitor is a platform component as part of the Manhattan Active® Platform and performs specific operations on the logs and metrics data to make them usable for alerting based on application conditions. The Monitor uses internal APIs provided by the Logs Collector (via Elasticsearch) and Metrics Collector (via Prometheus) to scrape for a predefined set of patterns and labels, and churns alert-able data from the raw logs and metrics. The output is read by the Metrics Collector, which then can produce alerts for specific conditions determined by the Monitor. Examples of some of these conditions are listed below:
- Watching for a specific log pattern, such as exception messages or textual values
- API call counters and response time aggregation
- Health-check probes for the supporting services such as messaging or caching
- Overall health indicators for the system based on key performance metrics
Additionally, the Monitor also integrates every deployment stack with Keyhole - the global monitoring dashboard - such that specific alerting data can be streamed directly from the deployment stack to Keyhole, allowing for near real-time notifications of alert conditions.
Keyhole is the global monitoring dashboard built using Grafana and Prometheus running on Google Cloud Run platform. Keyhole receives alert notifications from all customer environments, and depending on the severity of alerts, it derives and indicates a deployment stack’s health, along with the recent trend of critical alerts. Keyhole also has an ability to drill down into each deployment stack’s monitoring dashboard from the global dashboard, allowing for easy access to individual customer environments directly from the global dashboard.
Manhattan Active® Operation teams continuously monitor Keyhole dashboard to get a comprehensive, and real-time view of the customer environments across the globe.
Slammer is the global alert capture and analysis database built using Google BigQuery and Google Cloud Run platform. Slammer receives alert notifications from all customer environments. Using the historical alert data, Slammer can perform various analytics such as alert trends, service quality and SLO analysis.
Slammer periodically publishes the report of key alerts and their trends from the customer environments, providing crucial learnings that are used for continuous improvements and predictions for future scenarios under similar conditions.
Arecibo is the global activity recorder that captures system interactions from all deployment stacks and persists them in a database built on Google BigQuery. Arecibo enables a centralized view of system interactions such as inbound or outbound HTTP calls, asynchronous message transmissions, and extension-point invocations. The activities recorded by Arecibo are helpful in identifying latency and performance of these system interactions.
Arecibo periodically publishes the report of latency and performance data of system interactions across all deployment stacks. These reports are used for performance engineering and tuning of the system infrastructure and application code.
Wiretap is a platform component as part of the Manhattan Active® Platform, which performs specific operations on the application data to produce functional metrics, which are then scraped and persisted by the Metrics Collector as time-series data. These metrics are subsequently inspected for alert conditions to reveal potential inconsistencies in business transactions. Wiretap provides insights into the symptoms of functional problems, or conditions that could lead to such problems.
Examples of some alert conditions that Wiretap captures are listed below:
- Order header and detail in different status
- Manifest closed but not all LPNs on manifest are shipped
- Locations with negative inventory
Sonar is the synthetic probe to keep track of the help of every HTTP endpoint deployed as part of the Manhattan Active® Platform. Sonar “pings” the HTTP endpoints across customer environments at a scheduled frequency and records the HTTP status of the response. If the status, latency, or availability of the response do not meet the previously defined criteria, Sonar then reports the error condition as an alert to be evaluated and acted upon by MRE.
Lacework is a security tracing solution that helps the MRE team identify anomalous activity that deviates from normal behavior in a customer environment and may indicate a threat. Lacework leverages behavioral models based on the usage patterns of the system, and reports unusual behaviors, invocations, and software components, which can then be analyzed by the security analysts to detect threats, risks, and potential exposures.
Pinpoint APM is an open-source application performance monitor agent and visualizer, integrated with the Manhattan Active® Platform. Pinpoint APM agent captures performance metrics and stack traces via bytecode instrumentation and persists these metrics in HBase. These metrics can then be analyzed with Pinpoint APM web console to identify performance bottlenecks or other symptoms of performance degradation.
The graphic below summarizes the monitoring, alerting, and reporting ecosystem deployed as part of the Manhattan Active® Platform:
Manhattan Active Monitoring
Manhattan Associates monitors our active platform for availability, events, and security.
We use Prometheus to data mine for application status via Kubernetes and a custom service (Snitch) that queries component endpoints that provide health and other detailed information.
Alertmanager works with Prometheus which raises events based on configured rules. These events are then sent to our centralized event collection service (Slammer - Service Level Agreement manager). Slammer will hold any critical events to allow self resolution/healing to take place. If a resolution has not been received in the allotted time, Slammer will create an alert and send it to our paging service (Pagerduty).
We have a custom service that will actively ping (Manhattan Sonar) the external endpoints that our customers use to verify that they are available. If the service is unavailable, it will send the event to Slammer via a webhook.
Slammer stores all events into a data lake which we used to identify trends and produce daily reports for internal big picture consumption.
From a security perspective, Manhattan Associates' cloud security team monitors all of our customer and internal environments with Lacework. Lacework uses machine learning to bubble up unusual activities as well as any security vulnerabilities and common vulnerabilities and exposures (CVE). This allows prompt identification, and resolution of events of interest.
Incident Response Workflow
The Manhattan Active® Network Operation Center (or NOC, for short) is the team primarily responsible for monitoring the health and availability of the customer environments. The NOC team is tasked with initiating the resolution workflow when a system or functional alert is triggered.
A variety of curated and refined set of system and functional alerts are defined as part of Manhattan Active® Platform deployments. These alerts fall in two broad categories:
- Proactive alerts act as the detection mechanism for possible conditions that could result in incidents in the future. These alerts also provide feedback to improve the auto-recovery mechanism built in the Manhattan Active® Platform
- Reactive alerts act as the detection of a potentially active incident. These alerts carry a higher priority, and result in actions by the NOC and MRE team to repair the problem as promptly as possible to reduce the mean time to repair the service.
The lists below include some key proactive and reactive alerts that are monitored by the NOC team:
The workflow shown below describes the rules of engagement and communication for resolving alerts with a singular focus of minimizing the downtime or degradation and exceeding the SLA of the Manhattan Active® solutions.
- Kartik Pandya: Vice President, Manhattan Active® Platform, R&D.