- Anthony Georgiades
Current Enterprise IT Shortfalls: Where Service Mesh Wins
Service Mesh Primer [Part 2]
In part 2 of this series, we discuss current enterprise infrastructure challenges and the shortfalls of existing solutions. We then dive deep into Service Mesh architecture, as well as analyze its core technical advantages.
In standardizing runtime operations across services, dynamic infrastructure is required to ensure service-to-service communications do not break down. In order to assess the validity and viability of service mesh offerings, it is critical to understand the existing challenges it aims to solve as well as how networking and infrastructure teams currently address these pain-points. While microservices provide far-reaching positive implications for organizations, they are still distributed systems (services distributed across instances) and thus fall victim to severe failure modes. As a result, many teams have adopted strategies to manage failure ranging from proactive testing, mitigation, and rapid response. New regimes have been adopted, but each come with new challenges.
Challenge: Routing and Discovery
Within traditional microservices systems, applications and/or services discover and locate each other via either 1) a central server (bank of addresses) or 2) a client that connects to a central server to access said address bank. Load balancers, hardcoded at IPs and placed in front of every service, query the service registry and route the request between instances. In a distributed microservice architecture, the volume and frequency of this communication increases drastically. Requiring a linear amount of load balancers, which are in turn manually managed by solution architects, creates a serious bottleneck (both cost and time) across scaling, updating or maintaining within a DevOps culture.
Service Mesh can overcome this challenge through creating an automated central registry of all services running in the environment. Therefore, when a new application is implemented, it automatically populates the central registry with the key identity details--instance with corresponding IP address. This allows for immediate discovery of services via a simple query to the registry. Efficient discovery is vital in changing and scaling environments.
Challenge: Network Security
Albeit recent developments towards micro-perimeter security models, most network environments today still rely on broad perimeter security (firewalls, WAFs, SIEMS, etc.) to filter incoming traffic. Once within the network, the environment remains relatively flat with little to no security. In some environments, firewalls are implemented to manage east-west traffic or specific portions of the network. Furthermore, organizations can adopt traditional rules-based approaches for security and load balancers within microservice environments. However, this quickly becomes infeasible and it would not be uncommon to reach ~10,000+ manually managed rules in a growing environment. The side-effect hampers both security teams and DevOps looking to deploy and deliver quickly. In each instance, new rules must be written or existing rules must be updated manually, creating a bottleneck upon application launch.
The central registry nature of service mesh also results in identity-based security: TLS certificates are distributed to end applications, with each certificate fostering service identities for web servers, databases, APIs, etc. Lastly, service mesh enables the management of rules, which can be distributed directly to the edge rather than through a central bus (typically required for rule enforcement), enhancing communications and efficiencies.
Alternative approaches, such as point-to-point connection and enterprise service buses (ESBs), have emerged, but each come with their own drawbacks in managing microservice-like architectures where volume and frequency are continuously increasing across a distributed system.
Point-To-Point Connection is the standard networking tool for inter-app communication and management in monolith or service-oriented architectures. In monoliths, this is simply a hardcoded connection between two applications. For SOA it is a one-to-one client-service interaction with each client processed by one service instance. Point-to-point integration becomes extremely tedious and complex as the system scales. Additionally, as client-service communication grows within these environments, the methodology proves too arduous to handle ongoing direct requests.
Enterprise Services Buses (ESBs) as a message queue (vs. traditional hub-spoke model) emerged as an alternative to point-to-point connections, acting as a centralized message router and a single source (point of entry) for client connections within SOA. The message broker implementation routes the client to the appropriate service and acts as a messaging layer, assuring that both inbound requests arrive at the proper location, while also providing transformation or service orchestration functionalities. In a load-balanced environment, this approach is scalable with single messages routing to multiple receivers.
Figure 3: ESB Centralizing and Managing Inter-Service Communications
API Gateways have been adopted by organizations alongside message queues (e.g. ESB) for many inter-service communications. API Gateways are web-facing servers that receive requests from the public internet or internal services, in turn routing to the appropriate microservice instance. They centralize management in high-load API call environments such as in a micro architecture (e.g. 300+ services) and allow for creating, publishing, maintaining, monitoring and securing calls regardless of scale. The ESB in turn centralizes messaging for interservice synchronous communication (e.g. RabbitMQ), while the API gateways send asynchronous calls via a waiting thread from web facing server to microservice instances.
Figure 4: API Gateway Centralizing API Management
ESBs and API Gateways, despite their functionality, have inherent disadvantages or drawbacks. ESB/SOA architecture, as a result of uniform runtime operations across services, can create limitations for scaling, maintenance and monitoring. ESBs also have regression and single point of failure risks due to their centralized nature, and create bottleneck risk as the central team is working at a single point of entry. API Gateways also have single point of failure or development bottleneck risks. They also introduce increased risk of 1) new development, implementation, and maintenance deprecation as rules are centrally stored and 2) vendor lock-in or potential migration challenges.
While service mesh is sometimes erroneously noted as ESB 2.0, it is important to understand that ESB development was essentially vendor-driven. It is overly centralised and tightly coupled, with light integration across vendor products and process choreography conflating instances. What was once a series of internal application communications has transitioned into a mesh of service-to-service remote procedure calls (RPCs). RPC also adopts the client-server model, however lightweight threads that share the same address space allow for concurrent operations, addressing linear scaling issues.
The Service Mesh Architecture
As shown in Figure 5, service mesh is interservice communication infrastructure made up of network proxies deployed alongside containers that serve as gateways for interactions. Proxies receive incoming connections and then re-route or distribute accordingly.
Figure 5: Service Mesh Interservice Communication
Service mesh is decoupled by divisions of concern into control and data planes with levels of abstraction in mind, nearly mirroring traditional telecommunications architecture (which also include a management plane). In service mesh architecture, the data plane handles inspection, transiting, and routing of network traffic. Conversely, the control plane sits out-of-band providing a central point of management and backend/underlying infrastructure integration (and includes management plane functionalities).
Figure 6: Conduit’s Control and Data Plane Architecture
The Control Plane (CP) determines the traffic destination and configures the data plane as the number of proxies become unwieldy or when single point of visibility / control is required. It also provides policy and configuration by taking a set of isolated / stateless proxies and turning them into a service mesh. Of note, it does not directly touch any network packets in the mesh (operates out-of-band) and is highly redundant given the frequency / volume of communications. Additional aspects include:
● Enforces network policy, service, and discovery in the mesh
● Manages the proxies that route service traffic
● Automates changes to control plane configuration through API (CI/CD
The Data Plane (DP) is a proxy-based (sidecar container) path that sits between microservices. It deals with actual traffic between applications and is responsible for the communication of services. It also includes networking aspects including routing, forwarding, load balancing, encryption and failure handling. In essence, it is a collection of sidecar proxies which ensure high throughput and low latency that also:
● Manage and configure by the CP per its designated microservice
● Intercepts and filters each packet upon request, inspecting communication/network traffic from service origination to intended destination
● Acts as its own infrastructure layer to filter traffic between services
Figure 7: Sidecar Proxy Design
Service mesh is ideal for a microservice environment- it operates at the L7 level, yet is partitioned from the application code while retaining app-level insight to enforce L3 and L4 policies. The control plane retains L7 insight, instructing the data plane to make complex routing decisions based on policies and telemetry data. Within the mesh, a level of abstraction contained in the networking infrastructure (L3/L4) or coded in the application layer (L7 network overlay) is achieved without sacrificing visibility of service requests and routing decisions.
Service mesh should be implemented to manage interservice communications for microservices within containerized application clusters such as managed systems (Kubernetes or Dockers), PaaS (OpenShift Online or Pivotal), container as-a-service environments (AWS Fargate, Google Kubernetes Engine), or other elastically scalable compute environments. Comparable managed compute frameworks can be other distributed systems that efficiently package, deploy, and manage microservices and containers (e.g. Azure Service Fabric, Lightbend Reactive Platform).
Service Mesh Advantages
Service mesh today has become part of the platform for many organizations, and continues to experience increased attention and innovation. Application leaders responsible for development and platform strategies will continue to adopt service meshes to achieve resilient, secure microservices operations. Below we address key use cases, advantages, and secular trends driving this trend.
Key Use Cases:
1. Service Discovery, Instrumentation, and Visibility: Provides service-level visibility and telemetry across services running in each organization’s infrastructure:
● Automatically instrument interservice interactions
● Manage and monitor service availability and communications
● Trace and perform analyses on services and connections
2. Scalability: Given dynamism in service discovery and request routing to different services, supports scaling via load balancing across service instances in the cluster
3. Performance Management: Set performance metrics to ensure efficient resource utilization and distribution while monitoring response time and app performance to meet operational metrics
4. Operational Reliability: Monitor telemetry metrics of service performance such as time between request and response, frequency of individual service connections, and resources used
5. Traffic Governance: Configure to conduct traffic management- including automatic routing policies- east/west and ingress/egress without accessing and updating the application itself
6. Security: Lightweight authentication and authorization for interservice communication (via TLS certificates distributed to the edge) and rule-based enforcement / TLS / RBAC encryption
Efficiency and agility are achieved across the aforementioned key use cases without downtime caused from app code updates or API Gateway / ESB sharing. Additional processes include:
● Automatic load balancing
● Fine-grained control of traffic behavior
● Configuration of API supporting access controls, rate limits, and quotas
Other advantages of the service mesh architecture include:
● Developer Independence → Small teams working in parallel, iterating faster than large teams
● Isolation & Resilience → If 1 component dies (timeouts, retries, circuit breakers, fault handling, load balancing), instantly spin-up another with 0 downtime
● Lifecycle Automation → Individual components fit easier in CI/CD pipelines, with complexity not possible in monoliths
● Relationships to Business → Split along domain boundaries, increasing organization independence and understanding
It is important to understand that service mesh, and the inherent impact of the secular trends driving its adoption, is largely dependent on the size and technical makeup of the organization. For example, if an enterprise is running a small number of microservices across a shallow topology, service mesh selection may be delayed and substituted with alternative failure management strategies. Furthermore, particular applications environment (say running 3 or fewer compute instances) may not be best suited for a mesh approach, as the operational complexity of implementation will outweigh the benefits.
Several of the discussed solutions can achieve fairly similar functionality, implemented in different manners. For one, ESBs and API gateways are able to centralize synchronous and asynchronous service communication through message queues and waiting threads. The ability to offload communication requirements and tasks to a different layer while keeping the application code independent is a key driving force of implementation for growing organizations with agile development needs. Service mesh allows organizations to scale without tackling the delays and stopgaps associated with deploying new services or scaling up and down.
Long term adoption will be closely correlated to the increased distribution and containerization of applications, and more specifically the rise of microservices and miniservices. As organizations grow, the need for more efficient, reliable, and agile app development and runtime operations is essential. Interest in service mesh technology is increasing dramatically (as shown in Figure 8), a leading indicator towards mainstream acceptance. However, the adoption remains primarily in large-scale and/or tech forward organizations that require its necessary benefits given the complexity and high-volume nature of their applications and services delivery systems.
Figure 8: Google Trends Data on “Service Mesh” (October 2016 - October 2020)
 https://www.oreilly.com/library/view/the-enterprise-path/9781492041795/ch01.html  IDC, Service Mesh, 3  https://searchitoperations.techtarget.com/definition/sidecar-proxy  https://searchitoperations.techtarget.com/definition/sidecar-proxy  Gartner, Innovation Insight for SM, 3  Ibid., 4  https://thenewstack.io/service-mesh-and-the-promise-of-istio/  Ibid  Gartner, Innovation Insight for SM, 4  https://containerjournal.com/2018/12/12/what-is-service-mesh-and-why-do-we-need-it/