The Global Service Mesh [2/4]: mTLS & Zero Trust with Authorisation Policies

In the distributed systems in 2026, the network is officially a hostile environment.

Part 1 established the "Regional Anchor" pattern to bridge Cloud Run into the Mesh. But connectivity is only half the battle. In a high-compliance, production-grade environment, we must move from Network-Centric Security (IP-based) to Identity-Centric Security (Attribute-based).

Today, we dive into the "Zero Trust" layer: how Cloud Service Mesh (CSM) handles automated certificate issuance via Google’s Managed CAS and how to enforce fine-grained L7 Authorisation Policies that go far beyond what a standard VPC Firewall can achieve.

The Identity Framework: SPIFFE and the Trust Domain

Under the hood, CSM uses the SPIFFE (Secure Production Identity Framework for Everyone) standard. In Google Cloud, your identity is not a static token; it is an X.509 certificate whose Subject Alternative Name (SAN) contains a URI-formatted identity.

The standard format for a Cloud Run workload in the mesh is: spiffe://<PROJECT_ID>.svc.id.goog/ns/default/sa/<SERVICE_ACCOUNT_NAME>

When Service A (the Frontend) calls Service B (the Backend), the Envoy sidecars perform a mutual TLS handshake. They don't just check if the certificate is valid; they check if the identity presented in the SAN is authorised to perform the specific HTTP verb on the specific URI requested.

1. Architecting the Certificate Lifecycle

We must acknowledge that managing a Private CA is a liability. CSM solves this by integrating natively with Certificate Authority Service (CAS).

The Control Plane Handshake

Workload Registration: When your Cloud Run sidecar boots, it requests a certificate from the mesh control plane using its identity token (OIDC).
CSR Generation: The control plane verifies the token and uses the CAS to sign a short-lived (usually 24-hour) certificate.
Automatic Rotation: Envoy handles the rotation in-memory. There is zero downtime, and the certificates never touch the local disk, significantly reducing the surface area for exfiltration.

Figure 1: The mTLS handshake flow between Envoy sidecars, orchestrated by the Cloud Service Mesh control plane and Google CAS.

2. Implementing Managed mTLS

To implement this, we define a ServerTlsPolicy and a ClientTlsPolicy.

The Server-Side Guardrail

The ServerTlsPolicy defines how the backend should challenge incoming requests. We specify MTLS to ensure that the client is also verified.

# server-tls-policy.yaml
name: "projects/YOUR_PROJECT_ID/locations/global/serverTlsPolicies/orders-mtls-policy"
description: "Enforce mTLS for the Orders Backend"
mtlsPolicy:
  clientValidationCa:
    - certificateProviderInstance:
        pluginInstance: "google_cloud_cas" # Native integration with Managed CAS

The Client-Side Identity

The ClientTlsPolicy tells the frontend sidecar to provide its identity and verify the server's identity against the same trust domain.

# client-tls-policy.yaml
name: "projects/YOUR_PROJECT_ID/locations/global/clientTlsPolicies/frontend-mtls-identity"
serverValidationCa:
  - certificateProviderInstance:
      pluginInstance: "google_cloud_cas"
clientCertificate:
  certificateProviderInstance:
    pluginInstance: "google_cloud_cas"

3. L7 Authorisation Policies: Beyond the Firewall

A firewall is a blunt instrument. It can stop a port, but it cannot stop a DELETE request from an authorised IP. AuthorizationPolicy in CSM allows for Attribute-Based Access Control (ABAC).

Scenario: The "Read-Only" Frontend

We want to allow our frontend-sa to GET orders, but we only want a specialised admin-sa to perform POST or DELETE operations.

# auth-policy.yaml
name: "projects/YOUR_PROJECT_ID/locations/global/authorizationPolicies/orders-authz"
action: ALLOW # Default Deny architecture
rules:
  - from:
      - principals:
          - "spiffe://PROJECT_ID.svc.id.goog/ns/default/sa/frontend-sa"
    to:
      - operations:
          - methods: ["GET"]
            paths: ["/api/v1/orders/*"]
  - from:
      - principals:
          - "spiffe://PROJECT_ID.svc.id.goog/ns/default/sa/admin-sa"
    to:
      - operations:
          - methods: ["POST", "DELETE"]
            paths: ["/api/v1/orders/*"]

Pro-Tip: Always include a path prefix. Authorisation policies are evaluated at the Envoy level. If you don't specify a path, any authenticated service could hit your administrative /metrics or /healthz endpoints.

Figure 2: Authorisation Policy Logic Flow. Envoy evaluates the TLS certificate, extracts the SPIFFE ID, and checks the requested method and path against the ALLOW rules before reaching the application.

4. The Zero-Downtime Migration: Permissive to Strict

One of the most dangerous moves in cloud engineering is flipping the "Strict mTLS" switch on a live system. CSM allows for a Permissive transition.

Permissive Mode: Apply the ServerTlsPolicy but don't update the BackendService to enforce it yet. Envoy will accept both plain text and mTLS.
Observability Check: Monitor the envoy_cluster_mtls_success metric in Cloud Monitoring.
Strict Enforcement: Once 100% of traffic is verified as mTLS, update the BackendService to use the policy.

# Update the Backend Service to link the Security Policy
gcloud compute backend-services update order-service-mesh-backend \
  --global \
  --security-settings="serverTlsPolicy=projects/YOUR_PROJECT_ID/locations/global/serverTlsPolicies/orders-mtls-policy"

Cloud Run Ingress Settings

Even with a Mesh, you must secure the "Front Door." I recommend setting your Cloud Run Ingress to "Internal-only and Cloud Load Balancing".

This ensures that the only way to reach your service is through the Mesh's regional anchors (Internal LB) we configured in Part 1. This prevents attackers from bypassing your mTLS policies by hitting the .a.run.app URL directly.

Conclusion

We have successfully moved from a network where "anything goes" to a network where nothing happens without a cryptographic signature. Your Cloud Run services are now isolated, encrypted, and authorised at Layer 7.

What's Next? Security is invisible until it breaks. In Part 3: Global Observability, we will explore how to use the Mesh to automatically generate distributed traces (Cloud Trace) and service-level metrics (Cloud Monitoring) without adding a single line of code to our application.