documentation/docs/ADRs/adr002.md

---
status: draft
date: 2025-12-05
author: Marco De Luca and Tobias Brunner
---

# ADR 002 Kubernetes Distribution

## Context and Problem Statement

We're building a multi-CSP platform that runs VSHN AppCat workloads across different cloud providers. We need to decide how to provision and manage Kubernetes clusters on each CSP. The choice affects operational complexity, cost structure, consistency across environments, and our ability to support AppCat reliably.

Each CSP offers their own managed Kubernetes service (EKS, AKS, GKE, SKS, etc.), but these services differ significantly in behavior, versioning, available features, and upgrade paths. For a platform like Servala, where predictable behavior and consistency are essential for AppCat development and operations, these differences create friction.

## Considered Options

We evaluated two main options for running Kubernetes on each CSP:

1. using the CSP's managed Kubernetes offering
1. deploying our own Kubernetes distribution on top of their compute layer (e.g., Talos Linux).

The table below summarizes the main differences we identified.

| BYO Kubernetes                                                                                                    | Cloud Kubernetes                                                                                   |
| ----------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| :lucide-check: We have a lot already in place (concept)                                                           | :lucide-check: Out-of-the-box pre-initiated                                                        |
| :lucide-check: Standardized infrastructure across CSPs                                                            | :lucide-check: Potentially better integrated                                                       |
| :lucide-check: Easier for AppCat to develop and test                                                              | :lucide-check: Pre-defined upgrade paths                                                           |
| :lucide-check: Full control over versions, upgrade cadence and feature gates                                      | :lucide-check: Support from CSP                                                                    |
| :lucide-check: Freedom of choice of cluster components                                                            | :lucide-x: Limited flexibility                                                                     |
| :lucide-check: Potentially better security                                                                        | :lucide-x: Inconsistency across CSPs (different k8s flavors, k8s version, CRDs, API feature gates) |
| :lucide-check: Predictable cluster behavior across CSPs                                                           | :lucide-x: Harder for AppCat to test on different environments                                     |
| :lucide-check: Easier to implement in a GitOps-first pattern                                                      | :lucide-x: Opinionated software and constraints                                                    |
| :lucide-check: Potentially cheaper, scalable cost model tied to raw compute offering not per-cluster service fees | :lucide-x: Unpredictable behavior (e.g., noisy neighbors)                                          |
| :lucide-check: Streamlined support and troubleshooting model                                                      |                                                                                                    |
| :lucide-x: Cope with infrastructure                                                                               |                                                                                                    |
| :lucide-x: Manage whole stack                                                                                     |                                                                                                    |

Based on this, BYO Kubernetes is a clear winner.

### Kubernetes distributions

We evaluated the following Kubernetes distributions:

**OpenShift**: Too expensive due to subscription costs and high infrastructure requirements. Adds significant complexity and bloat for a platform that only needs to run hosted AppCat workloads. The overhead doesn't justify the benefits for our use case.

**Rancher**: Past experience with Rancher has been negative. The additional management layer introduces complexity and potential points of failure that we want to avoid.

**k3s**: Lightweight and easy to deploy, but lacks full integration with the underlying operating system. We would still need to manage a traditional Linux distribution separately, which adds operational burden.

**Talos Linux**: Purpose-built for Kubernetes with an immutable, API-driven design. No SSH, no shell, minimal attack surface. The OS and Kubernetes are managed as a single unit with declarative configuration. Produces consistent behavior across all environments.

## Decision

We're choosing a BYO Kubernetes approach using Talos Linux across all CSPs.

Talos Linux provides an immutable, API-driven operating system purpose-built for Kubernetes. It eliminates SSH access, uses a declarative configuration model, and produces identical cluster behavior regardless of where it runs. This gives us the consistency and control we need without the operational burden of managing traditional Linux distributions.

Each CSP will run Talos Linux on their compute layer (usually virtual machines). We control the Kubernetes version, component configuration, security defaults, and upgrade cadence. The same cluster configuration works everywhere, which simplifies AppCat development, testing, and support.

### Consequences

The operational model shifts from consuming managed services to managing infrastructure. We take on responsibility for Kubernetes upgrades, security patches, and cluster lifecycle management. This requires investment in automation and tooling, but we already have experience and concepts in place from previous work.

In return, we get full control over our platform. AppCat can be developed and tested against a single, predictable Kubernetes environment instead of accounting for differences between different flavors of Kubernetes distributions. Troubleshooting becomes easier because cluster behavior is consistent. Cost becomes predictable and tied to compute resources rather than per-cluster service fees.

The security posture improves. Talos Linux has no SSH, no shell, and a minimal attack surface. We define exactly what runs on the nodes. There are no surprises from CSP-specific components or automatic updates we don't control.
adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00			`---`
			`status: draft`
			`date: 2025-12-05`
			`author: Marco De Luca and Tobias Brunner`
			`---`

			`# ADR 002 Kubernetes Distribution`

			`## Context and Problem Statement`

document decision for talos 2025-12-08 17:05:39 +01:00			`We're building a multi-CSP platform that runs VSHN AppCat workloads across different cloud providers. We need to decide how to provision and manage Kubernetes clusters on each CSP. The choice affects operational complexity, cost structure, consistency across environments, and our ability to support AppCat reliably.`

			`Each CSP offers their own managed Kubernetes service (EKS, AKS, GKE, SKS, etc.), but these services differ significantly in behavior, versioning, available features, and upgrade paths. For a platform like Servala, where predictable behavior and consistency are essential for AppCat development and operations, these differences create friction.`
adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00
			`## Considered Options`

document decision for talos 2025-12-08 17:05:39 +01:00			`We evaluated two main options for running Kubernetes on each CSP:`
adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00
document decision for talos 2025-12-08 17:05:39 +01:00			`1. using the CSP's managed Kubernetes offering`
			`1. deploying our own Kubernetes distribution on top of their compute layer (e.g., Talos Linux).`
adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00
			`The table below summarizes the main differences we identified.`

			`\| BYO Kubernetes \| Cloud Kubernetes \|`
			`\| ----------------------------------------------------------------------------------------------------------------- \| -------------------------------------------------------------------------------------------------- \|`
			`\| :lucide-check: We have a lot already in place (concept) \| :lucide-check: Out-of-the-box pre-initiated \|`
			`\| :lucide-check: Standardized infrastructure across CSPs \| :lucide-check: Potentially better integrated \|`
			`\| :lucide-check: Easier for AppCat to develop and test \| :lucide-check: Pre-defined upgrade paths \|`
			`\| :lucide-check: Full control over versions, upgrade cadence and feature gates \| :lucide-check: Support from CSP \|`
			`\| :lucide-check: Freedom of choice of cluster components \| :lucide-x: Limited flexibility \|`
			`\| :lucide-check: Potentially better security \| :lucide-x: Inconsistency across CSPs (different k8s flavors, k8s version, CRDs, API feature gates) \|`
			`\| :lucide-check: Predictable cluster behavior across CSPs \| :lucide-x: Harder for AppCat to test on different environments \|`
			`\| :lucide-check: Easier to implement in a GitOps-first pattern \| :lucide-x: Opinionated software and constraints \|`
			`\| :lucide-check: Potentially cheaper, scalable cost model tied to raw compute offering not per-cluster service fees \| :lucide-x: Unpredictable behavior (e.g., noisy neighbors) \|`
			`\| :lucide-check: Streamlined support and troubleshooting model \| \|`
			`\| :lucide-x: Cope with infrastructure \| \|`
			`\| :lucide-x: Manage whole stack \| \|`

document decision for talos 2025-12-08 17:05:39 +01:00			`Based on this, BYO Kubernetes is a clear winner.`

			`### Kubernetes distributions`

			`We evaluated the following Kubernetes distributions:`

			`OpenShift: Too expensive due to subscription costs and high infrastructure requirements. Adds significant complexity and bloat for a platform that only needs to run hosted AppCat workloads. The overhead doesn't justify the benefits for our use case.`

			`Rancher: Past experience with Rancher has been negative. The additional management layer introduces complexity and potential points of failure that we want to avoid.`

			`k3s: Lightweight and easy to deploy, but lacks full integration with the underlying operating system. We would still need to manage a traditional Linux distribution separately, which adds operational burden.`

			`Talos Linux: Purpose-built for Kubernetes with an immutable, API-driven design. No SSH, no shell, minimal attack surface. The OS and Kubernetes are managed as a single unit with declarative configuration. Produces consistent behavior across all environments.`

adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00			`## Decision`

document decision for talos 2025-12-08 17:05:39 +01:00			`We're choosing a BYO Kubernetes approach using Talos Linux across all CSPs.`

			`Talos Linux provides an immutable, API-driven operating system purpose-built for Kubernetes. It eliminates SSH access, uses a declarative configuration model, and produces identical cluster behavior regardless of where it runs. This gives us the consistency and control we need without the operational burden of managing traditional Linux distributions.`

			`Each CSP will run Talos Linux on their compute layer (usually virtual machines). We control the Kubernetes version, component configuration, security defaults, and upgrade cadence. The same cluster configuration works everywhere, which simplifies AppCat development, testing, and support.`

adr discussing the choice of kubernetes distro 2025-12-08 15:42:55 +01:00			`### Consequences`
document decision for talos 2025-12-08 17:05:39 +01:00
			`The operational model shifts from consuming managed services to managing infrastructure. We take on responsibility for Kubernetes upgrades, security patches, and cluster lifecycle management. This requires investment in automation and tooling, but we already have experience and concepts in place from previous work.`

			`In return, we get full control over our platform. AppCat can be developed and tested against a single, predictable Kubernetes environment instead of accounting for differences between different flavors of Kubernetes distributions. Troubleshooting becomes easier because cluster behavior is consistent. Cost becomes predictable and tied to compute resources rather than per-cluster service fees.`

			`The security posture improves. Talos Linux has no SSH, no shell, and a minimal attack surface. We define exactly what runs on the nodes. There are no surprises from CSP-specific components or automatic updates we don't control.`