This guide walks you through building a Kubernetes-as-a-Service portal from scratch. YOU write every line of code. Each milestone teaches Go concepts, K8s internals, and container runtime knowledge directly relevant to the Netflix Compute Runtime role.
The project is at ~/PyCharmProjects/kaas-portal. There's already a scaffold committed from
an earlier session β you can reference it for ideas, rewrite it completely, or rm -rf it and
start fresh. It's YOUR project.
The job posting references two blog posts that describe work done by this exact team. Read both before you start building β they'll shape how you think about the project.
Source: Netflix Tech Blog β Noisy Neighbor Detection with eBPF (Sept 2024)
Netflix runs Titus, their multi-tenant compute platform. Multiple containers share the same physical host. A "noisy neighbor" is a container (or system process) that hogs host resources β especially CPU β and degrades performance for other containers on the same machine.
Traditional tools like perf add significant overhead and are deployed after the problem is noticed.
By then the noisy neighbor has moved on, or the profiling overhead makes things worse.
They instrument run queue latency β the time a process sits waiting for a CPU after it becomes runnable. This uses three Linux scheduler hooks:
| Hook | When It Fires | What They Do |
|---|---|---|
sched_wakeup |
Process transitions from sleeping β runnable | Record timestamp in a BPF hash map, keyed by PID |
sched_wakeup_new |
Newly created process becomes runnable | Same β record timestamp |
sched_switch |
CPU switches to running a different process | Look up the wakeup timestamp, compute delta = now - wakeup_time. That delta is the run queue latency. |
They use the process's cgroup ID to map each scheduling event back to its container. This is critical: without it, you just have PID-level data. With cgroup mapping, you can say "container X is experiencing high run queue latency because container Y is causing CPU contention."
A container exceeding its cgroup CPU limit gets throttled, which also appears as high run queue latency. You must distinguish throttling (self-inflicted) from actual noisy-neighbor preemption (caused by others). The team found that system processes (not just other containers) are often the real noisy neighbors.
--cpus and --memory in Docker.Source: Netflix Tech Blog β Mount Mayhem at Netflix (Feb 2026)
Netflix nodes stalled for tens of seconds when starting many containers concurrently. The mount table ballooned during startup because containerd executes thousands of bind mount operations when assembling multi-layer container images. A health check that reads the mount table would take 30+ seconds.
Almost all time was spent trying to grab a global kernel lock in the Linux Virtual Filesystem (VFS) layer.
The hottest code path was path_init() β specifically a sequence lock (seqlock) that serializes
mount table lookups.
When hundreds of containers start simultaneously, each needing dozens of mount operations, they all fight over this single lock.
Using Intel's Topdown Microarchitecture Analysis (TMA), they found:
Hardware architecture mattered enormously:
| Instance Type | Architecture | Behavior Under Contention |
|---|---|---|
| AWS r5.metal (older) | Dual-socket, NUMA, mesh cache coherence | Severe stalls β cross-socket cache line bouncing |
| AWS m7i.metal / m7a.24xlarge (newer) | Single-socket, distributed cache | Scaled smoothly, far less contention |
Disabling hyperthreading improved latency by up to 30%.
Two approaches were considered:
fsopen() / fsmount()) β these use file descriptors instead of path-based lookups, avoiding the global VFS lock. But requires newer kernels.1Verify Go is installed
Run go version. You should see Go 1.26.x (already installed via brew).
2Understand what changed since 2016-2018
You used Go before modules existed. Here's what's different now:
| Then (2016-2018) | Now (2026) | Why It Matters |
|---|---|---|
$GOPATH and vendor/ |
go mod (modules) |
No more GOPATH. Run go mod init <module-path> in any directory. Dependencies in go.mod. |
| No generics | Generics (Go 1.18+) | Type parameters: func Map[T any](s []T, f func(T) T) []T. Used in newer K8s libraries. |
log.Printf |
log/slog (Go 1.21+) |
Structured logging in stdlib: slog.Info("msg", "key", value) |
Basic http.ServeMux |
Enhanced mux (Go 1.22+) | Method-based routing: mux.HandleFunc("GET /api/users/{id}", handler). No need for gorilla/chi for basic apps. |
interface{} |
any (alias, Go 1.18+) |
any is just interface{}. Cleaner to read. |
| Manual context threading | context is everywhere |
Every API call, every K8s client method, every DB query takes a context.Context as first arg. Non-negotiable pattern. |
| Error wrapping was manual | fmt.Errorf("...: %w", err) |
The %w verb wraps errors. errors.Is() and errors.As() unwrap them. This is how K8s code handles errors. |
3Warm up: Write a small program
Before touching the KaaS project, write a standalone Go program that:
go mod init warmup)struct with JSON tagsServeMux routing ("GET /hello/{name}")slog for loggingsignal.NotifyContextWhy: This covers the exact patterns you'll use in the KaaS portal, in a throwaway sandbox.
The key pieces are: signal.NotifyContext(context.Background(), syscall.SIGINT, syscall.SIGTERM) gives you a
context that cancels on Ctrl+C. Start the HTTP server in a goroutine. Block on <-ctx.Done(). Then call
server.Shutdown(timeoutCtx). This is the standard Go HTTP server pattern.
Goal: Get a running Go API server with proper project structure, structured logging, and a health endpoint.
1Initialize the project
You can either start from the existing scaffold at ~/PyCharmProjects/kaas-portal or wipe it and start fresh.
If starting fresh:
mkdir -p ~/PyCharmProjects/kaas-portal and cd into itgo mod init github.com/<your-username>/kaas-portalgit init2Choose your project layout
Go doesn't enforce a project structure, but the community convention for non-trivial projects is:
Why internal/? The Go compiler enforces that packages under internal/ can only be imported by code in the
parent directory tree. This is a hard compiler guarantee, not just convention. It keeps your API surface intentional.
3Build these pieces yourself
Load() function.Router() method that returns http.Handler.GET /healthz returns {"status": "ok"}slog.4Test it
Run go run ./cmd/kaas-portal and curl http://localhost:8080/healthz.
Then write an actual Go test. Create internal/api/server_test.go:
httptest.NewRequest and httptest.NewRecorderWhy: Go's testing story is built into the language (go test ./...). No test framework needed. Table-driven tests are the Go idiom β learn this pattern early.
You need to capture the status code, but http.ResponseWriter doesn't expose it after WriteHeader().
The solution: wrap ResponseWriter in your own struct that records the status code. Then call next.ServeHTTP(yourWrapper, r).
Goal: Design the core abstractions β the Cluster model and the Provider interface that every cloud backend will implement.
1Design the Cluster model
Think about what a cluster is, provider-agnostically. At minimum:
Write this as a struct in pkg/models/cluster.go with JSON tags.
Why pkg/? This is your public API. If someone imported your module, they could use these types. The provider implementations in internal/ will create and return these models.
2Design the Provider interface
This is where Go's interface model shines. Define an interface with methods like:
Name() stringCreateCluster(ctx context.Context, req CreateClusterRequest) (*Cluster, error)GetCluster(ctx context.Context, id string) (*ClusterDetail, error)ListClusters(ctx context.Context) ([]Cluster, error)DeleteCluster(ctx context.Context, id string) errorIn Java/C#, you write class KindProvider implements Provider. In Go, you don't. If a type has the right methods,
it is a Provider. The compiler checks this at the call site, not at the declaration. This is called structural typing
(or "duck typing, but checked at compile time").
This matters for K8s: the entire K8s codebase is built on this pattern. The kubelet talks to containerd via the CRI interface. containerd talks to runc via the OCI runtime interface. Plugins everywhere β all using Go interfaces.
3Think about error handling
Every method returns error. Think about what errors mean for each operation:
type NotFoundError struct{...})For now, keep it simple β you can use errors.Is() and sentinel errors. Refine later.
4Wire the provider map into your server
Your server should accept a map[string]Provider. Add a GET /api/v1/providers endpoint that returns the list of registered provider names.
Go won't tell you a type fails to implement an interface until you try to use it as one.
To get early feedback, add this line to your provider file:
var _ Provider = (*KindProvider)(nil).
This asserts at compile time that KindProvider implements Provider.
Goal: Implement a provider that creates real local K8s clusters using Kind. By the end, you'll run
curl -X POST .../clusters and get a real, working Kubernetes cluster.
1Understand what Kind does under the hood
Before writing code, understand what happens when Kind creates a cluster:
kindest/node:v1.x.x) that contains kubelet, kubeadm, and containerdThis is containers running inside containers β the node is a Docker container, and the pods inside it use containerd. Understanding this nesting is key.
2Add the Kind library dependency
go get sigs.k8s.io/kind
Look at the sigs.k8s.io/kind/pkg/cluster package. The key type is cluster.Provider which has methods like Create(), Delete(), List(), and KubeConfig().
3Implement the Kind provider
Create internal/provider/kind/kind.go. You need:
cluster.Provider instance and an in-memory map of clusterssync.RWMutex to protect the map (multiple goroutines may read/write concurrently)CreateCluster: call Kind's Create(), track state, get kubeconfigListClusters: return from your in-memory mapDeleteCluster: call Kind's Delete(), update stateYour API server handles concurrent HTTP requests (each in its own goroutine). If two requests try to
read/write the clusters map simultaneously, you'll get a data race. sync.RWMutex allows:
mu.RLock() / mu.RUnlock())mu.Lock() / mu.Unlock())Use go run -race ./cmd/kaas-portal to run with the race detector β it will catch data races at runtime.
4Add CRUD handlers
Wire up these endpoints:
| Method | Path | What It Does |
|---|---|---|
| POST | /api/v1/clusters | Create cluster (takes JSON body) |
| GET | /api/v1/clusters | List all clusters (optional ?provider= filter) |
| GET | /api/v1/clusters/{id} | Get single cluster (include kubeconfig) |
| DELETE | /api/v1/clusters/{id} | Delete cluster |
| GET | /api/v1/clusters/{id}/kubeconfig | Get kubeconfig as YAML |
5Test it end-to-end
Start the server and create an actual cluster:
# Create a Kind cluster
curl -X POST http://localhost:8080/api/v1/clusters \
-H "Content-Type: application/json" \
-d '{"name": "test-1", "provider": "kind", "node_count": 1}'
# This will take 1-2 minutes. When it returns, you have a real K8s cluster.
# List clusters
curl http://localhost:8080/api/v1/clusters | jq
# Get kubeconfig and use it
curl http://localhost:8080/api/v1/clusters/kind-test-1/kubeconfig > /tmp/test-1.kubeconfig
kubectl --kubeconfig /tmp/test-1.kubeconfig get nodes
# Clean up
curl -X DELETE http://localhost:8080/api/v1/clusters/kind-test-1
6Explore what Kind created
While the cluster is running, investigate what's happening at the container runtime level:
# See the Docker containers Kind created (each is a K8s "node")
docker ps
# Exec into the node container and look at containerd
docker exec -it test-1-control-plane bash
# Inside the node:
crictl ps # List containers via CRI (like kubectl for the runtime)
crictl images # List images in containerd
cat /etc/containerd/config.toml # containerd configuration
ps aux | grep kubelet # The kubelet process
mount | head -30 # See the mount table (remember the Mount Mayhem blog!)
cat /proc/1/cgroup # cgroup of the init process
crictl ps inside the Kind node, you're using the same CRI interface that
the Netflix Compute Runtime team works on. When you look at /etc/containerd/config.toml,
you're seeing the same configuration they customize. When you look at the mount table, you're seeing the same
mount point explosion described in the "Mount Mayhem" blog post. This is the stack.
Goal: Use the official Kubernetes Go client to query your clusters β get nodes, list pods, read namespaces. This is the library that kubelet, controllers, and operators are built on.
1Add the client-go dependency
go get k8s.io/client-go@latest
This is a large dependency. It pulls in the same code that powers kubectl and every K8s controller.
2Build a cluster info endpoint
Add GET /api/v1/clusters/{id}/info that:
clientcmd and kubernetes.NewForConfig)3Add a pod listing endpoint
Add GET /api/v1/clusters/{id}/pods that lists pods across all namespaces (or filtered by ?namespace=).
Use clientset.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{})
4Deploy a workload through the API
Add POST /api/v1/clusters/{id}/deployments that creates a Deployment in the cluster.
Accept a simple payload (image, replicas, name) and use client-go to create it.
Every client-go call takes a context.Context. Your HTTP handler gets one from r.Context().
Pass it through to the K8s client calls. If the HTTP client disconnects, the context cancels,
and the K8s API call is abandoned. This is how Go propagates cancellation through entire call chains
without try/catch/finally.
The trick is that clientcmd normally reads from a file. To use an in-memory kubeconfig string,
use clientcmd.RESTConfigFromKubeConfig([]byte(kubeconfigStr)). This gives you a *rest.Config
that you pass to kubernetes.NewForConfig(config).
Goal: Cluster creation takes minutes. Right now the API blocks. Fix this using Go's concurrency primitives β goroutines, channels, and status polling.
1Make creation async
Change CreateCluster to:
Clients poll GET /api/v1/clusters/{id} to check status.
2Handle cancellation
What if someone deletes a cluster while it's still provisioning? Think about:
context.WithCancel for each provisioning goroutine3Add a status event stream (stretch goal)
Add GET /api/v1/clusters/{id}/events using Server-Sent Events (SSE) to stream status updates in real-time.
ctx.Done() returns a channel that closes when cancelled. Use select to listen for it.Goal: Add a real cloud provider. EKS creation takes 10-15 minutes β your async pattern from Milestone 5 pays off here.
1Add the AWS SDK
go get github.com/aws/aws-sdk-go-v2 and the EKS, EC2, IAM, and STS service packages.
You used the AWS SDK in Go at Symantec β this will feel familiar, but the SDK has been rewritten (v2). The API design is different: it uses functional options and context everywhere.
2Implement the EKS provider
This is more complex than Kind. You'll need to:
3Handle the cost angle
EKS clusters cost money. Consider:
max_clusters config to prevent runaway costsGoal: Add your second cloud provider. By now the Provider interface should make this clean β same interface, different backend.
Same pattern as EKS but with cloud.google.com/go/container. GKE clusters are faster to create (~5 min) and have a simpler IAM model.
The real learning here is: does your Provider interface hold up? If adding GKE requires changing the interface, that's a design smell worth reflecting on.
Once the core is solid, these are directions that go deeper into Netflix-relevant territory:
| Milestone | What You'll Learn | Netflix Relevance |
|---|---|---|
| Custom K8s Controller (CRD) | controller-runtime, reconciliation loops, CRD design | This is how K8s operators work. The Compute Runtime team builds controllers. |
| Node diagnostics endpoint | SSH into nodes, collect metrics, read cgroup stats | Operational troubleshooting β "why is this container slow?" |
| containerd configuration management | containerd config.toml, runtime classes, snapshotter config | Literally what the team customizes β runtime configuration at the node level |
| eBPF-based node monitoring | Write a Go program that uses eBPF to collect scheduling metrics | Directly from the "Noisy Neighbor" blog post β same technology, same use case |
| React frontend | Cluster dashboard, real-time status, log viewer | Not Netflix-relevant, but makes the project a complete portfolio piece |
| Topic | Resource | When to Read |
|---|---|---|
| Modern Go | Effective Go + Go Blog | Milestone 0 β skim for what changed |
| Go modules | go.mod reference | Milestone 0 |
| Go concurrency | Concurrency in Go by Katherine Cox-Buday, chapters 1-4 | Milestone 5 |
| client-go | client-go examples | Milestone 4 |
| Kind internals | Kind design docs | Milestone 3 |
| Container runtime | Container Security by Liz Rice | Milestone 3 (while exploring the Kind node) |
| Linux performance | Systems Performance by Brendan Gregg, chapters 1-6 | Ongoing β start during Milestone 3 |
| eBPF | Netflix blog post + Learning eBPF by Liz Rice | Future milestone on node monitoring |