Skip to main content
  1. Argv.Blog/

Architecture overview: Designing a Self-Managing Linux Fleet

1499 words·8 mins·
Projects Development Infrastructure Network Architecture
linux-fleet - This article is part of a series.
Part 1: This Article

Introduction
#

I wanted a scalable Linux fleet that could take care of itself. A system that didn’t need babysitting every time I deployed a machine or crossed into a different network. I am not a Systems Administrator, yet I keep falling into it. When juggling different environments with different constraints, that work stops being “just annoying” and starts being a real drain.

What I needed was the opposite of that: machines that behaved the same everywhere, that told me what they were doing, and that didn’t fall out of sync the moment I stopped looking at them. No more VPN acrobatics. No more silent config drift. No more guessing whether a server was fine because the SSH banner looked normal.

Instead of scaling manual work and drowning in my own infrastructure, I built a platform.

Multi-modular project, and what to expect from this post
#

This post walks you through that architecture from the moment a Linux VM boots from any network to the point where it becomes a fully managed, observable, and locked-down node. The details of each component come in future posts; here you get the story of how the whole thing fits together. I hope it sparks some ideas if you are encountering a similar need for your operations.

Here I walk through the full flow, from a VM coming online in some random network to the moment it becomes a managed, monitored, locked-down node. I’ll dig into each component in separate posts. This one is more about how everything snaps together and why the setup ended up this way.

This will probably grow into a small series over time. I’ll add new posts whenever I tweak something in the fleet or stumble onto something worth showing. No schedule in mind, just whenever time allows.

Background: The Problem I Wanted to Solve
#

On a side project, I’m managing a growing set of Linux servers across multiple environments. Since the early stages of the project, the future headache started forming almost immediately: deployment, configs drifting, patching, and no single place to control or observe the fleet. I decided to fix the problem now that it’s a small headache instead of waiting for it to explode later.

I was dealing with:

  • servers spun up in different environments, isolated from each other,
  • inconsistent setups,
  • scattered hardening levels,
  • no automatic patching,
  • no unified telemetry or observability,
  • drift appearing silently over time,
  • and too much “did we configure this box the same way as the others?” anxiety

As I deployed clusters (Set of 1+ Linux nodes in 1 environment) for different networks or sites over time, I hit a predictable problem almost immediately: I had no visibility into the previous environments. By the time I finished the second deployment, I couldn’t even tell if the first one was alive or dead. And they were my responsibility. Without a unified system, each new environment added uncertainty and operational risk.

The following picture shows this concern as I saw it in my “mind’s eye”:

timeline-cluster-health.svg

That’s far from a clean, consistent, self-managing fleet, so the answer emerged as a layered architecture.

Luckily, almost all servers do the exact same thing and require the same configuration, so the project’s evolution is somewhat linear and complexity is predictable.

The Architecture at a Glance
#

Here’s what this project accomplished in plain English:

  1. Spinning up 1 or more fresh Linux servers.

  2. They run one Go binary, which:

    • configures SSH
    • creates a service user
    • installs Tailscale
    • installs the operational software (telemetry ingestion and forwarding server but that’s not what this post is about)
    • verifies everything went right
    • sends me a Slack “Success” (Or “Error”) message
  3. The server automatically joins my zero-trust mesh network.

  4. Prometheus gets installed to begin scraping metrics and Grafana shows dashboards instantly. I have a few custom configurations for Prometheus, like monitoring iptables incoming bytes.

  5. Ansible Semaphore takes over fleet management (patching, config rollouts, updates).

  6. Lynis audits run on schedule, and the results get turned into Prometheus metrics.

  7. Grafana visualizes the endpoint metrics I need, and even the hardening score and drift over time.

  8. Uptime kuma monitors the fleet

  9. Alerts fire if anything breaks, regresses, or drifts.

It looks something like this:

general-diagram.png

Each layer solves one problem, and together they form a platform.

Layer 1: Zero-Touch Bootstrap (Go)
#

The bootstrapper is the entry point.

The goal was simple:

The initial configuration is the same for every node, so the nodes should configure themselves. One command, zero manual work, and each server should enroll into Tailscale on its own.

Running a single binary turns a raw Debian VM into a “known-good” managed node:

  • SSH config
  • Service user created and configured
  • Tailscale installed and authenticated
  • Telemetry ingestion/forwarding software installed
  • Basic health checks executed
  • A Slack notification sent with the result (success or failure): “Server $hostname is up”

This alone has saved me countless hours.

This small Go binary is genuinely one of the things I’m proudest of in this project. Later posts will go deeper into how it works and the design decisions behind it.


Layer 2: Zero-Trust Access With Tailscale
#

Once the fleet grew past the first environment, a future nightmare became more obvious: managing access through multiple VPN clients across different environments one at a time would turn into a mess. FortiClient here, Palo Alto there, Appgate… This just can’t scale.

So the first thing I needed was a networking model that was painless, unified and safer.

Tailscale solved everything at once:

  • every server joins a private mesh, even on different environments
  • identity-based access replaces SSH key chaos
  • Properly configured ACLs give least privilege (Who can talk to whom? On what port?)
  • no VPN servers or jump hosts
  • consistent connectivity across environments

mesh-world-network.png

The bootstrapper enrolls each new “node” into Tailscale.


Layer 3: Observability: Prometheus + Grafana + Uptime Kuma
#

Once a server is online, I want to know how it behaves. If it coughs, if it slows down, if something starts drifting. I don’t want to ask the server; I want it to report to me. Everything should flow into a central place where I get both real-time signals and historical trends.

Prometheus scrapes:

  • node metrics (RAM, CPU, Disk, and other metrics)
  • iptables logging metrics for incoming bytes (Yes, I did a prometheus exporter for iptables) - The nodes are mostly telemetry ingestion servers, so ingestion metrics is a God send to help troubleshoot at different layers.
  • security drift metrics (more on this later, this is also awesome)

Grafana turns that data into dashboards and alerts.

Uptime Kuma (An Open Source uptime and performance monitoring tool) adds simple heartbeat monitoring for all the nodes.

This layer gives me clarity and peace of mind.

If something breaks, I know what, when, and with some future improvements, why.

timeline-cluster-health-known.png


Layer 4: Orchestration With Ansible Semaphore
#

The servers in the fleet need love and care, not just “Set and forget”. And this manual love and care takes a lot of time. So monitoring alone isn’t enough. I want the fleet to be actionable in order for me to fix things, push changes, improve the system as a whole.

Ansible Semaphore gives me:

  • reliable patching schedules
  • config rollouts
  • hardening tasks
  • service restarts
  • node initialization tasks
  • ad-hoc operations across the fleet

It’s light, agentless, and works beautifully with Tailscale.

In the series I’ll show examples of the patching pipeline and the hardening sweeps.

Layer 5: Continuous Security Drift Detection (Lynis → Prometheus)
#

Security posture decays quietly.

Lynis gives me baseline auditing, but I wanted it in time series, not logs.

So I built a small pipeline:

  1. Lynis outputs a JSON report.
  2. A reducer script converts the results into Prometheus metrics.
  3. node_exporter exposes the metrics (textfile collector).
  4. Grafana shows hardening score + warnings over time.

This turns baseline auditing into drift monitoring with history, trends, and alerts.

It’s one of the most valuable layers of the whole system.


What the System Achieves
#

By combining these components, the fleet becomes:

  • self-onboarding
  • self-observable
  • self-managing
  • self-auditing

It requires almost no ongoing human attention beyond improvements and oversight.

Instead of reacting to problems, I get to design systems. I get to corner problems instead of

That’s the whole point.


What’s Coming in the Series
#

Over the next posts, I’ll discuss decisions and some patterns I have implemented on the following topics:

  • Zero-touch bootstrap for nodes’ initial config,
  • Tailscale & Zero Trust
  • Observability stack
  • Fleet & Security automation with Ansible Semaphore

Each post will show real code, configs, screenshots, and design choices.


Closing thoughts
#

This project is ongoing and improvements might never end (And I’m not short on ideas for this), so that’s why I think of it as a bonsai tree.

It also started as a way to avoid repetitive manual work, but it grew into a full platform: one that’s simple, consistent, and surprisingly powerful for the size of its components.

If you’re interested in building something similar, or want to dig into the details, the next posts in this series break it all down.

J Armando G
Author
J Armando G
Cybersecurity & General Tech Enthusiast
linux-fleet - This article is part of a series.
Part 1: This Article