Who's monitoring my monitoring Infrastructure? Desigining for observability with Grafana stack

Table of Contents

In this post, I go through the design process of setting up a monitoring stack with Grafana and Prometheus, along with a couple of Prometheus custom node exportes, inside a Tailscale network, to monitor a SIEM collection infrastructure stack.

Link to Github repository:

Argandov/Engineering-Patterns

Monorepo for a collection of infrastructure & automation patterns for different purposes.

Python

Note: The technical details are outlined in this project’s Github repository (https://github.com/Argandov/Engineering-Patterns/tree/main/monitoring-stack ), so even if a little bit of technical jargon is used here, I’ll stick with the design criteria throughout this blog post. Hence, this post is not a “How to install Grafana/Prometheus/Iptables/Tailscale”.

About Wazuh
#

Throughout this post, some data presented is related to Wazuh SIEM. This is intended solely for documenting my lab’s tests and to avoid documenting actual production systems.

Intro
#

I’ve been working heavily on Detection Engineering lately, and part of what makes this detection engineering possible, of course, is setting up and administering the systems that do the detection in the first place. At first, a couple of monitoring servers or sensors in hybrid networks was a breeze both to set up and to troubleshoot. But I knew from the start that eventually I’d need a solution that would allow me to “monitor my own monitoring infrastructure”. The time came rather soon, when not being able to tell at any given time “Is my SIEM infrastructure working correctly?” started to rob me of my precious peace of mind. And so I started this design phase, and it was quite surprising how its actual implementation turned out to be even faster than the design phase 🤔.

The components of the “monitored infrastructure” are, roughly, as follows:

Several monitoring servers, or “Sensors” scattered across hybrid infrastructures, which have the job of sending telemetry to a centralized SIEM,
Sensor administration servers, also scattered across different Cloud environments. These have the job of doing “sensor” or “collector” administration (Remote configuration, updates, etc.).
The SIEM itself.

I needed to be able to answer things like:

Is the SIEM’s peripheral infrastructure (collectors, or sensors) working at all?
- Are they having performance issues? Saturation issues?
- If one sensor stops receiving logs at any given time, would I know?
  - For the past X hours/days/weeks, has any sensor had a hiccup?
    - For the past X hours/days/weeks, has any sensor stopped receiving Syslog or Windows Event logs?
      - “It’s all working now. But have I dropped logs due to any issue at all in the past X hours/days/weeks?”
        
        What’s my throughput for each ingested technology?

Also very important:

If something happens (A collector stops working, I lose logs, etc.) would I be able to address the issue promptly? Or would I spend a whole afternoon troubleshooting due to not having any visibility?

I was not able to answer any one of those. So I addressed this issue aggressively.

The challenge
#

Using Tailscale as the highway for logs
#

One of the main challenges to implementing a monitoring stack with Prometheus + Grafana was that I needed to set up PKI (Public Key Infra), and possibly having to expose a few services to the internet and of course, expanding the attack surface of this stack. This was a big “no-no” for me, so I spent a few days thinking about how to cover this, and eventually the idea of Tailscale came to my mind. I’ve been using it very lightly over the past 1-2 years and it’s awesome but I wasn’t sure if Wireguard was a technology that would allow me to achieve this kind of use cases. After some research, I found out I was just being extra cautious and the path was a clear green light.

Tailscale even exposes its own client metrics endpoint at (http://localhost:port/metrics), so Prometheus could in the future be pointed also to this endpoint, to have a feedback loop on the health of the mesh itself. But that’s for a future iteration of this project.

I had my solution.

Monitoring network throughput via Prometheus? 🤨
#

Now, the other concern I had, was that while performance and saturation was important, it wasn’t the most important thing I needed (The monitoring servers were very carefully tested and scoped so I wasn’t really worried about saturation at the moment).

Monitoring performance and saturation was like 20% of the battle. I needed to monitor network connections in the collector, or sensor servers, in order to be able to know when (Not if) something fails.

I had a “chain” of interconnected and interdependent systems:

flowchart LR
  subgraph on-premises or VPC [on-premises or VPC]
    A[Network Devices]:::grayBox
    B[Servers]:::grayBox
    C[SIEM Collector]:::collector
  end

subgraph VPC [VPC]
  D[SIEM Server]:::defaultBox
  E[Collector Admin Server]:::adminBox
  end

  A -- Syslog/Others --> C
  B -- Syslog/WinEvents --> C
  C -- gRPC --> D
  E -. Websocket .- C

  classDef grayBox fill:#f6f6f6,stroke:#bbb,stroke-width:1px,color:#000;
  classDef collector fill:#cbc8f8,stroke:#444,stroke-width:1.5px;
  classDef adminBox fill:#f9c6f9,stroke:#d46ad4,stroke-width:2px,stroke-dasharray: 5 5,color:#000;
  classDef defaultBox fill:#fff,stroke:#333,stroke-width:1px;

  linkStyle 0 stroke:#00bcd4,stroke-width:2px;
  linkStyle 1 stroke:#00bcd4,stroke-width:2px;
  linkStyle 2 stroke:#aaa,stroke-dasharray: 5 5,stroke-width:2px;

Being able to triage issues:

Am I ingesting logs to the SIEM?

If not,
- Do I know why? (Having several orchestrated systems that achieve 1 big goal, when something fails, it’s often very complex to find the culprit) Of course, there could be a lot of reasons:
  - SIEM server updated log parsing and mine broke (I mean, it happens)
  - SIEM Server having an issue (But where exactly?)
  - SIEM collector having an issue (But where exactly?)
  - DHCP server handing a new IP to the SIEM collectors (It shouldn’t happen with fixed IPs, but you never know, specially in networks where different departments are involved… So it was a risk I didn’t want to have)
  - Hell, the Cloud Provider could fail (It happened very recently..)
  - DHCP server handing a new IP to my log sources (FWs, Servers, etc.). Again, this shouldn’t happen. But “shouldn’t” != “Won’t”
  - Any other reason for which my log sources would stop sending their Syslog and other kinds of logs.

I didn’t want to deal with monitoring inbound bytes/packets per port on the collector side so I went around in circles trying to avoid this. But it soon became apparent that I absolutely needed to do it. I tried several methods, none of them worked, or were total overkill solutions. And then I found this Youtube video which was a life saver:

I didn’t know iptables was able to do logging by packets and bytes. It’s not foolproof, because it is not historical data but instead is an aggregate, but it’s a very good start. I care about the numbers, but more importantly, I care about the rate of change of this numbers.

The collector server, well, receives logs, from different technologies, on different ports (i.e. Syslog from FW 1 on p. 31000, Syslog from FW 2 on p. 31001, and so on), and this iptables solution was perfect. We don’t need to use conntrack, or sniff the network. We can just use the Firewall itself to log its own metrics and just send them to a prometheus relevant file.

And so, using a bit of AI, I arrived at the solution. Turns out, we can set up logging for every port on our iptables rules:

sudo iptables -A INPUT -p tcp --dport "$PORT" -j LOG --log-prefix "Monitor TCP $PORT: "

This sets our Firewall to start logging bytes/packets, and we can just retrieve the cumulative counts that have matched our rules since the last time the counters were reset or the rules were modified with:

sudo iptables -L -v -n -x | grep "dpt:$port" | grep -i "Monitor TCP $port" | head -n1

Since iptables logging format is Syslog, and we can “query” them directly with iptables program, it’s trivial now to do some bash wrangling to:

“Scrape” iptables logging,
Turn this data into a prometheus metric file with awk:

(The following is a snippet of the original repo’s scripts )

OUT="/var/lib/prometheus/node-exporter/bindplane_ports.prom"
: > "$OUT"  # Truncate file

for port in "${SYSLOG_PORTS[@]}"; do
    line=$(sudo iptables -L -v -n -x | grep "dpt:$port" | grep -i "Monitor TCP $port" | head -n1)

    if [[ -n "$line" ]]; then
        pkts=$(echo "$line" | awk '{print $1}')
        bytes=$(echo "$line" | awk '{print $2}')
        echo "syslog_port_packets_total{port=\"$port\", client=\"$CLI\", collector=\"$HOST\"} $pkts" >> "$OUT"
        echo "syslog_port_bytes_total{port=\"$port\", client=\"$CLI\", collector=\"$HOST\"} $bytes" >> "$OUT"
    else
        echo "!! No iptables rule found for port $port with comment 'Monitor TCP $port'"
    fi
done

Which, when ran using a cronjob, takes a snapshot of the data we want, per port. The output Prometheus metric files, in /var/lib/prometheus/node-exporter/our-export.prom look more or less like this:

# HELP syslog_port_packets_total Total number of packets for the monitored port
# TYPE syslog_port_packets_total counter
syslog_port_packets_total{port="31000", client="homelab", collector="wazuh-manager"} 150
syslog_port_packets_total{port="31001", client="homelab", collector="wazuh-manager"} 200

# HELP syslog_port_bytes_total Total number of bytes for the monitored port
# TYPE syslog_port_bytes_total counter
syslog_port_bytes_total{port="31000", client="homelab", collector="wazuh-manager"} 12000
syslog_port_bytes_total{port="31001", client="homelab", collector="wazuh-manager"} 18000

Problem solved.

Now, the path was clear and I had full green light now to go and install and set up Grafana and Prometheus. I learned a couple things on the way, including a deeper understanding of Prometheus, since my previous implementations of this stack gave me just the bare minimum performance monitoring capabilities and didn’t have the need to learn about custom prometheus exporters.

The implementation
#

By leveraging tailscale ’s infrastructure, I just added each component (Servers) inside a Tailnet, so the metric flows travelled inside a private network via Wireguard. I was expecting to run with networking issues but everything went very smoothly.

Scalability
#

This implementation scales perfectly well, and since we’re identifying each monitored node by “environment” (As specified in the client parameter in our Prometheus prom files), we can group them by this same parameter, which allows me to have any number of environments, with any number of monitored nodes.

Also, for this particular project’s purpose, which doesn’t require a lot of historical data, I can just set up Prometheus log retention configuration to 30 days, or even less.

Zero Trust
#

A bit outside the scope of this post, but important to note, is the application of Zero Trust principles to our Tailscale network. Yes, Tailscale applies ZT principles, but the problem is, our nodes can all see and connect to each other, and even information on the tailnet itself with the command:

tailscale status --json

This is completely unacceptable, since our collection servers are placed on potentially hostile networks (Networks outside of my control). To solve this, we can have a proper segmentation by using Tailscale’s ACLs and control the flow by means of tags:

Flow	Action
Prometheus Server → “Clients” (Collectors)	Allow
“Clients” (Collectors) → “Clients” (Collectors)	Block

And we can achieve exactly that by having tags for every “group” or purpose of each component. To whitelist intended flows between systems (Like prometheus → collectors, on their telemetry port only, or my administration computer’s full access to, and not from, the collector servers’ SSH ports, or deny any traffic between collector servers, and so on).

Results of this project
#

A LOT of time has been saved since I set this up and peace of mind.
I’ve had a couple of issues after I implemented this, and it was extremely easy to troubleshoot, due to the fact that I can pin point issues A LOT faster by jumping straight to the problem instead of spending time figuring out “What might have happened in my whole stack?”

Next Steps
#

Setting up Alerting on specific thresholds with Alert Grafana,
Loki for scraping specific collectors’ logs.
Monitoring Tailscale metrics

Future challenge: HA
#

I’m currently planning for High Availability, let’s see how this ends up working.

Author

J Armando G

Cybersecurity & General Tech Enthusiast

About Wazuh #

Intro #

The challenge #

Using Tailscale as the highway for logs #

Monitoring network throughput via Prometheus? 🤨 #

The implementation #

Scalability #

Zero Trust #

Results of this project #

Next Steps #

Future challenge: HA #