chore(vlab): add lgtm observability node #1051

pau-hedgehog · 2025-10-20T13:58:46Z

No description provided.

Copilot

Pull Request Overview

This PR adds support for an LGTM (Loki, Grafana, Tempo, Prometheus) observability node to the vlab environment. This enables comprehensive monitoring and logging capabilities for fabric deployments.

Key changes:

Introduces a new VMTypeObservability node type that runs the LGTM stack
Adds configuration support for LGTM components with version management
Implements network configuration for observability nodes with external internet access

Reviewed Changes

Copilot reviewed 39 out of 40 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/hhfab/vlabconfig.go	Adds observability VM type, network configuration, and port forwarding for LGTM services
pkg/hhfab/vlabrunner.go	Integrates observability VMs into the VLAB lifecycle and build process
pkg/fab/comp/lgtm/lgtm.go	Core LGTM component implementation with Helm chart and image management
pkg/fab/versions.go	Adds version specifications for LGTM components
api/fabricator/v1beta1/fabricator_types.go	Defines LGTM configuration types and version structures
api/fabricator/v1beta1/fabnode_types.go	Adds observability node role and external interface support
pkg/controller/fabricator_ctrl.go	Integrates LGTM installation into the reconciliation loop

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pkg/fab/comp/lgtm/grafana_crm.json

pau-hedgehog · 2025-10-21T08:06:11Z

You can bring the VLAB+LGTM node locally with:

just lint && just oci=http push && bin/hhfab init --registry-repo localhost:30000 --dev --obs --gw -f && bin/hhfab vlab gen --eslag-leaf-groups=2 --spines-count=2 --bundled-servers=0 --eslag-serv
ers=1 --unbundled-servers=1 && bin/hhfab vlab up -f -v

Once VLAB is ready you can get to Grafana on host_ip:3000 and you will have logs/metrics from the VLAB readily available, including our dashboards

You can even run the new observability tests with some small hacks in the local env:

19:08:30 INF * Running test test="Loki Observability"
19:08:30 DBG Found Loki target name=monitor url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1/push
19:08:30 DBG Found Prometheus target name=monitor url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write
19:08:30 DBG Using Loki endpoint url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1 auth_from_env=false env=vlab
19:08:30 INF Checking logs for devices expected="[leaf-01 leaf-02 spine-01 spine-02 gateway-1 alloy-ctrl-lskvc]"
19:08:30 DBG Loki query details hostname=leaf-01 status="200 OK" query="{hostname=\"leaf-01\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:14.776445+00:00 2025 leaf-01 WARNING syncd#syncd: :- process_packet_for_fdb_event: skipping mac learn for {\"bvid\":\"oid:0x0\",\"mac\":\"0C:20:12:FE:04:01\",\"switch_id\":\"oid:0x2100000000\"}, since BV_ID was not found for mac"
19:08:30 INF Found logs for device hostname=leaf-01 count=20 env=vlab
19:08:30 DBG Loki query details hostname=leaf-02 status="200 OK" query="{hostname=\"leaf-02\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:10.814582+00:00 2025 leaf-02 WARNING syncd#syncd: :- process_packet_for_fdb_event: skipping mac learn for {\"bvid\":\"oid:0x0\",\"mac\":\"0C:20:12:FE:03:02\",\"switch_id\":\"oid:0x2100000000\"}, since BV_ID was not found for mac"
19:08:30 INF Found logs for device hostname=leaf-02 count=20 env=vlab
19:08:30 DBG Loki query details hostname=spine-01 status="200 OK" query="{hostname=\"spine-01\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:24.596823+00:00 2025 spine-01 ERR telemetry#telemetry[52]: [xfmr_if_dropcounters.go:35] DbToYang_rpc_get_debugcounters_xfmr - drop counters not supported"
19:08:30 INF Found logs for device hostname=spine-01 count=20 env=vlab
19:08:30 DBG Loki query details hostname=spine-02 status="200 OK" query="{hostname=\"spine-02\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:02.764956+00:00 2025 spine-02 WARNING systemd[1]: /etc/systemd/system/ssh.service.d/override.conf:5: Unknown key name 'StartLimitIntervalSec' in section 'Service', ignoring."
19:08:30 INF Found logs for device hostname=spine-02 count=20 env=vlab
19:08:30 DBG Loki query details hostname=gateway-1 status="200 OK" query="{hostname=\"gateway-1\", env=\"vlab\"}" entries=8 sample="ts=2025-10-25T16:30:52.177071025Z level=warn msg=\"Skipping resharding, last successful send was beyond threshold\" component_path=/ component_id=prometheus.remote_write.monitor subcomponent=rw remote_name=4658b5 url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write lastSendTimestamp=1761409836 minSendTimestamp=1761409842\n"
19:08:30 DBG Only stale logs found for device (older than 15 minutes) hostname=gateway-1
19:08:30 DBG Loki query details hostname=alloy-ctrl-lskvc status="200 OK" query="{hostname=\"alloy-ctrl-lskvc\", env=\"vlab\"}" entries=20 sample="172.28.0.48 - - [25/Oct/2025:17:08:28 +0000]  204 \"POST /loki/api/v1/push HTTP/1.1\" 0 \"-\" \"Alloy/v1.11.2 (linux; helm)\" \"-\"\n"
19:08:30 INF Found logs for device hostname=alloy-ctrl-lskvc count=20 env=vlab
19:08:30 WRN Some devices missing logs in Loki missing_count=1 total_count=6
19:08:30 DBG Devices missing logs missing=[gateway-1]
19:08:30 INF PASS test="Loki Observability"
19:08:30 INF * Running test test="Prometheus Observability"
19:08:30 DBG Found Loki target name=monitor url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1/push
19:08:30 DBG Found Prometheus target name=monitor url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write
19:08:30 DBG Using Prometheus endpoint url=http://prometheus-server.lgtm.svc.cluster.local/api/v1 auth_from_env=false env=vlab
19:08:30 INF Checking metrics for switches expected="[leaf-02 spine-01 spine-02 leaf-01]"
19:08:30 DBG Prometheus query details query="fabric_agent_agent_generation{env=\"vlab\"}" status="200 OK" count=4
19:08:30 DBG Prometheus metric hostname=spine-01 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=spine-01 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=leaf-02 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=leaf-02 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=leaf-01 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=leaf-01 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=spine-02 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=spine-02 value=1 env=vlab
19:08:30 INF Verified Prometheus metrics delivery metric=fabric_agent_agent_generation metrics_count=4 switches_checked=4
19:08:30 INF PASS test="Prometheus Observability"

Basically you need to add these entries in /etc/hosts:

127.0.0.2 loki-gateway.lgtm.svc.cluster.local
127.0.0.3 prometheus-server.lgtm.svc.cluster.local

And run this script
start-lgtm-iptables.sh

Adds support for an LGTM (Loki, Grafana, Tempo, Prometheus) observability node to the VLAB. Usage: hhfab init --dev --obs Initializes fab.yaml with an Observability type node, adds loki and prometheus targets poiting to the LGTM stack --dev also creates a well known password for Grafana Signed-off-by: Pau Capdevila <[email protected]>

pau-hedgehog · 2025-10-26T09:28:09Z

@Frostman , I'll keep this in draft but I think it's fully functional now. Many things can be improved for sure. The whole Idea is to have an observability capable all-in-one VLAB for testing/demos. Thanks for your consideration

Frostman · 2025-10-27T05:23:17Z

pkg/support/kuberesources.go


 func redactAlloyTarget(target *alloy.Target) {
-	if target.BasicAuth.Password != "" {
+	if target.BasicAuth != nil && target.BasicAuth.Password != "" {


Could you please make a separate PR with it so we can merge it

Frostman · 2025-10-27T05:23:43Z

pkg/fab/comp/k3s/localpathconfig.go

@@ -0,0 +1,201 @@
+// Copyright 2025 Hedgehog


I wasn't digging into what's done here, why is it required?

I think I needed that in my first iteration, otherwise I was getting failed pods or similar. But I will check if I can streamline and get rid of it

pau-hedgehog self-assigned this Oct 20, 2025

pau-hedgehog requested a review from Copilot October 20, 2025 13:58

Copilot AI reviewed Oct 20, 2025

View reviewed changes

pkg/fab/comp/lgtm/grafana_crm.json Outdated Show resolved Hide resolved

pau-hedgehog force-pushed the pau/vlab_lgtm branch from 1f99449 to b895d06 Compare October 21, 2025 07:50

pau-hedgehog force-pushed the pau/vlab_lgtm branch from b895d06 to 33a471c Compare October 26, 2025 09:21

Frostman reviewed Oct 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

chore(vlab): add lgtm observability node #1051

chore(vlab): add lgtm observability node #1051

pau-hedgehog commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

pau-hedgehog commented Oct 21, 2025 •

edited

Loading

Uh oh!

pau-hedgehog commented Oct 26, 2025 •

edited

Loading

Uh oh!

Frostman Oct 27, 2025

Uh oh!

pau-hedgehog Oct 27, 2025

Uh oh!

Frostman Oct 27, 2025

Uh oh!

pau-hedgehog Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

chore(vlab): add lgtm observability node #1051

Are you sure you want to change the base?

chore(vlab): add lgtm observability node #1051

Conversation

pau-hedgehog commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

pau-hedgehog commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pau-hedgehog commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Frostman Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

pau-hedgehog Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Frostman Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

pau-hedgehog Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pau-hedgehog commented Oct 21, 2025 •

edited

Loading

pau-hedgehog commented Oct 26, 2025 •

edited

Loading