Skip to content

Conversation

@pau-hedgehog
Copy link
Contributor

No description provided.

@pau-hedgehog pau-hedgehog self-assigned this Oct 20, 2025
@pau-hedgehog pau-hedgehog requested a review from Copilot October 20, 2025 13:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for an LGTM (Loki, Grafana, Tempo, Prometheus) observability node to the vlab environment. This enables comprehensive monitoring and logging capabilities for fabric deployments.

Key changes:

  • Introduces a new VMTypeObservability node type that runs the LGTM stack
  • Adds configuration support for LGTM components with version management
  • Implements network configuration for observability nodes with external internet access

Reviewed Changes

Copilot reviewed 39 out of 40 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/hhfab/vlabconfig.go Adds observability VM type, network configuration, and port forwarding for LGTM services
pkg/hhfab/vlabrunner.go Integrates observability VMs into the VLAB lifecycle and build process
pkg/fab/comp/lgtm/lgtm.go Core LGTM component implementation with Helm chart and image management
pkg/fab/versions.go Adds version specifications for LGTM components
api/fabricator/v1beta1/fabricator_types.go Defines LGTM configuration types and version structures
api/fabricator/v1beta1/fabnode_types.go Adds observability node role and external interface support
pkg/controller/fabricator_ctrl.go Integrates LGTM installation into the reconciliation loop

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Oct 21, 2025

You can bring the VLAB+LGTM node locally with:

just lint && just oci=http push && bin/hhfab init --registry-repo localhost:30000 --dev --obs --gw -f && bin/hhfab vlab gen --eslag-leaf-groups=2 --spines-count=2 --bundled-servers=0 --eslag-serv
ers=1 --unbundled-servers=1 && bin/hhfab vlab up -f -v

Once VLAB is ready you can get to Grafana on host_ip:3000 and you will have logs/metrics from the VLAB readily available, including our dashboards

You can even run the new observability tests with some small hacks in the local env:

19:08:30 INF * Running test test="Loki Observability"
19:08:30 DBG Found Loki target name=monitor url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1/push
19:08:30 DBG Found Prometheus target name=monitor url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write
19:08:30 DBG Using Loki endpoint url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1 auth_from_env=false env=vlab
19:08:30 INF Checking logs for devices expected="[leaf-01 leaf-02 spine-01 spine-02 gateway-1 alloy-ctrl-lskvc]"
19:08:30 DBG Loki query details hostname=leaf-01 status="200 OK" query="{hostname=\"leaf-01\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:14.776445+00:00 2025 leaf-01 WARNING syncd#syncd: :- process_packet_for_fdb_event: skipping mac learn for {\"bvid\":\"oid:0x0\",\"mac\":\"0C:20:12:FE:04:01\",\"switch_id\":\"oid:0x2100000000\"}, since BV_ID was not found for mac"
19:08:30 INF Found logs for device hostname=leaf-01 count=20 env=vlab
19:08:30 DBG Loki query details hostname=leaf-02 status="200 OK" query="{hostname=\"leaf-02\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:10.814582+00:00 2025 leaf-02 WARNING syncd#syncd: :- process_packet_for_fdb_event: skipping mac learn for {\"bvid\":\"oid:0x0\",\"mac\":\"0C:20:12:FE:03:02\",\"switch_id\":\"oid:0x2100000000\"}, since BV_ID was not found for mac"
19:08:30 INF Found logs for device hostname=leaf-02 count=20 env=vlab
19:08:30 DBG Loki query details hostname=spine-01 status="200 OK" query="{hostname=\"spine-01\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:24.596823+00:00 2025 spine-01 ERR telemetry#telemetry[52]: [xfmr_if_dropcounters.go:35] DbToYang_rpc_get_debugcounters_xfmr - drop counters not supported"
19:08:30 INF Found logs for device hostname=spine-01 count=20 env=vlab
19:08:30 DBG Loki query details hostname=spine-02 status="200 OK" query="{hostname=\"spine-02\", env=\"vlab\"}" entries=20 sample="Oct 25 17:08:02.764956+00:00 2025 spine-02 WARNING systemd[1]: /etc/systemd/system/ssh.service.d/override.conf:5: Unknown key name 'StartLimitIntervalSec' in section 'Service', ignoring."
19:08:30 INF Found logs for device hostname=spine-02 count=20 env=vlab
19:08:30 DBG Loki query details hostname=gateway-1 status="200 OK" query="{hostname=\"gateway-1\", env=\"vlab\"}" entries=8 sample="ts=2025-10-25T16:30:52.177071025Z level=warn msg=\"Skipping resharding, last successful send was beyond threshold\" component_path=/ component_id=prometheus.remote_write.monitor subcomponent=rw remote_name=4658b5 url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write lastSendTimestamp=1761409836 minSendTimestamp=1761409842\n"
19:08:30 DBG Only stale logs found for device (older than 15 minutes) hostname=gateway-1
19:08:30 DBG Loki query details hostname=alloy-ctrl-lskvc status="200 OK" query="{hostname=\"alloy-ctrl-lskvc\", env=\"vlab\"}" entries=20 sample="172.28.0.48 - - [25/Oct/2025:17:08:28 +0000]  204 \"POST /loki/api/v1/push HTTP/1.1\" 0 \"-\" \"Alloy/v1.11.2 (linux; helm)\" \"-\"\n"
19:08:30 INF Found logs for device hostname=alloy-ctrl-lskvc count=20 env=vlab
19:08:30 WRN Some devices missing logs in Loki missing_count=1 total_count=6
19:08:30 DBG Devices missing logs missing=[gateway-1]
19:08:30 INF PASS test="Loki Observability"
19:08:30 INF * Running test test="Prometheus Observability"
19:08:30 DBG Found Loki target name=monitor url=http://loki-gateway.lgtm.svc.cluster.local/loki/api/v1/push
19:08:30 DBG Found Prometheus target name=monitor url=http://prometheus-server.lgtm.svc.cluster.local/api/v1/write
19:08:30 DBG Using Prometheus endpoint url=http://prometheus-server.lgtm.svc.cluster.local/api/v1 auth_from_env=false env=vlab
19:08:30 INF Checking metrics for switches expected="[leaf-02 spine-01 spine-02 leaf-01]"
19:08:30 DBG Prometheus query details query="fabric_agent_agent_generation{env=\"vlab\"}" status="200 OK" count=4
19:08:30 DBG Prometheus metric hostname=spine-01 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=spine-01 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=leaf-02 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=leaf-02 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=leaf-01 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=leaf-01 value=1 env=vlab
19:08:30 DBG Prometheus metric hostname=spine-02 value=1 timestamp=2025-10-25T19:08:30+02:00
19:08:30 INF Found metric metric=fabric_agent_agent_generation switch=spine-02 value=1 env=vlab
19:08:30 INF Verified Prometheus metrics delivery metric=fabric_agent_agent_generation metrics_count=4 switches_checked=4
19:08:30 INF PASS test="Prometheus Observability"

Basically you need to add these entries in /etc/hosts:

127.0.0.2 loki-gateway.lgtm.svc.cluster.local
127.0.0.3 prometheus-server.lgtm.svc.cluster.local

And run this script
start-lgtm-iptables.sh

Adds support for an LGTM (Loki, Grafana, Tempo, Prometheus)
observability node to the VLAB. Usage:

hhfab init --dev --obs

Initializes fab.yaml with an Observability type node, adds
loki and prometheus targets poiting to the LGTM stack

--dev also creates a well known password for Grafana

Signed-off-by: Pau Capdevila <[email protected]>
@pau-hedgehog
Copy link
Contributor Author

pau-hedgehog commented Oct 26, 2025

@Frostman , I'll keep this in draft but I think it's fully functional now. Many things can be improved for sure. The whole Idea is to have an observability capable all-in-one VLAB for testing/demos. Thanks for your consideration


func redactAlloyTarget(target *alloy.Target) {
if target.BasicAuth.Password != "" {
if target.BasicAuth != nil && target.BasicAuth.Password != "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make a separate PR with it so we can merge it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,201 @@
// Copyright 2025 Hedgehog
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't digging into what's done here, why is it required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I needed that in my first iteration, otherwise I was getting failed pods or similar. But I will check if I can streamline and get rid of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants