-
Notifications
You must be signed in to change notification settings - Fork 6
feat(agent): externals as VPCs #1033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🚀 Temp artifacts published: |
be37778 to
2d64064
Compare
|
🚀 Temp artifacts published: |
2d64064 to
dcf672e
Compare
|
🚀 Temp artifacts published: |
|
@Frostman I ran part of the release-test (specifically all tests with externals) in env-4 with these changes and everything worked fine. I could also manually check that we are advertising the default route via the external in the EVPN AF, and that no routes from the VPCs are advertised via the external. Obviously the real test would be running this with the gateway and checking whether external peering works + whether we can mix and match fabric and GW external peering without issues. HOWEVER, the only way to upgrade the agents was to delete all VPCs and externals, else it would complain of missing VNIs and/or gNMI errors when attempting to modifying VNIs due to collisions. Also the VNIs of the externals change depending on the number of VPCs we have in the system at any one time, which is not great (although I did not observe errors like the gNMI one while testing, once the upgrade was completed). I think we should partition the VNI space or at least have the externals come before the VPCs, given that they are less likely to change. |
dcf672e to
cebcb12
Compare
|
🚀 Temp artifacts published: |
|
I got back to this today and decided to try it in a VM (with the help of the virt-external branch) so that I could try the end-to-end flow with the gateway, but at some point I realized that the external routes were not being advertised to the spines from the attached leaf. The only indication that something was wrong, other than not seeing the routes, was the output of this command, on the leaf attached to the external, whose VRF has VNI 700: compare it to a VNI of a VPC VRF: I'm not sure if something got broken since I first tested this, or it has to do with virtual switches (as opposed to the hybrid envs where I had tried it before). I'm going to double check the output of this command on a hybrid env to make sure I'm not yet again wasting time due to VS limitations. |
cebcb12 to
83be71c
Compare
|
🚀 Temp artifacts published: |
no, env-1 shows the same behavior. Either something broke it along the way, or I had hallucinations when I thought it worked in env-4? |
|
OK, problem solved - I was missing the dedicated IRB VLAN. Manually adding it solved the issue: and now on the spine: I must update the branch to also add this |
|
after making the changes above and manually creating a gateway peering between vpc-3 and the "external-01" VPC (which the gateway sees as a standard VPC), I can see that pings from the peered VPC get to the virtual external, but we are not able to send responses back, because the external does not have a route for that subnet to send it back adding a static route for the IPv4 namespace to the external attachment interface Looks like this is interfering with the local NAT? how are we supposed to handle NAT anyway when there's both the gateway and the external potentially doing it? |
|
🚀 Temp artifacts published: |
bd3dd21 to
488e5b1
Compare
|
🚀 Temp artifacts published: |
|
@Frostman main thing that is missing now is updating |
7f7ba71 to
d129a26
Compare
|
🚀 Temp artifacts published: |
I've replicated what we do for VPCs and I manually confirmed that deleting/creating an external deletes/creates the corresponding VPCInfo object. Note however that I'm always associating the Anyway, tagging as ready for review, as it's mature enough at least as a base for further discussion |
d129a26 to
69d57f7
Compare
|
🚀 Temp artifacts published: |
69d57f7 to
7e5a796
Compare
|
🚀 Temp artifacts published: |
7e5a796 to
8ae043a
Compare
4cd5cbc to
0e89f55
Compare
|
🚀 Temp artifacts published: |
0e89f55 to
d94cbdd
Compare
|
🚀 Temp artifacts published: |
|
@Frostman copying here what we discussed via slack so the information is all in one place: main issue I see with this PR is the VNI space shared between externals and VPCs. We do allocations sequentially for all objects of the same kind and then move to the other type, specifically right now we do externals first and then VPCs. This means that:
maybe if we wanted to keep a single VNI space and address the problems above we could consider checking the existing allocations first and only allocating new stuff / removing old ones, but that complicates the code and (for deletions) forces us to deal with gaps in the allocated resource space. Of course none of this would be a problem if we had proper declarative APIs and the ability to replace config in one go... |
d94cbdd to
5630063
Compare
|
@Frostman I've taken your comment on doing things like we do for the loopback workaround and gave it a shot myself. It's ugly because I don't want to change how we store VNIs for VPCs (for backwards compatibility), so we have to do a double translation, but it appears to work. I'm sure it could be further optimized but at least it will allow us to move this forward faster - I've kept it as a separate commit anyway in case we want to take a different approach. |
|
🚀 Temp artifacts published: |
add l3vni to external VRFs, advertise routes we learn from the external (filtered by BGP community) over EVPN. This way the gateway will see them just as another VPC and we will be able to do external peering via GW with no special code. add also the IRB VLAN, else routes from the external are not going to be advertised over BGP EVPN. for the external outbound route-map, pre-emptively allow any prefix belonging to the external's ipv4namespace, as opposed to selectively adding VPC prefixes to a prefix-list depending on the external peering. this way, routes learned via the gateway will be advertised to the external, which allows traffic to be routed back to the VPC. place externals VNIs and IRB VLANs before the VPCs, as they share the same space and are less likely to change dynamically. This way we will not see a different external VNI/VLAN every time we create or delete a VPC. make the catalog errors warnings for now, to allow agent upgrade. these should be made into full blown errors eventually. Signed-off-by: Emanuele Di Pascale <[email protected]>
Signed-off-by: Emanuele Di Pascale <[email protected]>
Signed-off-by: Emanuele Di Pascale <[email protected]>
Signed-off-by: Sergei Lukianov <[email protected]>
5630063 to
abafe6f
Compare
Signed-off-by: Sergei Lukianov <[email protected]>
abafe6f to
d69beae
Compare
|
🚀 Temp artifacts published: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR extends the VNI (Virtual Network Identifier) allocation system to support External resources in addition to VPCs. The changes unify VPC and External VNI management under a common catalog, ensuring both resource types receive proper VNI allocations and IRB VLAN assignments.
Key changes:
- Renamed
UpdateVPCstoUpdateVNIsto reflect that it now handles both VPCs and Externals - Consolidated catalog name from
CatVPCstoCatVNIswith updated documentation - Added
GetExternalVNImethod and newGwExternalSynccontroller for External resource reconciliation - Updated prefix constants from
LoWorkaround*to cleanerReq*names
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/manager/librarian/librarian.go | Core logic changes: renamed UpdateVPCs to UpdateVNIs, added External VNI allocation, updated catalog handling for both VPCs and Externals |
| pkg/hhfctl/inspect/conn.go | Updated constant references from LoWorkaroundReqPrefix* to ReqPrefix* |
| pkg/ctrl/vpc_ctrl.go | Updated VPC reconciler to call renamed UpdateVNIs method |
| pkg/ctrl/gw_sync.go | Added GwExternalSync reconciler for External resources with VPCInfo synchronization |
| pkg/ctrl/agent_ctrl.go | Updated agent reconciler to handle External VNIs and call UpdateVNIs |
| pkg/agent/dozer/bcm/plan.go | Removed unused prefix list logic, added IRB VLAN and VNI configuration for externals, updated to use ReqForExt |
| config/rbac/role.yaml | Consolidated RBAC rules for externals and vpcs finalizers |
| config/crd/bases/agent.githedgehog.com_catalogs.yaml | Updated field descriptions to mention Externals with ext@ prefix |
| config/crd/bases/agent.githedgehog.com_agents.yaml | Updated field descriptions to mention Externals with ext@ prefix |
| cmd/main.go | Added setup call for GwExternalSync controller |
| api/vpc/v1beta1/vpc_types.go | Added validation to prevent VPC names from starting with VPCInfoExtPrefix |
| api/vpc/v1beta1/external_types.go | Added VPCInfoExtPrefix constant definition |
| api/agent/v1beta1/catalog_types.go | Updated field comments to document External inclusion |
Comments suppressed due to low confidence (1)
pkg/ctrl/gw_sync.go:214
- The hardcoded subnet entry with key "external" and CIDR "0.0.0.0/0" is a placeholder as indicated by the FIXME comment above. This magic string and default route could cause issues if other code depends on accurate subnet information. Consider using a more descriptive key or documenting this placeholder pattern more explicitly until the proper implementation is added.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
🚀 Temp artifacts published: |
Signed-off-by: Emanuele Di Pascale <[email protected]>
|
🚀 Temp artifacts published: |
add l3vni to external VRFs, advertise routes we learn from the external (filtered by BGP community) over EVPN. This way the gateway will see them just as another VPC and we will be able to do external peering via GW with no special code