Metadata Nodes Keep Dropping from Health Net Check. #55

eekay35 · 2025-06-06T21:59:20Z

eekay35
Jun 6, 2025

Hello!

Versions:
OS - Rocky Linux 9.5
BeeGFS - version 8.0.1
Mellanox OFED - version 24.10

Setup:
Node1 - Meta, Storage, Management, Client
Node2 - Meta, Storage
Node3 - Meta, Storage

We're currently using a 3-server setup all serving metadata and storage with one management node (for now.) Everything appears to be working perfectly fine. However, when we run a "beegfs health net" from any clients, node1's "node_meta_1" is always connected via RDMA (as expected) but node2 and node3 seem to randomly drop out of the health net check. I can bring them back instantly running a "beegfs node ping" (which successfully pings all nodes very quickly) but this is only temporary as at some point the node2 and node3 meta nodes will drop back out again (from the health net output.)

This doesn't appear to be impacting performance or clients or anything. But, it is somewhat concerning to see them missing all the time. I've check logs and don't see any errors or any information in regards to why these are dropping out of the health net check. All other health checks appear to be fine as well.

Does anyone have any idea why this is happening? Known issue? Is it an issue? Is there some way to track down why this is happening? Any push in the right direction would be greatly appreciated!

Here's an example of nodes gone:

---------------
Management Node
---------------
management [ID: 1]
   Connections: ethernet: 1 (172.30.190.17:8008);
--------------
Metadata Nodes
--------------
node_meta_1 [ID: 1]
   Connections: rdma: 1 (172.30.194.17:8005);
node_meta_2 [ID: 2]
   Connections: <none>
node_meta_3 [ID: 3]
   Connections: <none>
-------------
Storage Nodes
-------------
node_storage_1 [ID: 1]
   Connections: rdma: 1 (172.30.194.17:8003);
node_storage_2 [ID: 2]
   Connections: rdma: 1 (172.30.194.18:8003);
node_storage_3 [ID: 3]
   Connections: rdma: 1 (172.30.194.19:8003);

Then, I run a beegfs node ping (all successful) and immediately see:

---------------
Management Node
---------------
management [ID: 1]
   Connections: ethernet: 1 (172.30.190.17:8008);
--------------
Metadata Nodes
--------------
node_meta_1 [ID: 1]
   Connections: rdma: 1 (172.30.194.17:8005);
node_meta_2 [ID: 2]
   Connections: rdma: 1 (172.30.194.18:8005);
node_meta_3 [ID: 3]
   Connections: rdma: 1 (172.30.194.19:8005);
-------------
Storage Nodes
-------------
node_storage_1 [ID: 1]
   Connections: rdma: 1 (172.30.194.17:8003);
node_storage_2 [ID: 2]
   Connections: rdma: 1 (172.30.194.18:8003);
node_storage_3 [ID: 3]
   Connections: rdma: 1 (172.30.194.19:8003);

About an hour later, the node2 and node3 meta's are gone again.

Answered by eekay35

Jun 15, 2025

@scaleoutsean

I appreciate the reply! This was my first thought as well so that's what I did. But, after speaking to BeeGFS tech support, they informed me that this was completely normal when there are no jobs running that would require the meta to be needed. So, I removed the cron job and we'll see what happens when actual jobs start running. Apparently, the meta nodes are simply smart enough to sync up when they need to. Otherwise, they just rest.

View full answer

scaleoutsean · 2025-06-15T03:42:15Z

scaleoutsean
Jun 15, 2025

No idea what the problem is, but if that bothered me I'd create a node ping cronjob on the management node and schedule it ever 1800s.

1 reply

eekay35 Jun 15, 2025
Author

@scaleoutsean

I appreciate the reply! This was my first thought as well so that's what I did. But, after speaking to BeeGFS tech support, they informed me that this was completely normal when there are no jobs running that would require the meta to be needed. So, I removed the cron job and we'll see what happens when actual jobs start running. Apparently, the meta nodes are simply smart enough to sync up when they need to. Otherwise, they just rest.

Answer selected by iamjoemccormick

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata Nodes Keep Dropping from Health Net Check. #55

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Metadata Nodes Keep Dropping from Health Net Check. #55

Uh oh!

Uh oh!

eekay35 Jun 6, 2025

Replies: 1 comment · 1 reply

Uh oh!

scaleoutsean Jun 15, 2025

Uh oh!

eekay35 Jun 15, 2025 Author

eekay35
Jun 6, 2025

Replies: 1 comment 1 reply

scaleoutsean
Jun 15, 2025

eekay35 Jun 15, 2025
Author