Skip to content

Commit 016d9fb

Browse files
dledfordrolandd
authored andcommitted
IPoIB: fix MCAST_FLAG_BUSY usage
Commit a9c8ba5 ("IPoIB: Fix usage of uninitialized multicast objects") added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test on mcast->mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast->done completion struct 5) Make sure that before calling complete(&mcast->done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast->mc before we process the callback. This way, our callback can safely clear mcast->mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast->mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast->mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast->mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv->broadcast = NULL join_complete wait_for_completion mcast->mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast->mc complete return -EAGAIN (making mcast->mc invalid) call ib_sa_multicast_leave on invalid mcast->mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast->mc to be valid at the time we run the callback. This allows us to clear mcast->mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast->mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford <[email protected]> Signed-off-by: Roland Dreier <[email protected]>
1 parent 67d7209 commit 016d9fb

File tree

2 files changed

+101
-57
lines changed

2 files changed

+101
-57
lines changed

drivers/infiniband/ulp/ipoib/ipoib.h

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,15 @@ enum {
9898

9999
IPOIB_MCAST_FLAG_FOUND = 0, /* used in set_multicast_list */
100100
IPOIB_MCAST_FLAG_SENDONLY = 1,
101-
IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */
101+
/*
102+
* For IPOIB_MCAST_FLAG_BUSY
103+
* When set, in flight join and mcast->mc is unreliable
104+
* When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
105+
* haven't started yet
106+
* When clear and mcast->mc is valid pointer, join was successful
107+
*/
108+
IPOIB_MCAST_FLAG_BUSY = 2,
102109
IPOIB_MCAST_FLAG_ATTACHED = 3,
103-
IPOIB_MCAST_JOIN_STARTED = 4,
104110

105111
MAX_SEND_CQE = 16,
106112
IPOIB_CM_COPYBREAK = 256,

drivers/infiniband/ulp/ipoib/ipoib_multicast.c

Lines changed: 93 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
271271
struct ipoib_mcast *mcast = multicast->context;
272272
struct net_device *dev = mcast->dev;
273273

274+
/*
275+
* We have to take the mutex to force mcast_sendonly_join to
276+
* return from ib_sa_multicast_join and set mcast->mc to a
277+
* valid value. Otherwise we were racing with ourselves in
278+
* that we might fail here, but get a valid return from
279+
* ib_sa_multicast_join after we had cleared mcast->mc here,
280+
* resulting in mis-matched joins and leaves and a deadlock
281+
*/
282+
mutex_lock(&mcast_mutex);
283+
274284
/* We trap for port events ourselves. */
275285
if (status == -ENETRESET)
276-
return 0;
286+
goto out;
277287

278288
if (!status)
279289
status = ipoib_mcast_join_finish(mcast, &multicast->rec);
280290

281291
if (status) {
282292
if (mcast->logcount++ < 20)
283-
ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n",
293+
ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
294+
"join failed for %pI6, status %d\n",
284295
mcast->mcmember.mgid.raw, status);
285296

286297
/* Flush out any queued packets */
@@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status,
290301
dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
291302
}
292303
netif_tx_unlock_bh(dev);
293-
294-
/* Clear the busy flag so we try again */
295-
status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
296-
&mcast->flags);
297304
}
305+
out:
306+
clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
307+
if (status)
308+
mcast->mc = NULL;
309+
complete(&mcast->done);
310+
if (status == -ENETRESET)
311+
status = 0;
312+
mutex_unlock(&mcast_mutex);
298313
return status;
299314
}
300315

@@ -312,19 +327,24 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
312327
int ret = 0;
313328

314329
if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
315-
ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
330+
ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
331+
"multicast joins\n");
316332
return -ENODEV;
317333
}
318334

319-
if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
320-
ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
335+
if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
336+
ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
337+
"sendonly join\n");
321338
return -EBUSY;
322339
}
323340

324341
rec.mgid = mcast->mcmember.mgid;
325342
rec.port_gid = priv->local_gid;
326343
rec.pkey = cpu_to_be16(priv->pkey);
327344

345+
mutex_lock(&mcast_mutex);
346+
init_completion(&mcast->done);
347+
set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
328348
mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
329349
priv->port, &rec,
330350
IB_SA_MCMEMBER_REC_MGID |
@@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
337357
if (IS_ERR(mcast->mc)) {
338358
ret = PTR_ERR(mcast->mc);
339359
clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
340-
ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
341-
ret);
360+
complete(&mcast->done);
361+
ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
362+
"failed (ret = %d)\n", ret);
342363
} else {
343-
ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n",
344-
mcast->mcmember.mgid.raw);
364+
ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
365+
"sendonly join\n", mcast->mcmember.mgid.raw);
345366
}
367+
mutex_unlock(&mcast_mutex);
346368

347369
return ret;
348370
}
@@ -390,60 +412,64 @@ static int ipoib_mcast_join_complete(int status,
390412
ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
391413
mcast->mcmember.mgid.raw, status);
392414

415+
/*
416+
* We have to take the mutex to force mcast_join to
417+
* return from ib_sa_multicast_join and set mcast->mc to a
418+
* valid value. Otherwise we were racing with ourselves in
419+
* that we might fail here, but get a valid return from
420+
* ib_sa_multicast_join after we had cleared mcast->mc here,
421+
* resulting in mis-matched joins and leaves and a deadlock
422+
*/
423+
mutex_lock(&mcast_mutex);
424+
393425
/* We trap for port events ourselves. */
394-
if (status == -ENETRESET) {
395-
status = 0;
426+
if (status == -ENETRESET)
396427
goto out;
397-
}
398428

399429
if (!status)
400430
status = ipoib_mcast_join_finish(mcast, &multicast->rec);
401431

402432
if (!status) {
403433
mcast->backoff = 1;
404-
mutex_lock(&mcast_mutex);
405434
if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
406435
queue_delayed_work(ipoib_workqueue,
407436
&priv->mcast_task, 0);
408-
mutex_unlock(&mcast_mutex);
409437

410438
/*
411439
* Defer carrier on work to ipoib_workqueue to avoid a
412440
* deadlock on rtnl_lock here.
413441
*/
414442
if (mcast == priv->broadcast)
415443
queue_work(ipoib_workqueue, &priv->carrier_on_task);
416-
417-
status = 0;
418-
goto out;
419-
}
420-
421-
if (mcast->logcount++ < 20) {
422-
if (status == -ETIMEDOUT || status == -EAGAIN) {
423-
ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
424-
mcast->mcmember.mgid.raw, status);
425-
} else {
426-
ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
427-
mcast->mcmember.mgid.raw, status);
444+
} else {
445+
if (mcast->logcount++ < 20) {
446+
if (status == -ETIMEDOUT || status == -EAGAIN) {
447+
ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
448+
mcast->mcmember.mgid.raw, status);
449+
} else {
450+
ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
451+
mcast->mcmember.mgid.raw, status);
452+
}
428453
}
429-
}
430-
431-
mcast->backoff *= 2;
432-
if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
433-
mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
434-
435-
/* Clear the busy flag so we try again */
436-
status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
437454

438-
mutex_lock(&mcast_mutex);
455+
mcast->backoff *= 2;
456+
if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
457+
mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
458+
}
459+
out:
439460
spin_lock_irq(&priv->lock);
440-
if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
461+
clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
462+
if (status)
463+
mcast->mc = NULL;
464+
complete(&mcast->done);
465+
if (status == -ENETRESET)
466+
status = 0;
467+
if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
441468
queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
442469
mcast->backoff * HZ);
443470
spin_unlock_irq(&priv->lock);
444471
mutex_unlock(&mcast_mutex);
445-
out:
446-
complete(&mcast->done);
472+
447473
return status;
448474
}
449475

@@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
492518
rec.hop_limit = priv->broadcast->mcmember.hop_limit;
493519
}
494520

495-
set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
521+
mutex_lock(&mcast_mutex);
496522
init_completion(&mcast->done);
497-
set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
498-
523+
set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
499524
mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
500525
&rec, comp_mask, GFP_KERNEL,
501526
ipoib_mcast_join_complete, mcast);
@@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
509534
if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
510535
mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
511536

512-
mutex_lock(&mcast_mutex);
513537
if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
514538
queue_delayed_work(ipoib_workqueue,
515539
&priv->mcast_task,
516540
mcast->backoff * HZ);
517-
mutex_unlock(&mcast_mutex);
518541
}
542+
mutex_unlock(&mcast_mutex);
519543
}
520544

521545
void ipoib_mcast_join_task(struct work_struct *work)
@@ -568,31 +592,42 @@ void ipoib_mcast_join_task(struct work_struct *work)
568592
}
569593

570594
if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
571-
if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
595+
if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
596+
!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
572597
ipoib_mcast_join(dev, priv->broadcast, 0);
573598
return;
574599
}
575600

576601
while (1) {
577602
struct ipoib_mcast *mcast = NULL;
578603

604+
/*
605+
* Need the mutex so our flags are consistent, need the
606+
* priv->lock so we don't race with list removals in either
607+
* mcast_dev_flush or mcast_restart_task
608+
*/
609+
mutex_lock(&mcast_mutex);
579610
spin_lock_irq(&priv->lock);
580611
list_for_each_entry(mcast, &priv->multicast_list, list) {
581-
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
582-
&& !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
583-
&& !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
612+
if (IS_ERR_OR_NULL(mcast->mc) &&
613+
!test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
614+
!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
584615
/* Found the next unjoined group */
585616
break;
586617
}
587618
}
588619
spin_unlock_irq(&priv->lock);
620+
mutex_unlock(&mcast_mutex);
589621

590622
if (&mcast->list == &priv->multicast_list) {
591623
/* All done */
592624
break;
593625
}
594626

595-
ipoib_mcast_join(dev, mcast, 1);
627+
if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
628+
ipoib_mcast_sendonly_join(mcast);
629+
else
630+
ipoib_mcast_join(dev, mcast, 1);
596631
return;
597632
}
598633

@@ -638,6 +673,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
638673
int ret = 0;
639674

640675
if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
676+
ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
677+
678+
if (!IS_ERR_OR_NULL(mcast->mc))
641679
ib_sa_free_multicast(mcast->mc);
642680

643681
if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
@@ -690,6 +728,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
690728
memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
691729
__ipoib_mcast_add(dev, mcast);
692730
list_add_tail(&mcast->list, &priv->multicast_list);
731+
if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
732+
queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
693733
}
694734

695735
if (!mcast->ah) {
@@ -703,8 +743,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
703743
if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
704744
ipoib_dbg_mcast(priv, "no address vector, "
705745
"but multicast join already started\n");
706-
else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
707-
ipoib_mcast_sendonly_join(mcast);
708746

709747
/*
710748
* If lookup completes between here and out:, don't
@@ -766,7 +804,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
766804

767805
/* seperate between the wait to the leave*/
768806
list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
769-
if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
807+
if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
770808
wait_for_completion(&mcast->done);
771809

772810
list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {

0 commit comments

Comments
 (0)