Skip to content

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: Socket migration for SO_REUSEPORT.
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573

@kernel-patches-bot
Copy link
Author

Master branch: 34da872
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: e1868b9
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573
version: 2

@kernel-patches-bot
Copy link
Author

Master branch: 2f4b031
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573
version: 2

kernel-patches-bot and others added 14 commits December 8, 2020 08:22
This patch is a preparation patch to migrate incoming connections in the
later commits and adds a field (num_closed_socks) to the struct
sock_reuseport to allow TCP_CLOSE sockets to access to the reuseport group.

When we close a listening socket, to migrate its connections to another
listener in the same reuseport group, we have to handle two kinds of child
sockets. One is that a listening socket has a reference to, and the other
is not.

The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the
accept queue of their listening socket. So, we can pop them out and push
them into another listener's queue at close() or shutdown() syscalls. On
the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the
three-way handshake and not in the accept queue. Thus, we cannot access
such sockets at close() or shutdown() syscalls. Accordingly, we have to
migrate immature sockets after their listening socket has been closed.

Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV
sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At
that time, if we could select a new listener from the same reuseport group,
no connection would be aborted. However, it is impossible because
reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to
the reuseport group from closed sockets.

This patch allows TCP_CLOSE sockets to hold sk_reuseport_cb while any child
socket references to them. The point is that reuseport_detach_sock() is
called twice from inet_unhash() and sk_destruct(). At first, it decrements
num_socks and increments num_closed_socks. Later, when all migrated
connections are accepted, it decrements num_closed_socks and sets NULL to
sk_reuseport_cb.

By this change, closed sockets can keep sk_reuseport_cb until all child
requests have been freed or accepted. Consequently calling listen() after
shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or
inet_csk_bind_conflict() which expect that such sockets should not have the
reuseport group. Therefore, this patch also loosens such validation rules
so that the socket can listen again if it has the same reuseport group with
other listening sockets.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
As noted in the preceding commit, there are two migration types. In
addition to that, the kernel will run the same eBPF program to select a
listener for SYN packets.

This patch defines three types to signal the kernel and the eBPF program if
it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in
the accept queue or NEW_SYN_RECV socket during 3WHS.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
This reverts commit 607904c to use
spin_lock_bh_nested() in the next commit.

Link: https://lore.kernel.org/netdev/[email protected]/
Signed-off-by: Kuniyuki Iwashima <[email protected]>
CC: Waiman Long <[email protected]>
Acked-by: Waiman Long <[email protected]>
This patch defines a new function to migrate ESTABLISHED/SYN_RECV sockets.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to its
full socket and listener. In inet_csk_reqsk_queue_migrate(), we only unlink
the requests from the closing listener's queue and relink them to the head
of the new listener's queue. We do not process each request and its
reference to the listener, so the migration completes in O(1) time
complexity.

Moreover, if TFO requests caused RST before 3WHS has completed, they are
held in the listener's TFO queue to prevent DDoS attack. Thus, we also
migrate the requests in the TFO queue in the same way.

After 3WHS has completed, there are three access patterns to incoming
sockets:

  (1) access to the full socket instead of request_sock
  (2) access to request_sock from access queue
  (3) access to request_sock from TFO queue

In the first case, the full socket does not have a reference to its request
socket and listener, so we do not need the correct listener set in the
request socket. In the second case, we always have the correct listener and
currently do not use req->rsk_listener. However, in the third case of
TCP_SYN_RECV sockets, we take special care in the next commit.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
A TFO request socket is only freed after BOTH 3WHS has completed (or
aborted) and the child socket has been accepted (or its listener has been
closed). Hence, depending on the order, there can be two kinds of request
sockets in the accept queue.

  3WHS -> accept : TCP_ESTABLISHED
  accept -> 3WHS : TCP_SYN_RECV

Unlike TCP_ESTABLISHED socket, accept() does not free the request socket
for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove().
Also, it accesses request_sock.rsk_listener. So, in order to complete TFO
socket migration, we have to set the current listener to it at accept()
before reqsk_fastopen_remove().

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

By default, the kernel selects a new listener randomly. In order to pick
out a different socket every time, we select the last element of socks[] as
the new listener. This behaviour is based on how the kernel moves sockets
in socks[]. (See also [1])

Basically, in order to redistribute sockets evenly, we have to use an eBPF
program called in the later commit, but as the side effect of such default
selection, the kernel can redistribute old requests evenly to new listeners
for a specific case where the application replaces listeners by
generations.

For example, we call listen() for four sockets (A, B, C, D), and close()
the first two by turns. The sockets move in socks[] like below.

  socks[0] : A <-.      socks[0] : D          socks[0] : D
  socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
  socks[2] : C   |      socks[2] : C --'
  socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

  socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
  socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
  socks[2] : C (c)   |      socks[2] : C (c) --'
  socks[3] : D (d) --'

Here, (A, D), or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not hold SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must attach an eBPF
program described in later commits.

Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch renames reuseport_select_sock() to __reuseport_select_sock() and
adds two wrapper function of it to pass the migration type defined in the
previous commit.

  reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
  reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch also changes the code to call reuseport_select_migrated_sock() even
if the listening socket is TCP_CLOSE. If we can pick out a listening socket
from the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Link: https://lore.kernel.org/bpf/[email protected]/
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to
check if the attached eBPF program is capable of migrating sockets.

When the eBPF program is attached, the kernel runs it for socket migration
only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
The kernel will change the behaviour depending on the returned value:

  - SK_PASS with selected_sk, select it as a new listener
  - SK_PASS with selected_sk NULL, fall back to the random selection
  - SK_DROP, cancel the migration

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This commit introduces a new section (sk_reuseport/migrate) and sets
expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT
program.

Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md
to signal the eBPF program if the kernel calls it for selecting a listener
for SYN or migrating sockets in the accept queue or an immature socket
during 3WHS.

Note that this field is accessible only if the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
…ORT.

We will call sock_reuseport.prog for socket migration in the next commit,
so the eBPF program has to know which listener is closing in order to
select the new listener.

Currently, we can get a unique ID for each listener in the userspace by
calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map.

This patch makes the sk pointer available in sk_reuseport_md so that we can
get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program.

Link: https://lore.kernel.org/netdev/[email protected]/
Suggested-by: Martin KaFai Lau <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch supports socket migration by eBPF. If the attached type is
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by
BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning
SK_DROP. This feature is useful when listeners have different settings at
the socket API level or when we want to free resources as soon as possible.

There are two noteworthy points. The first is that we select a listening
socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do
not have struct skb at closing a listener or retransmitting a SYN+ACK.
However, some helper functions do not expect skb is NULL (e.g.
skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in
BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb
temporarily before running the eBPF program. The second is that we do not
have struct request_sock in unhash path, and the sk_hash of the listener is
always zero. So we pass zero as hash to bpf_run_sk_reuseport().

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

Reviewed-by: Benjamin Herrenschmidt <[email protected]>
Signed-off-by: Kuniyuki Iwashima <[email protected]>
@kernel-patches-bot
Copy link
Author

Master branch: 8bdd8e2
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573
version: 2

@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=397573 expired. Closing PR.

@kernel-patches-bot kernel-patches-bot deleted the series/385817=>bpf-next branch December 11, 2020 10:43
eddyz87 added a commit to eddyz87/bpf that referenced this pull request Jul 30, 2025
Failing tests:
- kernel-patches#110     fexit_bpf2bpf:FAIL
- kernel-patches#124     for_each:FAIL
- kernel-patches#144     iters:FAIL
- kernel-patches#148     kfree_skb:FAIL
- kernel-patches#161     l4lb_all:FAIL
- kernel-patches#193     map_kptr:FAIL
- kernel-patches#23      bpf_loop:FAIL
- kernel-patches#260     pkt_access:FAIL
- kernel-patches#269     prog_run_opts:FAIL
- kernel-patches#280     rbtree_success:FAIL
- kernel-patches#356     res_spin_lock_failure:FAIL
- kernel-patches#364     setget_sockopt:FAIL
- kernel-patches#381     sock_fields:FAIL
- kernel-patches#394     spin_lock:FAIL
- kernel-patches#395     spin_lock_success:FAIL
- kernel-patches#444     test_bpffs:FAIL
- kernel-patches#453     test_profiler:FAIL
- kernel-patches#479     usdt:FAIL
- kernel-patches#488     verifier_bits_iter:FAIL
- kernel-patches#597     verif_scale_pyperf600:FAIL
- kernel-patches#598     verif_scale_pyperf600_bpf_loop:FAIL
- kernel-patches#599     verif_scale_pyperf600_iter:FAIL
- kernel-patches#608     verif_scale_strobemeta_subprogs:FAIL
- kernel-patches#622     xdp_attach:FAIL
- kernel-patches#637     xdp_noinline:FAIL
- kernel-patches#639     xdp_synproxy:FAIL
- kernel-patches#72      cls_redirect:FAIL
- kernel-patches#88      crypto_sanity:FAIL
- kernel-patches#97      dynptr:FAIL

Signed-off-by: Eduard Zingerman <[email protected]>
eddyz87 added a commit to eddyz87/bpf that referenced this pull request Jul 30, 2025
Failing tests:
- kernel-patches#110     fexit_bpf2bpf:FAIL
- kernel-patches#124     for_each:FAIL
- kernel-patches#144     iters:FAIL
- kernel-patches#148     kfree_skb:FAIL
- kernel-patches#161     l4lb_all:FAIL
- kernel-patches#193     map_kptr:FAIL
- kernel-patches#23      bpf_loop:FAIL
- kernel-patches#260     pkt_access:FAIL
- kernel-patches#269     prog_run_opts:FAIL
- kernel-patches#280     rbtree_success:FAIL
- kernel-patches#356     res_spin_lock_failure:FAIL
- kernel-patches#364     setget_sockopt:FAIL
- kernel-patches#381     sock_fields:FAIL
- kernel-patches#394     spin_lock:FAIL
- kernel-patches#395     spin_lock_success:FAIL
- kernel-patches#444     test_bpffs:FAIL
- kernel-patches#453     test_profiler:FAIL
- kernel-patches#479     usdt:FAIL
- kernel-patches#488     verifier_bits_iter:FAIL
- kernel-patches#597     verif_scale_pyperf600:FAIL
- kernel-patches#598     verif_scale_pyperf600_bpf_loop:FAIL
- kernel-patches#599     verif_scale_pyperf600_iter:FAIL
- kernel-patches#608     verif_scale_strobemeta_subprogs:FAIL
- kernel-patches#622     xdp_attach:FAIL
- kernel-patches#637     xdp_noinline:FAIL
- kernel-patches#639     xdp_synproxy:FAIL
- kernel-patches#72      cls_redirect:FAIL
- kernel-patches#88      crypto_sanity:FAIL
- kernel-patches#97      dynptr:FAIL

Signed-off-by: Eduard Zingerman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants