-
Notifications
You must be signed in to change notification settings - Fork 149
Socket migration for SO_REUSEPORT. #488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Master branch: 34da872 |
Master branch: e1868b9 |
e3a719c
to
a446d9b
Compare
Master branch: 2f4b031 |
a446d9b
to
44c7926
Compare
This patch is a preparation patch to migrate incoming connections in the later commits and adds a field (num_closed_socks) to the struct sock_reuseport to allow TCP_CLOSE sockets to access to the reuseport group. When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So, we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, it is impossible because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to hold sk_reuseport_cb while any child socket references to them. The point is that reuseport_detach_sock() is called twice from inet_unhash() and sk_destruct(). At first, it decrements num_socks and increments num_closed_socks. Later, when all migrated connections are accepted, it decrements num_closed_socks and sets NULL to sk_reuseport_cb. By this change, closed sockets can keep sk_reuseport_cb until all child requests have been freed or accepted. Consequently calling listen() after shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or inet_csk_bind_conflict() which expect that such sockets should not have the reuseport group. Therefore, this patch also loosens such validation rules so that the socket can listen again if it has the same reuseport group with other listening sockets. Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
As noted in the preceding commit, there are two migration types. In addition to that, the kernel will run the same eBPF program to select a listener for SYN packets. This patch defines three types to signal the kernel and the eBPF program if it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in the accept queue or NEW_SYN_RECV socket during 3WHS. Signed-off-by: Kuniyuki Iwashima <[email protected]>
This reverts commit 607904c to use spin_lock_bh_nested() in the next commit. Link: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Kuniyuki Iwashima <[email protected]> CC: Waiman Long <[email protected]> Acked-by: Waiman Long <[email protected]>
This patch defines a new function to migrate ESTABLISHED/SYN_RECV sockets. Listening sockets hold incoming connections as a linked list of struct request_sock in the accept queue, and each request has reference to its full socket and listener. In inet_csk_reqsk_queue_migrate(), we only unlink the requests from the closing listener's queue and relink them to the head of the new listener's queue. We do not process each request and its reference to the listener, so the migration completes in O(1) time complexity. Moreover, if TFO requests caused RST before 3WHS has completed, they are held in the listener's TFO queue to prevent DDoS attack. Thus, we also migrate the requests in the TFO queue in the same way. After 3WHS has completed, there are three access patterns to incoming sockets: (1) access to the full socket instead of request_sock (2) access to request_sock from access queue (3) access to request_sock from TFO queue In the first case, the full socket does not have a reference to its request socket and listener, so we do not need the correct listener set in the request socket. In the second case, we always have the correct listener and currently do not use req->rsk_listener. However, in the third case of TCP_SYN_RECV sockets, we take special care in the next commit. Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
A TFO request socket is only freed after BOTH 3WHS has completed (or aborted) and the child socket has been accepted (or its listener has been closed). Hence, depending on the order, there can be two kinds of request sockets in the accept queue. 3WHS -> accept : TCP_ESTABLISHED accept -> 3WHS : TCP_SYN_RECV Unlike TCP_ESTABLISHED socket, accept() does not free the request socket for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove(). Also, it accesses request_sock.rsk_listener. So, in order to complete TFO socket migration, we have to set the current listener to it at accept() before reqsk_fastopen_remove(). Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch lets reuseport_detach_sock() return a pointer of struct sock, which is used only by inet_unhash(). If it is not NULL, inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV sockets from the closing listener to the selected one. By default, the kernel selects a new listener randomly. In order to pick out a different socket every time, we select the last element of socks[] as the new listener. This behaviour is based on how the kernel moves sockets in socks[]. (See also [1]) Basically, in order to redistribute sockets evenly, we have to use an eBPF program called in the later commit, but as the side effect of such default selection, the kernel can redistribute old requests evenly to new listeners for a specific case where the application replaces listeners by generations. For example, we call listen() for four sockets (A, B, C, D), and close() the first two by turns. The sockets move in socks[] like below. socks[0] : A <-. socks[0] : D socks[0] : D socks[1] : B | => socks[1] : B <-. => socks[1] : C socks[2] : C | socks[2] : C --' socks[3] : D --' Then, if C and D have newer settings than A and B, and each socket has a request (a, b, c, d) in their accept queue, we can redistribute old requests evenly to new listeners. socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d) socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c) socks[2] : C (c) | socks[2] : C (c) --' socks[3] : D (d) --' Here, (A, D), or (B, C) can have different application settings, but they MUST have the same settings at the socket API level; otherwise, unexpected error may happen. For instance, if only the new listeners have TCP_SAVE_SYN, old requests do not hold SYN data, so the application will face inconsistency and cause an error. Therefore, if there are different kinds of sockets, we must attach an eBPF program described in later commits. Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/ Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch renames reuseport_select_sock() to __reuseport_select_sock() and adds two wrapper function of it to pass the migration type defined in the previous commit. reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV requests at receiving the final ACK or sending a SYN+ACK. Therefore, this patch also changes the code to call reuseport_select_migrated_sock() even if the listening socket is TCP_CLOSE. If we can pick out a listening socket from the reuseport group, we rewrite request_sock.rsk_listener and resume processing the request. Link: https://lore.kernel.org/bpf/[email protected]/ Reported-by: kernel test robot <[email protected]> Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, the kernel runs it for socket migration only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. The kernel will change the behaviour depending on the returned value: - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fall back to the random selection - SK_DROP, cancel the migration Link: https://lore.kernel.org/netdev/[email protected]/ Suggested-by: Martin KaFai Lau <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This commit introduces a new section (sk_reuseport/migrate) and sets expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT program. Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md to signal the eBPF program if the kernel calls it for selecting a listener for SYN or migrating sockets in the accept queue or an immature socket during 3WHS. Note that this field is accessible only if the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Link: https://lore.kernel.org/netdev/[email protected]/ Suggested-by: Martin KaFai Lau <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
…ORT. We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing in order to select the new listener. Currently, we can get a unique ID for each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the sk pointer available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Link: https://lore.kernel.org/netdev/[email protected]/ Suggested-by: Martin KaFai Lau <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch supports socket migration by eBPF. If the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. There are two noteworthy points. The first is that we select a listening socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. However, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So we allocate an empty skb temporarily before running the eBPF program. The second is that we do not have struct request_sock in unhash path, and the sk_hash of the listener is always zero. So we pass zero as hash to bpf_run_sk_reuseport(). Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Reviewed-by: Benjamin Herrenschmidt <[email protected]> Signed-off-by: Kuniyuki Iwashima <[email protected]>
Master branch: 8bdd8e2 |
44c7926
to
0d78128
Compare
At least one diff in series https://patchwork.kernel.org/project/netdevbpf/list/?series=397573 expired. Closing PR. |
Failing tests: - kernel-patches#110 fexit_bpf2bpf:FAIL - kernel-patches#124 for_each:FAIL - kernel-patches#144 iters:FAIL - kernel-patches#148 kfree_skb:FAIL - kernel-patches#161 l4lb_all:FAIL - kernel-patches#193 map_kptr:FAIL - kernel-patches#23 bpf_loop:FAIL - kernel-patches#260 pkt_access:FAIL - kernel-patches#269 prog_run_opts:FAIL - kernel-patches#280 rbtree_success:FAIL - kernel-patches#356 res_spin_lock_failure:FAIL - kernel-patches#364 setget_sockopt:FAIL - kernel-patches#381 sock_fields:FAIL - kernel-patches#394 spin_lock:FAIL - kernel-patches#395 spin_lock_success:FAIL - kernel-patches#444 test_bpffs:FAIL - kernel-patches#453 test_profiler:FAIL - kernel-patches#479 usdt:FAIL - kernel-patches#488 verifier_bits_iter:FAIL - kernel-patches#597 verif_scale_pyperf600:FAIL - kernel-patches#598 verif_scale_pyperf600_bpf_loop:FAIL - kernel-patches#599 verif_scale_pyperf600_iter:FAIL - kernel-patches#608 verif_scale_strobemeta_subprogs:FAIL - kernel-patches#622 xdp_attach:FAIL - kernel-patches#637 xdp_noinline:FAIL - kernel-patches#639 xdp_synproxy:FAIL - kernel-patches#72 cls_redirect:FAIL - kernel-patches#88 crypto_sanity:FAIL - kernel-patches#97 dynptr:FAIL Signed-off-by: Eduard Zingerman <[email protected]>
Failing tests: - kernel-patches#110 fexit_bpf2bpf:FAIL - kernel-patches#124 for_each:FAIL - kernel-patches#144 iters:FAIL - kernel-patches#148 kfree_skb:FAIL - kernel-patches#161 l4lb_all:FAIL - kernel-patches#193 map_kptr:FAIL - kernel-patches#23 bpf_loop:FAIL - kernel-patches#260 pkt_access:FAIL - kernel-patches#269 prog_run_opts:FAIL - kernel-patches#280 rbtree_success:FAIL - kernel-patches#356 res_spin_lock_failure:FAIL - kernel-patches#364 setget_sockopt:FAIL - kernel-patches#381 sock_fields:FAIL - kernel-patches#394 spin_lock:FAIL - kernel-patches#395 spin_lock_success:FAIL - kernel-patches#444 test_bpffs:FAIL - kernel-patches#453 test_profiler:FAIL - kernel-patches#479 usdt:FAIL - kernel-patches#488 verifier_bits_iter:FAIL - kernel-patches#597 verif_scale_pyperf600:FAIL - kernel-patches#598 verif_scale_pyperf600_bpf_loop:FAIL - kernel-patches#599 verif_scale_pyperf600_iter:FAIL - kernel-patches#608 verif_scale_strobemeta_subprogs:FAIL - kernel-patches#622 xdp_attach:FAIL - kernel-patches#637 xdp_noinline:FAIL - kernel-patches#639 xdp_synproxy:FAIL - kernel-patches#72 cls_redirect:FAIL - kernel-patches#88 crypto_sanity:FAIL - kernel-patches#97 dynptr:FAIL Signed-off-by: Eduard Zingerman <[email protected]>
Pull request for series with
subject: Socket migration for SO_REUSEPORT.
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=397573