Skip to content

Commit 69e3c75

Browse files
Johann Baudydavem330
authored andcommitted
net: TX_RING and packet mmap
New packet socket feature that makes packet socket more efficient for transmission. - It reduces number of system call through a PACKET_TX_RING mechanism, based on PACKET_RX_RING (Circular buffer allocated in kernel space which is mmapped from user space). - It minimizes CPU copy using fragmented SKB (almost zero copy). Signed-off-by: Johann Baudy <[email protected]> Signed-off-by: David S. Miller <[email protected]>
1 parent f67f340 commit 69e3c75

File tree

4 files changed

+616
-135
lines changed

4 files changed

+616
-135
lines changed

Documentation/networking/packet_mmap.txt

Lines changed: 121 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,18 @@
44

55
This file documents the CONFIG_PACKET_MMAP option available with the PACKET
66
socket interface on 2.4 and 2.6 kernels. This type of sockets is used for
7-
capture network traffic with utilities like tcpdump or any other that uses
8-
the libpcap library.
9-
10-
You can find the latest version of this document at
7+
capture network traffic with utilities like tcpdump or any other that needs
8+
raw access to network interface.
119

10+
You can find the latest version of this document at:
1211
http://pusa.uv.es/~ulisses/packet_mmap/
1312

14-
Please send me your comments to
13+
Howto can be found at:
14+
http://wiki.gnu-log.net (packet_mmap)
1515

16+
Please send your comments to
1617
Ulisses Alonso Camaró <[email protected]>
18+
Johann Baudy <[email protected]>
1719

1820
-------------------------------------------------------------------------------
1921
+ Why use PACKET_MMAP
@@ -25,19 +27,24 @@ to capture each packet, it requires two if you want to get packet's
2527
timestamp (like libpcap always does).
2628

2729
In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
28-
configurable circular buffer mapped in user space. This way reading packets just
29-
needs to wait for them, most of the time there is no need to issue a single
30-
system call. By using a shared buffer between the kernel and the user
31-
also has the benefit of minimizing packet copies.
32-
33-
It's fine to use PACKET_MMAP to improve the performance of the capture process,
34-
but it isn't everything. At least, if you are capturing at high speeds (this
35-
is relative to the cpu speed), you should check if the device driver of your
36-
network interface card supports some sort of interrupt load mitigation or
37-
(even better) if it supports NAPI, also make sure it is enabled.
30+
configurable circular buffer mapped in user space that can be used to either
31+
send or receive packets. This way reading packets just needs to wait for them,
32+
most of the time there is no need to issue a single system call. Concerning
33+
transmission, multiple packets can be sent through one system call to get the
34+
highest bandwidth.
35+
By using a shared buffer between the kernel and the user also has the benefit
36+
of minimizing packet copies.
37+
38+
It's fine to use PACKET_MMAP to improve the performance of the capture and
39+
transmission process, but it isn't everything. At least, if you are capturing
40+
at high speeds (this is relative to the cpu speed), you should check if the
41+
device driver of your network interface card supports some sort of interrupt
42+
load mitigation or (even better) if it supports NAPI, also make sure it is
43+
enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
44+
supported by devices of your network.
3845

3946
--------------------------------------------------------------------------------
40-
+ How to use CONFIG_PACKET_MMAP
47+
+ How to use CONFIG_PACKET_MMAP to improve capture process
4148
--------------------------------------------------------------------------------
4249

4350
From the user standpoint, you should use the higher level libpcap library, which
@@ -57,7 +64,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP
5764
support.
5865

5966
--------------------------------------------------------------------------------
60-
+ How to use CONFIG_PACKET_MMAP directly
67+
+ How to use CONFIG_PACKET_MMAP directly to improve capture process
6168
--------------------------------------------------------------------------------
6269

6370
From the system calls stand point, the use of PACKET_MMAP involves
@@ -66,6 +73,7 @@ the following process:
6673

6774
[setup] socket() -------> creation of the capture socket
6875
setsockopt() ---> allocation of the circular buffer (ring)
76+
option: PACKET_RX_RING
6977
mmap() ---------> mapping of the allocated buffer to the
7078
user process
7179

@@ -96,14 +104,76 @@ Next I will describe PACKET_MMAP settings and it's constraints,
96104
also the mapping of the circular buffer in the user process and
97105
the use of this buffer.
98106

107+
--------------------------------------------------------------------------------
108+
+ How to use CONFIG_PACKET_MMAP directly to improve transmission process
109+
--------------------------------------------------------------------------------
110+
Transmission process is similar to capture as shown below.
111+
112+
[setup] socket() -------> creation of the transmission socket
113+
setsockopt() ---> allocation of the circular buffer (ring)
114+
option: PACKET_TX_RING
115+
bind() ---------> bind transmission socket with a network interface
116+
mmap() ---------> mapping of the allocated buffer to the
117+
user process
118+
119+
[transmission] poll() ---------> wait for free packets (optional)
120+
send() ---------> send all packets that are set as ready in
121+
the ring
122+
The flag MSG_DONTWAIT can be used to return
123+
before end of transfer.
124+
125+
[shutdown] close() --------> destruction of the transmission socket and
126+
deallocation of all associated resources.
127+
128+
Binding the socket to your network interface is mandatory (with zero copy) to
129+
know the header size of frames used in the circular buffer.
130+
131+
As capture, each frame contains two parts:
132+
133+
--------------------
134+
| struct tpacket_hdr | Header. It contains the status of
135+
| | of this frame
136+
|--------------------|
137+
| data buffer |
138+
. . Data that will be sent over the network interface.
139+
. .
140+
--------------------
141+
142+
bind() associates the socket to your network interface thanks to
143+
sll_ifindex parameter of struct sockaddr_ll.
144+
145+
Initialization example:
146+
147+
struct sockaddr_ll my_addr;
148+
struct ifreq s_ifr;
149+
...
150+
151+
strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
152+
153+
/* get interface index of eth0 */
154+
ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
155+
156+
/* fill sockaddr_ll struct to prepare binding */
157+
my_addr.sll_family = AF_PACKET;
158+
my_addr.sll_protocol = ETH_P_ALL;
159+
my_addr.sll_ifindex = s_ifr.ifr_ifindex;
160+
161+
/* bind socket to eth0 */
162+
bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
163+
164+
A complete tutorial is available at: http://wiki.gnu-log.net/
165+
99166
--------------------------------------------------------------------------------
100167
+ PACKET_MMAP settings
101168
--------------------------------------------------------------------------------
102169

103170

104171
To setup PACKET_MMAP from user level code is done with a call like
105172

173+
- Capture process
106174
setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
175+
- Transmission process
176+
setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
107177

108178
The most significant argument in the previous call is the req parameter,
109179
this parameter must to have the following structure:
@@ -117,11 +187,11 @@ this parameter must to have the following structure:
117187
};
118188

119189
This structure is defined in /usr/include/linux/if_packet.h and establishes a
120-
circular buffer (ring) of unswappable memory mapped in the capture process.
190+
circular buffer (ring) of unswappable memory.
121191
Being mapped in the capture process allows reading the captured frames and
122192
related meta-information like timestamps without requiring a system call.
123193

124-
Captured frames are grouped in blocks. Each block is a physically contiguous
194+
Frames are grouped in blocks. Each block is a physically contiguous
125195
region of memory and holds tp_block_size/tp_frame_size frames. The total number
126196
of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
127197

@@ -336,6 +406,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready
336406
to be used for the kernel, If not, there is a frame the user can read
337407
and the following flags apply:
338408

409+
+++ Capture process:
339410
from include/linux/if_packet.h
340411

341412
#define TP_STATUS_COPY 2
@@ -391,6 +462,37 @@ packets are in the ring:
391462
It doesn't incur in a race condition to first check the status value and
392463
then poll for frames.
393464

465+
466+
++ Transmission process
467+
Those defines are also used for transmission:
468+
469+
#define TP_STATUS_AVAILABLE 0 // Frame is available
470+
#define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
471+
#define TP_STATUS_SENDING 2 // Frame is currently in transmission
472+
#define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
473+
474+
First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
475+
packet, the user fills a data buffer of an available frame, sets tp_len to
476+
current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
477+
This can be done on multiple frames. Once the user is ready to transmit, it
478+
calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
479+
forwarded to the network device. The kernel updates each status of sent
480+
frames with TP_STATUS_SENDING until the end of transfer.
481+
At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
482+
483+
header->tp_len = in_i_size;
484+
header->tp_status = TP_STATUS_SEND_REQUEST;
485+
retval = send(this->socket, NULL, 0, 0);
486+
487+
The user can also use poll() to check if a buffer is available:
488+
(status == TP_STATUS_SENDING)
489+
490+
struct pollfd pfd;
491+
pfd.fd = fd;
492+
pfd.revents = 0;
493+
pfd.events = POLLOUT;
494+
retval = poll(&pfd, 1, timeout);
495+
394496
--------------------------------------------------------------------------------
395497
+ THANKS
396498
--------------------------------------------------------------------------------

include/linux/if_packet.h

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,8 @@ struct sockaddr_ll
4646
#define PACKET_VERSION 10
4747
#define PACKET_HDRLEN 11
4848
#define PACKET_RESERVE 12
49+
#define PACKET_TX_RING 13
50+
#define PACKET_LOSS 14
4951

5052
struct tpacket_stats
5153
{
@@ -63,14 +65,22 @@ struct tpacket_auxdata
6365
__u16 tp_vlan_tci;
6466
};
6567

68+
/* Rx ring - header status */
69+
#define TP_STATUS_KERNEL 0x0
70+
#define TP_STATUS_USER 0x1
71+
#define TP_STATUS_COPY 0x2
72+
#define TP_STATUS_LOSING 0x4
73+
#define TP_STATUS_CSUMNOTREADY 0x8
74+
75+
/* Tx ring - header status */
76+
#define TP_STATUS_AVAILABLE 0x0
77+
#define TP_STATUS_SEND_REQUEST 0x1
78+
#define TP_STATUS_SENDING 0x2
79+
#define TP_STATUS_WRONG_FORMAT 0x4
80+
6681
struct tpacket_hdr
6782
{
6883
unsigned long tp_status;
69-
#define TP_STATUS_KERNEL 0
70-
#define TP_STATUS_USER 1
71-
#define TP_STATUS_COPY 2
72-
#define TP_STATUS_LOSING 4
73-
#define TP_STATUS_CSUMNOTREADY 8
7484
unsigned int tp_len;
7585
unsigned int tp_snaplen;
7686
unsigned short tp_mac;

include/linux/skbuff.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,9 @@ struct skb_shared_info {
203203
#ifdef CONFIG_HAS_DMA
204204
dma_addr_t dma_maps[MAX_SKB_FRAGS + 1];
205205
#endif
206+
/* Intermediate layers must ensure that destructor_arg
207+
* remains valid until skb destructor */
208+
void * destructor_arg;
206209
};
207210

208211
/* We divide dataref into two halves. The higher 16 bits hold references

0 commit comments

Comments
 (0)