summaryrefslogtreecommitdiff
path: root/net/core/skbuff.c
AgeCommit message (Collapse)Author
2026-02-24net: Drop the lock in skb_may_tx_timestamp()Sebastian Andrzej Siewior
skb_may_tx_timestamp() may acquire sock::sk_callback_lock. The lock must not be taken in IRQ context, only softirq is okay. A few drivers receive the timestamp via a dedicated interrupt and complete the TX timestamp from that handler. This will lead to a deadlock if the lock is already write-locked on the same CPU. Taking the lock can be avoided. The socket (pointed by the skb) will remain valid until the skb is released. The ->sk_socket and ->file member will be set to NULL once the user closes the socket which may happen before the timestamp arrives. If we happen to observe the pointer while the socket is closing but before the pointer is set to NULL then we may use it because both pointer (and the file's cred member) are RCU freed. Drop the lock. Use READ_ONCE() to obtain the individual pointer. Add a matching WRITE_ONCE() where the pointer are cleared. Link: https://lore.kernel.org/all/20260205145104.iWinkXHv@linutronix.de Fixes: b245be1f4db1a ("net-timestamp: no-payload only sysctl") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260220183858.N4ERjFW6@linutronix.de Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-17net: do not delay zero-copy skbs in skb_attempt_defer_free()Eric Dumazet
After the blamed commit, TCP tx zero copy notifications could be arbitrarily delayed and cause regressions in applications waiting for them. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260216193653.627617-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-06net: skb: allow up to 8 skb extension idsOliver Hartkopp
The skb extension ids range from 0 .. 7 to fit their bits as flags into a single byte. The ids are automatically enumnerated in enum skb_ext_id in skbuff.h, where SKB_EXT_NUM is defined as the last value. When having 8 skb extension ids (0 .. 7), SKB_EXT_NUM becomes 8 which is a valid value for SKB_EXT_NUM. Fixes: 96ea3a1e2d31 ("can: add CAN skb extension infrastructure") Link: https://lore.kernel.org/netdev/aXoMqaA7b2CqJZNA@strlen.de/ Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260205-skb_ext-v1-1-9ba992ccee8b@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-02-05net: add vlan_get_protocol_offset_inline() helperEric Dumazet
skb_protocol() is bloated, and forces slow stack canaries in many fast paths. Add vlan_get_protocol_offset_inline() which deals with the non-vlan common cases. __vlan_get_protocol_offset() is now out of line. It returns a vlan_type_depth struct to avoid stack canaries in callers. struct vlan_type_depth { __be16 type; u16 depth; }; $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 0/22 up/down: 0/-6320 (-6320) Function old new delta vlan_get_protocol_dgram 61 59 -2 __pfx_skb_protocol 16 - -16 __vlan_get_protocol_offset 307 273 -34 tap_get_user 1374 1207 -167 ip_md_tunnel_xmit 1625 1452 -173 tap_sendmsg 940 753 -187 netif_skb_features 1079 866 -213 netem_enqueue 3017 2800 -217 vlan_parse_protocol 271 50 -221 tso_start 567 344 -223 fq_dequeue 1908 1685 -223 skb_network_protocol 434 205 -229 ip6_tnl_xmit 2639 2409 -230 br_dev_queue_push_xmit 474 236 -238 skb_protocol 258 - -258 packet_parse_headers 621 357 -264 __ip6_tnl_rcv 1306 1039 -267 skb_csum_hwoffload_help 515 224 -291 ip_tunnel_xmit 2635 2339 -296 sch_frag_xmit_hook 1582 1233 -349 bpf_skb_ecn_set_ce 868 457 -411 IP6_ECN_decapsulate 1297 768 -529 ip_tunnel_rcv 2121 1489 -632 ipip6_rcv 2572 1922 -650 Total: Before=24892803, After=24886483, chg -0.03% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260204053023.1622775-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-02-05can: add CAN skb extension infrastructureOliver Hartkopp
To remove the private CAN bus skb headroom infrastructure 8 bytes need to be stored in the skb. The skb extensions are a common pattern and an easy and efficient way to hold private data travelling along with the skb. We only need the skb_ext_add() and skb_ext_find() functions to allocate and access CAN specific content as the skb helpers to copy/clone/free skbs automatically take care of skb extensions and their final removal. This patch introduces the complete CAN skb extensions infrastructure: - add struct can_skb_ext in new file include/net/can.h - add include/net/can.h in MAINTAINERS - add SKB_EXT_CAN to skbuff.c and skbuff.h - select SKB_EXTENSIONS in Kconfig when CONFIG_CAN is enabled - check for existing CAN skb extensions in can_rcv() in af_can.c - add CAN skb extensions allocation at every skb_alloc() location - duplicate the skb extensions if cloning outgoing skbs (framelen/gw_hops) - introduce can_skb_ext_add() and can_skb_ext_find() helpers The patch also corrects an indention issue in the original code from 2018: Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202602010426.PnGrYAk3-lkp@intel.com/ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20260201-can_skb_ext-v8-2-3635d790fe8b@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-25net: inline get_netmem() and put_netmem()Eric Dumazet
These helpers are used in network fast paths. Only call out-of-line helpers for netmem case. We might consider inlining __get_netmem() and __put_netmem() in the future. $ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4 add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968) Function old new delta pskb_carve 1669 1894 +225 gro_pull_from_frag0 - 206 +206 get_page 190 380 +190 skb_segment 3561 3747 +186 put_page 595 765 +170 skb_copy_ubufs 1683 1822 +139 __pskb_trim_head 276 401 +125 __pskb_copy_fclone 734 858 +124 skb_zerocopy 1092 1215 +123 pskb_expand_head 892 1008 +116 skb_split 828 940 +112 skb_release_data 297 409 +112 ___pskb_trim 829 941 +112 __skb_zcopy_downgrade_managed 120 226 +106 tcp_clone_payload 530 634 +104 esp_ssg_unref 191 294 +103 dev_gro_receive 1464 1514 +50 __put_netmem - 41 +41 __get_netmem - 41 +41 skb_shift 1139 1175 +36 skb_try_coalesce 681 714 +33 __pfx_put_page 112 144 +32 __pfx_get_page 32 64 +32 __pskb_pull_tail 1137 1168 +31 veth_xdp_get 250 267 +17 __pfx_gro_pull_from_frag0 - 16 +16 __pfx___put_netmem - 16 +16 __pfx___get_netmem - 16 +16 __pfx_put_netmem 16 - -16 __pfx_gro_try_pull_from_frag0 16 - -16 __pfx_get_netmem 16 - -16 put_netmem 114 - -114 get_netmem 130 - -130 napi_gro_frags 929 771 -158 gro_try_pull_from_frag0 196 - -196 Total: Before=22565857, After=22567825, chg +0.01% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM") fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: add kdoc for napi_consume_skb()Jakub Kicinski
Looks like AI reviewers miss that napi_consume_skb() must have a real budget passed to it. Let's see if adding a real kdoc will help them figure this out. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20260119224140.1362729-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: fclone allocation small optimizationEric Dumazet
After skb allocation, initial skb->fclone value is 0 (SKB_FCLONE_UNAVAILABLE) We can replace one RMW sequence with a single OR instruction. movzbl 0x7e(%r13),%eax // skb->fclone = SKB_FCLONE_ORIG; and $0xf3,%al or $0x4,%al mov %al,0x7e(%r13) -> or $0x4,0x7e(%r13) // skb->fclone |= SKB_FCLONE_ORIG; Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116164402.1872649-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20net: split kmalloc_reserve() to allow inliningEric Dumazet
kmalloc_reserve() is too big to be inlined. Put the slow path in a new out-of-line function : kmalloc_pfmemalloc() Then let kmalloc_reserve() set skb->pfmemalloc only when/if the slow path is taken. This makes __alloc_skb() faster : - kmalloc_reserve() is now automatically inlined by both gcc and clang. - No more expensive RMW (skb->pfmemalloc = pfmemalloc). - No more expensive stack canary (for CONFIG_STACKPROTECTOR_STRONG=y). - Removal of two prefetches that were coming too late for modern cpus. Text size increase is quite small compared to the cpu savings (~0.7 %) $ size net/core/skbuff.clang.before.o net/core/skbuff.clang.after.o text data bss dec hex filename 72507 5897 0 78404 13244 net/core/skbuff.clang.before.o 72681 5897 0 78578 132f2 net/core/skbuff.clang.after.o Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260116041359.181104-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15net: minor __alloc_skb() optimizationEric Dumazet
We can directly call __finalize_skb_around() instead of __build_skb_around() because @size is not zero. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260113131017.2310584-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15net: add skb->data_len and (skb>end - skb->tail) to skb_dump()Eric Dumazet
While working on a syzbot report, I found that skb_dump() is lacking two important parts : - skb->data_len. - (skb>end - skb->tail) tailroom is zero if skb is not linear. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260112172621.4188700-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15net: inline napi_skb_cache_get()Eric Dumazet
clang is inlining it already, gcc (14.2) does not. Small space cost (215 bytes on x86_64) but faster sk_buff allocations. $ scripts/bloat-o-meter -t net/core/skbuff.gcc.before.o net/core/skbuff.gcc.after.o add/remove: 0/1 grow/shrink: 4/1 up/down: 359/-144 (215) Function old new delta __alloc_skb 471 611 +140 napi_build_skb 245 363 +118 napi_alloc_skb 331 416 +85 skb_copy_ubufs 1869 1885 +16 skb_shift 1445 1413 -32 napi_skb_cache_get 112 - -112 Total: Before=59941, After=60156, chg +0.36% Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260112131515.4051589-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-12net: add skbuff_clear() helperEric Dumazet
clang is unable to inline the memset() calls in net/core/skbuff.c when initializing allocated sk_buff. memset(skb, 0, offsetof(struct sk_buff, tail)); This is unfortunate, because: 1) calling external memset_orig() helper adds a call/ret and typical setup cost. 2) offsetof(struct sk_buff, tail) == 0xb8 = 0x80 + 0x38 On x86_64, memset_orig() performs two 64 bytes clear, then has to loop 7 times to clear the final 56 bytes. skbuff_clear() makes sure the minimal and optimal code is generated. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260109203836.1667441-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-05net: fix memory leak in skb_segment_list for GRO packetsMohammad Heib
When skb_segment_list() is called during packet forwarding, it handles packets that were aggregated by the GRO engine. Historically, the segmentation logic in skb_segment_list assumes that individual segments are split from a parent SKB and may need to carry their own socket memory accounting. Accordingly, the code transfers truesize from the parent to the newly created segments. Prior to commit ed4cccef64c1 ("gro: fix ownership transfer"), this truesize subtraction in skb_segment_list() was valid because fragments still carry a reference to the original socket. However, commit ed4cccef64c1 ("gro: fix ownership transfer") changed this behavior by ensuring that fraglist entries are explicitly orphaned (skb->sk = NULL) to prevent illegal orphaning later in the stack. This change meant that the entire socket memory charge remained with the head SKB, but the corresponding accounting logic in skb_segment_list() was never updated. As a result, the current code unconditionally adds each fragment's truesize to delta_truesize and subtracts it from the parent SKB. Since the fragments are no longer charged to the socket, this subtraction results in an effective under-count of memory when the head is freed. This causes sk_wmem_alloc to remain non-zero, preventing socket destruction and leading to a persistent memory leak. The leak can be observed via KMEMLEAK when tearing down the networking environment: unreferenced object 0xffff8881e6eb9100 (size 2048): comm "ping", pid 6720, jiffies 4295492526 backtrace: kmem_cache_alloc_noprof+0x5c6/0x800 sk_prot_alloc+0x5b/0x220 sk_alloc+0x35/0xa00 inet6_create.part.0+0x303/0x10d0 __sock_create+0x248/0x640 __sys_socket+0x11b/0x1d0 Since skb_segment_list() is exclusively used for SKB_GSO_FRAGLIST packets constructed by GRO, the truesize adjustment is removed. The call to skb_release_head_state() must be preserved. As documented in commit cf673ed0e057 ("net: fix fraglist segmentation reference count leak"), it is still required to correctly drop references to SKB extensions that may be overwritten during __copy_skb_header(). Fixes: ed4cccef64c1 ("gro: fix ownership transfer") Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20260104213101.352887-1-mheib@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-27net: restore napi_consume_skb()'s NULL-handlingJakub Kicinski
Commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") added a skb->cpu check to napi_consume_skb(), before the point where napi_consume_skb() validated skb is not NULL. Add an explicit check to the early exit condition. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: prefetch the next skb in napi_skb_cache_get()Jason Xing
After getting the current skb in napi_skb_cache_get(), the next skb in cache is highly likely to be used soon, so prefetch would be helpful. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-5-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: use NAPI_SKB_CACHE_FREE to keep 32 as default to do bulk freeJason Xing
- Replace NAPI_SKB_CACHE_HALF with NAPI_SKB_CACHE_FREE - Only free 32 skbs in napi_skb_cache_put() Since the first patch adjusting NAPI_SKB_CACHE_SIZE to 128, the number of packets to be freed in the softirq was increased from 32 to 64. Considering a subsequent net_rx_action() calling napi_poll() a few times can easily consume the 64 available slots and we can afford keeping a higher value of sk_buffs in per-cpu storage, decrease NAPI_SKB_CACHE_FREE to 32 like before. So now the logic is 1) keeping 96 skbs, 2) freeing 32 skbs at one time. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-4-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: increase default NAPI_SKB_CACHE_BULK to 32Jason Xing
The previous value 16 is a bit conservative, so adjust it along with NAPI_SKB_CACHE_SIZE, which can minimize triggering memory allocation in napi_skb_cache_get*(). Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-3-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-19net: increase default NAPI_SKB_CACHE_SIZE to 128Jason Xing
After commit b61785852ed0 ("net: increase skb_defer_max default to 128") changed the value sysctl_skb_defer_max to avoid many calls to kick_defer_list_purge(), the same situation can be applied to NAPI_SKB_CACHE_SIZE that was proposed in 2016. It's a trade-off between using pre-allocated memory in skb_cache and saving more a bit heavy function calls in the softirq context. With this patch applied, we can have more skbs per-cpu to accelerate the sending path that needs to acquire new skbs. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251118070646.61344-2-kerneljasonxing@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18net: use napi_skb_cache even in process contextEric Dumazet
This is a followup of commit e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs"). Now the per-cpu napi_skb_cache is populated from TX completion path, we can make use of this cache, especially for cpus not used from a driver NAPI poll (primary user of napi_cache). We can use the napi_skb_cache only if current context is not from hard irq. With this patch, I consistently reach 130 Mpps on my UDP tx stress test and reduce SLUB spinlock contention to smaller values. Note there is still some SLUB contention for skb->head allocations. I had to tune /sys/kernel/slab/skbuff_small_head/cpu_partial and /sys/kernel/slab/skbuff_small_head/min_partial depending on the platform taxonomy. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Tested-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18net: __alloc_skb() cleanupEric Dumazet
This patch refactors __alloc_skb() to prepare the following one, and does not change functionality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-18net: add a new @alloc parameter to napi_skb_cache_get()Eric Dumazet
We want to be able in the series last patch to get an skb from napi_skb_cache from process context, if there is one available. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251116202717.1542829-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-12net: clear skb->sk in skb_release_head_state()Eric Dumazet
skb_release_head_state() inlines skb_orphan(). We need to clear skb->sk otherwise we can freeze TCP flows on a mostly idle host, because skb_fclone_busy() would return true as long as the packet is not yet processed by skb_defer_free_flush(). Fixes: 1fcf572211da ("net: allow skb_release_head_state() to be called multiple times") Fixes: e20dfbad8aab ("net: fix napi_consume_skb() with alien skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Aditya Garg <gargaditya@linux.microsoft.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251111151235.1903659-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-11xsk: add indirect call for xsk_destruct_skbJason Xing
Since Eric proposed an idea about adding indirect call wrappers for UDP and managed to see a huge improvement[1], the same situation can also be applied in xsk scenario. This patch adds an indirect call for xsk and helps current copy mode improve the performance by around 1% stably which was observed with IXGBE at 10Gb/sec loaded. If the throughput grows, the positive effect will be magnified. I applied this patch on top of batch xmit series[2], and was able to see <5% improvement from our internal application which is a little bit unstable though. Use INDIRECT wrappers to keep xsk_destruct_skb static as it used to be when the mitigation config is off. Be aware of the freeing path that can be very hot since the frequency can reach around 2,000,000 times per second with the xdpsock test. [1]: https://lore.kernel.org/netdev/20251006193103.2684156-2-edumazet@google.com/ [2]: https://lore.kernel.org/all/20251021131209.41491-1-kerneljasonxing@gmail.com/ Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251031103328.95468-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-10net: Preserve metadata on pskb_expand_headJakub Sitnicki
pskb_expand_head() copies headroom, including skb metadata, into the newly allocated head, but then clears the metadata. As a result, metadata is lost when BPF helpers trigger an skb head reallocation. Let the skb metadata remain in the newly created copy of head. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://patch.msgid.link/20251105-skb-meta-rx-path-v4-2-5ceb08a9b37b@cloudflare.com
2025-11-07net: fix napi_consume_skb() with alien skbsEric Dumazet
There is a lack of NUMA awareness and more generally lack of slab caches affinity on TX completion path. Modern drivers are using napi_consume_skb(), hoping to cache sk_buff in per-cpu caches so that they can be recycled in RX path. Only use this if the skb was allocated on the same cpu, otherwise use skb_attempt_defer_free() so that the skb is freed on the original cpu. This removes contention on SLUB spinlocks and data structures. After this patch, I get ~50% improvement for an UDP tx workload on an AMD EPYC 9B45 (IDPF 200Gbit NIC with 32 TX queues). 80 Mpps -> 120 Mpps. Profiling one of the 32 cpus servicing NIC interrupts : Before: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 98.00 0.00 0.00 0.00 2.00 31.01% ksoftirqd/511 [kernel.kallsyms] [k] queued_spin_lock_slowpath 12.45% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 5.60% ksoftirqd/511 [kernel.kallsyms] [k] __slab_free 3.31% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 3.27% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 2.95% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_start 2.52% ksoftirqd/511 [kernel.kallsyms] [k] fq_dequeue 2.32% ksoftirqd/511 [kernel.kallsyms] [k] read_tsc 2.25% ksoftirqd/511 [kernel.kallsyms] [k] build_detached_freelist 2.15% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free 2.11% swapper [kernel.kallsyms] [k] __slab_free 2.06% ksoftirqd/511 [kernel.kallsyms] [k] idpf_features_check 2.01% ksoftirqd/511 [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 1.97% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_data 1.52% ksoftirqd/511 [kernel.kallsyms] [k] sock_wfree 1.34% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 1.23% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 1.15% ksoftirqd/511 [kernel.kallsyms] [k] dma_unmap_page_attrs 1.11% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.03% swapper [kernel.kallsyms] [k] fq_dequeue 0.94% swapper [kernel.kallsyms] [k] kmem_cache_free 0.93% swapper [kernel.kallsyms] [k] read_tsc 0.81% ksoftirqd/511 [kernel.kallsyms] [k] napi_consume_skb 0.79% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 0.77% ksoftirqd/511 [kernel.kallsyms] [k] skb_free_head 0.76% swapper [kernel.kallsyms] [k] idpf_features_check 0.72% swapper [kernel.kallsyms] [k] skb_release_data 0.69% swapper [kernel.kallsyms] [k] build_detached_freelist 0.58% ksoftirqd/511 [kernel.kallsyms] [k] skb_release_head_state 0.56% ksoftirqd/511 [kernel.kallsyms] [k] __put_partials 0.55% ksoftirqd/511 [kernel.kallsyms] [k] kmem_cache_free_bulk 0.48% swapper [kernel.kallsyms] [k] sock_wfree After: mpstat -P 511 1 1 Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: 511 0.00 0.00 0.00 0.00 0.00 51.49 0.00 0.00 0.00 48.51 19.10% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_hdr 13.86% swapper [kernel.kallsyms] [k] idpf_tx_clean_buf_ring 10.80% swapper [kernel.kallsyms] [k] skb_attempt_defer_free 10.57% swapper [kernel.kallsyms] [k] idpf_tx_splitq_clean_all 7.18% swapper [kernel.kallsyms] [k] queued_spin_lock_slowpath 6.69% swapper [kernel.kallsyms] [k] sock_wfree 5.55% swapper [kernel.kallsyms] [k] dma_unmap_page_attrs 3.10% swapper [kernel.kallsyms] [k] fq_dequeue 3.00% swapper [kernel.kallsyms] [k] skb_release_head_state 2.73% swapper [kernel.kallsyms] [k] read_tsc 2.48% swapper [kernel.kallsyms] [k] idpf_tx_splitq_start 1.20% swapper [kernel.kallsyms] [k] idpf_features_check 1.13% swapper [kernel.kallsyms] [k] napi_consume_skb 0.93% swapper [kernel.kallsyms] [k] idpf_vport_splitq_napi_poll 0.64% swapper [kernel.kallsyms] [k] native_send_call_func_single_ipi 0.60% swapper [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 0.53% swapper [kernel.kallsyms] [k] io_idle 0.43% swapper [kernel.kallsyms] [k] netif_skb_features 0.41% swapper [kernel.kallsyms] [k] __direct_call_cpuidle_state_enter2 0.40% swapper [kernel.kallsyms] [k] native_irq_return_iret 0.40% swapper [kernel.kallsyms] [k] idpf_tx_buf_hw_update 0.36% swapper [kernel.kallsyms] [k] sched_clock_noinstr 0.34% swapper [kernel.kallsyms] [k] handle_softirqs 0.32% swapper [kernel.kallsyms] [k] net_rx_action 0.32% swapper [kernel.kallsyms] [k] dql_completed 0.32% swapper [kernel.kallsyms] [k] validate_xmit_skb 0.31% swapper [kernel.kallsyms] [k] skb_network_protocol 0.29% swapper [kernel.kallsyms] [k] skb_csum_hwoffload_help 0.29% swapper [kernel.kallsyms] [k] x2apic_send_IPI 0.28% swapper [kernel.kallsyms] [k] ktime_get 0.24% swapper [kernel.kallsyms] [k] __qdisc_run Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251106202935.1776179-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07net: allow skb_release_head_state() to be called multiple timesEric Dumazet
Currently, only skb dst is cleared (thanks to skb_dst_drop()) Make sure skb->destructor, conntrack and extensions are cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-20net: shrink napi_skb_cache_{put,get}() and napi_skb_cache_get_bulk()Eric Dumazet
Following loop in napi_skb_cache_put() is unrolled by the compiler even if CONFIG_KASAN is not enabled: for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++) kasan_mempool_unpoison_object(nc->skb_cache[i], kmem_cache_size(net_hotdata.skbuff_cache)); We have 32 times this sequence, for a total of 384 bytes. 48 8b 3d 00 00 00 00 net_hotdata.skbuff_cache,%rdi e8 00 00 00 00 call kmem_cache_size This is because kmem_cache_size() is not an inline and not const, and kasan_unpoison_object_data() is an inline function. Cache kmem_cache_size() result in a variable, so that the compiler can remove dead code (and variable) when/if CONFIG_KASAN is unset. After this patch, napi_skb_cache_put() is inlined in its callers, and we avoid one kmem_cache_size() call in napi_skb_cache_get() and napi_skb_cache_get_bulk(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Link: https://patch.msgid.link/20251016182911.1132792-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16net: add add indirect call wrapper in skb_release_head_state()Eric Dumazet
While stress testing UDP senders on a host with expensive indirect calls, I found cpus processing TX completions where showing a very high cost (20%) in sock_wfree() due to CONFIG_MITIGATION_RETPOLINE=y. Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251014171907.3554413-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-10-16udp: do not use skb_release_head_state() before skb_attempt_defer_free()Eric Dumazet
Michal reported and bisected an issue after recent adoption of skb_attempt_defer_free() in UDP. The issue here is that skb_release_head_state() is called twice per skb, one time from skb_consume_udp(), then a second time from skb_defer_free_flush() and napi_consume_skb(). As Sabrina suggested, remove skb_release_head_state() call from skb_consume_udp(). Add a DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb)) in skb_attempt_defer_free() Many thanks to Michal, Sabrina, Paolo and Florian for their help. Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()") Reported-and-bisected-by: Michal Kubecek <mkubecek@suse.cz> Closes: https://lore.kernel.org/netdev/gpjh4lrotyephiqpuldtxxizrsg6job7cvhiqrw72saz2ubs3h@g6fgbvexgl3r/ Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Cc: Sabrina Dubroca <sd@queasysnail.net> Cc: Florian Westphal <fw@strlen.de> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251015052715.4140493-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30net: add NUMA awareness to skb_attempt_defer_free()Eric Dumazet
Instead of sharing sd->defer_list & sd->defer_count with many cpus, add one pair for each NUMA node. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30net: use llist for sd->defer_listEric Dumazet
Get rid of sd->defer_lock and adopt llist operations. We optimize skb_attempt_defer_free() for the common case, where the packet is queued. Otherwise sd->defer_count is increasing, until skb_defer_free_flush() clears it. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30net: make softnet_data.defer_count an atomicEric Dumazet
This is preparation work to remove the softnet_data.defer_lock, as it is contended on hosts with large number of cores. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250928084934.3266948-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-30netdevsim: a basic test PSP implementationJakub Kicinski
Provide a PSP implementation for netdevsim. Use psp_dev_encapsulate() and psp_dev_rcv() to do actual encapsulation and decapsulation on skbs, but perform no encryption or decryption. In order to make encryption with a bad key result in a drop on the peer's rx side, we stash our psd's generation number in the first byte of each key before handing to the peer. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20250927225420.1443468-2-kuba@kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-09-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.17-rc8). Conflicts: drivers/net/can/spi/hi311x.c 6b6968084721 ("can: hi311x: fix null pointer dereference when resuming from sleep before interface was enabled") 27ce71e1ce81 ("net: WQ_PERCPU added to alloc_workqueue users") https://lore.kernel.org/72ce7599-1b5b-464a-a5de-228ff9724701@kernel.org net/smc/smc_loopback.c drivers/dibs/dibs_loopback.c a35c04de2565 ("net/smc: fix warning in smc_rx_splice() when calling get_page()") cc21191b584c ("dibs: Move data path to dibs layer") https://lore.kernel.org/74368a5c-48ac-4f8e-a198-40ec1ed3cf5f@kernel.org Adjacent changes: drivers/net/dsa/lantiq/lantiq_gswip.c c0054b25e2f1 ("net: dsa: lantiq_gswip: move gswip_add_single_port_br() call to port_setup()") 7a1eaef0a791 ("net: dsa: lantiq_gswip: support model-specific mac_select_pcs()") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-23net: allow alloc_skb_with_frags() to use MAX_SKB_FRAGSJason Baron
Currently, alloc_skb_with_frags() will only fill (MAX_SKB_FRAGS - 1) slots. I think it should use all MAX_SKB_FRAGS slots, as callers of alloc_skb_with_frags() will size their allocation of frags based on MAX_SKB_FRAGS. This issue was discovered via a test patch that sets 'order' to 0 in alloc_skb_with_frags(), which effectively tests/simulates high fragmentation. In this case sendmsg() on unix sockets will fail every time for large allocations. If the PAGE_SIZE is 4K, then data_len will request 68K or 17 pages, but alloc_skb_with_frags() can only allocate 64K in this case or 16 pages. Fixes: 09c2c90705bb ("net: allow alloc_skb_with_frags() to allocate bigger packets") Signed-off-by: Jason Baron <jbaron@akamai.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250922191957.2855612-1-jbaron@akamai.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-09-18net: modify core data structures for PSP datapath supportJakub Kicinski
Add pointers to psp data structures to core networking structs, and an SKB extension to carry the PSP information from the drivers to the socket layer. Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Co-developed-by: Daniel Zahka <daniel.zahka@gmail.com> Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250917000954.859376-4-daniel.zahka@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-08-20net: avoid one loop iteration in __skb_splice_bitsPengtao He
If *len is equal to 0 at the beginning of __splice_segment it returns true directly. But when decreasing *len from a positive number to 0 in __splice_segment, it returns false. The __skb_splice_bits needs to call __splice_segment again. Recheck *len if it changes, return true in time. Reduce unnecessary calls to __splice_segment. Signed-off-by: Pengtao He <hept.hept.hept@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250819021551.8361-1-hept.hept.hept@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function"), pskb_carve_frag_list() never used the argument @skb. Drop it and adapt the only caller. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit ce098da1497c ("skbuff: Introduce slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove it and adapt both callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @gfpMichal Luczaj
Since its introduction in commit 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter() never used the @gfp argument. Remove it and adapt callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @pipeMichal Luczaj
Since commit 41c73a0d44c9 ("net: speedup skb_splice_bits()"), __splice_segment() and spd_fill_page() do not use the @pipe argument. Drop it. While adapting the callers, move one line to enforce reverse xmas tree order. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-1-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-17net: netmem: fix skb_ensure_writable with unreadable skbsMina Almasry
skb_ensure_writable should succeed when it's trying to write to the header of the unreadable skbs, so it doesn't need an unconditional skb_frags_readable check. The preceding pskb_may_pull() call will succeed if write_len is within the head and fail if we're trying to write to the unreadable payload, so we don't need an additional check. Removing this check restores DSCP functionality with unreadable skbs as it's called from dscp_tg. Cc: willemb@google.com Cc: asml.silence@gmail.com Fixes: 65249feb6b3d ("net: add support for skbs with unreadable frags") Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250615200733.520113-1-almasrymina@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21net: fold __skb_checksum() into skb_checksum()Eric Biggers
Now that the only remaining caller of __skb_checksum() is skb_checksum(), fold __skb_checksum() into skb_checksum(). This makes struct skb_checksum_ops unnecessary, so remove that too and simply do the "regular" net checksum. It also makes the wrapper functions csum_partial_ext() and csum_block_add_ext() unnecessary, so remove those too and just use the underlying functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-7-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-21net: add skb_crc32c()Eric Biggers
Add skb_crc32c(), which calculates the CRC32C of a sk_buff. It will replace __skb_checksum(), which unnecessarily supports arbitrary checksums. Compared to __skb_checksum(), skb_crc32c(): - Uses the correct type for CRC32C values (u32, not __wsum). - Does not require the caller to provide a skb_checksum_ops struct. - Is faster because it does not use indirect calls and does not use the very slow crc32c_combine(). According to commit 2817a336d4d5 ("net: skb_checksum: allow custom update/combine for walking skb") which added __skb_checksum(), the original motivation for the abstraction layer was to avoid code duplication for CRC32C and other checksums in the future. However: - No additional checksums showed up after CRC32C. __skb_checksum() is only used with the "regular" net checksum and CRC32C. - Indirect calls are expensive. Commit 2544af0344ba ("net: avoid indirect calls in L4 checksum calculation") worked around this using the INDIRECT_CALL_1 macro. But that only avoided the indirect call for the net checksum, and at the cost of an extra branch. - The checksums use different types (__wsum and u32), causing casts to be needed. - It made the checksums of fragments be combined (rather than chained) for both checksums, despite this being highly counterproductive for CRC32C due to how slow crc32c_combine() is. This can clearly be seen in commit 4c2f24549644 ("sctp: linearize early if it's not GSO") which tried to work around this performance bug. With a dedicated function for each checksum, we can instead just use the proper strategy for each checksum. As shown by the following tables, the new function skb_crc32c() is faster than __skb_checksum(), with the improvement varying greatly from 5% to 2500% depending on the case. The largest improvements come from fragmented packets, mainly due to eliminating the inefficient crc32c_combine(). But linear packets are improved too, especially shorter ones, mainly due to eliminating indirect calls. These benchmarks were done on AMD Zen 5. On that CPU, Linux uses IBRS instead of retpoline; an even greater improvement might be seen with retpoline: Linear sk_buffs Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 43 18 256 94 77 1420 204 161 16384 1735 1642 Nonlinear sk_buffs (even split between head and one fragment) Length in bytes __skb_checksum cycles skb_crc32c cycles =============== ===================== ================= 64 579 22 256 829 77 1420 1506 194 16384 4365 1682 Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://patch.msgid.link/20250519175012.36581-3-ebiggers@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-13net: devmem: Implement TX pathMina Almasry
Augment dmabuf binding to be able to handle TX. Additional to all the RX binding, we also create tx_vec needed for the TX path. Provide API for sendmsg to be able to send dmabufs bound to this device: - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY implementation, while disabling instances where MSG_ZEROCOPY falls back to copying. We additionally pipe the binding down to the new zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems instead of the traditional page netmems. We also special case skb_frag_dma_map to return the dma-address of these dmabuf net_iovs instead of attempting to map pages. The TX path may release the dmabuf in a context where we cannot wait. This happens when the user unbinds a TX dmabuf while there are still references to its netmems in the TX path. In that case, the netmems will be put_netmem'd from a context where we can't unmap the dmabuf, Resolve this by making __net_devmem_dmabuf_binding_free schedule_work'd. Based on work by Stanislav Fomichev <sdf@fomichev.me>. A lot of the meat of the implementation came from devmem TCP RFC v1[1], which included the TX path, but Stan did all the rebasing on top of netmem/net_iov. Cc: Stanislav Fomichev <sdf@fomichev.me> Signed-off-by: Kaiyuan Zhang <kaiyuanz@google.com> Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-5-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-05-13net: add get_netmem/put_netmem supportMina Almasry
Currently net_iovs support only pp ref counts, and do not support a page ref equivalent. This is fine for the RX path as net_iovs are used exclusively with the pp and only pp refcounting is needed there. The TX path however does not use pp ref counts, thus, support for get_page/put_page equivalent is needed for netmem. Support get_netmem/put_netmem. Check the type of the netmem before passing it to page or net_iov specific code to obtain a page ref equivalent. For dmabuf net_iovs, we obtain a ref on the underlying binding. This ensures the entire binding doesn't disappear until all the net_iovs have been put_netmem'ed. We do not need to track the refcount of individual dmabuf net_iovs as we don't allocate/free them from a pool similar to what the buddy allocator does for pages. This code is written to be extensible by other net_iov implementers. get_netmem/put_netmem will check the type of the netmem and route it to the correct helper: pages -> [get|put]_page() dmabuf net_iovs -> net_devmem_[get|put]_net_iov() new net_iovs -> new helpers Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20250508004830.4100853-3-almasrymina@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-04-17skb: implement skb_send_sock_locked_with_flags()Antonio Quartulli
When sending an skb over a socket using skb_send_sock_locked(), it is currently not possible to specify any flag to be set in msghdr->msg_flags. However, we may want to pass flags the user may have specified, like MSG_NOSIGNAL. Extend __skb_send_sock() with a new argument 'flags' and add a new interface named skb_send_sock_locked_with_flags(). Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Signed-off-by: Antonio Quartulli <antonio@openvpn.net> Link: https://patch.msgid.link/20250415-b4-ovpn-v26-12-577f6097b964@openvpn.net Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Abeni <pabeni@redhat.com>