linux.git - linus torvalds my love

Age	Commit message (Collapse)	Author
2025-09-03	net: Add rfs_needed() helper	Christoph Paasch
	Add a helper to check if RFS is needed or not. Allows to make the code a bit cleaner and the next patch to have MPTCP use this helper to decide whether or not to iterate over the subflows. tun_flow_update() was calling sock_rps_record_flow_hash() regardless of the state of rfs_needed. This was not really a bug as sock_flow_table simply ends up being NULL and thus everything will be fine. This commit here thus also implicitly makes tun_flow_update() respect the state of rfs_needed. Suggested-by: Matthieu Baerts <matttbe@kernel.org> Signed-off-by: Christoph Paasch <cpaasch@openai.com> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20250902-net-next-mptcp-misc-feat-6-18-v2-3-fa02bb3188b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-08-27	net: Prevent RPS table overwrite of active flows	Krishna Kumar
	This patch fixes an issue where two different flows on the same RXq produce the same hash resulting in continuous flow overwrites. Flow #1: A packet for Flow #1 comes in, kernel calls the steering function. The driver gives back a filter id. The kernel saves this filter id in the selected slot. Later, the driver's service task checks if any filters have expired and then installs the rule for Flow #1. Flow #2: A packet for Flow #2 comes in. It goes through the same steps. But this time, the chosen slot is being used by Flow #1. The driver gives a new filter id and the kernel saves it in the same slot. When the driver's service task runs, it runs through all the flows, checks if Flow #1 should be expired, the kernel returns True as the slot has a different filter id, and then the driver installs the rule for Flow #2. Flow #1: Another packet for Flow #1 comes in. The same thing repeats. The slot is overwritten with a new filter id for Flow #1. This causes a repeated cycle of flow programming for missed packets, wasting CPU cycles while not improving performance. This problem happens at higher rates when the RPS table is small, but tests show it still happens even with 12,000 connections and an RPS size of 16K per queue (global table size = 144x16K = 64K). This patch prevents overwriting an rps_dev_flow entry if it is active. The intention is that it is better to do aRFS for the first flow instead of hurting all flows on the same hash. Without this, two (or more) flows on one RX queue with the same hash can keep overwriting each other. This causes the driver to reprogram the flow repeatedly. Changes: 1. Add a new 'hash' field to struct rps_dev_flow. 2. Add rps_flow_is_active(): a helper function to check if a flow is active or not, extracted from rps_may_expire_flow(). It is further simplified as per reviewer feedback. 3. In set_rps_cpu(): - Avoid overwriting by programming a new filter if: - The slot is not in use, or - The slot is in use but the flow is not active, or - The slot has an active flow with the same hash, but target CPU differs. - Save the hash in the rps_dev_flow entry. 4. rps_may_expire_flow(): Use earlier extracted rps_flow_is_active(). Testing & results: - Driver: ice (E810 NIC), Kernel: net-next - #CPUs = #RXq = 144 (1:1) - Number of flows: 12K - Eight RPS settings from 256 to 32768. Though RPS=256 is not ideal, it is still sufficient to cover 12K flows (256144 rx-queues = 64K global table slots) - Global Table Size = 144 RPS (effectively equal to 256 * RPS) - Each RPS test duration = 8 mins (org code) + 8 mins (new code). - Metrics captured on client Legend for following tables: Steer-C: #times ndo_rx_flow_steer() was Called by set_rps_cpu() Steer-L: #times ice_arfs_flow_steer() Looped over aRFS entries Add: #times driver actually programmed aRFS (ice_arfs_build_entry()) Del: #times driver deleted the flow (ice_arfs_del_flow_rules()) Units: K = 1,000 times, M = 1 million times \|-------\|---------\|------\| Org Code \|---------\|---------\| \| RPS \| Latency \| CPU \| Add \| Del \| Steer-C \| Steer-L \| \|-------\|---------\|------\|--------\|--------\|---------\|---------\| \| 256 \| 227.0 \| 93.2 \| 1.6M \| 1.6M \| 121.7M \| 267.6M \| \| 512 \| 225.9 \| 94.1 \| 11.5M \| 11.2M \| 65.7M \| 199.6M \| \| 1024 \| 223.5 \| 95.6 \| 16.5M \| 16.5M \| 27.1M \| 187.3M \| \| 2048 \| 222.2 \| 96.3 \| 10.5M \| 10.5M \| 12.5M \| 115.2M \| \| 4096 \| 223.9 \| 94.1 \| 5.5M \| 5.5M \| 7.2M \| 65.9M \| \| 8192 \| 224.7 \| 92.5 \| 2.7M \| 2.7M \| 3.0M \| 29.9M \| \| 16384 \| 223.5 \| 92.5 \| 1.3M \| 1.3M \| 1.4M \| 13.9M \| \| 32768 \| 219.6 \| 93.2 \| 838.1K \| 838.1K \| 965.1K \| 8.9M \| \|-------\|---------\|------\| New Code \|---------\|---------\| \| 256 \| 201.5 \| 99.1 \| 13.4K \| 5.0K \| 13.7K \| 75.2K \| \| 512 \| 202.5 \| 98.2 \| 11.2K \| 5.9K \| 11.2K \| 55.5K \| \| 1024 \| 207.3 \| 93.9 \| 11.5K \| 9.7K \| 11.5K \| 59.6K \| \| 2048 \| 207.5 \| 96.7 \| 11.8K \| 11.1K \| 15.5K \| 79.3K \| \| 4096 \| 206.9 \| 96.6 \| 11.8K \| 11.7K \| 11.8K \| 63.2K \| \| 8192 \| 205.8 \| 96.7 \| 11.9K \| 11.8K \| 11.9K \| 63.9K \| \| 16384 \| 200.9 \| 98.2 \| 11.9K \| 11.9K \| 11.9K \| 64.2K \| \| 32768 \| 202.5 \| 98.0 \| 11.9K \| 11.9K \| 11.9K \| 64.2K \| \|-------\|---------\|------\|--------\|--------\|---------\|---------\| Some observations: 1. Overall Latency improved: (1790.19-1634.94)/1790.19100 = 8.67% 2. Overall CPU increased: (777.32-751.49)/751.45100 = 3.44% 3. Flow Management (add/delete) remained almost constant at ~11K compared to values in millions. Signed-off-by: Krishna Kumar <krikku@gmail.com> Link: https://patch.msgid.link/20250825031005.3674864-2-krikku@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-16	net: rfs: add sock_rps_delete_flow() helper	Eric Dumazet
	RFS can exhibit lower performance for workloads using short-lived flows and a small set of 4-tuple. This is often the case for load-testers, using a pair of hosts, if the server has a single listener port. Typical use case : Server : tcp_crr -T128 -F1000 -6 -U -l30 -R 14250 Client : tcp_crr -T128 -F1000 -6 -U -l30 -c -H server \| grep local_throughput This is because RFS global hash table contains stale information, when the same RSS key is recycled for another socket and another cpu. Make sure to undo the changes and go back to initial state when a flow is disconnected. Performance of the above test is increased by 22 %, going from 372604 transactions per second to 457773. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Octavian Purdila <tavip@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20250515100354.3339920-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-08	net: rps: remove kfree_rcu_mightsleep() use	Eric Dumazet
	Add an rcu_head to sd_flow_limit and rps_sock_flow_table structs to use the more conventional and predictable k[v]free_rcu(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250407163602.170356-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25	net: rfs: hash function change	Eric Dumazet
	RFS is using two kinds of hash tables. First one is controlled by /proc/sys/net/core/rps_sock_flow_entries = 2^N and using the N low order bits of the l4 hash is good enough. Then each RX queue has its own hash table, controlled by /sys/class/net/eth1/queues/rx-$q/rps_flow_cnt = 2^X Current hash function, using the X low order bits is suboptimal, because RSS is usually using Func(hash) = (hash % power_of_two); For example, with 32 RX queues, 6 low order bits have no entropy for a given queue. Switch this hash function to hash_32(hash, log) to increase chances to use all possible slots and reduce collisions. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250321171309.634100-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-04-01	net: rps: add rps_input_queue_head_add() helper	Eric Dumazet
	process_backlog() can batch increments of sd->input_queue_head, saving some memory bandwidth. Also add READ_ONCE()/WRITE_ONCE() annotations around sd->input_queue_head accesses. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-04-01	net: rps: change input_queue_tail_incr_save()	Eric Dumazet
	input_queue_tail_incr_save() is incrementing the sd queue_tail and save it in the flow last_qtail. Two issues here : - no lock protects the write on last_qtail, we should use appropriate annotations. - We can perform this write after releasing the per-cpu backlog lock, to decrease this lock hold duration (move away the cache line miss) Also move input_queue_head_incr() and rps helpers to include/net/rps.h, while adding rps_ prefix to better reflect their role. v2: Fixed a build issue (Jakub and kernel build bots) Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-07	net: move rps_sock_flow_table to net_hotdata	Eric Dumazet
	rps_sock_flow_table and rps_cpu_mask are used in fast path. Move them to net_hotdata for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-19-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-07	net: introduce include/net/rps.h	Eric Dumazet
	Move RPS related structures and helpers from include/linux/netdevice.h and include/net/sock.h to a new include file. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240306160031.874438-18-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>