| Age | Commit message (Collapse) | Author |
|
This converts some of the visually simpler cases that have been split
over multiple lines. I only did the ones that are easy to verify the
resulting diff by having just that final GFP_KERNEL argument on the next
line.
Somebody should probably do a proper coccinelle script for this, but for
me the trivial script actually resulted in an assertion failure in the
middle of the script. I probably had made it a bit _too_ trivial.
So after fighting that far a while I decided to just do some of the
syntactically simpler cases with variations of the previous 'sed'
scripts.
The more syntactically complex multi-line cases would mostly really want
whitespace cleanup anyway.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Pull drm fixes from Dave Airlie:
"This is the fixes and cleanups for the end of the merge window, it's
nearly all amdgpu, with some amdkfd, then a pagemap core fix, i915/xe
display fixes, and some xe driver fixes.
Nothing seems out of the ordinary, except amdgpu is a little more
volume than usual.
pagemap:
- drm/pagemap: pass pagemap_addr by reference
amdgpu:
- DML 2.1 fixes
- Panel replay fixes
- Display writeback fixes
- MES 11 old firmware compat fix
- DC CRC improvements
- DPIA fixes
- XGMI fixes
- ASPM fix
- SMU feature bit handling fixes
- DC LUT fixes
- RAS fixes
- Misc memory leak in error path fixes
- SDMA queue reset fixes
- PG handling fixes
- 5 level GPUVM page table fix
- SR-IOV fix
- Queue reset fix
- SMU 13.x fixes
- DC resume lag fix
- MPO fixes
- DCN 3.6 fix
- VSDB fixes
- HWSS clean up
- Replay fixes
- DCE cursor fixes
- DCN 3.5 SR DDR5 latency fixes
- HPD fixes
- Error path unwind fixes
- SMU13/14 mode1 reset fixes
- PSP 15 updates
- SMU 15 updates
- Sync fix in amdgpu_dma_buf_move_notify()
- HAINAN fix
- PSP 13.x fix
- GPUVM locking fix
- Fixes for DC analog support
- DC FAMS fixes
- DML 2.1 fixes
- eDP fixes
- Misc DC fixes
- Fastboot fix
- 3DLUT fixes
- GPUVM fixes
- 64bpp format fix
- Fix for MacBooks with switchable gfx
amdkfd:
- Fix possible double deletion of validate list
- Event setup fix
- Device disconnect regression fix
- APU GTT as VRAM fix
- Fix piority inversion with MQDs
- NULL check fix
radeon:
- HAINAN fix
i915/xe display:
- Regresion fix for HDR 4k displays (#15503)
- Fixup for Dell XPS 13 7390 eDP rate limit
- Memory leak fix on ACPI _DSM handling
- Add missing slice count check during DP mode validation
xe:
- drm/xe: Prevent VFs from exposing the CCS mode sysfs file
- SRIOV related fixes
- PAT cache fix
- MMIO read fix
- W/a fixes
- Adjust type of xe_modparam.force_vram_bar_size
- Wedge mode fix
- HWMon fix
* tag 'drm-next-2026-02-21' of https://gitlab.freedesktop.org/drm/kernel: (143 commits)
drm/amd/display: Remove unneeded DAC link encoder register
drm/amd/display: Enable DAC in DCE link encoder
drm/amd/display: Set CRTC source for DAC using registers
drm/amd/display: Initialize DAC in DCE link encoder using VBIOS
drm/amd/display: Turn off DAC in DCE link encoder using VBIOS
drm/amd/display: Don't call find_analog_engine() twice
drm/amdgpu: fix 4-level paging if GMC supports 57-bit VA v2
drm/amdgpu: keep vga memory on MacBooks with switchable graphics
drm/amdgpu: Set atomics to true for xgmi
drm/amdkfd: Check for NULL return values
drm/amd/display: Use same max plane scaling limits for all 64 bpp formats
drm/amdgpu: Set vmid0 PAGE_TABLE_DEPTH for GFX12.1
drm/amdkfd: Disable MQD queue priority
drm/amd/display: Remove conditional for shaper 3DLUT power-on
drm/amd/display: Check return of shaper curve to HW format
drm/amd/display: Correct logic check error for fastboot
drm/amd/display: Skip eDP detection when no sink
Revert "drm/amd/display: Add Gfx Base Case For Linear Tiling Handling"
Revert "drm/amd/display: Correct hubp GfxVersion verification"
Revert "drm/amd/display: Add Handling for gfxversion DcGfxBase"
...
|
|
This patch fixes issues when the code moves forward with a potential
NULL pointer, without checking.
Removed one redundant NULL check for a function parameter. This check
is already done in the only caller.
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
- "ocfs2: give ocfs2 the ability to reclaim suballocator free bg" saves
disk space by teaching ocfs2 to reclaim suballocator block group
space (Heming Zhao)
- "Add ARRAY_END(), and use it to fix off-by-one bugs" adds the
ARRAY_END() macro and uses it in various places (Alejandro Colomar)
- "vmcoreinfo: support VMCOREINFO_BYTES larger than PAGE_SIZE" makes
the vmcore code future-safe, if VMCOREINFO_BYTES ever exceeds the
page size (Pnina Feder)
- "kallsyms: Prevent invalid access when showing module buildid" cleans
up kallsyms code related to module buildid and fixes an invalid
access crash when printing backtraces (Petr Mladek)
- "Address page fault in ima_restore_measurement_list()" fixes a
kexec-related crash that can occur when booting the second-stage
kernel on x86 (Harshit Mogalapalli)
- "kho: ABI headers and Documentation updates" updates the kexec
handover ABI documentation (Mike Rapoport)
- "Align atomic storage" adds the __aligned attribute to atomic_t and
atomic64_t definitions to get natural alignment of both types on
csky, m68k, microblaze, nios2, openrisc and sh (Finn Thain)
- "kho: clean up page initialization logic" simplifies the page
initialization logic in kho_restore_page() (Pratyush Yadav)
- "Unload linux/kernel.h" moves several things out of kernel.h and into
more appropriate places (Yury Norov)
- "don't abuse task_struct.group_leader" removes the usage of
->group_leader when it is "obviously unnecessary" (Oleg Nesterov)
- "list private v2 & luo flb" adds some infrastructure improvements to
the live update orchestrator (Pasha Tatashin)
* tag 'mm-nonmm-stable-2026-02-12-10-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (107 commits)
watchdog/hardlockup: simplify perf event probe and remove per-cpu dependency
procfs: fix missing RCU protection when reading real_parent in do_task_stat()
watchdog/softlockup: fix sample ring index wrap in need_counting_irqs()
kcsan, compiler_types: avoid duplicate type issues in BPF Type Format
kho: fix doc for kho_restore_pages()
tests/liveupdate: add in-kernel liveupdate test
liveupdate: luo_flb: introduce File-Lifecycle-Bound global state
liveupdate: luo_file: Use private list
list: add kunit test for private list primitives
list: add primitives for private list manipulations
delayacct: fix uapi timespec64 definition
panic: add panic_force_cpu= parameter to redirect panic to a specific CPU
netclassid: use thread_group_leader(p) in update_classid_task()
RDMA/umem: don't abuse current->group_leader
drm/pan*: don't abuse current->group_leader
drm/amd: kill the outdated "Only the pthreads threading model is supported" checks
drm/amdgpu: don't abuse current->group_leader
android/binder: use same_thread_group(proc->tsk, current) in binder_mmap()
android/binder: don't abuse current->group_leader
kho: skip memoryless NUMA nodes when reserving scratch areas
...
|
|
checks
Nowadays task->group_leader->mm != task->mm is only possible if a) task is
not a group leader and b) task->group_leader->mm == NULL because
task->group_leader has already exited using sys_exit().
I don't think that drm/amd tries to detect/nack this case.
Link: https://lkml.kernel.org/r/aXY_yLVHd63UlWtm@redhat.com
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Christan König <christian.koenig@amd.com>
Acked-by: Felix Kuehling <felix.kuehling@amd.com>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Boris Brezillon <boris.brezillon@collabora.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Simon Horman <horms@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.20-2026-01-16:
amdgpu:
- SR-IOV fixes
- Rework SMU mailbox handling
- Drop MMIO_REMAP domain
- UserQ fixes
- MES cleanups
- Panel Replay updates
- HDMI fixes
- Backlight fixes
- SMU 14.x fixes
- SMU 15 updates
amdkfd:
- Fix a memory leak
- Fixes for systems with non-4K pages
- LDS/Scratch cleanup
- MES process eviction fix
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260116202609.23107-1-alexander.deucher@amd.com
|
|
https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.20-2026-01-09:
amdgpu:
- GPUVM updates
- Initial support for larger GPU address spaces
- Initial SMUIO 15.x support
- Documentation updates
- Initial PSP 15.x support
- Initial IH 7.1 support
- Initial IH 6.1.1 support
- SMU 13.0.12 updates
- RAS updates
- Initial MMHUB 3.4 support
- Initial MMHUB 4.2 support
- Initial GC 12.1 support
- Initial GC 11.5.4 support
- HDMI fixes
- Panel replay improvements
- DML updates
- DC FP fixes
- Initial SDMA 6.1.4 support
- Initial SDMA 7.1 support
- Userq updates
- DC HPD refactor
- SwSMU cleanups and refactoring
- TTM memory ops parallelization
- DCN 3.5 fixes
- DP audio fixes
- Clang fixes
- Misc spelling fixes and cleanups
- Initial SDMA 7.11.4 support
- Convert legacy DRM logging helpers to new drm logging helpers
- Initial JPEG 5.3 support
- Add support for changing UMA size via the driver
- DC analog fixes
- GC 9 gfx queue reset support
- Initial SMU 15.x support
amdkfd:
- Reserved SDMA rework
- Refactor SPM
- Initial GC 12.1 support
- Initial GC 11.5.4 support
- Initial SDMA 7.1 support
- Initial SDMA 6.1.4 support
- Increase the kfd process hash table
- Per context support
- Topology fixes
radeon:
- Convert legacy DRM logging helpers to new drm logging helpers
- Use devm for i2c adapters
- Variable sized array fix
- Misc cleanups
UAPI:
- KFD context support. Proposed userspace:
https://github.com/ROCm/rocm-systems/pull/1705
https://github.com/ROCm/rocm-systems/pull/1701
- Add userq metadata queries for more queue types. Proposed userspace:
https://gitlab.freedesktop.org/yogeshmohan/mesa/-/commits/userq_query
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260109154713.3242957-1-alexander.deucher@amd.com
Signed-off-by: Dave Airlie <airlied@redhat.com>
|
|
To allocate kernel BO from VRAM domain for MQD in the following patch.
No functional change because kernel BO allocate all from GTT domain.
Rename amdgpu_amdkfd_alloc_gtt_mem to amdgpu_amdkfd_alloc_kernel_mem
Rename amdgpu_amdkfd_free_gtt_mem to amdgpu_amdkfd_free_kernel_mem
Rename mem_kfd_mem_obj gtt_mem to mem
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch allows kfd driver function correctly when AMD gpu devices got
unplug/replug at run time.
When an AMD gpu device got unplug kfd driver gracefully terminates existing
kfd processes after stops all queues by sending SIGBUS to user process. After
that user space can still use remaining AMD gpu devices. When all AMD gpu
devices at system got removed kfd driver will not response new requests.
Unplugged AMD gpu devices can be re-plugged. kfd driver will use added devices
to function as usual.
The purpose of this patch is having kfd driver behavior as expected during and
after AMD gpu devices unplug/replug at run time.
Signed-off-by: Xiaogang Chen <Xiaogang.Chen@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
signal_eviction_fence() is declared to return bool, but returns -EINVAL
when no eviction fence is present. This makes the "no fence" or "the
NULL-fence" path evaluate to true and triggers a Smatch warning.
v2: Return true instead to explicitly indicate that there is no eviction
fence to signal and that eviction is already complete. This matches the
existing caller logic where a NULL fence means "nothing to do" and
allows restore handling to proceed normally. (Christian)
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:2099 signal_eviction_fence()
warn: '(-22)' is not bool
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c
2090 static bool signal_eviction_fence(struct kfd_process *p)
^^^^
2091 {
2092 struct dma_fence *ef;
2093 bool ret;
2094
2095 rcu_read_lock();
2096 ef = dma_fence_get_rcu_safe(&p->ef);
2097 rcu_read_unlock();
2098 if (!ef)
--> 2099 return -EINVAL;
This should be either true or false.
Probably true because presumably
it has been tested?
2100
2101 ret = dma_fence_check_and_signal(ef);
2102 dma_fence_put(ef);
2103
2104 return ret;
2105 }
Fixes: 37865e02e6cc ("drm/amdkfd: Fix eviction fence handling")
Reported by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Gang BA <Gang.Ba@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There are a couple of spelling mistakes, one in a pr_warn message
and one in a seq_printf message. Fix these.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
GFX 12.1.0 will support enabling/disabling XNACK on a per-
process basis. This change enables the per process XNACK feature.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This commit implemetns a new ioctl AMDKFD_IOC_CREATE_PROCESS
that creates a new secondary kfd_progress on the FD.
To keep backward compatibility, userspace programs need to invoke
this ioctl explicitly on a FD to create a secondary
kfd_process which replacing its primary kfd_process.
This commit bumps ioctl minor version.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add interrupt handling for GFX 12.1.0 similar to what is done
for GFX 9.4.3.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This commit introduces a new helper function
kfd_lookup_process_by_id which can find a
kfd process that identified by its context id from
the kfd process table
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
In kq_initialize, queue->process of a HIQ should
be NULL as initialized, because it does not belong
to any kfd_process.
This commit decommisions the function kfd_get_process() because
it can not locate a specific kfd_process among multiple
contexts and not any code path calls it after this commit.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This commit introduces a new id field for
struct kfd process, which helps identify
a kfd process among multiple contexts that
all belong to a single user space program.
The sysfs entry of a secondary kfd process
is placed under the sysfs entry folder of
its primary kfd process.
The naming format of the sysfs entry of a secondary
kfd process is "context_%u" where %u is the context id.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Life cycle of a KFD secondary context(kfd_process) is tied
to the opened file. Therefore this commit destroy a kfd
secondary context when close the fd it belonging to.
This commit extracts the code removing the kfd_process
from the kfd_process_table to a separate function and
call it in kfd_process_notifier_release_internal unconditionally.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
KFD creates sysfs entries for a kfd_process in
function kfd_create_process when creating it.
This commit extracts the code creating sysfs
entries to a separate function because it
would be invoked in other code path like
creating secondary kfd contexts (kfd_process).
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Up until this commit, the kfd multiple contexts feature has
not been fully implemented in mainline kernel yet.
For backawrd compatibility, not break existing use cases,
this commit changes function find_process_by_mm, let it
always return the primary kfd_process.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The first kfd_process is created through open(),
this commit marks it as the primary kfd_process
by assigning a primary id for its context_id.
Only the primary process should register the mmu_notifier.
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
amdkfd is one of the few users which relies on the return code of
dma_fence_signal(), which, so far, informs the caller whether the fence
had already been signaled.
As there are barely any users, dma_fence signaling functions shall get
the return value void. To do so, the few users must be ported to a
function which preserves the old behavior.
Replace the call to dma_fence_signal() with one to
dma_fence_check_and_signal().
Suggested-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Link: https://patch.msgid.link/20251201105011.19386-5-phasta@kernel.org
|
|
pm_runtime_put_autosuspend(), pm_runtime_put_sync_autosuspend(),
pm_runtime_autosuspend() and pm_request_autosuspend() now include a call
to pm_runtime_mark_last_busy(). Remove the now-redundant explicit call to
pm_runtime_mark_last_busy().
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Move dequeue user queues and destroy user queues from
kfd_process_wq_release to mmu notifier release callback, to ensure no
system memory access from GPU because the process memory is going to
free from CPU after mmu release notifier callback returns.
Destroy queue releases the svm prange queue_refcount, this also removes
fake flase positive warning message "Freeing queue vital buffer" message
if application crash or killed.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
current switch partition only check if kfd_processes_table is empty.
kfd_prcesses_table entry is deleted in kfd_process_notifier_release, but
kfd_process tear down is in kfd_process_wq_release.
consider two processes:
Process A (workqueue) -> kfd_process_wq_release -> Access kfd_node member
Process B switch partition -> amdgpu_xcp_pre_partition_switch -> amdgpu_amdkfd_device_fini_sw
-> kfd_node tear down.
Process A and B may trigger a race as shown in dmesg log.
This patch is to resolve the race by adding an atomic kfd_process counter
kfd_processes_count, it increment as create kfd process, decrement as
finish kfd_process_wq_release.
v2: Put kfd_processes_count per kfd_dev, move decrement to kfd_process_destroy_pdds
and bug fix. (Philip Yang)
[3966658.307702] divide error: 0000 [#1] SMP NOPTI
[3966658.350818] i10nm_edac
[3966658.356318] CPU: 124 PID: 38435 Comm: kworker/124:0 Kdump: loaded Tainted
[3966658.356890] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[3966658.362839] nfit
[3966658.366457] RIP: 0010:kfd_get_num_sdma_engines+0x17/0x40 [amdgpu]
[3966658.366460] Code: 00 00 e9 ac 81 02 00 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 48 8b 4f 08 48 8b b7 00 01 00 00 8b 81 58 26 03 00 99 <f7> be b8 01 00 00 80 b9 70 2e 00 00 00 74 0b 83 f8 02 ba 02 00 00
[3966658.380967] x86_pkg_temp_thermal
[3966658.391529] RSP: 0018:ffffc900a0edfdd8 EFLAGS: 00010246
[3966658.391531] RAX: 0000000000000008 RBX: ffff8974e593b800 RCX: ffff888645900000
[3966658.391531] RDX: 0000000000000000 RSI: ffff888129154400 RDI: ffff888129151c00
[3966658.391532] RBP: ffff8883ad79d400 R08: 0000000000000000 R09: ffff8890d2750af4
[3966658.391532] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000000
[3966658.391533] R13: ffff8883ad79d400 R14: ffffe87ff662ba00 R15: ffff8974e593b800
[3966658.391533] FS: 0000000000000000(0000) GS:ffff88fe7f600000(0000) knlGS:0000000000000000
[3966658.391534] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3966658.391534] CR2: 0000000000d71000 CR3: 000000dd0e970004 CR4: 0000000002770ee0
[3966658.391535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3966658.391535] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[3966658.391536] PKRU: 55555554
[3966658.391536] Call Trace:
[3966658.391674] deallocate_sdma_queue+0x38/0xa0 [amdgpu]
[3966658.391762] process_termination_cpsch+0x1ed/0x480 [amdgpu]
[3966658.399754] intel_powerclamp
[3966658.402831] kfd_process_dequeue_from_all_devices+0x5b/0xc0 [amdgpu]
[3966658.402908] kfd_process_wq_release+0x1a/0x1a0 [amdgpu]
[3966658.410516] coretemp
[3966658.434016] process_one_work+0x1ad/0x380
[3966658.434021] worker_thread+0x49/0x310
[3966658.438963] kvm_intel
[3966658.446041] ? process_one_work+0x380/0x380
[3966658.446045] kthread+0x118/0x140
[3966658.446047] ? __kthread_bind_mask+0x60/0x60
[3966658.446050] ret_from_fork+0x1f/0x30
[3966658.446053] Modules linked in: kpatch_20765354(OEK)
[3966658.455310] kvm
[3966658.464534] mptcp_diag xsk_diag raw_diag unix_diag af_packet_diag netlink_diag udp_diag act_pedit act_mirred act_vlan cls_flower kpatch_21951273(OEK) kpatch_18424469(OEK) kpatch_19749756(OEK)
[3966658.473462] idxd_mdev
[3966658.482306] kpatch_17971294(OEK) sch_ingress xt_conntrack amdgpu(OE) amdxcp(OE) amddrm_buddy(OE) amd_sched(OE) amdttm(OE) amdkcl(OE) intel_ifs iptable_mangle tcm_loop target_core_pscsi tcp_diag target_core_file inet_diag target_core_iblock target_core_user target_core_mod coldpgs kpatch_18383292(OEK) ip6table_nat ip6table_filter ip6_tables ip_set_hash_ipportip ip_set_hash_ipportnet ip_set_hash_ipport ip_set_bitmap_port xt_comment iptable_nat nf_nat iptable_filter ip_tables ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 sn_core_odd(OE) i40e overlay binfmt_misc tun bonding(OE) aisqos(OE) aisqos_hotfixes(OE) rfkill uio_pci_generic uio cuse fuse nf_tables nfnetlink intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm idxd_mdev
[3966658.491237] vfio_pci
[3966658.501196] vfio_pci vfio_virqfd mdev vfio_iommu_type1 vfio iax_crypto intel_pmt_telemetry iTCO_wdt intel_pmt_class iTCO_vendor_support irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl intel_cstate snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_seq
[3966658.508537] vfio_virqfd
[3966658.517569] snd_seq_device ipmi_ssif isst_if_mbox_pci isst_if_mmio pcspkr snd_pcm idxd intel_uncore ses isst_if_common intel_vsec idxd_bus enclosure snd_timer mei_me snd i2c_i801 i2c_smbus mei i2c_ismt soundcore joydev acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad vfat fat
[3966658.526851] mdev
[3966658.536096] nfsd auth_rpcgss nfs_acl lockd grace slb_vtoa(OE) sunrpc dm_mod hookers mlx5_ib(OE) ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm mlx5_core(OE) mlxfw(OE)
[3966658.540381] vfio_iommu_type1
[3966658.544341] nvme mpt3sas tls drm nvme_core pci_hyperv_intf raid_class psample libcrc32c crc32c_intel mlxdevm(OE) i2c_core
[3966658.551254] vfio
[3966658.558742] scsi_transport_sas wmi pinctrl_emmitsburg sd_mod t10_pi sg ahci libahci libata rdma_ucm(OE) ib_uverbs(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) ib_umad(OE) ib_core(OE) ib_ucm(OE) mlx_compat(OE)
[3966658.563004] iax_crypto
[3966658.570988] [last unloaded: diagnose]
[3966658.571027] ---[ end trace cc9dbb180f9ae537 ]---
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: Philip.Yang<Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The KFD currently bars a compute partition mode switch while a KFD
process exists.
Since cgroup excluded devices remain excluded for the lifetime of a KFD
process and user space is able to mode switch single devices, allow
users to mode switch a device with any running process that has been
cgroup excluded from this device.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
the entries will be appearing at
/sys/kernel/debug/kfd/proc/<pid>/pasid_<gpuid>.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
rocm-smi will be able to show the events for KFD process
start/end, it is the implementation of this feature.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If HW scheduler hangs and mode1 reset is used to recover GPU, KFD signal
user space to abort the processes. After process abort exit, user queues
still use the GPU to access system memory before h/w is reset while KFD
cleanup worker free system memory and free VRAM.
There is use-after-free race bug that KFD allocate and reuse the freed
system memory, and user queue write to the same system memory to corrupt
the data structure and cause driver crash.
To fix this race, KFD cleanup worker terminate user queues, then flush
reset_domain wq to wait for any GPU ongoing reset complete, and then
free outstanding BOs.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If waiting for gpu reset done in KFD release_work, thers is WARNING:
possible circular locking dependency detected
#2 kfd_create_process
kfd_process_mutex
flush kfd release work
#1 kfd release work
wait for amdgpu reset work
#0 amdgpu_device_gpu_reset
kgd2kfd_pre_reset
kfd_process_mutex
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock((work_completion)(&p->release_work));
lock((wq_completion)kfd_process_wq);
lock((work_completion)(&p->release_work));
lock((wq_completion)amdgpu-reset-dev);
To fix this, KFD create process move flush release work outside
kfd_process_mutex.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
In the kfd_process_device_init_vm function, a valid error code is now
returned when the associated Process Address Space ID (PASID) is not
present.
If the address space virtual memory (avm) does not have an associated
PASID, the function sets the ret variable to -EINVAL before proceeding
to the error handling section. This ensures that the calling function,
such as kfd_ioctl_acquire_vm, can appropriately handle the error,
thereby preventing any issues during virtual memory initialization.
Fixes the below:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:1694 kfd_process_device_init_vm()
warn: missing error code 'ret'
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c
1647 int kfd_process_device_init_vm(struct kfd_process_device *pdd,
1648 struct file *drm_file)
1649 {
...
1690
1691 if (unlikely(!avm->pasid)) {
1692 dev_warn(pdd->dev->adev->dev, "WARN: vm %p has no pasid associated",
1693 avm);
--> 1694 goto err_get_pasid;
ret = -EINVAL?
1695 }
Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver")
Reported by: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Xiaogang Chen <xiaogang.chen@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Curret kfd does not allocate pasid values, instead uses pasid value for each
vm from graphic driver. So should not prevent graphic driver from releasing
pasid values since the values are allocated by graphic driver, not kfd driver
anymore. This patch does not stop graphic driver release pasid values.
Fixes: 8544374c0f82 ("drm/amdkfd: Have kfd driver use same PASID values from graphic driver")
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Current kfd driver has its own PASID value for a kfd process and uses it to
locate vm at interrupt handler or mapping between kfd process and vm. That
design is not working when a physical gpu device has multiple spatial
partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid
values that graphic driver generated which is per vm per pasid.
These pasid values are passed to fw/hardware. We do not need change interrupt
handler though more pasid values are used. Also, pasid values at log are
replaced by user process pid; pasid values are not exposed to user. Users see
their process pids that have meaning in user space.
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
kfd_process_wq_release() signals eviction fence by
dma_fence_signal() which wanrs if dma_fence
is NULL.
kfd_process->ef is initialized by kfd_process_device_init_vm()
through ioctl. That means the fence is NULL for a new
created kfd_process, and close a kfd_process right
after open it will trigger the warning.
This commit conditionally signals the eviction fence
in kfd_process_wq_release() only when it is available.
[ 503.660882] WARNING: CPU: 0 PID: 9 at drivers/dma-buf/dma-fence.c:467 dma_fence_signal+0x74/0xa0
[ 503.782940] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[ 503.789640] RIP: 0010:dma_fence_signal+0x74/0xa0
[ 503.877620] Call Trace:
[ 503.880066] <TASK>
[ 503.882168] ? __warn+0xcd/0x260
[ 503.885407] ? dma_fence_signal+0x74/0xa0
[ 503.889416] ? report_bug+0x288/0x2d0
[ 503.893089] ? handle_bug+0x53/0xa0
[ 503.896587] ? exc_invalid_op+0x14/0x50
[ 503.900424] ? asm_exc_invalid_op+0x16/0x20
[ 503.904616] ? dma_fence_signal+0x74/0xa0
[ 503.908626] kfd_process_wq_release+0x6b/0x370 [amdgpu]
[ 503.914081] process_one_work+0x654/0x10a0
[ 503.918186] worker_thread+0x6c3/0xe70
[ 503.921943] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.926735] ? srso_alias_return_thunk+0x5/0xfbef5
[ 503.931527] ? __kthread_parkme+0x82/0x140
[ 503.935631] ? __pfx_worker_thread+0x10/0x10
[ 503.939904] kthread+0x2a8/0x380
[ 503.943132] ? __pfx_kthread+0x10/0x10
[ 503.946882] ret_from_fork+0x2d/0x70
[ 503.950458] ? __pfx_kthread+0x10/0x10
[ 503.954210] ret_from_fork_asm+0x1a/0x30
[ 503.958142] </TASK>
[ 503.960328] ---[ end trace 0000000000000000 ]---
Fixes: 967d226eaae8 ("dma-buf: add WARN_ON() illegal dma-fence signaling")
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Initial support for GC 9.5.0.
v2: squash in pqm_clean_queue_resource() fix from Lijo
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
When using MES creating a pdd will require talking to the GPU to
setup the relevant context. The code here forgot to wake up the GPU
in case it was in suspend, this causes KVM to EFAULT for passthrough
GPU for example. This issue can be masked if the GPU was woken up by
other things (e.g. opening the KMS node) first and have not yet gone to sleep.
v4: do the allocation of proc_ctx_bo in a lazy fashion
when the first queue is created in a process (Felix)
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Yunxiang Li <Yunxiang.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
kfd process kref count(process->ref) is initialized to 1 by kref_init. After
it is created not need to increase its kref. Instad add kfd process kref at kfd
process mmu notifier allocation since we already decrease the kref at
free_notifier of mmu_notifier_ops, so pair them.
When user process opens kfd node multiple times the kfd process kref is
increased each time to balance with kfd node close operation.
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
In kfd_procfs_show(), the sdma_activity_work_handler is a local variable
and the sdma_activity_work_handler.sdma_activity_work should initialize
with INIT_WORK_ONSTACK() instead of INIT_WORK().
Fixes: 32cb59f31362 ("drm/amdkfd: Track SDMA utilization per process")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
'kfd_get_cu_occupancy()'
The `kfd_get_cu_occupancy` function previously declared a large
`cu_occupancy` array as a local variable, which could lead to stack
overflows due to excessive stack usage. This commit replaces the static
array allocation with dynamic memory allocation using `kcalloc`,
thereby reducing the stack size.
This change avoids the risk of stack overflows in kernel space, in
scenarios where `AMDGPU_MAX_QUEUES` is large. The allocated memory is
freed using `kfree` before the function returns to prevent memory
leaks.
Fixes the below with gcc W=1:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c: In function ‘kfd_get_cu_occupancy’:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:322:1: warning: the frame size of 1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]
322 | }
| ^
Fixes: 6ae9e1aba97e ("drm/amdkfd: Update logic for CU occupancy calculations")
Cc: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Suggested-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Process device data pdd->vram_usage is read by rocm-smi via sysfs, this
is currently missing the svm_bo usage accounting, so "rocm-smi
--showpids" per process VRAM usage report is incorrect.
Add pdd->vram_usage accounting when svm_bo allocation and release,
change to atomic64_t type because it is updated outside process mutex
now.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Only creating a new reference for each process instead of each VM.
Fixes: 9a1c1339abf9 ("drm/amdkfd: Run restore_workers on freezable WQs")
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update the VMID-PASID mapping anytime especially
when there are more processes than number of vmids. Update the
logic to calculate CU occupancy by matching doorbell offset of
the queue with valid wave counts against the process's queues.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Support per-queue reset for GFX9. The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.
Since this requires pipe and HW queue info and MEC FW is limited to
doorbell reports of hung queues after an unmap failure, scan the HW
queue slots defined by SET_RESOURCES first to identify the user queue
candidates to reset.
Only signal reset events to processes that have had a queue reset.
If queue reset fails, fall back to GPU reset.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Convert some pr_* to some dev_* APIs to identify the device.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add gfx v9_4_4 ip block support
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
There is a race condition when re-creating a kfd_process for a process.
This has been observed when a process under the debugger executes
exec(3). In this scenario:
- The process executes exec.
- This will eventually release the process's mm, which will cause the
kfd_process object associated with the process to be freed
(kfd_process_free_notifier decrements the reference count to the
kfd_process to 0). This causes kfd_process_ref_release to enqueue
kfd_process_wq_release to the kfd_process_wq.
- The debugger receives the PTRACE_EVENT_EXEC notification, and tries to
re-enable AMDGPU traps (KFD_IOC_DBG_TRAP_ENABLE).
- When handling this request, KFD tries to re-create a kfd_process.
This eventually calls kfd_create_process and kobject_init_and_add.
At this point the call to kobject_init_and_add can fail because the
old kfd_process.kobj has not been freed yet by kfd_process_wq_release.
This patch proposes to avoid this race by making sure to drain
kfd_process_wq before creating a new kfd_process object. This way, we
know that any cleanup task is done executing when we reach
kobject_init_and_add.
Signed-off-by: Lancelot SIX <lancelot.six@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|