| Age | Commit message (Collapse) | Author |
|
This was done entirely with mindless brute force, using
git grep -l '\<k[vmz]*alloc_objs*(.*, GFP_KERNEL)' |
xargs sed -i 's/\(alloc_objs*(.*\), GFP_KERNEL)/\1)/'
to convert the new alloc_obj() users that had a simple GFP_KERNEL
argument to just drop that argument.
Note that due to the extreme simplicity of the scripting, any slightly
more complex cases spread over multiple lines would not be triggered:
they definitely exist, but this covers the vast bulk of the cases, and
the resulting diff is also then easier to check automatically.
For the same reason the 'flex' versions will be done as a separate
conversion.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:
Single allocations: kmalloc(sizeof(TYPE), ...)
are replaced with: kmalloc_obj(TYPE, ...)
Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with: kmalloc_objs(TYPE, COUNT, ...)
Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with: kmalloc_flex(*PTR, FAM, COUNT, ...)
(where TYPE may also be *VAR)
The resulting allocations no longer return "void *", instead returning
"TYPE *".
Signed-off-by: Kees Cook <kees@kernel.org>
|
|
Make allocate_mqd consistent with other callbacks.
Prepare for next patch to use mqd_manager->mqd_size.
Signed-off-by: Lang Yu <lang.yu@amd.com>
Reviewed-by: David Belanger <david.belanger@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Mukul Joshi <mukul.joshi@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
To allocate kernel BO from VRAM domain for MQD in the following patch.
No functional change because kernel BO allocate all from GTT domain.
Rename amdgpu_amdkfd_alloc_gtt_mem to amdgpu_amdkfd_alloc_kernel_mem
Rename amdgpu_amdkfd_free_gtt_mem to amdgpu_amdkfd_free_kernel_mem
Rename mem_kfd_mem_obj gtt_mem to mem
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Kent Russell <kent.russell@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Each queue of the process is individually removed and there is not need
to suspend whole mes. Suspending mes stops kernel mode queues also
causing unnecessary timeouts when running mixed work loads
Fixes: 079ae5118e1f ("drm/amdkfd: fix suspend/resume all calls in mes based eviction path")
Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/4765
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If dqm->ops.initialize() fails, add deallocate_hiq_sdma_mqd()
to release the memory allocated by allocate_hiq_sdma_mqd().
Move deallocate_hiq_sdma_mqd() up to ensure proper function
visibility at the point of use.
Fixes: 11614c36bc8f ("drm/amdkfd: Allocate MQD trunk for HIQ and SDMA")
Signed-off-by: Haoxiang Li <lihaoxiang@isrc.iscas.ac.cn>
Signed-off-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Oak Zeng <Oak.Zeng@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
On GFX 12.1, pass the xcc id of the master XCC to choose the correct
MES Pipe to send the add_queue/remove_queue requests to MES.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Sierra <alex.sierra@amd.com>
Reviewed-by: Michael Chen <michael.chen@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Currently, we do not turn off retry faults in VM_CONTEXT_CNTL value
when passing it to MES if XNACK is off. This creates a situation where
XNACK is disabled in SQ but enabled in UTCL2, which is not recommended.
As a result, turn off/on retry faults in both SQ and UTCL2 when passing
vm_context_cntl value to MES if XNACK is disabled/enabled.
Suggested-by: Jay Cornwall <jay.cornwall@amd.com>
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
GFX 12.1.0 will support enabling/disabling XNACK on a per-
process basis. This change enables the per process XNACK feature.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch adds the following functionality for GFX 12.1.0:
1. Add a new MQD manager for GFX v12.1.0.
2. Add a new 12.1.0 specific device queue manager file.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Alex Sierra <alex.sierra@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This commit remove DIQ support because it has been
marked as DEPRECATED since 2022
Signed-off-by: Zhu Lingshan <lingshan.zhu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch initialize key variables and removed unused ones.
Signed-off-by: Andrew Martin <andrew.martin@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
We would need to reserve SDMA queues per KFD node.
As a result, rework the SDMA reserved queue handling to make it per
KFD node.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch fixes the formatting in the patch
"amdkfd: Do not wait for queue op response during reset"
Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
This patch adds the condition to not wait for
the queue response for unmap, if the gpu is in reset.
Signed-off-by: Ahmad Rehman <Ahmad.Rehman@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Properly check the return values for function, as done elsewhere.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Suspend/resume all gangs should be done with the device lock is held.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
GPUs with multi-xcc have multiple MQDs per queue. This patch saves and
restores all the MQDs within the partition.
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit a578f2a58c3ab38f0643b1b6e7534af860233cb1)
Cc: stable@vger.kernel.org
|
|
Add a parameter to amdgpu_sdma_reset_engine() to let the
caller handle the kernel rings. This allows the kernel
rings to back up their unprocessed state if the reset comes in
via the drm scheduler rather than KFD.
Reviewed-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If ring reset is disabled, skip resetting queues. Instead, fall back to
device based reset.
Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
SDMA doesn't support oversubsciption, it is the user matter to create
queues over HW limit, but not supposed to be a KFD error.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Move the kfd suspend/resume code into the caller. That
is where the KFD is likely to detect a reset so on the KFD
side there is no need to call them. Also add a mutex to
lock the actual reset sequence.
v2: make the locking per instance
Fixes: bac38ca8c475 ("drm/amdkfd: implement per queue sdma reset for gfx 9.4+")
Reviewed-by: Jesse Zhang <jesse.zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
[Why]
If reset is detected and kfd need to evict working queues, HWS moving queue will be failed.
Then remaining queues are not evicted and in active state.
After reset done, kfd uses HWS to termination remaining activated queues but HWS is resetted.
So remove queue will be failed again.
[How]
Keep removing all queues even if HWS returns failed.
It will not affect cpsch as it checks reset_domain->sem.
v2: If any queue failed, evict queue returns error.
v3: Declare err inside the if-block.
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Yifan Zha <Yifan.Zha@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Update pm_update_grace_period() to more cleaner
pm_config_dequeue_wait_counts(). Previously, grace_period variable was
overloaded as a variable and a macro, making it inflexible to configure
additional dequeue wait times.
pm_config_dequeue_wait_counts() now takes in a cmd / variable. This
allows flexibility to update different dequeue wait times.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add support for more per-process flags starting with option to configure
MFMA precision for gfx 9.5
v2: Change flag name to KFD_PROC_FLAG_MFMA_HIGH_PRECISION
Remove unused else condition
v3: Bump the KFD API version
v4: Missed SH_MEM_CONFIG__PRECISION_MODE__SHIFT define. Added it.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Set per-process static sh_mem config only once during process
initialization. Move all static changes from update_qpd() which is
called each time a queue is created to set_cache_memory_policy() which
is called once during process initialization.
set_cache_memory_policy() is currently defined only for cik and vi
family. So this commit only focuses on these two. A separate commit will
address other asics.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
To reset hung SDMA queues on GFX 9.4+ for the GFX9 family, a soft reset
must be issued through SMU. Since soft resets will reset an entire SDMA
engine, use a common KGD call to do the reset as the KGD will handle
avoiding a reset of in flight GFX and paging queues on that engine.
In addition, create a common call for all reset types to simplify
the handling of module parameter settings that block gpu resets.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Harish Kasiviswanathan <harish.kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
With GPU reset-domain worker implemented, KFD hw_exception worker is not
needed any more, just call amdgpu_amdkfd_gpu_reset directly from
kfd_hws_hang.
Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
'svm_range_cpu_invalidate_pagetables'
This commit addresses a circular locking dependency in the
svm_range_cpu_invalidate_pagetables function. The function previously
held a lock while determining whether to perform an unmap or eviction
operation, which could lead to deadlocks.
Fixes the below:
[ 223.418794] ======================================================
[ 223.418820] WARNING: possible circular locking dependency detected
[ 223.418845] 6.12.0-amdstaging-drm-next-lol-050225 #14 Tainted: G U OE
[ 223.418869] ------------------------------------------------------
[ 223.418889] kfdtest/3939 is trying to acquire lock:
[ 223.418906] ffff8957552eae38 (&dqm->lock_hidden){+.+.}-{3:3}, at: evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.419302]
but task is already holding lock:
[ 223.419303] ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu]
[ 223.419447] Console: switching to colour dummy device 80x25
[ 223.419477] [IGT] amd_basic: executing
[ 223.419599]
which lock already depends on the new lock.
[ 223.419611]
the existing dependency chain (in reverse order) is:
[ 223.419621]
-> #2 (&prange->lock){+.+.}-{3:3}:
[ 223.419636] __mutex_lock+0x85/0xe20
[ 223.419647] mutex_lock_nested+0x1b/0x30
[ 223.419656] svm_range_validate_and_map+0x2f1/0x15b0 [amdgpu]
[ 223.419954] svm_range_set_attr+0xe8c/0x1710 [amdgpu]
[ 223.420236] svm_ioctl+0x46/0x50 [amdgpu]
[ 223.420503] kfd_ioctl_svm+0x50/0x90 [amdgpu]
[ 223.420763] kfd_ioctl+0x409/0x6d0 [amdgpu]
[ 223.421024] __x64_sys_ioctl+0x95/0xd0
[ 223.421036] x64_sys_call+0x1205/0x20d0
[ 223.421047] do_syscall_64+0x87/0x140
[ 223.421056] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 223.421068]
-> #1 (reservation_ww_class_mutex){+.+.}-{3:3}:
[ 223.421084] __ww_mutex_lock.constprop.0+0xab/0x1560
[ 223.421095] ww_mutex_lock+0x2b/0x90
[ 223.421103] amdgpu_amdkfd_alloc_gtt_mem+0xcc/0x2b0 [amdgpu]
[ 223.421361] add_queue_mes+0x3bc/0x440 [amdgpu]
[ 223.421623] unhalt_cpsch+0x1ae/0x240 [amdgpu]
[ 223.421888] kgd2kfd_start_sched+0x5e/0xd0 [amdgpu]
[ 223.422148] amdgpu_amdkfd_start_sched+0x3d/0x50 [amdgpu]
[ 223.422414] amdgpu_gfx_enforce_isolation_handler+0x132/0x270 [amdgpu]
[ 223.422662] process_one_work+0x21e/0x680
[ 223.422673] worker_thread+0x190/0x330
[ 223.422682] kthread+0xe7/0x120
[ 223.422690] ret_from_fork+0x3c/0x60
[ 223.422699] ret_from_fork_asm+0x1a/0x30
[ 223.422708]
-> #0 (&dqm->lock_hidden){+.+.}-{3:3}:
[ 223.422723] __lock_acquire+0x16f4/0x2810
[ 223.422734] lock_acquire+0xd1/0x300
[ 223.422742] __mutex_lock+0x85/0xe20
[ 223.422751] mutex_lock_nested+0x1b/0x30
[ 223.422760] evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.423025] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu]
[ 223.423285] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu]
[ 223.423540] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu]
[ 223.423807] __mmu_notifier_invalidate_range_start+0x1f5/0x250
[ 223.423819] copy_page_range+0x1e94/0x1ea0
[ 223.423829] copy_process+0x172f/0x2ad0
[ 223.423839] kernel_clone+0x9c/0x3f0
[ 223.423847] __do_sys_clone+0x66/0x90
[ 223.423856] __x64_sys_clone+0x25/0x30
[ 223.423864] x64_sys_call+0x1d7c/0x20d0
[ 223.423872] do_syscall_64+0x87/0x140
[ 223.423880] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 223.423891]
other info that might help us debug this:
[ 223.423903] Chain exists of:
&dqm->lock_hidden --> reservation_ww_class_mutex --> &prange->lock
[ 223.423926] Possible unsafe locking scenario:
[ 223.423935] CPU0 CPU1
[ 223.423942] ---- ----
[ 223.423949] lock(&prange->lock);
[ 223.423958] lock(reservation_ww_class_mutex);
[ 223.423970] lock(&prange->lock);
[ 223.423981] lock(&dqm->lock_hidden);
[ 223.423990]
*** DEADLOCK ***
[ 223.423999] 5 locks held by kfdtest/3939:
[ 223.424006] #0: ffffffffb82b4fc0 (dup_mmap_sem){.+.+}-{0:0}, at: copy_process+0x1387/0x2ad0
[ 223.424026] #1: ffff89575eda81b0 (&mm->mmap_lock){++++}-{3:3}, at: copy_process+0x13a8/0x2ad0
[ 223.424046] #2: ffff89575edaf3b0 (&mm->mmap_lock/1){+.+.}-{3:3}, at: copy_process+0x13e4/0x2ad0
[ 223.424066] #3: ffffffffb82e76e0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: copy_page_range+0x1cea/0x1ea0
[ 223.424088] #4: ffff8957556b83b0 (&prange->lock){+.+.}-{3:3}, at: svm_range_cpu_invalidate_pagetables+0x9d/0x850 [amdgpu]
[ 223.424365]
stack backtrace:
[ 223.424374] CPU: 0 UID: 0 PID: 3939 Comm: kfdtest Tainted: G U OE 6.12.0-amdstaging-drm-next-lol-050225 #14
[ 223.424392] Tainted: [U]=USER, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 223.424401] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO WIFI/X570 AORUS PRO WIFI, BIOS F36a 02/16/2022
[ 223.424416] Call Trace:
[ 223.424423] <TASK>
[ 223.424430] dump_stack_lvl+0x9b/0xf0
[ 223.424441] dump_stack+0x10/0x20
[ 223.424449] print_circular_bug+0x275/0x350
[ 223.424460] check_noncircular+0x157/0x170
[ 223.424469] ? __bfs+0xfd/0x2c0
[ 223.424481] __lock_acquire+0x16f4/0x2810
[ 223.424490] ? srso_return_thunk+0x5/0x5f
[ 223.424505] lock_acquire+0xd1/0x300
[ 223.424514] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.424783] __mutex_lock+0x85/0xe20
[ 223.424792] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.425058] ? srso_return_thunk+0x5/0x5f
[ 223.425067] ? mark_held_locks+0x54/0x90
[ 223.425076] ? evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.425339] ? srso_return_thunk+0x5/0x5f
[ 223.425350] mutex_lock_nested+0x1b/0x30
[ 223.425358] ? mutex_lock_nested+0x1b/0x30
[ 223.425367] evict_process_queues_cpsch+0x43/0x210 [amdgpu]
[ 223.425631] kfd_process_evict_queues+0x8a/0x1d0 [amdgpu]
[ 223.425893] kgd2kfd_quiesce_mm+0x43/0x90 [amdgpu]
[ 223.426156] svm_range_cpu_invalidate_pagetables+0x4a7/0x850 [amdgpu]
[ 223.426423] ? srso_return_thunk+0x5/0x5f
[ 223.426436] __mmu_notifier_invalidate_range_start+0x1f5/0x250
[ 223.426450] copy_page_range+0x1e94/0x1ea0
[ 223.426461] ? srso_return_thunk+0x5/0x5f
[ 223.426474] ? srso_return_thunk+0x5/0x5f
[ 223.426484] ? lock_acquire+0xd1/0x300
[ 223.426494] ? copy_process+0x1718/0x2ad0
[ 223.426502] ? srso_return_thunk+0x5/0x5f
[ 223.426510] ? sched_clock_noinstr+0x9/0x10
[ 223.426519] ? local_clock_noinstr+0xe/0xc0
[ 223.426528] ? copy_process+0x1718/0x2ad0
[ 223.426537] ? srso_return_thunk+0x5/0x5f
[ 223.426550] copy_process+0x172f/0x2ad0
[ 223.426569] kernel_clone+0x9c/0x3f0
[ 223.426577] ? __schedule+0x4c9/0x1b00
[ 223.426586] ? srso_return_thunk+0x5/0x5f
[ 223.426594] ? sched_clock_noinstr+0x9/0x10
[ 223.426602] ? srso_return_thunk+0x5/0x5f
[ 223.426610] ? local_clock_noinstr+0xe/0xc0
[ 223.426619] ? schedule+0x107/0x1a0
[ 223.426629] __do_sys_clone+0x66/0x90
[ 223.426643] __x64_sys_clone+0x25/0x30
[ 223.426652] x64_sys_call+0x1d7c/0x20d0
[ 223.426661] do_syscall_64+0x87/0x140
[ 223.426671] ? srso_return_thunk+0x5/0x5f
[ 223.426679] ? common_nsleep+0x44/0x50
[ 223.426690] ? srso_return_thunk+0x5/0x5f
[ 223.426698] ? trace_hardirqs_off+0x52/0xd0
[ 223.426709] ? srso_return_thunk+0x5/0x5f
[ 223.426717] ? syscall_exit_to_user_mode+0xcc/0x200
[ 223.426727] ? srso_return_thunk+0x5/0x5f
[ 223.426736] ? do_syscall_64+0x93/0x140
[ 223.426748] ? srso_return_thunk+0x5/0x5f
[ 223.426756] ? up_write+0x1c/0x1e0
[ 223.426765] ? srso_return_thunk+0x5/0x5f
[ 223.426775] ? srso_return_thunk+0x5/0x5f
[ 223.426783] ? trace_hardirqs_off+0x52/0xd0
[ 223.426792] ? srso_return_thunk+0x5/0x5f
[ 223.426800] ? syscall_exit_to_user_mode+0xcc/0x200
[ 223.426810] ? srso_return_thunk+0x5/0x5f
[ 223.426818] ? do_syscall_64+0x93/0x140
[ 223.426826] ? syscall_exit_to_user_mode+0xcc/0x200
[ 223.426836] ? srso_return_thunk+0x5/0x5f
[ 223.426844] ? do_syscall_64+0x93/0x140
[ 223.426853] ? srso_return_thunk+0x5/0x5f
[ 223.426861] ? irqentry_exit+0x6b/0x90
[ 223.426869] ? srso_return_thunk+0x5/0x5f
[ 223.426877] ? exc_page_fault+0xa7/0x2c0
[ 223.426888] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 223.426898] RIP: 0033:0x7f46758eab57
[ 223.426906] Code: ba 04 00 f3 0f 1e fa 64 48 8b 04 25 10 00 00 00 45 31 c0 31 d2 31 f6 bf 11 00 20 01 4c 8d 90 d0 02 00 00 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 41 89 c0 85 c0 75 2c 64 48 8b 04 25 10 00
[ 223.426930] RSP: 002b:00007fff5c3e5188 EFLAGS: 00000246 ORIG_RAX: 0000000000000038
[ 223.426943] RAX: ffffffffffffffda RBX: 00007f4675f8c040 RCX: 00007f46758eab57
[ 223.426954] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
[ 223.426965] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 223.426975] R10: 00007f4675e81a50 R11: 0000000000000246 R12: 0000000000000001
[ 223.426986] R13: 00007fff5c3e5470 R14: 00007fff5c3e53e0 R15: 00007fff5c3e5410
[ 223.427004] </TASK>
v2: To resolve this issue, the allocation of the process context buffer
(`proc_ctx_bo`) has been moved from the `add_queue_mes` function to the
`pqm_create_queue` function. This change ensures that the buffer is
allocated only when the first queue for a process is created and only if
the Micro Engine Scheduler (MES) is enabled. (Felix)
v3: Fix typo s/Memory Execution Scheduler (MES)/Micro Engine Scheduler
in commit message. (Lijo)
Fixes: 438b39ac74e2 ("drm/amdkfd: pause autosuspend when creating pdd")
Cc: Jesse Zhang <jesse.zhang@amd.com>
Cc: Yunxiang Li <Yunxiang.Li@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Current kfd driver has its own PASID value for a kfd process and uses it to
locate vm at interrupt handler or mapping between kfd process and vm. That
design is not working when a physical gpu device has multiple spatial
partitions, ex: adev in CPX mode. This patch has kfd driver use same pasid
values that graphic driver generated which is per vm per pasid.
These pasid values are passed to fw/hardware. We do not need change interrupt
handler though more pasid values are used. Also, pasid values at log are
replaced by user process pid; pasid values are not exposed to user. Users see
their process pids that have meaning in user space.
Signed-off-by: Xiaogang Chen <xiaogang.chen@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
The purpose of halt_if_hws_hang is to preserve GPU state for driver
debugging when queue preemption fails. Issuing per-queue reset may
kill wavefronts which caused the preemption failure.
Signed-off-by: Jay Cornwall <jay.cornwall@amd.com>
Reviewed-by: Jonathan Kim <Jonathan.Kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.12.x
|
|
This patch checks and warns if pdd is NULL.
Signed-off-by: Andrew Martin <Andrew.Martin@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
When using MES creating a pdd will require talking to the GPU to
setup the relevant context. The code here forgot to wake up the GPU
in case it was in suspend, this causes KVM to EFAULT for passthrough
GPU for example. This issue can be masked if the GPU was woken up by
other things (e.g. opening the KMS node) first and have not yet gone to sleep.
v4: do the allocation of proc_ctx_bo in a lazy fashion
when the first queue is created in a process (Felix)
Signed-off-by: Jesse Zhang <jesse.zhang@amd.com>
Reviewed-by: Yunxiang Li <Yunxiang.Li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
|
|
Add back kfd queues in start scheduling that originally been
removed on stop scheduling.
Signed-off-by: Shaoyun Liu <shaoyun.liu@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
as the adding of mb() should be sufficient in function unmap_queues_cpsch,
remove the add of volatile type as recommended
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
make sure KFD_FENCE_INIT write to fence_addr before pm_send_query_status
called, to avoid qcm fence timeout caused by incorrect ordering.
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
get_wave_state is not defined for sdma queue, copy_context_work_handler
calls it for sdma queue will crash.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Jonathan Kim <jonathan.kim@amd.com>
Tested-by: Jonathan Kim <jonathan.kim@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Make CU occupancy calculations work on GFX 9.4.3 by
updating the logic to handle multiple XCCs correctly.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Currently, the code uses the IH_VMID_X_LUT register to map
a queue's vmid to the corresponding PASID. This logic is racy
since CP can update the VMID-PASID mapping anytime especially
when there are more processes than number of vmids. Update the
logic to calculate CU occupancy by matching doorbell offset of
the queue with valid wave counts against the process's queues.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
If a queue is being destroyed but causes a HWS hang on removal, the KFD
may issue an unnecessary gpu reset if the destroyed queue can be fixed
by a queue reset.
This is because the queue has been removed from the KFD's queue list
prior to the preemption action on destroy so the reset call will fail to
match the HQD PQ reset information against the KFD's queue record to do
the actual reset.
To fix this, deactivate the queue prior to preemption since it's being
destroyed anyways and remove the queue from the KFD's queue list after
preemption.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Based on the recommendation of MEC FW, update BadOpcode interrupt
handling by unmapping all queues, removing the queue that got the
interrupt from scheduling and remapping rest of the queues back when
using MES scheduler. This is done to prevent the case where unmapping
of the bad queue can fail thereby causing a GPU reset.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
MEC FW expects MES to unmap all queues when a VM fault is observed
on a queue and then resumed once the affected process is terminated.
Use the MES Suspend and Resume APIs to achieve this.
Signed-off-by: Mukul Joshi <mukul.joshi@amd.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Provide amdgpu_amdkfd_stop_sched() for amdgpu to stop KFD scheduling
compute work on HIQ. amdgpu_amdkfd_start_sched() resumes the scheduling.
When amdgpu_amdkfd_stop_sched is called, KFD will unmap queues from
runlist. If users send ioctls to KFD to create queues, they'll be added
but those queues won't be mapped to runlist (so not scheduled) until
amdgpu_amdkfd_start_sched is called.
v2: fix build (Alex)
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Support per-queue reset for GFX9. The recommendation is for the driver
to target reset the HW queue via a SPI MMIO register write.
Since this requires pipe and HW queue info and MEC FW is limited to
doorbell reports of hung queues after an unmap failure, scan the HW
queue slots defined by SET_RESOURCES first to identify the user queue
candidates to reset.
Only signal reset events to processes that have had a queue reset.
If queue reset fails, fall back to GPU reset.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Certain GPUs have better copy performance over xGMI on specific
SDMA engines depending on the source and destination GPU.
Allow users to create SDMA queues on these recommended engines.
Close to 2x overall performance has been observed with this
optimization.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Add helper function kfd_queue_acquire_buffers to get queue wptr_bo
reference from queue write_ptr if it is mapped to the KFD node with
expected size.
Add wptr_bo to structure queue_properties because structure queue is
allocated after queue buffers are validated, then we can remove wptr_bo
parameter from pqm_create_queue.
Rename structure queue wptr_bo_gart to hold wptr_bo reference for GART
mapping and umapping. Move MES wptr_bo_gart mapping to init_user_queue,
the same location with queue ctx_bo GART mapping.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
Pass pointer reference to amdgpu_bo_unref to clear the correct pointer,
otherwise amdgpu_bo_unref clear the local variable, the original pointer
not set to NULL, this could cause use-after-free bug.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
We recently added locking to add_queue_mes() but this error path was
overlooked. Add an unlock to the error path.
Fixes: 1802b042a343 ("drm/amdgpu/kfd: remove is_hws_hang and is_resetting")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
is_hws_hang and is_resetting serves pretty much the same purpose and
they all duplicates the work of the reset_domain lock, just check that
directly instead. This also eliminate a few bugs listed below and get
rid of dqm->ops.pre_reset.
kfd_hws_hang did not need to avoid scheduling another reset. If the
on-going reset decided to skip GPU reset we have a bad time, otherwise
the extra reset will get cancelled anyway.
remove_queue_mes forgot to check is_resetting flag compared to the
pre-MES path unmap_queue_cpsch, so it did not block hw access during
reset correctly.
Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
Reviewed-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|
|
It was an enablement vehicle for MES 11 and was never
productized. Remove it.
v2: drop additional checks in the GFX10 code.
v3: drop mes_api_def.h
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
|