Bug 4267 - dmaengine: idxd: Fix descriptor bof flag issue in dmaengine
Summary: dmaengine: idxd: Fix descriptor bof flag issue in dmaengine
Status: RESOLVED FIXED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: drivers (show other bugs) drivers
Version: unspecified
Hardware: x86_64 Linux
: P2-High S2-major
Target Milestone: ---
Assignee: GuixinLiu
QA Contact: shuming
URL:
Whiteboard:
Keywords: Bugfix
Depends on:
Blocks:
 
Reported: 2023-03-01 16:28 UTC by xiaochenshen
Modified: 2023-08-16 15:45 UTC (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description xiaochenshen intel_group 2023-03-01 16:28:57 UTC
Description of problem:
Ran memory compaction with DSA page copy engine enabled in a ECS VM instance. See randomly #GP or kernel crash in dmesg.

Version-Release number of selected component (if applicable):
ANCK 5.10 Dev.

How reproducible:
Always.

Steps to Reproduce:

Machine: ECS VM instance
OS: alinux3  with latest anck devel-5.10 kernel
Reproduction Scripts:
```
#
#    ...omitting DSA WQ config steps...
#

# enable batch_migrate
echo 1 > /sys/kernel/mm/migrate/batch_migrate_enabled

# enable dma
echo 1 > /sys/kernel/mm/migrate/dma_migrate_enabled

# start memory compaction
echo 1 > /proc/sys/vm/compact_memory
```

Actual results:
I tried several times, but logs were different every time - #GP error or kernel crash.

Expected results:
No #GP errors or kernel crash.

Additional info:
Regression commit in devel-5.10 branch:
454304c9ea39 anolis: dmaengine: idxd: Add block_on_fault flag to DMA descriptor by default
Comment 1 xiaochenshen intel_group 2023-03-01 16:38:27 UTC
Root cause:
Commit 454304c9ea39 ("anolis: dmaengine: idxd: Add block_on_fault flag to DMA descriptor by default") set block_on_fault flag in the descriptor for all data operations by default. DSA page copy engine makes use of Batch operation, but DSA descriptor for Batch doesn’t support block_on_fault flag. The Batch operation failure leads to #GP errors or kernel crash.

The fix:
Set Block on Fault flag in the descriptor if Block on Fault bit in WQCFG
register is set.

Clear Block on Fault flag in the descriptor explicitly if the flag is
not supported by the data operation (e.g., Batch, No-op).
Comment 2 xiaochenshen intel_group 2023-03-03 18:33:40 UTC
NOTE: This issue impacts both cases in bare-metal and in guest VM.

The fixing PR:
https://gitee.com/anolis/cloud-kernel/pulls/1305
Comment 3 GuixinLiu alibaba_cloud_group 2023-08-16 15:45:43 UTC
already fixed.