Description of problem: Ran memory compaction with DSA page copy engine enabled in a ECS VM instance. See randomly #GP or kernel crash in dmesg. Version-Release number of selected component (if applicable): ANCK 5.10 Dev. How reproducible: Always. Steps to Reproduce: Machine: ECS VM instance OS: alinux3 with latest anck devel-5.10 kernel Reproduction Scripts: ``` # # ...omitting DSA WQ config steps... # # enable batch_migrate echo 1 > /sys/kernel/mm/migrate/batch_migrate_enabled # enable dma echo 1 > /sys/kernel/mm/migrate/dma_migrate_enabled # start memory compaction echo 1 > /proc/sys/vm/compact_memory ``` Actual results: I tried several times, but logs were different every time - #GP error or kernel crash. Expected results: No #GP errors or kernel crash. Additional info: Regression commit in devel-5.10 branch: 454304c9ea39 anolis: dmaengine: idxd: Add block_on_fault flag to DMA descriptor by default
Root cause: Commit 454304c9ea39 ("anolis: dmaengine: idxd: Add block_on_fault flag to DMA descriptor by default") set block_on_fault flag in the descriptor for all data operations by default. DSA page copy engine makes use of Batch operation, but DSA descriptor for Batch doesn’t support block_on_fault flag. The Batch operation failure leads to #GP errors or kernel crash. The fix: Set Block on Fault flag in the descriptor if Block on Fault bit in WQCFG register is set. Clear Block on Fault flag in the descriptor explicitly if the flag is not supported by the data operation (e.g., Batch, No-op).
NOTE: This issue impacts both cases in bare-metal and in guest VM. The fixing PR: https://gitee.com/anolis/cloud-kernel/pulls/1305
already fixed.