Bug 12364 - Revert "softirq: Let ksoftirqd do its job" to avoid softirq starvation
Summary: Revert "softirq: Let ksoftirqd do its job" to avoid softirq starvation
Status: RESOLVED FIXED
Alias: None
Product: ANCK 5.10 Dev
Classification: ANCK
Component: net (show other bugs) net
Version: unspecified
Hardware: All Linux
: P3-Medium S3-normal
Target Milestone: ---
Assignee: dtcccc
QA Contact: shuming
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-12-13 10:33 UTC by wenjianhn
Modified: 2024-12-19 10:39 UTC (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description wenjianhn 2024-12-13 10:33:57 UTC
Description of problem:

    Since commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"),
    pending softirqs are no longer always handled immediately, instead,
    if there are pending softirqs, and ksoftirqd is in state TASK_RUNNING,
    the handling of the softirqs are deferred, and are instead supposed
    to be handled by ksoftirqd, when ksoftirqd gets scheduled.
    
    If a user space process with a real-time policy or a kernel function starts 
    to misbehave
    by never relinquishing the CPU while ksoftirqd is in state TASK_RUNNING,
    what will happen is that all softirqs will get deferred, while ksoftirqd,
    which is supposed to handle the deferred softirqs, will never get to run.

Real world problems I have seen so far:

    1. OS hung(related to rtnl and rcu_barrier()) due to RCU_SOFTIRQ starvation
    2. timekeeping watchdog issue due to TIMER_SOFTIRQ starvation
    3. P99 latency issue due to NET_RX_SOFTIRQ starvation

Proposal:

    Please consider chery-pick commit d15121be7485 ("Revert "softirq: Let ksoftirqd do its job""). RHEL 9 has done so last year. See https://access.redhat.com/errata/RHSA-2023:7370.
Comment 1 小龙 admin 2024-12-17 14:39:45 UTC
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/4236
Comment 2 dtcccc alibaba_cloud_group 2024-12-19 10:39:00 UTC
done