Description of problem: A memory leak occurs in cgroup wmark reclaim when MGLRU is enabled. call stack: kmalloc_trace+124 set_mm_walk+100 try_to_inc_max_seq.isra.0+82 try_to_shrink_lruvec+553 lru_gen_shrink_lruvec+71 shrink_node_memcgs+386 shrink_node+390 shrink_zones.constprop.0+133 do_try_to_free_pages+157 try_to_free_mem_cgroup_pages+263 wmark_work_func+153 process_one_work+397 worker_thread+631 kthread+228 ret_from_fork+48 ret_from_fork_asm+27 `walk = kzalloc(sizeof(*walk) ...` will never free! In direct reclaim, MGLRU allocate walk in set_mm_walk() and free it in clear_mm_walk(). In kspwad, however, MGLRU use static allocated walk and free is not needed. cgroup wmark reclaim call direct reclaim from kswapd, this lead a inconsistence, set_mm_walk() hit `VM_WARN_ON_ONCE(current_is_kswapd())` but still call kzalloc(), clear_mm_walk() free walk iff current is not kswapd. It consumes ~60GB kmalloc-256 slab memory per day in our production cluster. Version-Release number of selected component (if applicable): How reproducible: In cgroup v1 environment, this can be reproduced by ``` ubuntu2404 :: ~ » slabtop -o | grep kmalloc-256 1792 1792 100% 0.25K 56 32 448K kmalloc-256 0 0 0% 0.25K 0 32 0K dma-kmalloc-256 ubuntu2404 :: ~ » echo y > /sys/kernel/mm/lru_gen/enabled 130 ↵ ubuntu2404 :: ~ » mkdir -p /sys/fs/cgroup/memory/mytest ubuntu2404 :: ~ » echo 200M > /sys/fs/cgroup/memory/mytest/memory.limit_in_bytes ubuntu2404 :: ~ » echo 50 > /sys/fs/cgroup/memory/mytest/memory.wmark_ratio ubuntu2404 :: ~ » echo $$ > /sys/fs/cgroup/memory/mytest/cgroup.procs ubuntu2404 :: ~ » python3 -c 'import numpy as np;a = np.random.rand(1024, 1024, 20)' ubuntu2404 :: ~ » slabtop -o | grep kmalloc-256 4128 4128 100% 0.25K 129 32 1032K kmalloc-256 0 0 0% 0.25K 0 32 0K dma-kmalloc-256 ubuntu2404 :: ~ » python3 -c 'import numpy as np;a = np.random.rand(1024, 1024, 20)' ubuntu2404 :: ~ » slabtop -o | grep kmalloc-256 6432 6432 100% 0.25K 201 32 1608K kmalloc-256 0 0 0% 0.25K 0 32 0K dma-kmalloc-256 ``` kmalloc-256 keep growing and never free. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
The PR Link: https://gitee.com/anolis/cloud-kernel/pulls/6224