This proposal is an interim amendment, which focused on possibility of
backporting, of a problem that a Linux system can lock up forever due to
the behavior of memory allocator.
About current behavior of memory allocator:
The memory allocator continues looping than fail the allocation requests
unless "the requested page's order is larger than PAGE_ALLOC_COSTLY_ORDER"
or "GFP_NORETRY flag is passed to the allocation requests" or "TIF_MEMDIE
flag was set on the current thread by the OOM killer". As a result, the
system can fall into forever stalling state without any kernel messages;
resulting in unexplained system hang up troubles.
( https://lwn.net/Articles/627419/ )
There are at least three cases where a thread falls into infinite loop
inside the memory allocator.
The first case is too_many_isolated() throttling loop inside
shrink_inactive_list(). This throttling is intended for not to invoke the
OOM killer unnecessarily, but a certain type of memory pressure can make it
possible to let too_many_isolated() return true forever and nobody can
escape from shrink_inactive_list(). If all threads trying to allocate memory
are caught at too_many_isolated() loop, nobody can proceed.
( http://marc.info/?l=linux-kernel&m=140051046730378 and
http://marc.info/?l=linux-mm&m=141671817211121 ; Reproducer program for this
case is shared by only security@xxxxxxxxxx members and some individuals. )
The second case is allocation requests without __GFP_FS flag. This behavior
is intended for not to invoke the OOM killer unnecessarily because there
might be memory reclaimable by allocation requests with __GFP_FS flag. But
it is possible that all threads doing __GFP_FS allocation requests (including
kswapd which is capable of reclaiming memory with __GFP_FS flag) are blocked
and nobody can perform memory reclaim operations. As a result, the memory
allocator gives nobody a chance to invoke the OOM killer, falling into
infinite loop.
The third case is that the OOM victim is unable to release memory due to being
blocked by invisible dependency after a __GFP_FS allocation request invoked
the OOM killer. This case can occur when the OOM victim is blocked for waiting
for a lock whereas a thread doing allocation request with the lock held is
waiting for the OOM victim to release its mm struct. For example, we can
reproduce this case on XFS filesystem by doing !__GFP_FS allocation requests
with inode's mutex held. We can't expect that there are memory reclaimable by
__GFP_FS allocations because the OOM killer is already invoked. And since
there is already an OOM victim, the OOM killer is not invoked even if threads
doing __GFP_FS allocations are running. As a result, allocation requests by a
thread which is blocking an OOM victim can fall into infinite loop regardless
of whether the allocation request is __GFP_FS or not. We call such state as
OOM deadlock.
There are programs which are protected from the OOM killer by setting
/proc/$pid/oom_score_adj to -1000. /usr/sbin/sshd (an example of such
programs) is helpful for restarting programs killed by the OOM killer
because /usr/sbin/sshd can offer a mean to login to the system. However,
under the OOM deadlock state, /usr/sbin/sshd cannot offer a mean to login
because /usr/sbin/sshd will be stalling forever inside allocation requests
(e.g. page faults).
Those who set /proc/sys/vm/panic_on_oom to 0 are not expecting that the
system falls into forever-inoperable state when the OOM killer is invoked.
Instead, they are expecting that the system keeps operable state via the
OOM killer when the OOM killer is invoked. But current behavior makes it
impossible to login to the system, impossible to trigger SysRq-f (manually
kill a process) due to "workqueue being fallen into infinite loop inside
the memory allocator" or "SysRq-f choosing an OOM victim which already got
TIF_MEMDIE flag and got stuck due to invisible dependency". As a result, they
need to choose from SysRq-i (manually kill all processes), SysRq-c (manually
trigger kernel panic) or SysRq-b (manually reset the system). Asking them to
choose one of these SysRq is an unnecessarily large sacrifice. Also, they are
carried penalty that they need to go to in front of console in order to issue
a SysRq command, for infinite loop inside the memory allocator prevents them
from logging into the system via /usr/sbin/sshd . And since administrators are
using /proc/sys/vm/panic_on_oom with 0 without understanding that there is
such sacrifice and penalty, they rush into support center that their systems
had unexplained hang up problem. I do want to solve this circumstance.
The above description is about the third case. But people are carried penalty
for the first case and the second case that their systems fall into forever-
inoperable state until they go to in front of console and trigger SysRq-f
manually. The first case and the second case can happen regardless of
/proc/sys/vm/panic_on_oom setting because the OOM killer is not involved, but
administrators are using it without understanding that there are such cases.
And, even if they rush into support center with vmcore captured via SysRq-c,
we cannot analyze how long the threads spent looping inside the memory
allocator because current implementation gives no hint.
About proposals for mitigating this problem:
There has been several proposals which try to reduce the possibility of
OOM deadlock without use of timeout. Two of them are explained here.
One proposal is to allow small allocation requests to fail in order to avoid
lockups caused by looping forever inside the memory allocator.
( https://lwn.net/Articles/636017/ and https://lwn.net/Articles/636797/ )
But if such allocation requests start failing under memory pressure, a lot of
memory allocation failure paths which have almost never been tested will be
used, and various obscure bugs (e.g.
http://marc.info/?l=dri-devel&m=142189369426813 ) will show up. Thus, it is
too risky to backport. Also, as long as there are GFP_NOFAIL allocations
(either explicit or open-coded retry loop), this approach cannot completely
avoid OOM deadlock.
If we allow small memory allocations to fail than loop inside the memory
allocator, allocation requests caused by page faults start failing. As a side
effect, either "the OOM killer is invoked and some mm struct is chosen by the
OOM killer" or "that thread is killed by SIGBUS signal sent from the kernel"
will occur when an allocation request by page faults failed.
If important processes which are protected from the OOM killer by setting
/proc/$pid/oom_score_adj to -1000 are killed by SIGBUS signal than kill
OOM-killable processes via the OOM killer, /proc/$pid/oom_score_adj becomes
useless. Also, we can observe kernel panic triggered by the global init
process being killed by SIGBUS signal.
( http://marc.info/?l=linux-kernel&m=142676304911566 )
Regarding !__GFP_FS allocation requests caused by page faults, there will
be no difference (except for SIGBUS case explained above) between "directly
invoking the OOM killer while looping inside the memory allocator" and
"indirectly invoking the OOM killer after failing the allocation request".
However, penalty carried by failing !__GFP_FS allocation requests not caused
by page faults is large. For example, we experienced in Linux 3.19 that ext4
filesystem started to trigger filesystem error actions (remount as read-only
which prevents programs from working correctly, or kernel panic which stops
the whole system) when memory is extremely tight because we unexpectedly
allowed !__GFP_FS allocations to fail without retrying.
( http://marc.info/?l=linux-ext4&m=142443125221571 ) And we restored the
original behavior for now.
It is observed that this proposal (which allows memory allocations to fail)
is likely carrying larger penalty than trying to keep the system operable
state by invoking the OOM killer. Allowing small allocations to fail is not
as easy as people think.
Another proposal is to reserve some amount of memory which is used by
allocation requests which can invoke the OOM killer, by manipulating zone
watermark. ( https://lwn.net/Articles/642057/ ) But this proposal will not
help if threads which are preventing the OOM victim are doing allocation
requests which cannot invoke the OOM killer, or threads which are not
preventing the OOM victim can consume the reserve by doing allocation
requests which can invoke the OOM killer. Also, by manipulating zone
watermark, there could be performance impact because direct reclaim is
more likely to be invoked.
Since the dependency needed for avoiding OOM deadlock is not visible to the
memory allocator, we cannot avoid use of heuristic approaches for detecting
the OOM deadlock state. Already proposed for many times, and again proposed
here is to invoke the OOM killer based on timeout approach.