cpu3 stuck in printk more time in spin lock low_water_lock cause cpu0
get spin lock fail and system crashed.
CRs-Fixed: 969097
Change-Id: I75356a4b4171ae2888ce6cce792f569b5ca8cdcf
Signed-off-by: Tingting Yang <tingting@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
Signed-off-by: Minming Qi <mqi@codeaurora.org>
* refs/heads/tmp-aef17a58:
Linux 4.9.101
kernel/exit.c: avoid undefined behaviour when calling wait4()
futex: futex_wake_op, fix sign_extend32 sign bits
proc: do not access cmdline nor environ from file-backed areas
nfp: TX time stamp packets before HW doorbell is rung
l2tp: revert "l2tp: fix missing print session offset info"
Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
lockd: lost rollback of set_grace_period() in lockd_down_net()
xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
futex: Remove duplicated code and fix undefined behaviour
serial: sccnxp: Fix error handling in sccnxp_probe()
sctp: delay the authentication for the duplicated cookie-echo chunk
sctp: fix the issue that the cookie-ack with auth can't get processed
tcp: ignore Fast Open on repair mode
bonding: send learning packets for vlans on slave
net/mlx5: Avoid cleaning flow steering table twice during error flow
bonding: do not allow rlb updates to invalid mac
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
tcp_bbr: fix to zero idle_restart only upon S/ACKed data
sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
r8169: fix powering up RTL8168h
qmi_wwan: do not steal interfaces from class drivers
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
net: support compat 64-bit time in {s,g}etsockopt
net_sched: fq: take care of throttled flows before reuse
net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
net/mlx4_en: Verify coalescing parameters are in range
net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
net: ethernet: sun: niu set correct packet size in skb
llc: better deal with too small mtu
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
dccp: fix tasklet usage
bridge: check iface upper dev when setting master via ioctl
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
ANDROID: sdcardfs: Don't d_drop in d_revalidate
FROMLIST: brcmfmac: fix initialization of struct cfg80211_inform_bss variable
FROMLIST: brcmfmac: reports boottime_ns while informing bss
Change-Id: Idfe62af1b38254bed44364aa6ef001c38a5ad285
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
Changes in 4.9.101
8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
bridge: check iface upper dev when setting master via ioctl
dccp: fix tasklet usage
ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsg
llc: better deal with too small mtu
net: ethernet: sun: niu set correct packet size in skb
net: ethernet: ti: cpsw: fix packet leaking in dual_mac mode
net/mlx4_en: Verify coalescing parameters are in range
net/mlx5: E-Switch, Include VF RDMA stats in vport statistics
net_sched: fq: take care of throttled flows before reuse
net: support compat 64-bit time in {s,g}etsockopt
openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found
qmi_wwan: do not steal interfaces from class drivers
r8169: fix powering up RTL8168h
sctp: handle two v4 addrs comparison in sctp_inet6_cmp_addr
sctp: remove sctp_chunk_put from fail_mark err path in sctp_ulpevent_make_rcvmsg
sctp: use the old asoc when making the cookie-ack chunk in dupcook_d
tcp_bbr: fix to zero idle_restart only upon S/ACKed data
tg3: Fix vunmap() BUG_ON() triggered from tg3_free_consistent().
bonding: do not allow rlb updates to invalid mac
net/mlx5: Avoid cleaning flow steering table twice during error flow
bonding: send learning packets for vlans on slave
tcp: ignore Fast Open on repair mode
sctp: fix the issue that the cookie-ack with auth can't get processed
sctp: delay the authentication for the duplicated cookie-echo chunk
serial: sccnxp: Fix error handling in sccnxp_probe()
futex: Remove duplicated code and fix undefined behaviour
xfrm: fix xfrm_do_migrate() with AEAD e.g(AES-GCM)
lockd: lost rollback of set_grace_period() in lockd_down_net()
Revert "ARM: dts: imx6qdl-wandboard: Fix audio channel swap"
l2tp: revert "l2tp: fix missing print session offset info"
nfp: TX time stamp packets before HW doorbell is rung
proc: do not access cmdline nor environ from file-backed areas
futex: futex_wake_op, fix sign_extend32 sign bits
kernel/exit.c: avoid undefined behaviour when calling wait4()
Linux 4.9.101
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
* refs/heads/tmp-05baf14:
Linux 4.9.93
spi: davinci: fix up dma_mapping_error() incorrect patch
Revert "ip6_vti: adjust vti mtu according to mtu of lower device"
Revert "mtip32xx: use runtime tag to initialize command header"
Revert "spi: bcm-qspi: shut up warning about cfi header inclusion"
Revert "ARM: dts: omap3-n900: Fix the audio CODEC's reset pin"
Revert "ARM: dts: am335x-pepper: Fix the audio CODEC's reset pin"
Fix slab name "biovec-(1<<(21-12))"
net: hns: Fix ethtool private flags
md/raid10: reset the 'first' at the end of loop
ARM: dts: am57xx-idk-common: Add overide powerhold property
ARM: dts: am57xx-beagle-x15-common: Add overide powerhold property
ARM: dts: dra7: Add power hold and power controller properties to palmas
Documentation: pinctrl: palmas: Add ti,palmas-powerhold-override property definition
vt: change SGR 21 to follow the standards
Input: i8042 - enable MUX on Sony VAIO VGN-CS series to fix touchpad
Input: i8042 - add Lenovo ThinkPad L460 to i8042 reset list
Input: ALPS - fix TrackStick detection on Thinkpad L570 and Latitude 7370
staging: comedi: ni_mio_common: ack ai fifo error interrupts.
crypto: x86/cast5-avx - fix ECB encryption when long sg follows short one
crypto: ahash - Fix early termination in hash walk
parport_pc: Add support for WCH CH382L PCI-E single parallel port card.
media: usbtv: prevent double free in error case
mei: remove dev_err message on an unsupported ioctl
USB: serial: cp210x: add ELDAT Easywave RX09 id
USB: serial: ftdi_sio: add support for Harman FirmwareHubEmulator
USB: serial: ftdi_sio: add RT Systems VX-8 cable
arm64: idmap: Use "awx" flags for .idmap.text .pushsection directives
arm64: entry: Reword comment about post_ttbr_update_workaround
arm64: Force KPTI to be disabled on Cavium ThunderX
arm64: kpti: Add ->enable callback to remap swapper using nG mappings
arm64: kpti: Make use of nG dependent on arm64_kernel_unmapped_at_el0()
arm64: Turn on KPTI only on CPUs that need it
arm64: cputype: Add MIDR values for Cavium ThunderX2 CPUs
arm64: capabilities: Handle duplicate entries for a capability
arm64: Allow checking of a CPU-local erratum
arm64: Take into account ID_AA64PFR0_EL1.CSV3
arm64: Kconfig: Reword UNMAP_KERNEL_AT_EL0 kconfig entry
arm64: Kconfig: Add CONFIG_UNMAP_KERNEL_AT_EL0
arm64: use RET instruction for exiting the trampoline
arm64: kaslr: Put kernel vectors address in separate data page
arm64: entry: Add fake CPU feature for unmapping the kernel at EL0
arm64: tls: Avoid unconditional zeroing of tpidrro_el0 for native tasks
arm64: entry: Hook up entry trampoline to exception vectors
arm64: entry: Explicitly pass exception level to kernel_ventry macro
arm64: mm: Map entry trampoline into trampoline and kernel page tables
arm64: entry: Add exception trampoline page for exceptions from EL0
module: extend 'rodata=off' boot cmdline parameter to module mappings
arm64: factor out entry stack manipulation
arm64: mm: Invalidate both kernel and user ASIDs when performing TLBI
arm64: mm: Add arm64_kernel_unmapped_at_el0 helper
arm64: mm: Allocate ASIDs in pairs
arm64: mm: Move ASID from TTBR0 to TTBR1
arm64: mm: Use non-global mappings for kernel space
usb: dwc2: Improve gadget state disconnection handling
scsi: virtio_scsi: always read VPD pages for multiqueue too
llist: clang: introduce member_address_is_nonnull()
Bluetooth: Fix missing encryption refresh on Security Request
netfilter: x_tables: add and use xt_check_proc_name
netfilter: bridge: ebt_among: add more missing match size checks
xfrm: Refuse to insert 32 bit userspace socket policies on 64 bit systems
net: xfrm: use preempt-safe this_cpu_read() in ipcomp_alloc_tfms()
RDMA/ucma: Introduce safer rdma_addr_size() variants
RDMA/ucma: Check that device exists prior to accessing it
RDMA/ucma: Check that device is connected prior to access it
RDMA/ucma: Ensure that CM_ID exists prior to access it
RDMA/ucma: Fix use-after-free access in ucma_close
RDMA/ucma: Check AF family prior resolving address
xfrm_user: uncoditionally validate esn replay attribute struct
mm/vmscan.c: fix unsequenced modification and access warning
selinux: Remove redundant check for unknown labeling behavior
arm64: avoid overflow in VA_START and PAGE_OFFSET
btrfs: Remove extra parentheses from condition in copy_items()
mac80211: ibss: Fix channel type enum in ieee80211_sta_join_ibss()
mac80211: Fix clang warning about constant operand in logical operation
netfilter: ctnetlink: Make some parameters integer to avoid enum mismatch
HID: sony: Use LED_CORE_SUSPENDRESUME
cfg80211: Fix array-bounds warning in fragment copy
nl80211: Fix enum type of variable in nl80211_put_sta_rate()
xgene_enet: remove bogus forward declarations
usb: gadget: remove redundant self assignment
frv: declare jiffies to be located in the .data section
jiffies.h: declare jiffies and jiffies_64 with ____cacheline_aligned_in_smp
fs: compat: Remove warning from COMPATIBLE_IOCTL
selinux: Remove unnecessary check of array base in selinux_set_mapping()
cpumask: Add helper cpumask_available()
genirq: Use cpumask_available() for check of cpumask variable
netfilter: nf_nat_h323: fix logical-not-parentheses warning
Input: mousedev - fix implicit conversion warning
dm ioctl: remove double parentheses
PCI: Make PCI_ROM_ADDRESS_MASK a 32-bit constant
kprobes/x86: Fix to set RWX bits correctly before releasing trampoline
partitions/msdos: Unable to mount UFS 44bsd partitions
powerpc/64s: Fix i-side SLB miss bad address handler saving nonvolatile GPRs
powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened
ipc/shm.c: add split function to shm_vm_ops
ceph: only dirty ITER_IOVEC pages for direct read
perf/hwbp: Simplify the perf-hwbp code, fix documentation
ALSA: pcm: potential uninitialized return values
ALSA: pcm: Use dma_bytes as size parameter in dma_mmap_coherent()
ALSA: usb-audio: Add native DSD support for TEAC UD-301
mtd: jedec_probe: Fix crash in jedec_read_mfr()
ARM: 8746/1: vfp: Go back to clearing vfp_current_hw_state[]
ANDROID: fuse: Add null terminator to path in canonical path to avoid issue
ANDROID: sdcardfs: Fix sdcardfs to stop creating cases-sensitive duplicate entries.
ANDROID: cpufreq: times: skip printing invalid frequencies
ANDROID: cpufreq: Add time_in_state to /proc/uid directories
ANDROID: proc: Add /proc/uid directory
ANDROID: cpufreq: times: track per-uid time in state
ANDROID: cpufreq: track per-task time in state
arm64: fix show_data fallout from KERN_CONT changes
arm: fix show_data fallout from KERN_CONT changes
Conflicts:
arch/arm64/include/asm/assembler.h
arch/arm64/include/asm/cputype.h
arch/arm64/include/asm/sysreg.h
arch/arm64/kernel/cpufeature.c
kernel/sched/core.c
Change-Id: If39e1c5577a1c9345b1b2739f4a5368422cef135
Signed-off-by: Blagovest Kolenichev <bkolenichev@codeaurora.org>
If a recursive fault is detected during do_exit(), tasks are left
to sit and wait in an un-interruptible sleep until the system
reboots (typically manually). Add Kconfig option to change this
behaviour and force a panic.
This is particularly important if a critical system task encounters
a recursive fault (ex. a kworker). Otherwise, the system may be
unusable, but since the scheduler is still running system watchdogs
may continue to be pet.
Change-Id: Ifc26fc79d6066f05a3b2c4d27f78bf4f8d2bd640
Signed-off-by: Matt Wagantall <mattw@codeaurora.org>
Add time in state data to task structs, and create
/proc/<pid>/time_in_state files to show how long each individual task
has run at each frequency.
Create a CONFIG_CPU_FREQ_TIMES option to enable/disable this tracking.
Signed-off-by: Connor O'Brien <connoro@google.com>
Bug: 72339335
Bug: 70951257
Test: Read /proc/<pid>/time_in_state
Change-Id: Ia6456754f4cb1e83b2bc35efa8fbe9f8696febc8
A killed task can stay in the task list long after its
memory has been returned to the system, therefore
ignore any tasks whose mm struct has been freed.
Change-Id: I76394b203b4ab2312437c839976f0ecb7b6dde4e
Signed-off-by: Liam Mark <lmark@codeaurora.org>
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
The scheduler needs to do some book-keeping when a task exits.
Call the scheduler hook sched_exit() that takes care of all of
that necessary book-keeping.
Change-Id: I551aead70248b06f9d918e6d032075d6ecaa7fed
Signed-off-by: Syed Rameez Mustafa <rameezmustafa@codeaurora.org>
Contains:
sched/tune: fix accounting for runnable tasks (1/5)
The accounting for tasks into boost groups of different CPUs is currently
broken mainly because:
a) we do not properly track the change of boost group of a RUNNABLE task
b) there are race conditions between migration code and accounting code
This patch provides a fixes to ensure enqueue/dequeue
accounting also for throttled tasks.
Without this patch is can happen that a task is enqueued into a throttled
RQ thus not being accounted for the boosting of the corresponding RQ.
We could argue that a throttled task should not boost a CPU, however:
a) properly implementing CPU boosting considering throttled tasks will
increase a lot the complexity of the solution
b) it's not easy to quantify the benefits introduced by such a more
complex solution
Since task throttling requires the usage of the CFS bandwidth controller,
which is not widely used on mobile systems (at least not by Android kernels
so far), for the time being we go for the simple solution and boost also
for throttled RQs.
sched/tune: fix accounting for runnable tasks (2/5)
This patch provides the code required to enforce proper locking.
A per boost group spinlock has been added to grant atomic
accounting of tasks as well as to serialise enqueue/dequeue operations,
triggered by tasks migrations, with cgroups's attach/detach operations.
sched/tune: fix accounting for runnable tasks (3/5)
This patch adds cgroups {allow,can,cancel}_attach callbacks.
Since a task can be migrated between boost groups while it's running,
the CGroups's attach callbacks have been added to properly migrate
boost contributions of RUNNABLE tasks.
The RQ's lock is used to serialise enqueue/dequeue operations, triggered
by tasks migrations, with cgroups's attach/detach operations. While the
SchedTune's CPU lock is used to grant atrocity of the accounting within
the CPU.
NOTE: the current implementation does not allows a concurrent CPU migration
and CGroups change.
sched/tune: fix accounting for runnable tasks (4/5)
This fixes accounting for exiting tasks by adding a dedicated call early
in the do_exit() syscall, which disables SchedTune accounting as soon as a
task is flagged PF_EXITING.
This flag is set before the multiple dequeue/enqueue dance triggered
by cgroup_exit() which is useful only to inject useless tasks movements
thus increasing possibilities for race conditions with the migration code.
The schedtune_exit_task() call does the last dequeue of a task from its
current boost group. This is a solution more aligned with what happens in
mainline kernels (>v4.4) where the exit_cgroup does not move anymore a dying
task to the root control group.
sched/tune: fix accounting for runnable tasks (5/5)
To avoid accounting issues at startup, this patch disable the SchedTune
accounting until the required data structures have been properly
initialized.
Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com>
[jstultz: fwdported to 4.4]
Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Andres Oportus <andresoportus@google.com>
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:
BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current <-- use after free
depot_save_stack
save_stack
kasan_slab_free
kmem_cache_free
__mpol_put <-- free
do_exit
This patch sets current->mempolicy to NULL before dropping the final
reference.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> [4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler updates from Ingo Molnar:
- introduce and use task_rcu_dereference()/try_get_task_struct() to fix
and generalize task_struct handling (Oleg Nesterov)
- do various per entity load tracking (PELT) fixes and optimizations
(Peter Zijlstra)
- cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)
- introduce consolidated cputime output file cpuacct.usage_all and
related refactorings (Zhao Lei)
- ... plus misc fixes and enhancements
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
sched/fair: Rework throttle_count sync
sched/core: Fix sched_getaffinity() return value kerneldoc comment
sched/fair: Reorder cgroup creation code
sched/fair: Apply more PELT fixes
sched/fair: Fix PELT integrity for new tasks
sched/cgroup: Fix cpu_cgroup_fork() handling
sched/fair: Fix PELT integrity for new groups
sched/fair: Fix and optimize the fork() path
sched/cputime: Add steal time support to full dynticks CPU time accounting
sched/cputime: Fix prev steal time accouting during CPU hotplug
KVM: Fix steal clock warp during guest CPU hotplug
sched/debug: Always show 'nr_migrations'
sched/fair: Use task_rcu_dereference()
sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
sched/idle: Optimize the generic idle loop
sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).
As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().
Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().
( Also add try_get_task_struct() to make it easier to use this API
correctly. )
Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following program (simplified version of generated by syzkaller)
#include <pthread.h>
#include <unistd.h>
#include <sys/ptrace.h>
#include <stdio.h>
#include <signal.h>
void *thread_func(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return 0;
}
int main(void)
{
pthread_t thread;
if (fork())
return 0;
while (getppid() != 1)
;
pthread_create(&thread, NULL, thread_func, NULL);
pthread_join(thread, NULL);
return 0;
}
creates an unreapable zombie if /sbin/init doesn't use __WALL.
This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.
Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.
This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.
This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this won't
break something, but at least strace/gdb won't suffer.
We could make a more conservative change. Say, we can take __WCLONE into
account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Roland McGrath <roland@hack.frob.com>
Cc: <syzkaller@googlegroups.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to call exit_thread from copy_process in a fail path. So make it
accept task_struct as a parameter.
[v2]
* s390: exit_thread_runtime_instr doesn't make sense to be called for
non-current tasks.
* arm: fix the comment in vfp_thread_copy
* change 'me' to 'tsk' for task_struct
* now we can change only archs that actually have exit_thread
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Liqin <liqin.linux@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Ley Foon Tan <lftan@altera.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Steven Miao <realmz6@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.
The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.
This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.
Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Argangeli <andrea@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.
kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).
Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.
This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.
We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:
https://github.com/google/syzkaller/wiki/Found-Bugs
We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.
Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.
kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.
Based on a patch by Quentin Casasnovas.
[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Kees Cook <keescook@google.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: David Drysdale <drysdale@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
task_stopped_code()->task_is_stopped_or_traced() doesn't look right, the
traced task must never be TASK_STOPPED.
We can not add WARN_ON(task_is_stopped(p)), but this is only because
do_wait() can race with PTRACE_ATTACH from another thread.
[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Roland McGrath <roland@hack.frob.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Pedro Alves <palves@redhat.com>
Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:
- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)
- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)
- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
Currently, __srcu_read_lock() cannot be invoked from restricted
environments because it contains calls to preempt_disable() and
preempt_enable(), both of which can invoke lockdep, which is a bad
idea in some restricted execution modes. This commit therefore moves
the preempt_disable() and preempt_enable() from __srcu_read_lock()
to srcu_read_lock(). It also inserts the preempt_disable() and
preempt_enable() around the call to __srcu_read_lock() in do_exit().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
When we warn about a preempt_count leak; reset the preempt_count to
the known good value such that the problem does not ripple forward.
This is most important on x86 which has a per cpu preempt_count that is
not saved/restored (after this series). So if you schedule with an
invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is
messed up too.
Enforcing this invariant limits the borkage to just the one task.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
s,critiera,criteria,
While at it, add a comma, because it makes sense grammatically.
Signed-off-by: Frans Klaver <fransklaver@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.com>
There is a helpful comment in do_exit() that states we sync the mm's RSS
info before statistics gathering.
The function that does the statistics gathering is called right above that
comment.
Change the code to obey the comment.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.
While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of exec_domain are gone, now we can get rid
of that abandoned feature.
To not break existing userspace we keep a dummy
/proc/execdomains file which will always contain
"0-0 Linux [kernel]".
Signed-off-by: Richard Weinberger <richard@nod.at>
Commit 5695be142e ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.
Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.
oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.
oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.
The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.
out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.
Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because
- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen. But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation. In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen. This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.
The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.
The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_. As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably. I can
only speculate what might happen when a task is still runnable
unexpectedly.
On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.
I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
freezing(current) then it tries another round of allocation before calling
try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
I think this should be OK because __thaw_task shouldn't interfere with any
locking down wake_up_process. Oleg?
As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.
[ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[ 243.636072] (elapsed 2.837 seconds) done.
[ 243.641985] Trying to disable OOM killer
[ 243.643032] Waiting for concurent OOM victims
[ 243.644342] OOM killer disabled
[ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[ 243.652983] Suspending console(s) (use no_console_suspend to debug)
[ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[ 243.992600] PM: suspend of devices complete after 336.667 msecs
[ 243.993264] PM: late suspend of devices complete after 0.660 msecs
[ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[ 243.994717] ACPI: Preparing to enter system sleep state S3
[ 243.994795] PM: Saving platform NVS memory
[ 243.994796] Disabling non-boot CPUs ...
The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.
Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.
The main patch is the last one and I would appreciate acks from Tejun and
Rafael. I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code. I have found several
surprises there.
This patch (of 5):
This patch is just a preparatory and it doesn't introduce any functional
change.
Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.
Many thanks to Arne who carefully investigated the problem.
Note: this bug is very old but it was pure theoretical until commit
b3ab03160d ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reported-by: Arne Goedeke <el@laramies.com>
Tested-by: Arne Goedeke <el@laramies.com>
Cc: <stable@vger.kernel.org> [3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull tty/serial driver updates from Greg KH:
"Here's the big tty/serial driver update for 3.19-rc1.
There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below"
* tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
serial: pxa: hold port.lock when reporting modem line changes
tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
tty: Deletion of unnecessary checks before two function calls
n_tty: Fix read_buf race condition, increment read_head after pushing data
serial: of-serial: add PM suspend/resume support
Revert "serial: of-serial: add PM suspend/resume support"
Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
serial: 8250: don't attempt a trylock if in sysrq
serial: core: Add big-endian iotype
serial: samsung: use port->fifosize instead of hardcoded values
serial: samsung: prefer to use fifosize from driver data
serial: samsung: fix style problems
serial: samsung: wait for transfer completion before clock disable
serial: icom: fix error return code
serial: tegra: clean up tty-flag assignments
serial: Fix io address assign flow with Fintek PCI-to-UART Product
serial: mxs-auart: fix tx_empty against shift register
serial: mxs-auart: fix gpio change detection on interrupt
serial: mxs-auart: Fix mxs_auart_set_ldisc()
serial: 8250_dw: Use 64-bit access for OCTEON.
...
Shift "release dead children" loop from forget_original_parent() to its
caller, exit_notify(). It is safe to reap them even if our parent reaps
us right after we drop tasklist_lock, those children no longer have any
connection to the exiting task.
And this allows us to avoid write_lock_irq(tasklist_lock) right after it
was released by forget_original_parent(), we can simply call it with
tasklist_lock held.
While at it, move the comment about forget_original_parent() up to
this function.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that pid_ns logic was isolated we can change forget_original_parent()
to return right after find_child_reaper() when father->children is empty,
there is nothing to reparent in this case.
In particular this avoids find_alive_thread() and this can help if the
whole process exits and it has a lot of PF_EXITING threads at the start of
the thread list, this can easily lead to O(nr_threads ** 2) iterations.
Trivial test case (tested under KVM, 2 CPUs):
static void *tfunc(void *arg)
{
pause();
return NULL;
}
static int child(unsigned int nt)
{
pthread_t pt;
while (nt--)
assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
pthread_kill(pt, SIGTRAP);
pause();
return 0;
}
int main(int argc, const char *argv[])
{
int stat;
unsigned int nf = atoi(argv[1]);
unsigned int nt = atoi(argv[2]);
while (nf--) {
if (!fork())
return child(nt);
wait(&stat);
assert(stat == SIGTRAP);
}
return 0;
}
$ time ./test 16 16536 shows:
real user sys
- 5m37.628s 0m4.437s 8m5.560s
+ 0m50.032s 0m7.130s 1m4.927s
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() does 2 completely different things. Not only it finds a
reaper, it also updates pid_ns->child_reaper or kills the whole namespace
if the caller is ->child_reaper.
Now that has_child_subreaper logic doesn't depend on child_reaper check we
can move that pid_ns code into a separate helper. IMHO this makes the
code more clean, and this allows the next changes.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change find_new_reaper() to use for_each_thread() instead of deprecated
while_each_thread(). We do not bother to check "thread != father" in the
1st loop, we can rely on PF_EXITING check.
Note: this means the minor behavioural change: for_each_thread() starts
from the group leader. But this should be fine, nobody should make any
assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
And this can avoid the pointless reparenting to a short-living thread
While zombie leaders are not that common.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:
1. In fact it is safe if "pid_ns->child_reaper == father"; there must
be no children after zap_pid_ns_processes() returns, so it doesn't
matter what we return in this case and even pid_ns->child_reaper is
wrong otherwise: we can't reparent to ->child_reaper == current.
This is not a bug, but this is confusing.
2. It is not safe if we are not pid_ns->child_reaper but from the same
thread group. We drop tasklist_lock before zap_pid_ns_processes(),
so another thread can lock it and choose the new reaper from the
upper namespace if has_child_subreaper == T, and this is obviously
wrong.
This is not that bad, zap_pid_ns_processes() won't return until the
the new reaper reaps all zombies, but this should be fixed anyway.
We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac62706a, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.
We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
correct but needs tasklist_lock, ->real_parent can exit.
We can use "current" instead. This is our natural child, its parent
must be our sub-thread.
2. Read psig/sig outside of ->siglock, ->signal is no longer protected
by this lock.
3. Fix the outdated comments about tasklist_lock. We can not race with
__exit_signal(), the whole thread group is dead, nobody but us can
call it.
Also clarify the usage of ->stats_lock and ->siglock.
Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
probably want to export cputime_adjust() to avoid thread_group_cputime().
The comment says "all threads" but there are no other threads.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that EXIT_DEAD is the terminal state we can kill "int traced"
variable and check "state == EXIT_DEAD" instead to cleanup the code. In
particular, this way it is clear that the check obviously doesn't need
tasklist_lock.
Also fix the type of "unsigned long state", "long" was always wrong
although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).
[akpm@linux-foundation.org: don't make me google the C Operator Precedence table]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sterling Alexander <stalexan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.
Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Cc: Sterling Alexander <stalexan@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roland McGrath <roland@hack.frob.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>