[solved] Random crashes kcompactd0 Tainted

Hello guys,

I am running Clear Linux on my Intel NUC 8i5BEH.
For some reasons, it crashes randomly.

Here what I see in the logs (the message is repeating indefinitely). Can someone help me understand the issue please?

Dec 19 06:08:48 myyqo systemd[1]: Starting Update Software content...
Dec 19 06:08:48 myyqo swupd[1731714]: Update started
Dec 19 06:08:49 myyqo swupd[1731714]: Version on server (31940) is not newer than system version (31940)
Dec 19 06:08:49 myyqo swupd[1731714]: Update complete - System already up-to-date at version 31940
Dec 19 06:08:49 myyqo systemd[1]: Started swupd telemetry probe.
Dec 19 06:08:49 myyqo systemd[1]: swupd-update.service: Succeeded.
Dec 19 06:08:49 myyqo systemd[1]: Started Update Software content.
Dec 19 06:08:49 myyqo systemd[1]: swupd-probe.service: Succeeded.
Dec 19 06:08:49 myyqo systemd[1]: Started Telemetrics Daemon.
Dec 19 06:08:49 myyqo systemd[1]: Started Telemetrics Post Daemon.
Dec 19 06:14:49 myyqo systemd[1]: telemprobd.service: Succeeded.
Dec 19 06:14:50 myyqo systemd[1]: telempostd.service: Succeeded.
Dec 19 06:43:46 myyqo containerd[493]: time="2019-12-19T06:43:46.144200988+01:00" level=error msg="get state for 46ca100e12e4eaa98b8200b78ae3acdda7e4fd2ef5f12c7aa3c146f0eb3d2fe4" error="context deadline exceeded: unknown"
Dec 19 06:43:46 myyqo containerd[493]: time="2019-12-19T06:43:46.144234010+01:00" level=warning msg="unknown status" status=0
Dec 19 06:44:13 myyqo kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Dec 19 06:44:13 myyqo kernel: rcu:         1-....: (59999 ticks this GP) idle=836/1/0x4000000000000002 softirq=30284222/30284222 fqs=14560 last_accelerate: 0000/5753, Nonlazy posted: ...
Dec 19 06:44:13 myyqo kernel:         (t=60002 jiffies g=109858665 q=214976)
Dec 19 06:44:13 myyqo kernel: NMI backtrace for cpu 1
Dec 19 06:44:13 myyqo kernel: CPU: 1 PID: 65 Comm: kcompactd0 Tainted: G      D           5.4.2-875.native #1
Dec 19 06:44:13 myyqo kernel: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0075.2019.1023.1448 10/23/2019
Dec 19 06:44:13 myyqo kernel: Call Trace:
Dec 19 06:44:13 myyqo kernel:  <IRQ>
Dec 19 06:44:13 myyqo kernel:  dump_stack+0x58/0x78
Dec 19 06:44:13 myyqo kernel:  ? lapic_can_unplug_cpu.cold+0x38/0x38
Dec 19 06:44:13 myyqo kernel:  nmi_cpu_backtrace.cold+0x14/0x53
Dec 19 06:44:13 myyqo kernel:  nmi_trigger_cpumask_backtrace+0xf7/0xfc
Dec 19 06:44:13 myyqo kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Dec 19 06:44:13 myyqo kernel:  rcu_dump_cpu_stacks+0x97/0xd1
Dec 19 06:44:13 myyqo kernel:  rcu_sched_clock_irq.cold+0x1dc/0x3d1
Dec 19 06:44:13 myyqo kernel:  ? account_system_index_time+0x8c/0xa0
Dec 19 06:44:13 myyqo kernel:  update_process_times+0x27/0x50
Dec 19 06:44:13 myyqo kernel:  tick_sched_handle+0x24/0x60
Dec 19 06:44:13 myyqo kernel:  tick_sched_timer+0x38/0x90
Dec 19 06:44:13 myyqo kernel:  __hrtimer_run_queues+0xee/0x250
Dec 19 06:44:13 myyqo kernel:  ? tick_sched_do_timer+0x70/0x70
Dec 19 06:44:13 myyqo kernel:  hrtimer_interrupt+0x104/0x220
Dec 19 06:44:13 myyqo kernel:  smp_apic_timer_interrupt+0x6c/0x140
Dec 19 06:44:13 myyqo kernel:  apic_timer_interrupt+0xf/0x20
Dec 19 06:44:13 myyqo kernel:  </IRQ>
Dec 19 06:44:13 myyqo kernel: RIP: 0010:queued_spin_lock_slowpath+0x3b/0x1c0
Dec 19 06:44:13 myyqo kernel: Code: ff 75 4f f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 29 85 c0 74 0f 8b 07 84 c0 74 09 <0f> ae e8 8b 07 84 c0 75 f7 b8 01 00 00 00 66 89 07 31 c0 89 c2 89
Dec 19 06:44:13 myyqo kernel: RSP: 0018:ffffae22002cbab0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Dec 19 06:44:13 myyqo kernel: RAX: 0000000000000101 RBX: ffffae22002cbb78 RCX: 0000000000000000
Dec 19 06:44:13 myyqo kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffd724c84e7568
Dec 19 06:44:13 myyqo kernel: RBP: ffffae22002cbab8 R08: ffff9ae15a45f010 R09: 0000000000000000
Dec 19 06:44:13 myyqo kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffd724c40747c0
Dec 19 06:44:13 myyqo kernel: R13: ffffd724c40747c0 R14: 0000000000000270 R15: ffff9ae035a0c840
Dec 19 06:44:13 myyqo kernel:  ? _raw_spin_lock+0x21/0x30
Dec 19 06:44:13 myyqo kernel:  page_vma_mapped_walk+0x343/0x6a0
Dec 19 06:44:13 myyqo kernel:  try_to_unmap_one+0x137/0xa70
Dec 19 06:44:13 myyqo kernel:  ? do_page_add_anon_rmap+0xa7/0xf0
Dec 19 06:44:13 myyqo kernel:  rmap_walk_anon+0xf1/0x260
Dec 19 06:44:13 myyqo kernel:  rmap_walk+0x4b/0x70
Dec 19 06:44:13 myyqo kernel:  try_to_unmap+0xab/0xe0
Dec 19 06:44:13 myyqo kernel:  ? page_remove_rmap+0x300/0x300
Dec 19 06:44:13 myyqo kernel:  ? page_not_mapped+0x20/0x20
Dec 19 06:44:13 myyqo kernel:  ? page_get_anon_vma+0x80/0x80
Dec 19 06:44:13 myyqo kernel:  ? invalid_mkclean_vma+0x20/0x20
Dec 19 06:44:13 myyqo kernel:  migrate_pages+0x7f0/0xb40
Dec 19 06:44:13 myyqo kernel:  ? __bpf_trace_mm_compaction_kcompactd_sleep+0x20/0x20
Dec 19 06:44:13 myyqo kernel:  ? fast_isolate_freepages+0x700/0x700
Dec 19 06:44:13 myyqo kernel:  compact_zone+0x6c3/0xcf0
Dec 19 06:44:13 myyqo kernel:  kcompactd_do_work+0xeb/0x270
Dec 19 06:44:13 myyqo kernel:  ? sched_clock_cpu+0x11/0xc0
Dec 19 06:44:13 myyqo kernel:  kcompactd+0x8c/0x1c0
Dec 19 06:44:13 myyqo kernel:  ? finish_wait+0x80/0x80
Dec 19 06:44:13 myyqo kernel:  kthread+0x101/0x140
Dec 19 06:44:13 myyqo kernel:  ? kcompactd_do_work+0x270/0x270
Dec 19 06:44:13 myyqo kernel:  ? kthread_park+0xa0/0xa0
Dec 19 06:44:13 myyqo kernel:  ret_from_fork+0x35/0x40
Dec 19 06:44:19 myyqo kernel: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 1-... } 62710 jiffies s: 39725 root: 0x2/.
Dec 19 06:44:19 myyqo kernel: rcu: blocking rcu_node structures:
Dec 19 06:44:19 myyqo kernel: Task dump for CPU 1:
Dec 19 06:44:19 myyqo kernel: kcompactd0      R  running task        0    65      2 0x80004008
Dec 19 06:44:19 myyqo kernel: Call Trace:
Dec 19 06:44:19 myyqo kernel:  ? kcompactd+0x8c/0x1c0
Dec 19 06:44:19 myyqo kernel:  ? finish_wait+0x80/0x80
Dec 19 06:44:19 myyqo kernel:  ? kthread+0x101/0x140
Dec 19 06:44:19 myyqo kernel:  ? kcompactd_do_work+0x270/0x270
Dec 19 06:44:19 myyqo kernel:  ? kthread_park+0xa0/0xa0
Dec 19 06:44:19 myyqo kernel:  ? ret_from_fork+0x35/0x40
Dec 19 06:46:00 myyqo kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 89s! [kcompactd0:65]
Dec 19 06:46:00 myyqo kernel: Modules linked in: xt_nat veth nf_conntrack_netlink xt_addrtype xt_conntrack iptable_filter iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bpfilter ip_tables wireguard ip6_udp_tunnel udp_tunnel overlay intel_wmi_thunderbolt mei_hdcp wmi_bmof cdc_acm iwlmvm mac80211 iwlwifi cfg80211 e1000e i2c_i801 mei_me rfkill mei intel_pch_thermal wmi thermal pinctrl_cannonlake pinctrl_intel intel_pmc_core
Dec 19 06:46:00 myyqo kernel: CPU: 1 PID: 65 Comm: kcompactd0 Tainted: G      D           5.4.2-875.native #1
Dec 19 06:46:00 myyqo kernel: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0075.2019.1023.1448 10/23/2019

Looks like a serious kernel bug. Can you consider turning on telemetry? Then we can try and possibly capture these events for more data.

Sure things. I already enabled the telemetry during the installation.

sudo telemctl opt-in
Already opted in. Nothing to do.

I had another crash, but with a different message this time but same behaviour (a hard reboot was necessary):

Dec 25 02:46:45 myyqo containerd[497]: time="2019-12-25T02:46:45.017269192+01:00" level=warning msg="unknown status" status=0
Dec 25 02:49:45 myyqo containerd[497]: time="2019-12-25T02:49:45.032308503+01:00" level=error msg="get state for 6c3533f75b5f0481c6dc52cc9eef3d82982619f4715170287ade6447903f7f4d" error="context deadline exceeded: 
unknown"
Dec 25 02:49:45 myyqo containerd[497]: time="2019-12-25T02:49:45.032362111+01:00" level=warning msg="unknown status" status=0
Dec 25 02:51:15 myyqo containerd[497]: time="2019-12-25T02:51:15.040837105+01:00" level=error msg="get state for 6c3533f75b5f0481c6dc52cc9eef3d82982619f4715170287ade6447903f7f4d" error="context deadline exceeded: 
unknown"
Dec 25 02:51:15 myyqo containerd[497]: time="2019-12-25T02:51:15.040890857+01:00" level=warning msg="unknown status" status=0
Dec 25 02:52:45 myyqo containerd[497]: time="2019-12-25T02:52:45.047988380+01:00" level=error msg="get state for 6c3533f75b5f0481c6dc52cc9eef3d82982619f4715170287ade6447903f7f4d" error="context deadline exceeded: 
unknown"
Dec 25 02:52:45 myyqo containerd[497]: time="2019-12-25T02:52:45.048041174+01:00" level=warning msg="unknown status" status=0
Dec 25 02:52:45 myyqo systemd[1]: Starting Update Software content...
Dec 25 02:52:45 myyqo swupd[1364905]: Update started
Dec 25 02:52:46 myyqo swupd[1364905]: Preparing to update from 31970 to 31980
Dec 25 02:53:17 myyqo containerd[497]: time="2019-12-25T02:53:17.276378793+01:00" level=error msg="get state for 6aad0c6c208dcb55726ff5271cb730653eb03d399da459bca4c69661176009b6" error="context deadline exceeded: 
unknown"
Dec 25 02:53:17 myyqo containerd[497]: time="2019-12-25T02:53:17.276412150+01:00" level=warning msg="unknown status" status=0
Dec 25 02:53:47 myyqo kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Dec 25 02:53:47 myyqo kernel: rcu:         2-....: (59999 ticks this GP) idle=c92/1/0x4000000000000002 softirq=18615608/18615608 fqs=14988 last_accelerate: 0000/bc29, Nonlazy posted: ...
Dec 25 02:53:47 myyqo kernel:         (t=60008 jiffies g=79055269 q=102866)
Dec 25 02:53:47 myyqo kernel: NMI backtrace for cpu 2
Dec 25 02:53:47 myyqo kernel: CPU: 2 PID: 1364905 Comm: swupd Tainted: G      D           5.4.4-879.native #1
Dec 25 02:53:47 myyqo kernel: Hardware name: Intel(R) Client Systems NUC8i5BEH/NUC8BEB, BIOS BECFL357.86A.0075.2019.1023.1448 10/23/2019
Dec 25 02:53:47 myyqo kernel: Call Trace:
Dec 25 02:53:47 myyqo kernel:  <IRQ>
Dec 25 02:53:47 myyqo kernel:  dump_stack+0x58/0x78
Dec 25 02:53:47 myyqo kernel:  ? lapic_can_unplug_cpu.cold+0x38/0x38
Dec 25 02:53:47 myyqo kernel:  nmi_cpu_backtrace.cold+0x14/0x53
Dec 25 02:53:47 myyqo kernel:  nmi_trigger_cpumask_backtrace+0xf7/0xfc
Dec 25 02:53:47 myyqo kernel:  arch_trigger_cpumask_backtrace+0x14/0x20
Dec 25 02:53:47 myyqo kernel:  rcu_dump_cpu_stacks+0x97/0xd1
Dec 25 02:53:47 myyqo kernel:  rcu_sched_clock_irq.cold+0x1dc/0x3d1
Dec 25 02:53:47 myyqo kernel:  ? account_system_index_time+0x8c/0xa0
Dec 25 02:53:47 myyqo kernel:  update_process_times+0x27/0x50
Dec 25 02:53:47 myyqo kernel:  tick_sched_handle+0x24/0x60
Dec 25 02:53:47 myyqo kernel:  tick_sched_timer+0x38/0x90
Dec 25 02:53:47 myyqo kernel:  __hrtimer_run_queues+0xee/0x250
Dec 25 02:53:47 myyqo kernel:  ? tick_sched_do_timer+0x70/0x70
Dec 25 02:53:47 myyqo kernel:  hrtimer_interrupt+0x104/0x220
Dec 25 02:53:47 myyqo kernel:  smp_apic_timer_interrupt+0x6c/0x140
Dec 25 02:53:47 myyqo kernel:  apic_timer_interrupt+0xf/0x20
Dec 25 02:53:47 myyqo kernel:  </IRQ>
Dec 25 02:53:47 myyqo kernel: RIP: 0010:queued_spin_lock_slowpath+0x40/0x1c0
Dec 25 02:53:47 myyqo kernel: Code: ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 29 85 c0 74 0f 8b 07 84 c0 74 09 0f ae e8 8b 07 <84> c0 75 f7 b8 01 00 00 00 66 89 07 31 c0 89 c2 8
9 c1 89 c6 89 c7
Dec 25 02:53:47 myyqo kernel: RSP: 0000:ffffb51503177700 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Dec 25 02:53:47 myyqo kernel: RAX: 0000000000000101 RBX: ffffb515031777c8 RCX: 0000000000000000
Dec 25 02:53:47 myyqo kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: fffff94784616328
Dec 25 02:53:47 myyqo kernel: RBP: ffffb51503177708 R08: ffff906ed8585568 R09: 0000000000000000
Dec 25 02:53:47 myyqo kernel: R10: 0000000000000000 R11: 0000000000000000 R12: fffff94784b1c0c0
Dec 25 02:53:47 myyqo kernel: R13: fffff94784b1c0c0 R14: 0000000000000d78 R15: ffff906f51b9a6c0
Dec 25 02:53:47 myyqo kernel:  ? _raw_spin_lock+0x21/0x30
Dec 25 02:53:47 myyqo kernel:  page_vma_mapped_walk+0x343/0x6a0
Dec 25 02:53:47 myyqo kernel:  try_to_unmap_one+0x137/0xa70
Dec 25 02:53:47 myyqo kernel:  rmap_walk_anon+0xf1/0x260

BTW, in case you try to match the the datetime of the logs and of the telemetry, it is in UTC+1 (Paris time) on my side.

@ahkok, do you know if the issue can be due to hardware memory issues?

It happens randomly and each time the “tainted” process is different. I have removed one memory stick to see if it is more stable

Entirely likely - please use e.g. memtest86+ to test your RAM - See e.g. https://www.memtest86.com/download.htm

Ok thanks for the info.

Actually I already did a memtest (but not a deep test by lack of time). The results so far are:

  • with the two memory stick, I do see errors
  • with one memory stick only, no errors (but I have to run a longer test to confirm).
  • same with the other memory stick

I guess there is an issue with dual channel on my NUC.

Or the sticks are not the same type. It might actually work to swap them. Were they a pair when you purchased them?

yes, it is a kit: Crucial CT2K8G4SFD824A 16Go Kit DDR4, 2400 MT/s, PC4-19200, Dual Rank x8

I already tried to swap them, but no changes.

This alone would suggest the sticks are bad and worth seeing if they are covered under warranty by the seller.

I took more time to test with memtest86+, you were right one memory stick have errors.

I’ll replace it.

Thanks for the help :slight_smile: