Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: SELinux should not contain hardcoded tests of filesystem type names #2

Open
stephensmalley opened this issue Nov 17, 2016 · 1 comment

Comments

@stephensmalley
Copy link
Member

Generalize the current hardcoded tests of specific filesystem type names used to determine whether to support setting per-file security contexts via setxattr on a genfscon-labeled filesystem and whether to initially label the files from policy based on pathname from the root of the filesystem. The former is only safe if the filesystem either implements its own setxattr handler for security labels or the filesystem pins its inodes in memory, as otherwise the label may not be preserved for the lifetime of the file. The latter is only safe if the filesystem does not permit userspace to modify the directory tree (i.e. no .create/.link/.rename methods or filesystem is not mountable by userspace), as otherwise userspace can potentially cause files to move in and out of a given label or to be accessible under different labels depending on which path is first looked up. We currently permit the former for sysfs (implements its own handler that saves/restores the value when the inode is evicted and later re-created from a backing data structure), and for pstore, debugfs, and rootfs (all of which pin their inodes in memory). We currently permit the latter for debugfs, sysfs, and pstore, as the first two do not permit any userspace manipulation of directories and the latter only permits unlink, which causes no issues by itself. We either need some way to detect which filesystems are safe to use in the kernel or specify the whitelists of filesystem type names in the policy.

@pcmoore pcmoore changed the title SELinux should not contain hardcoded tests of filesystem type names RFE: SELinux should not contain hardcoded tests of filesystem type names Nov 18, 2016
pcmoore pushed a commit that referenced this issue Dec 5, 2016
During rpm resume we restore the fences, but we do not have the
protection of struct_mutex. This rules out updating the activity
tracking on the fences, and requires us to rely on the rpm as the
serialisation barrier instead.

[  350.298052] [drm:intel_runtime_resume [i915]] Resuming device
[  350.308606]
[  350.310520] ===============================
[  350.315560] [ INFO: suspicious RCU usage. ]
[  350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G     U  W
[  350.327208] -------------------------------
[  350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage!
[  350.342619]
[  350.342619] other info that might help us debug this:
[  350.342619]
[  350.351593]
[  350.351593] rcu_scheduler_active = 1, debug_locks = 0
[  350.358952] 3 locks held by Xorg/320:
[  350.363077]  #0:  (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm]
[  350.375162]  #1:  (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm]
[  350.387022]  #2:  (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm]
[  350.398236]
[  350.398236] stack backtrace:
[  350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G     U  W       4.8.0-rc8-bsw-rapl+ #3133
[  350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015
[  350.425212]  0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000
[  350.433611]  0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0
[  350.442012]  ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948
[  350.450409] Call Trace:
[  350.453165]  [<ffffffff81332187>] dump_stack+0x67/0x90
[  350.458931]  [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120
[  350.466002]  [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915]
[  350.472766]  [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915]
[  350.480496]  [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915]
[  350.487839]  [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915]
[  350.495442]  [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0
[  350.502274]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[  350.509883]  [<ffffffff814401c5>] __rpm_callback+0x35/0x70
[  350.516037]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[  350.523646]  [<ffffffff81440224>] rpm_callback+0x24/0x80
[  350.529604]  [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40
[  350.537212]  [<ffffffff814417bd>] rpm_resume+0x4ad/0x740
[  350.543161]  [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80
[  350.549824]  [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915]
[  350.557265]  [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915]
[  350.565001]  [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915]
[  350.573106]  [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[  350.582659]  [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50
[  350.589205]  [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper]
[  350.598787]  [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915]
[  350.606319]  [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm]
[  350.615209]  [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm]
[  350.622242]  [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper]
[  350.631419]  [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm]
[  350.639623]  [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm]
[  350.646760]  [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm]
[  350.653217]  [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm]
[  350.660536]  [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70
[  350.666885]  [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0
[  350.672939]  [<ffffffff8120f843>] ? __fget+0x113/0x200
[  350.678797]  [<ffffffff8120f735>] ? __fget+0x5/0x200
[  350.684361]  [<ffffffff81202964>] SyS_ioctl+0x44/0x80
[  350.690030]  [<ffffffff81001deb>] do_syscall_64+0x5b/0x120
[  350.696184]  [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25

Note we also have to remember the lesson from commit 4fc788f
("drm/i915: Flush delayed fence releases after reset") where we have to
flush any changes to the fence on restore.

v2: Replace call to release user mmaps with an assertion that they have
already been zapped.

Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma")
Reported-by: Ville Syrjälä <[email protected]>
Signed-off-by: Chris Wilson <[email protected]>
Cc: Ville Syrjälä <[email protected]>
Cc: Joonas Lahtinen <[email protected]>
Cc: Mika Kuoppala <[email protected]>
Reviewed-by: Joonas Lahtinen <[email protected]>
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
(cherry picked from commit 4676dc8)
Signed-off-by: Jani Nikula <[email protected]>
pcmoore pushed a commit that referenced this issue Dec 5, 2016
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192).  ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up.

The total credit should include 3 times of "num_dx_leaves" from
ocfs2_dx_dir_rebalance(), because 2 times will be consumed in
ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in
ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() ->
ocfs2_dx_dir_format_cluster().  But only two times is included in
ocfs2_dx_dir_rebalance_credits(), fix it.

This can cause read-only fs(v4.1+) or panic for mainline linux depending
on mount option.

  ------------[ cut here ]------------
  kernel BUG at fs/ocfs2/journal.c:775!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
  CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2
  Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
  task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000
  RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  RSP: 0018:ffff8800a7d4b6d8  EFLAGS: 00010286
  RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9
  RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460
  RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460
  R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e
  R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070
  FS:  00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0
  Call Trace:
    ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2]
    ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2]
    ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2]
    ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2]
    ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2]
    ocfs2_mknod+0x5a2/0x1400 [ocfs2]
    ocfs2_create+0x73/0x180 [ocfs2]
    vfs_create+0xd8/0x100
    lookup_open+0x185/0x1c0
    do_last+0x36d/0x780
    path_openat+0x92/0x470
    do_filp_open+0x4a/0xa0
    do_sys_open+0x11a/0x230
    SyS_open+0x1e/0x20
    system_call_fastpath+0x12/0x71
  Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54
  RIP  ocfs2_journal_dirty+0xa7/0xb0 [ocfs2]
  ---[ end trace 91ac5312a6ee1288 ]---
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Junxiao Bi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Joseph Qi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
pcmoore pushed a commit that referenced this issue Dec 15, 2016
Missing initialization of udp_tunnel_sock_cfg causes to following
kernel panic, while kernel tries to execute gro_receive().

While being there, we converted udp_port_cfg to use the same
initialization scheme as udp_tunnel_sock_cfg.

------------[ cut here ]------------
kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
BUG: unable to handle kernel paging request at ffffffffa0588c50
IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163
Oops: 0011 [#1] SMP
Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel
CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2
Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011
task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000
RIP: 0010:[<ffffffffa0588c50>]
[<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP: 0018:ffff880237343c80  EFLAGS: 00010282
RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712
RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00
RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262
R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578
FS:  0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0
Stack:
ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014
000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb
0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000
Call Trace:
<IRQ>
[<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130
[<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0
[<ffffffff81611373>] inet_gro_receive+0x1d3/0x270
[<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0
[<ffffffff81595188>] napi_gro_receive+0x38/0x120
[<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core]
[<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core]
[<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core]
[<ffffffff815949a0>] net_rx_action+0x160/0x380
[<ffffffff816a9197>] __do_softirq+0xd7/0x2c5
[<ffffffff81085c35>] irq_exit+0xf5/0x100
[<ffffffff816a8f16>] do_IRQ+0x56/0xd0
[<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c
<EOI>
[<ffffffff81061f96>] ? native_safe_halt+0x6/0x10
[<ffffffff81037ade>] default_idle+0x1e/0xd0
[<ffffffff8103828f>] arch_cpu_idle+0xf/0x20
[<ffffffff810c37dc>] default_idle_call+0x3c/0x50
[<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0
[<ffffffff81050d8c>] start_secondary+0x15c/0x1a0
RIP  [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe]
RSP <ffff880237343c80>
CR2: ffffffffa0588c50
---[ end trace 489ee31fa7614ac5 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
------------[ cut here ]------------

Fixes: 8700e3e ("Soft RoCE driver")
Signed-off-by: Yonatan Cohen <[email protected]>
Reviewed-by: Moni Shoua <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
Signed-off-by: Doug Ledford <[email protected]>
pcmoore pushed a commit that referenced this issue Dec 15, 2016
Tushar Dave says:

====================
sparc: Enable sun4v hypervisor PCI IOMMU v2 APIs and ATU

ATU (Address Translation Unit) is a new IOMMU in SPARC supported with
sun4v hypervisor PCI IOMMU v2 APIs.

Current SPARC IOMMU supports only 32bit address ranges and one TSB
per PCIe root complex that has a 2GB per root complex DVMA space
limit. The limit has become a scalability bottleneck nowadays that
a typical 10G/40G NIC can consume 500MB DVMA space per instance.
When DVMA resource is exhausted, devices will not be usable
since the driver can't allocate DVMA.

For example, we recently experienced legacy IOMMU limitation while
using i40e driver in system with large number of CPUs (e.g. 128).
Four ports of i40e, each request 128 QP (Queue Pairs). Each queue has
512 (default) descriptors. So considering only RX queues (because RX
premap DMA buffers), i40e takes 4*128*512 number of DMA entries in
IOMMU table. Legacy IOMMU can have at max (2G/8K)- 1 entries available
in table. So bringing up four instance of i40e alone saturate existing
IOMMU resource.

ATU removes bottleneck by allowing guest os to create IOTSB of size
32G (or more) with 64bit address ranges available in ATU HW. 32G is
more than enough DVMA space to be shared by all PCIe devices under
root complex contrast to 2G space provided by legacy IOMMU.

ATU allows PCIe devices to use 64bit DMA addressing. Devices
which choose to use 32bit DMA mask will continue to work with the
existing legacy IOMMU.

The patch set is tested on sun4v (T1000, T2000, T3, T4, T5, T7, S7)
and sun4u SPARC.

Thanks.
-Tushar

v2->v3:
- Patch #5 addresses comment by Joe Perches.
 -- use %s, __func__ instead of embedding the function name.

v1->v2:
- Patch #2 addresses comments by Dave M.
 -- use page allocator to allocate IOTSB.
 -- use true/false with boolean variables.
====================

Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to
the removal of the cpu hotplug notifiers.

Usually I don't care much about out of tree modules, but LTTNG is widely
used in distros. There are two ways to solve that:

1) Reserve a hotplug state for LTTNG

2) Add a dynamic range for the prepare states.

While #1 is the simplest solution, #2 is the proper one as we can convert
in tree users, which do not care about ordering, to the dynamic range as
well.

Add a dynamic range which allows LTTNG to request states in the prepare
stage.

Reported-and-tested-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Mathieu Desnoyers <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sebastian Sewior <[email protected]>
Cc: Steven Rostedt <[email protected]>
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos
Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
The lockdep splat below hints at a bug in RCU usage in dm-crypt that
was introduced with commit c538f6e ("dm crypt: add ability to use
keys from the kernel key retention service").  The kernel keyring
function user_key_payload() is in fact a wrapper for
rcu_dereference_protected() which must not be called with only
rcu_read_lock() section mark.

Unfortunately the kernel keyring subsystem doesn't currently provide
an interface that allows the use of an RCU read-side section.  So for
now we must drop RCU in favour of rwsem until a proper function is
made available in the kernel keyring subsystem.

===============================
[ INFO: suspicious RCU usage. ]
4.10.0-rc5 #2 Not tainted
-------------------------------
./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
2 locks held by cryptsetup/6464:
 #0:  (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod]
 #1:  (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt]
stack backtrace:
CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Call Trace:
 dump_stack+0x67/0x92
 lockdep_rcu_suspicious+0xc5/0x100
 crypt_set_key+0x351/0x4b0 [dm_crypt]
 ? crypt_set_key+0x1d8/0x4b0 [dm_crypt]
 crypt_ctr+0x341/0xa53 [dm_crypt]
 dm_table_add_target+0x147/0x330 [dm_mod]
 table_load+0x111/0x350 [dm_mod]
 ? retrieve_status+0x1c0/0x1c0 [dm_mod]
 ctl_ioctl+0x1f5/0x510 [dm_mod]
 dm_ctl_ioctl+0xe/0x20 [dm_mod]
 do_vfs_ioctl+0x8e/0x690
 ? ____fput+0x9/0x10
 ? task_work_run+0x7e/0xa0
 ? trace_hardirqs_on_caller+0x122/0x1b0
 SyS_ioctl+0x3c/0x70
 entry_SYSCALL_64_fastpath+0x18/0xad
RIP: 0033:0x7f392c9a4ec7
RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7
RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005
RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0
R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08
R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b

Fixes: c538f6e ("dm crypt: add ability to use keys from the kernel key retention service")
Reported-by: Milan Broz <[email protected]>
Signed-off-by: Ondrej Kozina <[email protected]>
Reviewed-by: Mikulas Patocka <[email protected]>
Signed-off-by: Mike Snitzer <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
Syzkaller fuzzer managed to trigger this:

    BUG: sleeping function called from invalid context at mm/shmem.c:852
    in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged
    3 locks held by khugepaged/529:
     #0:  (shrinker_rwsem){++++..}, at: [<ffffffff818d7ef1>] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451
     #1:  (&type->s_umount_key#29){++++..}, at: [<ffffffff81a63630>] trylock_super+0x20/0x100 fs/super.c:392
     #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] spin_lock include/linux/spinlock.h:302 [inline]
     #2:  (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427
    CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
       shmem_undo_range+0xb20/0x2710 mm/shmem.c:852
       shmem_truncate_range+0x27/0xa0 mm/shmem.c:939
       shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030
       evict+0x46e/0x980 fs/inode.c:553
       iput_final fs/inode.c:1515 [inline]
       iput+0x589/0xb20 fs/inode.c:1542
       shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446
       shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512
       super_cache_scan+0x376/0x450 fs/super.c:106
       do_shrink_slab mm/vmscan.c:378 [inline]
       shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481
       shrink_slab mm/vmscan.c:2592 [inline]
       shrink_node+0x2c7/0x870 mm/vmscan.c:2592
       shrink_zones mm/vmscan.c:2734 [inline]
       do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776
       try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982
       __perform_reclaim mm/page_alloc.c:3301 [inline]
       __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline]
       __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683
       __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848
       __alloc_pages include/linux/gfp.h:426 [inline]
       __alloc_pages_node include/linux/gfp.h:439 [inline]
       khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750
       collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955
       khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208
       khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline]
       khugepaged_do_scan mm/khugepaged.c:1808 [inline]
       khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853
       kthread+0x326/0x3f0 kernel/kthread.c:227
       ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430

The iput() from atomic context was a bad idea: if after igrab() somebody
else calls iput() and we left with the last inode reference, our iput()
would lead to inode eviction and therefore sleeping.

This patch should fix the situation.

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
This fixes a race condition that may occur whenever ST micro busy end
interrupt is raised just after being unmasked but before leaving mmci
interrupt context.

A dead-lock has been found if connecting mmci ST Micro variant whose amba
id is 0x10480180 to some new eMMC that supports internal caches.  Whenever
mmci driver enables cache control by programming eMMC's EXT_CSD register,
block driver may request to flush the eMMC internal caches causing mmci
driver to send a MMC_SWITCH command to the card with FLUSH_CACHE operation.
And because busy end interrupt may be mistakenly cleared while not yet
processed, this mmc request may never complete.  As a result, mmcqd task
may be stuck forever.

Here is an instance caught by lockup detector which shows that mmcqd task
was hung while waiting for mmc_flush_cache command to complete:

..
[  240.251595] INFO: task mmcqd/1:52 blocked for more than 120 seconds.
[  240.257973]       Not tainted 4.1.13-00510-g9d91424 #2
[  240.263109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.270955] mmcqd/1         D c047504c     0    52      2 0x00000000
[  240.277359] [<c047504c>] (__schedule) from [<c04754a0>] (schedule+0x40/0x98)
[  240.284418] [<c04754a0>] (schedule) from [<c0477d40>] (schedule_timeout+0x148/0x188)
[  240.292191] [<c0477d40>] (schedule_timeout) from [<c0476040>] (wait_for_common+0xa4/0x170)
[  240.300491] [<c0476040>] (wait_for_common) from [<c02efc1c>] (mmc_wait_for_req_done+0x4c/0x13c)
[  240.309224] [<c02efc1c>] (mmc_wait_for_req_done) from [<c02efd90>] (mmc_wait_for_cmd+0x64/0x84)
[  240.317953] [<c02efd90>] (mmc_wait_for_cmd) from [<c02f5b14>] (__mmc_switch+0xa4/0x2a8)
[  240.325964] [<c02f5b14>] (__mmc_switch) from [<c02f5d40>] (mmc_switch+0x28/0x30)
[  240.333389] [<c02f5d40>] (mmc_switch) from [<c02f0984>] (mmc_flush_cache+0x54/0x80)
[  240.341073] [<c02f0984>] (mmc_flush_cache) from [<c02ff0c4>] (mmc_blk_issue_rq+0x114/0x4e8)
[  240.349459] [<c02ff0c4>] (mmc_blk_issue_rq) from [<c03008d4>] (mmc_queue_thread+0xc0/0x180)
[  240.357844] [<c03008d4>] (mmc_queue_thread) from [<c003cf90>] (kthread+0xdc/0xf4)
[  240.365339] [<c003cf90>] (kthread) from [<c0010068>] (ret_from_fork+0x14/0x2c)
..
..
[  240.664311] INFO: task partprobe:564 blocked for more than 120 seconds.
[  240.670943]       Not tainted 4.1.13-00510-g9d91424 #2
[  240.676078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.683922] partprobe       D c047504c     0   564    486 0x00000000
[  240.690318] [<c047504c>] (__schedule) from [<c04754a0>] (schedule+0x40/0x98)
[  240.697396] [<c04754a0>] (schedule) from [<c0477d40>] (schedule_timeout+0x148/0x188)
[  240.705149] [<c0477d40>] (schedule_timeout) from [<c0476040>] (wait_for_common+0xa4/0x170)
[  240.713446] [<c0476040>] (wait_for_common) from [<c01f3300>] (submit_bio_wait+0x58/0x64)
[  240.721571] [<c01f3300>] (submit_bio_wait) from [<c01fbbd8>] (blkdev_issue_flush+0x60/0x88)
[  240.729957] [<c01fbbd8>] (blkdev_issue_flush) from [<c010ff84>] (blkdev_fsync+0x34/0x44)
[  240.738083] [<c010ff84>] (blkdev_fsync) from [<c0109594>] (do_fsync+0x3c/0x64)
[  240.745319] [<c0109594>] (do_fsync) from [<c000ffc0>] (ret_fast_syscall+0x0/0x3c)
..

Here is the detailed sequence showing when this issue may happen:

1) At probe time, mmci device is initialized and card busy detection based
on DAT[0] monitoring is enabled.

2) Later during run time, since card reported to support internal caches, a
MMCI_SWITCH command is sent to eMMC device with FLUSH_CACHE operation. On
receiving this command, eMMC may enter busy state (for a relatively short
time in the case of the dead-lock).

3) Then mmci interrupt is raised and mmci_irq() is called:

MMCISTATUS register is read and is equal to 0x01000440. So the following
status bits are set:
- MCI_CMDRESPEND (= 6)
- MCI_DATABLOCKEND (= 10)
- MCI_ST_CARDBUSY (= 24)

Since MMCIMASK0 register is 0x3FF, status variable is set to 0x00000040 and
BIT MCI_CMDRESPEND is cleared by writing MMCICLEAR register.

Then mmci_cmd_irq() is called. Considering the following conditions:
- host->busy_status is 0,
- this is a "busy response",
- reading again MMCISTATUS register gives 0x1000400,
MMCIMASK0 is updated to unmask MCI_ST_BUSYEND bit.

Thus, MMCIMASK0 is set to 0x010003FF and host->busy_status is set to wait
for busy end completion.

Back again in status loop of mmci_irq(), we quickly go through
mmci_data_irq() as there are no data in that case.  And we finally go
through following test at the end of while(status) loop:

/*
 * Don't poll for busy completion in irq context.
 */
if (host->variant->busy_detect && host->busy_status)
	status &= ~host->variant->busy_detect_flag;

Because status variable is not yet null (is equal to 0x40), we do not leave
interrupt context yet but we loop again into while(status) loop. So we run
across following steps:

a) MMCISTATUS register is read again and this time is equal to 0x01000400.
So that following bits are set:
- MCI_DATABLOCKEND (= 10)
- MCI_ST_CARDBUSY (= 24)

Since MMCIMASK0 register is equal to 0x010003FF:

b) status variable is set to 0x01000000.
c) MCI_ST_CARDBUSY bit is cleared by writing MMCICLEAR register.

Then, mmci_cmd_irq() is called one more time. Since host->busy_status is
set and that MCI_ST_CARDBUSY is set in status variable, we just return from
this function.

Back again in mmci_irq(), status variable is set to 0 and we finally leave
the while(status) loop. As a result we leave interrupt context, waiting for
busy end interrupt event.

Now, consider that busy end completion is raised IN BETWEEN steps 3.a) and
3.c). In such a case, we may mistakenly clear busy end interrupt at step
3.c) while it has not yet been processed. This will result in mmc command
to wait forever for a busy end completion that will never happen.

To fix the problem, this patch implements the following changes:

Considering that the mmci seems to be triggering the IRQ on both edges
while monitoring DAT0 for busy completion and that same status bit is used
to monitor start and end of busy detection, special care must be taken to
make sure that both start and end interrupts are always cleared one after
the other.

1) Clearing of card busy bit is moved in mmc_cmd_irq() function where
unmasking of busy end bit is effectively handled.
2) Just before unmasking busy end event, busy start event is cleared by
writing card busy bit in MMCICLEAR register.
3) Finally, once we are no more busy with a command, busy end event is
cleared writing again card busy bit in MMCICLEAR register.

This patch has been tested with the ST Accordo5 machine, not yet supported
upstream but relies on the mmci driver.

Signed-off-by: Sarang Mairal <[email protected]>
Signed-off-by: Jean-Nicolas Graux <[email protected]>
Reviewed-by: Linus Walleij <[email protected]>
Tested-by: Ulf Hansson <[email protected]>
Signed-off-by: Ulf Hansson <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
…failed send

Dan Carpenter kindly reported:
<quote>
The patch d27a7cb: "zfcp: trace on request for open and close of
WKA port" from Aug 10, 2016, leads to the following static checker
warning:

	drivers/s390/scsi/zfcp_fsf.c:1615 zfcp_fsf_open_wka_port()
	warn: 'req' was already freed.

drivers/s390/scsi/zfcp_fsf.c
  1609          zfcp_fsf_start_timer(req, ZFCP_FSF_REQUEST_TIMEOUT);
  1610          retval = zfcp_fsf_req_send(req);
  1611          if (retval)
  1612                  zfcp_fsf_req_free(req);
                                          ^^^
Freed.

  1613  out:
  1614          spin_unlock_irq(&qdio->req_q_lock);
  1615          if (req && !IS_ERR(req))
  1616                  zfcp_dbf_rec_run_wka("fsowp_1", wka_port, req->req_id);
                                                                  ^^^^^^^^^^^
Use after free.

  1617          return retval;
  1618  }

Same thing for zfcp_fsf_close_wka_port() as well.
</quote>

Rather than relying on req being NULL (or ERR_PTR) for all cases where
we don't want to trace or should not trace,
simply check retval which is unconditionally initialized with -EIO != 0
and it can only become 0 on successful retval = zfcp_fsf_req_send(req).
With that we can also remove the then again unnecessary unconditional
initialization of req which was introduced with that earlier commit.

Reported-by: Dan Carpenter <[email protected]>
Suggested-by: Benjamin Block <[email protected]>
Signed-off-by: Steffen Maier <[email protected]>
Fixes: d27a7cb ("zfcp: trace on request for open and close of WKA port")
Cc: <[email protected]> #2.6.38+
Reviewed-by: Benjamin Block <[email protected]>
Reviewed-by: Jens Remus <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
Olof reported that on a machine which has a BIOS wreckaged TSC the
timestamps in dmesg are making a large jump because the TSC value is
jumping forward after resetting the TSC ADJUST register to a sane value.

This can be avoided by calling the TSC ADJUST saniziting function before
initializing the per cpu sched clock machinery. That takes the offset into
account and avoid the time jump.

What cannot be avoided is that the 'Firmware Bug' warnings on the secondary
CPUs are printed with the large time offsets because it would be too much
effort and ugly hackery to print those warnings into a buffer and emit them
after the adjustemt on the starting CPUs. It's a firmware bug and should be
fixed in firmware. The weird timestamps are collateral damage and just
illustrate the sillyness of the BIOS folks:

[    0.397445] smp: Bringing up secondary CPUs ...
[    0.402100] x86: Booting SMP configuration:
[    0.406343] .... node  #0, CPUs:      #1
[1265776479.930667] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU1: -2978888639183101
[1265776479.944664] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2978888639183101
[    0.508119]  #2
[1265776480.032346] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU2: -2978888639183677
[1265776480.044192] TSC ADJUST synchronize: Reference CPU0: 0 CPU2: -2978888639183677
[    0.607643]  #3
[1265776480.131874] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU3: -2978888639184530
[1265776480.143720] TSC ADJUST synchronize: Reference CPU0: 0 CPU3: -2978888639184530
[    0.707108] smp: Brought up 1 node, 4 CPUs
[    0.711271] smpboot: Total of 4 processors activated (21698.88 BogoMIPS)

Reported-by: Olof Johansson <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
We cannot do printk() from tk_debug_account_sleep_time(), because
tk_debug_account_sleep_time() is called under tk_core seq lock.
The reason why printk() is unsafe there is that console_sem may
invoke scheduler (up()->wake_up_process()->activate_task()), which,
in turn, can return back to timekeeping code, for instance, via
get_time()->ktime_get(), deadlocking the system on tk_core seq lock.

[   48.950592] ======================================================
[   48.950622] [ INFO: possible circular locking dependency detected ]
[   48.950622] 4.10.0-rc7-next-20170213+ #101 Not tainted
[   48.950622] -------------------------------------------------------
[   48.950622] kworker/0:0/3 is trying to acquire lock:
[   48.950653]  (tk_core){----..}, at: [<c01cc624>] retrigger_next_event+0x4c/0x90
[   48.950683]
               but task is already holding lock:
[   48.950683]  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.950714]
               which lock already depends on the new lock.

[   48.950714]
               the existing dependency chain (in reverse order) is:
[   48.950714]
               -> #5 (hrtimer_bases.lock){-.-...}:
[   48.950744]        _raw_spin_lock_irqsave+0x50/0x64
[   48.950775]        lock_hrtimer_base+0x28/0x58
[   48.950775]        hrtimer_start_range_ns+0x20/0x5c8
[   48.950775]        __enqueue_rt_entity+0x320/0x360
[   48.950805]        enqueue_rt_entity+0x2c/0x44
[   48.950805]        enqueue_task_rt+0x24/0x94
[   48.950836]        ttwu_do_activate+0x54/0xc0
[   48.950836]        try_to_wake_up+0x248/0x5c8
[   48.950836]        __setup_irq+0x420/0x5f0
[   48.950836]        request_threaded_irq+0xdc/0x184
[   48.950866]        devm_request_threaded_irq+0x58/0xa4
[   48.950866]        omap_i2c_probe+0x530/0x6a0
[   48.950897]        platform_drv_probe+0x50/0xb0
[   48.950897]        driver_probe_device+0x1f8/0x2cc
[   48.950897]        __driver_attach+0xc0/0xc4
[   48.950927]        bus_for_each_dev+0x6c/0xa0
[   48.950927]        bus_add_driver+0x100/0x210
[   48.950927]        driver_register+0x78/0xf4
[   48.950958]        do_one_initcall+0x3c/0x16c
[   48.950958]        kernel_init_freeable+0x20c/0x2d8
[   48.950958]        kernel_init+0x8/0x110
[   48.950988]        ret_from_fork+0x14/0x24
[   48.950988]
               -> #4 (&rt_b->rt_runtime_lock){-.-...}:
[   48.951019]        _raw_spin_lock+0x40/0x50
[   48.951019]        rq_offline_rt+0x9c/0x2bc
[   48.951019]        set_rq_offline.part.2+0x2c/0x58
[   48.951049]        rq_attach_root+0x134/0x144
[   48.951049]        cpu_attach_domain+0x18c/0x6f4
[   48.951049]        build_sched_domains+0xba4/0xd80
[   48.951080]        sched_init_smp+0x68/0x10c
[   48.951080]        kernel_init_freeable+0x160/0x2d8
[   48.951080]        kernel_init+0x8/0x110
[   48.951080]        ret_from_fork+0x14/0x24
[   48.951110]
               -> #3 (&rq->lock){-.-.-.}:
[   48.951110]        _raw_spin_lock+0x40/0x50
[   48.951141]        task_fork_fair+0x30/0x124
[   48.951141]        sched_fork+0x194/0x2e0
[   48.951141]        copy_process.part.5+0x448/0x1a20
[   48.951171]        _do_fork+0x98/0x7e8
[   48.951171]        kernel_thread+0x2c/0x34
[   48.951171]        rest_init+0x1c/0x18c
[   48.951202]        start_kernel+0x35c/0x3d4
[   48.951202]        0x8000807c
[   48.951202]
               -> #2 (&p->pi_lock){-.-.-.}:
[   48.951232]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951232]        try_to_wake_up+0x30/0x5c8
[   48.951232]        up+0x4c/0x60
[   48.951263]        __up_console_sem+0x2c/0x58
[   48.951263]        console_unlock+0x3b4/0x650
[   48.951263]        vprintk_emit+0x270/0x474
[   48.951293]        vprintk_default+0x20/0x28
[   48.951293]        printk+0x20/0x30
[   48.951324]        kauditd_hold_skb+0x94/0xb8
[   48.951324]        kauditd_thread+0x1a4/0x56c
[   48.951324]        kthread+0x104/0x148
[   48.951354]        ret_from_fork+0x14/0x24
[   48.951354]
               -> #1 ((console_sem).lock){-.....}:
[   48.951385]        _raw_spin_lock_irqsave+0x50/0x64
[   48.951385]        down_trylock+0xc/0x2c
[   48.951385]        __down_trylock_console_sem+0x24/0x80
[   48.951385]        console_trylock+0x10/0x8c
[   48.951416]        vprintk_emit+0x264/0x474
[   48.951416]        vprintk_default+0x20/0x28
[   48.951416]        printk+0x20/0x30
[   48.951446]        tk_debug_account_sleep_time+0x5c/0x70
[   48.951446]        __timekeeping_inject_sleeptime.constprop.3+0x170/0x1a0
[   48.951446]        timekeeping_resume+0x218/0x23c
[   48.951477]        syscore_resume+0x94/0x42c
[   48.951477]        suspend_enter+0x554/0x9b4
[   48.951477]        suspend_devices_and_enter+0xd8/0x4b4
[   48.951507]        enter_state+0x934/0xbd4
[   48.951507]        pm_suspend+0x14/0x70
[   48.951507]        state_store+0x68/0xc8
[   48.951538]        kernfs_fop_write+0xf4/0x1f8
[   48.951538]        __vfs_write+0x1c/0x114
[   48.951538]        vfs_write+0xa0/0x168
[   48.951568]        SyS_write+0x3c/0x90
[   48.951568]        __sys_trace_return+0x0/0x10
[   48.951568]
               -> #0 (tk_core){----..}:
[   48.951599]        lock_acquire+0xe0/0x294
[   48.951599]        ktime_get_update_offsets_now+0x5c/0x1d4
[   48.951629]        retrigger_next_event+0x4c/0x90
[   48.951629]        on_each_cpu+0x40/0x7c
[   48.951629]        clock_was_set_work+0x14/0x20
[   48.951660]        process_one_work+0x2b4/0x808
[   48.951660]        worker_thread+0x3c/0x550
[   48.951660]        kthread+0x104/0x148
[   48.951690]        ret_from_fork+0x14/0x24
[   48.951690]
               other info that might help us debug this:

[   48.951690] Chain exists of:
                 tk_core --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock

[   48.951721]  Possible unsafe locking scenario:

[   48.951721]        CPU0                    CPU1
[   48.951721]        ----                    ----
[   48.951721]   lock(hrtimer_bases.lock);
[   48.951751]                                lock(&rt_b->rt_runtime_lock);
[   48.951751]                                lock(hrtimer_bases.lock);
[   48.951751]   lock(tk_core);
[   48.951782]
                *** DEADLOCK ***

[   48.951782] 3 locks held by kworker/0:0/3:
[   48.951782]  #0:  ("events"){.+.+.+}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951812]  #1:  (hrtimer_work){+.+...}, at: [<c0156590>] process_one_work+0x1f8/0x808
[   48.951843]  #2:  (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90
[   48.951843]   stack backtrace:
[   48.951873] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc7-next-20170213+
[   48.951904] Workqueue: events clock_was_set_work
[   48.951904] [<c0110208>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14)
[   48.951934] [<c010c224>] (show_stack) from [<c04ca6c0>] (dump_stack+0xac/0xe0)
[   48.951934] [<c04ca6c0>] (dump_stack) from [<c019b5cc>] (print_circular_bug+0x1d0/0x308)
[   48.951965] [<c019b5cc>] (print_circular_bug) from [<c019d2a8>] (validate_chain+0xf50/0x1324)
[   48.951965] [<c019d2a8>] (validate_chain) from [<c019ec18>] (__lock_acquire+0x468/0x7e8)
[   48.951995] [<c019ec18>] (__lock_acquire) from [<c019f634>] (lock_acquire+0xe0/0x294)
[   48.951995] [<c019f634>] (lock_acquire) from [<c01d0ea0>] (ktime_get_update_offsets_now+0x5c/0x1d4)
[   48.952026] [<c01d0ea0>] (ktime_get_update_offsets_now) from [<c01cc624>] (retrigger_next_event+0x4c/0x90)
[   48.952026] [<c01cc624>] (retrigger_next_event) from [<c01e4e24>] (on_each_cpu+0x40/0x7c)
[   48.952056] [<c01e4e24>] (on_each_cpu) from [<c01cafc4>] (clock_was_set_work+0x14/0x20)
[   48.952056] [<c01cafc4>] (clock_was_set_work) from [<c015664c>] (process_one_work+0x2b4/0x808)
[   48.952087] [<c015664c>] (process_one_work) from [<c0157774>] (worker_thread+0x3c/0x550)
[   48.952087] [<c0157774>] (worker_thread) from [<c015d644>] (kthread+0x104/0x148)
[   48.952087] [<c015d644>] (kthread) from [<c0107830>] (ret_from_fork+0x14/0x24)

Replace printk() with printk_deferred(), which does not call into
the scheduler.

Fixes: 0bf43f1 ("timekeeping: Prints the amounts of time spent during suspend")
Reported-and-tested-by: Tony Lindgren <[email protected]>
Signed-off-by: Sergey Senozhatsky <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "Rafael J . Wysocki" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: John Stultz <[email protected]>
Cc: "[4.9+]" <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore pushed a commit that referenced this issue Feb 23, 2017
Commit 6664498 ("packet: call fanout_release, while UNREGISTERING a
netdev"), unfortunately, introduced the following issues.

1. calling mutex_lock(&fanout_mutex) (fanout_release()) from inside
rcu_read-side critical section. rcu_read_lock disables preemption, most often,
which prohibits calling sleeping functions.

[  ] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section!
[  ]
[  ] rcu_scheduler_active = 1, debug_locks = 0
[  ] 4 locks held by ovs-vswitchd/1969:
[  ]  #0:  (cb_lock){++++++}, at: [<ffffffff8158a6c9>] genl_rcv+0x19/0x40
[  ]  #1:  (ovs_mutex){+.+.+.}, at: [<ffffffffa04878ca>] ovs_vport_cmd_del+0x4a/0x100 [openvswitch]
[  ]  #2:  (rtnl_mutex){+.+.+.}, at: [<ffffffff81564157>] rtnl_lock+0x17/0x20
[  ]  #3:  (rcu_read_lock){......}, at: [<ffffffff81614165>] packet_notifier+0x5/0x3f0
[  ]
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810c9077>] lockdep_rcu_suspicious+0x107/0x110
[  ]  [<ffffffff810a2da7>] ___might_sleep+0x57/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff810de93f>] ? vprintk_default+0x1f/0x30
[  ]  [<ffffffff81186e88>] ? printk+0x4d/0x4f
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

2. calling mutex_lock(&fanout_mutex) inside spin_lock(&po->bind_lock).
"sleeping function called from invalid context"

[  ] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[  ] in_atomic(): 1, irqs_disabled(): 0, pid: 1969, name: ovs-vswitchd
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff810a2f52>] ___might_sleep+0x202/0x210
[  ]  [<ffffffff810a2fd0>] __might_sleep+0x70/0x90
[  ]  [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0
[  ]  [<ffffffff816106dd>] fanout_release+0x1d/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

3. calling dev_remove_pack(&fanout->prot_hook), from inside
spin_lock(&po->bind_lock) or rcu_read-side critical-section. dev_remove_pack()
-> synchronize_net(), which might sleep.

[  ] BUG: scheduling while atomic: ovs-vswitchd/1969/0x00000002
[  ] INFO: lockdep is turned off.
[  ] Call Trace:
[  ]  [<ffffffff813770c1>] dump_stack+0x85/0xc4
[  ]  [<ffffffff81186274>] __schedule_bug+0x64/0x73
[  ]  [<ffffffff8162b8cb>] __schedule+0x6b/0xd10
[  ]  [<ffffffff8162c5db>] schedule+0x6b/0x80
[  ]  [<ffffffff81630b1d>] schedule_timeout+0x38d/0x410
[  ]  [<ffffffff810ea3fd>] synchronize_sched_expedited+0x53d/0x810
[  ]  [<ffffffff810ea6de>] synchronize_rcu_expedited+0xe/0x10
[  ]  [<ffffffff8154eab5>] synchronize_net+0x35/0x50
[  ]  [<ffffffff8154eae3>] dev_remove_pack+0x13/0x20
[  ]  [<ffffffff8161077e>] fanout_release+0xbe/0xe0
[  ]  [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0

4. fanout_release() races with calls from different CPU.

To fix the above problems, remove the call to fanout_release() under
rcu_read_lock(). Instead, call __dev_remove_pack(&fanout->prot_hook) and
netdev_run_todo will be happy that &dev->ptype_specific list is empty. In order
to achieve this, I moved dev_{add,remove}_pack() out of fanout_{add,release} to
__fanout_{link,unlink}. So, call to {,__}unregister_prot_hook() will make sure
fanout->prot_hook is removed as well.

Fixes: 6664498 ("packet: call fanout_release, while UNREGISTERING a netdev")
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Anoob Soman <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue May 17, 2017
This is a story about 4 distinct (and very old) btrfs bugs.

Commit c8b9781 ("Btrfs: Add zlib compression support") added
three data corruption bugs for inline extents (bugs #1-3).

Commit 93c82d5 ("Btrfs: zero page past end of inline file items")
fixed bug #1:  uncompressed inline extents followed by a hole and more
extents could get non-zero data in the hole as they were read.  The fix
was to add a memset in btrfs_get_extent to zero out the hole.

Commit 166ae5a ("btrfs: fix inline compressed read err corruption")
fixed bug #2:  compressed inline extents which contained non-zero bytes
might be replaced with zero bytes in some cases.  This patch removed an
unhelpful memset from uncompress_inline, but the case where memset is
required was missed.

There is also a memset in the decompression code, but this only covers
decompressed data that is shorter than the ram_bytes from the extent
ref record.  This memset doesn't cover the region between the end of the
decompressed data and the end of the page.  It has also moved around a
few times over the years, so there's no single patch to refer to.

This patch fixes bug #3:  compressed inline extents followed by a hole
and more extents could get non-zero data in the hole as they were read
(i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/).
The fix is the same:  zero out the hole in the compressed case too,
by putting a memset back in uncompress_inline, but this time with
correct parameters.

The last and oldest bug, bug #0, is the cause of the offending inline
extent/hole/extent pattern.  Bug #0 is a subtle and mostly-harmless quirk
of behavior somewhere in the btrfs write code.  In a few special cases,
an inline extent and hole are allowed to persist where they normally
would be combined with later extents in the file.

A fast reproducer for bug #0 is presented below.  A few offending extents
are also created in the wild during large rsync transfers with the -S
flag.  A Linux kernel build (git checkout; make allyesconfig; make -j8)
will produce a handful of offending files as well.  Once an offending
file is created, it can present different content to userspace each
time it is read.

Bug #0 is at least 4 and possibly 8 years old.  I verified every vX.Y
kernel back to v3.5 has this behavior.  There are fossil records of this
bug's effects in commits all the way back to v2.6.32.  I have no reason
to believe bug #0 wasn't present at the beginning of btrfs compression
support in v2.6.29, but I can't easily test kernels that old to be sure.

It is not clear whether bug #0 is worth fixing.  A fix would likely
require injecting extra reads into currently write-only paths, and most
of the exceptional cases caused by bug #0 are already handled now.

Whether we like them or not, bug #0's inline extents followed by holes
are part of the btrfs de-facto disk format now, and we need to be able
to read them without data corruption or an infoleak.  So enough about
bug #0, let's get back to bug #3 (this patch).

An example of on-disk structure leading to data corruption found in
the wild:

        item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160
                inode generation 50 transid 50 size 47424 nbytes 49141
                block group 0 mode 100644 links 1 uid 0 gid 0
                rdev 0 flags 0x0(none)
        item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20
                inode ref index 3 namelen 10 name: DB_File.so
        item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362
                inline extent data size 1341 ram 4085 compress(zlib)
        item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53
                extent data disk byte 5367308288 nr 20480
                extent data offset 0 nr 45056 ram 45056
                extent compression(zlib)

Different data appears in userspace during each read of the 11 bytes
between 4085 and 4096.  The extent in item 63 is not long enough to
fill the first page of the file, so a memset is required to fill the
space between item 63 (ending at 4085) and item 64 (beginning at 4096)
with zero.

Here is a reproducer from Liu Bo, which demonstrates another method
of creating the same inline extent and hole pattern:

Using 'page_poison=on' kernel command line (or enable
CONFIG_PAGE_POISONING) run the following:

	# touch foo
	# chattr +c foo
	# xfs_io -f -c "pwrite -W 0 1000" foo
	# xfs_io -f -c "falloc 4 8188" foo
	# od -x foo
	# echo 3 >/proc/sys/vm/drop_caches
	# od -x foo

This produce the following on my box:

Correct output:  file contains 1000 data bytes followed
by zeros:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000
	0001760 0000 0000 0000 0000 0000 0000 0000 0000
	*
	0020000

Actual output:  the data after the first 1000 bytes
will be different each run:

	0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
	*
	0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d
	0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400
	0002000 435f 0056 5f74 6164 7400 645f 0062 5f74
	(...)

Signed-off-by: Zygo Blaxell <[email protected]>
Reviewed-by: Liu Bo <[email protected]>
Reviewed-by: Chris Mason <[email protected]>
Signed-off-by: Chris Mason <[email protected]>
pcmoore pushed a commit that referenced this issue May 17, 2017
The rcu_barrier() takes the cpu_hotplug mutex which itself is not
reclaim-safe, and so rcu_barrier() is illegal from inside the shrinker.

[  309.661373] =========================================================
[  309.661376] [ INFO: possible irq lock inversion dependency detected ]
[  309.661380] 4.11.0-rc1-CI-CI_DRM_2333+ #1 Tainted: G        W
[  309.661383] ---------------------------------------------------------
[  309.661386] gem_exec_gttfil/6435 just changed the state of lock:
[  309.661389]  (rcu_preempt_state.barrier_mutex){+.+.-.}, at: [<ffffffff81100731>] _rcu_barrier+0x31/0x160
[  309.661399] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[  309.661402]  (cpu_hotplug.lock){+.+.+.}
[  309.661404]

               and interrupts could create inverse lock ordering between them.

[  309.661410]
               other info that might help us debug this:
[  309.661414]  Possible interrupt unsafe locking scenario:

[  309.661417]        CPU0                    CPU1
[  309.661419]        ----                    ----
[  309.661421]   lock(cpu_hotplug.lock);
[  309.661425]                                local_irq_disable();
[  309.661432]                                lock(rcu_preempt_state.barrier_mutex);
[  309.661441]                                lock(cpu_hotplug.lock);
[  309.661446]   <Interrupt>
[  309.661448]     lock(rcu_preempt_state.barrier_mutex);
[  309.661453]
                *** DEADLOCK ***

[  309.661460] 4 locks held by gem_exec_gttfil/6435:
[  309.661464]  #0:  (sb_writers#10){.+.+.+}, at: [<ffffffff8120d83d>] vfs_write+0x17d/0x1f0
[  309.661475]  #1:  (debugfs_srcu){......}, at: [<ffffffff81320491>] debugfs_use_file_start+0x41/0xa0
[  309.661486]  #2:  (&attr->mutex){+.+.+.}, at: [<ffffffff8123a3e7>] simple_attr_write+0x37/0xe0
[  309.661495]  #3:  (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0091b4a>] i915_drop_caches_set+0x3a/0x150 [i915]
[  309.661540]
               the shortest dependencies between 2nd lock and 1st lock:
[  309.661547]  -> (cpu_hotplug.lock){+.+.+.} ops: 829 {
[  309.661553]     HARDIRQ-ON-W at:
[  309.661560]                       __lock_acquire+0x5e5/0x1b50
[  309.661565]                       lock_acquire+0xc9/0x220
[  309.661572]                       __mutex_lock+0x6e/0x990
[  309.661576]                       mutex_lock_nested+0x16/0x20
[  309.661583]                       get_online_cpus+0x61/0x80
[  309.661590]                       kmem_cache_create+0x25/0x1d0
[  309.661596]                       debug_objects_mem_init+0x30/0x249
[  309.661602]                       start_kernel+0x341/0x3fe
[  309.661607]                       x86_64_start_reservations+0x2a/0x2c
[  309.661612]                       x86_64_start_kernel+0x173/0x186
[  309.661619]                       verify_cpu+0x0/0xfc
[  309.661622]     SOFTIRQ-ON-W at:
[  309.661627]                       __lock_acquire+0x611/0x1b50
[  309.661632]                       lock_acquire+0xc9/0x220
[  309.661636]                       __mutex_lock+0x6e/0x990
[  309.661641]                       mutex_lock_nested+0x16/0x20
[  309.661646]                       get_online_cpus+0x61/0x80
[  309.661650]                       kmem_cache_create+0x25/0x1d0
[  309.661655]                       debug_objects_mem_init+0x30/0x249
[  309.661660]                       start_kernel+0x341/0x3fe
[  309.661664]                       x86_64_start_reservations+0x2a/0x2c
[  309.661669]                       x86_64_start_kernel+0x173/0x186
[  309.661674]                       verify_cpu+0x0/0xfc
[  309.661677]     RECLAIM_FS-ON-W at:
[  309.661682]                          mark_held_locks+0x6f/0xa0
[  309.661687]                          lockdep_trace_alloc+0xb3/0x100
[  309.661693]                          kmem_cache_alloc_trace+0x31/0x2e0
[  309.661699]                          __smpboot_create_thread.part.1+0x27/0xe0
[  309.661704]                          smpboot_create_threads+0x61/0x90
[  309.661709]                          cpuhp_invoke_callback+0x9c/0x8a0
[  309.661713]                          cpuhp_up_callbacks+0x31/0xb0
[  309.661718]                          _cpu_up+0x7a/0xc0
[  309.661723]                          do_cpu_up+0x5f/0x80
[  309.661727]                          cpu_up+0xe/0x10
[  309.661734]                          smp_init+0x71/0xb3
[  309.661738]                          kernel_init_freeable+0x94/0x19e
[  309.661743]                          kernel_init+0x9/0xf0
[  309.661748]                          ret_from_fork+0x2e/0x40
[  309.661752]     INITIAL USE at:
[  309.661757]                      __lock_acquire+0x234/0x1b50
[  309.661761]                      lock_acquire+0xc9/0x220
[  309.661766]                      __mutex_lock+0x6e/0x990
[  309.661771]                      mutex_lock_nested+0x16/0x20
[  309.661775]                      get_online_cpus+0x61/0x80
[  309.661780]                      __cpuhp_setup_state+0x44/0x170
[  309.661785]                      page_alloc_init+0x23/0x3a
[  309.661790]                      start_kernel+0x124/0x3fe
[  309.661794]                      x86_64_start_reservations+0x2a/0x2c
[  309.661799]                      x86_64_start_kernel+0x173/0x186
[  309.661804]                      verify_cpu+0x0/0xfc
[  309.661807]   }
[  309.661813]   ... key      at: [<ffffffff81e37690>] cpu_hotplug+0xb0/0x100
[  309.661817]   ... acquired at:
[  309.661821]    lock_acquire+0xc9/0x220
[  309.661825]    __mutex_lock+0x6e/0x990
[  309.661829]    mutex_lock_nested+0x16/0x20
[  309.661833]    get_online_cpus+0x61/0x80
[  309.661837]    _rcu_barrier+0x9f/0x160
[  309.661841]    rcu_barrier+0x10/0x20
[  309.661847]    netdev_run_todo+0x5f/0x310
[  309.661852]    rtnl_unlock+0x9/0x10
[  309.661856]    default_device_exit_batch+0x133/0x150
[  309.661862]    ops_exit_list.isra.0+0x4d/0x60
[  309.661866]    cleanup_net+0x1d8/0x2c0
[  309.661872]    process_one_work+0x1f4/0x6d0
[  309.661876]    worker_thread+0x49/0x4a0
[  309.661881]    kthread+0x107/0x140
[  309.661884]    ret_from_fork+0x2e/0x40

[  309.661890] -> (rcu_preempt_state.barrier_mutex){+.+.-.} ops: 179 {
[  309.661896]    HARDIRQ-ON-W at:
[  309.661901]                     __lock_acquire+0x5e5/0x1b50
[  309.661905]                     lock_acquire+0xc9/0x220
[  309.661910]                     __mutex_lock+0x6e/0x990
[  309.661914]                     mutex_lock_nested+0x16/0x20
[  309.661919]                     _rcu_barrier+0x31/0x160
[  309.661923]                     rcu_barrier+0x10/0x20
[  309.661928]                     netdev_run_todo+0x5f/0x310
[  309.661932]                     rtnl_unlock+0x9/0x10
[  309.661936]                     default_device_exit_batch+0x133/0x150
[  309.661941]                     ops_exit_list.isra.0+0x4d/0x60
[  309.661946]                     cleanup_net+0x1d8/0x2c0
[  309.661951]                     process_one_work+0x1f4/0x6d0
[  309.661955]                     worker_thread+0x49/0x4a0
[  309.661960]                     kthread+0x107/0x140
[  309.661964]                     ret_from_fork+0x2e/0x40
[  309.661968]    SOFTIRQ-ON-W at:
[  309.661972]                     __lock_acquire+0x611/0x1b50
[  309.661977]                     lock_acquire+0xc9/0x220
[  309.661981]                     __mutex_lock+0x6e/0x990
[  309.661986]                     mutex_lock_nested+0x16/0x20
[  309.661990]                     _rcu_barrier+0x31/0x160
[  309.661995]                     rcu_barrier+0x10/0x20
[  309.661999]                     netdev_run_todo+0x5f/0x310
[  309.662003]                     rtnl_unlock+0x9/0x10
[  309.662008]                     default_device_exit_batch+0x133/0x150
[  309.662013]                     ops_exit_list.isra.0+0x4d/0x60
[  309.662017]                     cleanup_net+0x1d8/0x2c0
[  309.662022]                     process_one_work+0x1f4/0x6d0
[  309.662027]                     worker_thread+0x49/0x4a0
[  309.662031]                     kthread+0x107/0x140
[  309.662035]                     ret_from_fork+0x2e/0x40
[  309.662039]    IN-RECLAIM_FS-W at:
[  309.662043]                        __lock_acquire+0x638/0x1b50
[  309.662048]                        lock_acquire+0xc9/0x220
[  309.662053]                        __mutex_lock+0x6e/0x990
[  309.662058]                        mutex_lock_nested+0x16/0x20
[  309.662062]                        _rcu_barrier+0x31/0x160
[  309.662067]                        rcu_barrier+0x10/0x20
[  309.662089]                        i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662109]                        i915_drop_caches_set+0x141/0x150 [i915]
[  309.662114]                        simple_attr_write+0xc7/0xe0
[  309.662119]                        full_proxy_write+0x4f/0x70
[  309.662124]                        __vfs_write+0x23/0x120
[  309.662128]                        vfs_write+0xc6/0x1f0
[  309.662133]                        SyS_write+0x44/0xb0
[  309.662138]                        entry_SYSCALL_64_fastpath+0x1c/0xb1
[  309.662142]    INITIAL USE at:
[  309.662147]                    __lock_acquire+0x234/0x1b50
[  309.662151]                    lock_acquire+0xc9/0x220
[  309.662156]                    __mutex_lock+0x6e/0x990
[  309.662160]                    mutex_lock_nested+0x16/0x20
[  309.662165]                    _rcu_barrier+0x31/0x160
[  309.662169]                    rcu_barrier+0x10/0x20
[  309.662174]                    netdev_run_todo+0x5f/0x310
[  309.662178]                    rtnl_unlock+0x9/0x10
[  309.662183]                    default_device_exit_batch+0x133/0x150
[  309.662188]                    ops_exit_list.isra.0+0x4d/0x60
[  309.662192]                    cleanup_net+0x1d8/0x2c0
[  309.662197]                    process_one_work+0x1f4/0x6d0
[  309.662202]                    worker_thread+0x49/0x4a0
[  309.662206]                    kthread+0x107/0x140
[  309.662210]                    ret_from_fork+0x2e/0x40
[  309.662214]  }
[  309.662220]  ... key      at: [<ffffffff81e4e1c8>] rcu_preempt_state+0x508/0x780
[  309.662225]  ... acquired at:
[  309.662229]    check_usage_forwards+0x12b/0x130
[  309.662233]    mark_lock+0x360/0x6f0
[  309.662237]    __lock_acquire+0x638/0x1b50
[  309.662241]    lock_acquire+0xc9/0x220
[  309.662245]    __mutex_lock+0x6e/0x990
[  309.662249]    mutex_lock_nested+0x16/0x20
[  309.662253]    _rcu_barrier+0x31/0x160
[  309.662257]    rcu_barrier+0x10/0x20
[  309.662279]    i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662298]    i915_drop_caches_set+0x141/0x150 [i915]
[  309.662303]    simple_attr_write+0xc7/0xe0
[  309.662307]    full_proxy_write+0x4f/0x70
[  309.662311]    __vfs_write+0x23/0x120
[  309.662315]    vfs_write+0xc6/0x1f0
[  309.662319]    SyS_write+0x44/0xb0
[  309.662323]    entry_SYSCALL_64_fastpath+0x1c/0xb1

[  309.662329]
               stack backtrace:
[  309.662335] CPU: 1 PID: 6435 Comm: gem_exec_gttfil Tainted: G        W       4.11.0-rc1-CI-CI_DRM_2333+ #1
[  309.662342] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011
[  309.662348] Call Trace:
[  309.662354]  dump_stack+0x67/0x92
[  309.662359]  print_irq_inversion_bug.part.19+0x1a4/0x1b0
[  309.662365]  check_usage_forwards+0x12b/0x130
[  309.662369]  mark_lock+0x360/0x6f0
[  309.662374]  ? print_shortest_lock_dependencies+0x1a0/0x1a0
[  309.662379]  __lock_acquire+0x638/0x1b50
[  309.662383]  ? __mutex_unlock_slowpath+0x3e/0x2e0
[  309.662388]  ? trace_hardirqs_on+0xd/0x10
[  309.662392]  ? _rcu_barrier+0x31/0x160
[  309.662396]  lock_acquire+0xc9/0x220
[  309.662400]  ? _rcu_barrier+0x31/0x160
[  309.662404]  ? _rcu_barrier+0x31/0x160
[  309.662409]  __mutex_lock+0x6e/0x990
[  309.662412]  ? _rcu_barrier+0x31/0x160
[  309.662416]  ? _rcu_barrier+0x31/0x160
[  309.662421]  ? synchronize_rcu_expedited+0x35/0xb0
[  309.662426]  ? _raw_spin_unlock_irqrestore+0x52/0x60
[  309.662434]  mutex_lock_nested+0x16/0x20
[  309.662438]  _rcu_barrier+0x31/0x160
[  309.662442]  rcu_barrier+0x10/0x20
[  309.662464]  i915_gem_shrink_all+0x33/0x40 [i915]
[  309.662484]  i915_drop_caches_set+0x141/0x150 [i915]
[  309.662489]  simple_attr_write+0xc7/0xe0
[  309.662494]  full_proxy_write+0x4f/0x70
[  309.662498]  __vfs_write+0x23/0x120
[  309.662503]  ? rcu_read_lock_sched_held+0x75/0x80
[  309.662507]  ? rcu_sync_lockdep_assert+0x2a/0x50
[  309.662512]  ? __sb_start_write+0x102/0x210
[  309.662516]  ? vfs_write+0x17d/0x1f0
[  309.662520]  vfs_write+0xc6/0x1f0
[  309.662524]  ? trace_hardirqs_on_caller+0xe7/0x200
[  309.662529]  SyS_write+0x44/0xb0
[  309.662533]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[  309.662537] RIP: 0033:0x7f507eac24a0
[  309.662541] RSP: 002b:00007fffda8720e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  309.662548] RAX: ffffffffffffffda RBX: ffffffff81482bd3 RCX: 00007f507eac24a0
[  309.662552] RDX: 0000000000000005 RSI: 00007fffda8720f0 RDI: 0000000000000005
[  309.662557] RBP: ffffc9000048bf88 R08: 0000000000000000 R09: 000000000000002c
[  309.662561] R10: 0000000000000014 R11: 0000000000000246 R12: 00007fffda872230
[  309.662566] R13: 00007fffda872228 R14: 0000000000000201 R15: 00007fffda8720f0
[  309.662572]  ? __this_cpu_preempt_check+0x13/0x20

Fixes: 0eafec6 ("drm/i915: Enable lockless lookup of request tracking via RCU")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100192
Signed-off-by: Chris Wilson <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: <[email protected]> # v4.9+
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
Reviewed-by: Daniel Vetter <[email protected]>
(cherry picked from commit bd784b7)
Signed-off-by: Jani Nikula <[email protected]>
Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
pcmoore pushed a commit that referenced this issue May 17, 2017
Dmitry reported a lockdep splat [1] (false positive) that we can fix
by releasing the spinlock before calling icmp_send() from ip_expire()

This is a false positive because sending an ICMP message can not
possibly re-enter the IP frag engine.

[1]
[ INFO: possible circular locking dependency detected ]
4.10.0+ #29 Not tainted
-------------------------------------------------------
modprobe/12392 is trying to acquire lock:
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock
include/linux/spinlock.h:299 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock
include/linux/netdevice.h:3486 [inline]
 (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>]
sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180

but task is already holding lock:
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&(&q->lock)->rlock){+.-...}:
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669
       ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713
       packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459
       deliver_skb net/core/dev.c:1834 [inline]
       dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890
       xmit_one net/core/dev.c:2903 [inline]
       dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923
       sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308
       neigh_output include/net/neighbour.h:478 [inline]
       ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228
       ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672
       ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545
       ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655
       inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761
       sock_sendmsg_nosec net/socket.c:633 [inline]
       sock_sendmsg+0xca/0x110 net/socket.c:643
       ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985
       __sys_sendmmsg+0x25c/0x750 net/socket.c:2075
       SYSC_sendmmsg net/socket.c:2106 [inline]
       SyS_sendmmsg+0x35/0x60 net/socket.c:2101
       do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281
       return_from_SYSCALL_64+0x0/0x7a

-> #0 (_xmit_ETHER#2){+.-...}:
       check_prev_add kernel/locking/lockdep.c:1830 [inline]
       check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
       validate_chain kernel/locking/lockdep.c:2267 [inline]
       __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
       lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       __netif_tx_lock include/linux/netdevice.h:3486 [inline]
       sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
       __dev_xmit_skb net/core/dev.c:3092 [inline]
       __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
       dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
       neigh_hh_output include/net/neighbour.h:468 [inline]
       neigh_output include/net/neighbour.h:476 [inline]
       ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
       ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:246 [inline]
       ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
       dst_output include/net/dst.h:486 [inline]
       ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
       ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
       ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
       icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
       icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
       ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
       call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
       expire_timers kernel/time/timer.c:1307 [inline]
       __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
       run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
       __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
       invoke_softirq kernel/softirq.c:364 [inline]
       irq_exit+0x1cc/0x200 kernel/softirq.c:405
       exiting_irq arch/x86/include/asm/apic.h:657 [inline]
       smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
       apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
       __read_once_size include/linux/compiler.h:254 [inline]
       atomic_read arch/x86/include/asm/atomic.h:26 [inline]
       rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
       __rcu_is_watching kernel/rcu/tree.c:1133 [inline]
       rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
       rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
       radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
       filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
       do_fault_around mm/memory.c:3231 [inline]
       do_read_fault mm/memory.c:3265 [inline]
       do_fault+0xbd5/0x2080 mm/memory.c:3370
       handle_pte_fault mm/memory.c:3600 [inline]
       __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
       handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
       __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
       do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
       page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011

other info that might help us debug this:

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&(&q->lock)->rlock);
                               lock(_xmit_ETHER#2);
                               lock(&(&q->lock)->rlock);
  lock(_xmit_ETHER#2);

 *** DEADLOCK ***

10 locks held by modprobe/12392:
 #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>]
__do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336
 #1:  (rcu_read_lock){......}, at: [<ffffffff8188cab6>]
filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
spin_lock include/linux/spinlock.h:299 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
pte_alloc_one_map mm/memory.c:2944 [inline]
 #2:  (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>]
alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
lockdep_copy_map include/linux/lockdep.h:175 [inline]
 #3:  (((&q->timer))){+.-...}, at: [<ffffffff81627e72>]
call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock
include/linux/spinlock.h:299 [inline]
 #4:  (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>]
ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201
 #5:  (rcu_read_lock){......}, at: [<ffffffff8389a633>]
ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock
include/linux/spinlock.h:309 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock
net/ipv4/icmp.c:219 [inline]
 #6:  (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>]
icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681
 #7:  (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>]
ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198
 #8:  (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>]
__dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324
 #9:  (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at:
[<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423

stack backtrace:
CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x2ee/0x3ef lib/dump_stack.c:52
 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204
 check_prev_add kernel/locking/lockdep.c:1830 [inline]
 check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940
 validate_chain kernel/locking/lockdep.c:2267 [inline]
 __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340
 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755
 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
 _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
 spin_lock include/linux/spinlock.h:299 [inline]
 __netif_tx_lock include/linux/netdevice.h:3486 [inline]
 sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180
 __dev_xmit_skb net/core/dev.c:3092 [inline]
 __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358
 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423
 neigh_hh_output include/net/neighbour.h:468 [inline]
 neigh_output include/net/neighbour.h:476 [inline]
 ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228
 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316
 NF_HOOK_COND include/linux/netfilter.h:246 [inline]
 ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404
 dst_output include/net/dst.h:486 [inline]
 ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124
 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492
 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512
 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394
 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754
 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239
 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268
 expire_timers kernel/time/timer.c:1307 [inline]
 __run_timers+0x960/0xcf0 kernel/time/timer.c:1601
 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614
 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284
 invoke_softirq kernel/softirq.c:364 [inline]
 irq_exit+0x1cc/0x200 kernel/softirq.c:405
 exiting_irq arch/x86/include/asm/apic.h:657 [inline]
 smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962
 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707
RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline]
RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline]
RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline]
RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline]
RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147
RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10
RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c
RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000
R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25
R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000
 </IRQ>
 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293
 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline]
 filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335
 do_fault_around mm/memory.c:3231 [inline]
 do_read_fault mm/memory.c:3265 [inline]
 do_fault+0xbd5/0x2080 mm/memory.c:3370
 handle_pte_fault mm/memory.c:3600 [inline]
 __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714
 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751
 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397
 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460
 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011
RIP: 0033:0x7f83172f2786
RSP: 002b:00007fffe859ae80 EFLAGS: 00010293
RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970
RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040
R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Jul 6, 2017
'perf annotate' is dropping the cr* fields from branch instructions.

Fix it by adding support to display branch instructions having
multiple operands.

Power Arch objdump of int_sqrt:

 20.36 | c0000000004d2694:   subf   r10,r10,r3
       | c0000000004d2698: v bgt    cr6,c0000000004d26a0 <int_sqrt+0x40>
  1.82 | c0000000004d269c:   mr     r3,r10
 29.18 | c0000000004d26a0:   mr     r10,r8
       | c0000000004d26a4: v bgt    cr7,c0000000004d26ac <int_sqrt+0x4c>
       | c0000000004d26a8:   mr     r10,r7

Power Arch Before Patch:

 20.36 |       subf   r10,r10,r3
       |     v bgt    40
  1.82 |       mr     r3,r10
 29.18 | 40:   mr     r10,r8
       |     v bgt    4c
       |       mr     r10,r7

Power Arch After patch:

 20.36 |       subf   r10,r10,r3
       |     v bgt    cr6,40
  1.82 |       mr     r3,r10
 29.18 | 40:   mr     r10,r8
       |     v bgt    cr7,4c
       |       mr     r10,r7

Also support AArch64 conditional branch instructions, which can
have up to three operands:

Aarch64 Non-simplified (raw objdump) view:

       │ffff0000083cd11c: ↑ cbz    w0, ffff0000083cd100 <security_fil▒
...
  4.44 │ffff000│083cd134: ↓ tbnz   w0, #26, ffff0000083cd190 <securit▒
...
  1.37 │ffff000│083cd144: ↓ tbnz   w22, #5, ffff0000083cd1a4 <securit▒
       │ffff000│083cd148:   mov    w19, #0x20000                   //▒
  1.02 │ffff000│083cd14c: ↓ tbz    w22, #2, ffff0000083cd1ac <securit▒
...
  0.68 │ffff000└──3cd16c: ↑ cbnz   w0, ffff0000083cd120 <security_fil▒

Aarch64 Simplified, before this patch:

       │    ↑ cbz    40
...
  4.44 │   │↓ tbnz   w0, #26, ffff0000083cd190 <security_file_permiss▒
...
  1.37 │   │↓ tbnz   w22, #5, ffff0000083cd1a4 <security_file_permiss▒
       │   │  mov    w19, #0x20000                   // #131072
  1.02 │   │↓ tbz    w22, #2, ffff0000083cd1ac <security_file_permiss▒
...
  0.68 │   └──cbnz   60

the cbz operand is missing, and the tbz doesn't get simplified processing
at all because the parsing function failed to match an address.

Aarch64 Simplified, After this patch applied:

       │    ↑ cbz    w0, 40
...
  4.44 │   │↓ tbnz   w0, #26, d0
...
  1.37 │   │↓ tbnz   w22, #5, e4
       │   │  mov    w19, #0x20000                   // #131072
  1.02 │   │↓ tbz    w22, #2, ec
...
  0.68 │   └──cbnz   w0, 60

Originally-by: Ravi Bangoria <[email protected]>
Tested-by: Ravi Bangoria <[email protected]>
Reported-by: Anton Blanchard <[email protected]>
Reported-by: Robin Murphy <[email protected]>
Signed-off-by: Kim Phillips <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Taeung Song <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
pcmoore pushed a commit that referenced this issue Jul 6, 2017
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
essential that the memory allocated for struct request_queue
stays around until all blk_exit_rl() calls have finished. Hence
make blk_init_rl() take a reference on struct request_queue.

This patch fixes the following crash:

general protection fault: 0000 [#2] SMP
CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G      D         4.12.0-rc2-dbg+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
task: ffff88013a108040 task.stack: ffffc9000071c000
RIP: 0010:free_request_size+0x1a/0x30
RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
Call Trace:
 mempool_destroy.part.10+0x21/0x40
 mempool_destroy+0xe/0x10
 blk_exit_rl+0x12/0x20
 blkg_free+0x4d/0xa0
 __blkg_release_rcu+0x59/0x170
 rcu_process_callbacks+0x260/0x4e0
 __do_softirq+0x116/0x250
 smpboot_thread_fn+0x123/0x1e0
 kthread+0x109/0x140
 ret_from_fork+0x31/0x40

Fixes: commit e9c787e ("scsi: allocate scsi_cmnd structures as part of struct request")
Signed-off-by: Bart Van Assche <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: <[email protected]> # v4.11+
Signed-off-by: Jens Axboe <[email protected]>
pcmoore pushed a commit that referenced this issue Jul 6, 2017
Yuval Mintz says:

====================
bnx2x: Fix malicious VFs indication

It was discovered that for a VF there's a simple [yet uncommon] scenario
which would cause device firmware to declare that VF as malicious -
Add a vlan interface on top of a VF and disable txvlan offloading for
that VF [causing VF to transmit packets where vlan is on payload].

Patch #1 corrects driver transmission to prevent this issue.
Patch #2 is a by-product correcting PF behavior once a VF is declared
malicious.
====================

Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Jul 6, 2017
When HSR interface is setup using ip link command, an annoying warning
appears with the trace as below:-

[  203.019828] hsr_get_node: Non-HSR frame
[  203.019833] Modules linked in:
[  203.019848] CPU: 0 PID: 158 Comm: sd-resolve Tainted: G        W       4.12.0-rc3-00052-g9fa6bf70 #2
[  203.019853] Hardware name: Generic DRA74X (Flattened Device Tree)
[  203.019869] [<c0110280>] (unwind_backtrace) from [<c010c2f4>] (show_stack+0x10/0x14)
[  203.019880] [<c010c2f4>] (show_stack) from [<c04b9f64>] (dump_stack+0xac/0xe0)
[  203.019894] [<c04b9f64>] (dump_stack) from [<c01374e8>] (__warn+0xd8/0x104)
[  203.019907] [<c01374e8>] (__warn) from [<c0137548>] (warn_slowpath_fmt+0x34/0x44)
root@am57xx-evm:~# [  203.019921] [<c0137548>] (warn_slowpath_fmt) from [<c081126c>] (hsr_get_node+0x148/0x170)
[  203.019932] [<c081126c>] (hsr_get_node) from [<c0814240>] (hsr_forward_skb+0x110/0x7c0)
[  203.019942] [<c0814240>] (hsr_forward_skb) from [<c0811d64>] (hsr_dev_xmit+0x2c/0x34)
[  203.019954] [<c0811d64>] (hsr_dev_xmit) from [<c06c0828>] (dev_hard_start_xmit+0xc4/0x3bc)
[  203.019963] [<c06c0828>] (dev_hard_start_xmit) from [<c06c13d8>] (__dev_queue_xmit+0x7c4/0x98c)
[  203.019974] [<c06c13d8>] (__dev_queue_xmit) from [<c0782f54>] (ip6_finish_output2+0x330/0xc1c)
[  203.019983] [<c0782f54>] (ip6_finish_output2) from [<c0788f0c>] (ip6_output+0x58/0x454)
[  203.019994] [<c0788f0c>] (ip6_output) from [<c07b16cc>] (mld_sendpack+0x420/0x744)

As this is an expected path to hsr_get_node() with frame coming from
the master interface, add a check to ensure packet is not from the
master port and then warn.

Signed-off-by: Murali Karicheri <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
Andre Wild reported the following warning:

  WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
  Modules linked in:
  CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
  Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
  task: 00000000701d8100 task.stack: 0000000073594000
  Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
  ...
  Call Trace:
   lockdep_assert_cpus_held+0x42/0x60)
   stop_machine_cpuslocked+0x62/0xf0
   build_all_zonelists+0x92/0x150
   numa_zonelist_order_handler+0x102/0x150
   proc_sys_call_handler.isra.12+0xda/0x118
   proc_sys_write+0x34/0x48
   __vfs_write+0x3c/0x178
   vfs_write+0xbc/0x1a0
   SyS_write+0x66/0xc0
   system_call+0xc4/0x2b0
   locks held by bash/1205:
   #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
   #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
   #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
  Last Breaking-Event-Address:
    lockdep_assert_cpus_held+0x48/0x60

This can be easily triggered with e.g.

    echo n > /proc/sys/vm/numa_zonelist_order

In commit 3f906ba ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.

This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.

This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().

In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().

Link: http://lkml.kernel.org/r/[email protected]
Fixes: 3f906ba ("mm/memory-hotplug: switch locking to a percpu rwsem")
Signed-off-by: Heiko Carstens <[email protected]>
Reported-by: Andre Wild <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
This patch fixes a bug associated with iscsit_reset_np_thread()
that can occur during parallel configfs rmdir of a single iscsi_np
used across multiple iscsi-target instances, that would result in
hung task(s) similar to below where configfs rmdir process context
was blocked indefinately waiting for iscsi_np->np_restart_comp
to finish:

[ 6726.112076] INFO: task dcp_proxy_node_:15550 blocked for more than 120 seconds.
[ 6726.119440]       Tainted: G        W  O     4.1.26-3321 #2
[ 6726.125045] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6726.132927] dcp_proxy_node_ D ffff8803f202bc88     0 15550      1 0x00000000
[ 6726.140058]  ffff8803f202bc88 ffff88085c64d960 ffff88083b3b1ad0 ffff88087fffeb08
[ 6726.147593]  ffff8803f202c000 7fffffffffffffff ffff88083f459c28 ffff88083b3b1ad0
[ 6726.155132]  ffff88035373c100 ffff8803f202bca8 ffffffff8168ced2 ffff8803f202bcb8
[ 6726.162667] Call Trace:
[ 6726.165150]  [<ffffffff8168ced2>] schedule+0x32/0x80
[ 6726.170156]  [<ffffffff8168f5b4>] schedule_timeout+0x214/0x290
[ 6726.176030]  [<ffffffff810caef2>] ? __send_signal+0x52/0x4a0
[ 6726.181728]  [<ffffffff8168d7d6>] wait_for_completion+0x96/0x100
[ 6726.187774]  [<ffffffff810e7c80>] ? wake_up_state+0x10/0x10
[ 6726.193395]  [<ffffffffa035d6e2>] iscsit_reset_np_thread+0x62/0xe0 [iscsi_target_mod]
[ 6726.201278]  [<ffffffffa0355d86>] iscsit_tpg_disable_portal_group+0x96/0x190 [iscsi_target_mod]
[ 6726.210033]  [<ffffffffa0363f7f>] lio_target_tpg_store_enable+0x4f/0xc0 [iscsi_target_mod]
[ 6726.218351]  [<ffffffff81260c5a>] configfs_write_file+0xaa/0x110
[ 6726.224392]  [<ffffffff811ea364>] vfs_write+0xa4/0x1b0
[ 6726.229576]  [<ffffffff811eb111>] SyS_write+0x41/0xb0
[ 6726.234659]  [<ffffffff8169042e>] system_call_fastpath+0x12/0x71

It would happen because each iscsit_reset_np_thread() sets state
to ISCSI_NP_THREAD_RESET, sends SIGINT, and then blocks waiting
for completion on iscsi_np->np_restart_comp.

However, if iscsi_np was active processing a login request and
more than a single iscsit_reset_np_thread() caller to the same
iscsi_np was blocked on iscsi_np->np_restart_comp, iscsi_np
kthread process context in __iscsi_target_login_thread() would
flush pending signals and only perform a single completion of
np->np_restart_comp before going back to sleep within transport
specific iscsit_transport->iscsi_accept_np code.

To address this bug, add a iscsi_np->np_reset_count and update
__iscsi_target_login_thread() to keep completing np->np_restart_comp
until ->np_reset_count has reached zero.

Reported-by: Gary Guo <[email protected]>
Tested-by: Gary Guo <[email protected]>
Cc: Mike Christie <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: [email protected] # 3.10+
Signed-off-by: Nicholas Bellinger <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
Dean Jenkins says:

====================
asix: Improve robustness

Please consider taking these patches to improve the robustness of the ASIX USB
to Ethernet driver.

Failures prompting an ASIX driver code review
=============================================

On an ARM i.MX6 embedded platform some strange one-off and two-off failures were
observed in and around the ASIX USB to Ethernet driver. This was observed on a
highly modified kernel 3.14 with the ASIX driver containing back-ported changes
from kernel.org up to kernel 4.8 approximately.

a) A one-off failure in asix_rx_fixup_internal():

There was an occurrence of an attempt to write off the end of the netdev buffer
which was trapped by skb_over_panic() in skb_put().

[20030.846440] skbuff: skb_over_panic: text:7f2271c0 len:120 put:60 head:8366ecc0 data:8366ed02 tail:0x8366ed7a end:0x8366ed40 dev:eth0
[20030.863007] Kernel BUG at 8044ce38 [verbose debug info unavailable]

[20031.215345] Backtrace:
[20031.217884] [<8044cde0>] (skb_panic) from [<8044d50c>] (skb_put+0x50/0x5c)
[20031.227408] [<8044d4bc>] (skb_put) from [<7f2271c0>] (asix_rx_fixup_internal+0x1c4/0x23c [asix])
[20031.242024] [<7f226ffc>] (asix_rx_fixup_internal [asix]) from [<7f22724c>] (asix_rx_fixup_common+0x14/0x18 [asix])
[20031.260309] [<7f227238>] (asix_rx_fixup_common [asix]) from [<7f21f7d4>] (usbnet_bh+0x74/0x224 [usbnet])
[20031.269879] [<7f21f760>] (usbnet_bh [usbnet]) from [<8002f834>] (call_timer_fn+0xa4/0x1f0)
[20031.283961] [<8002f790>] (call_timer_fn) from [<80030834>] (run_timer_softirq+0x230/0x2a8)
[20031.302782] [<80030604>] (run_timer_softirq) from [<80028780>] (__do_softirq+0x15c/0x37c)
[20031.321511] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
[20031.339298] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
[20031.350038] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
[20031.365528] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)

Analysis of the logic of the ASIX driver (containing backported changes from
kernel.org up to kernel 4.8 approximately) suggested that the software could not
trigger skb_over_panic(). The analysis of the kernel BUG() crash information
suggested that the netdev buffer was written with 2 minimal 60 octet length
Ethernet frames (ASIX hardware drops the 4 octet FCS field) and the 2nd Ethernet
frame attempted to write off the end of the netdev buffer.

Note that the netdev buffer should only contain 1 Ethernet frame so if an
attempt to write 2 Ethernet frames into the buffer is made then that is wrong.
However, the logic of the asix_rx_fixup_internal() only allows 1 Ethernet frame
to be written into the netdev buffer.

Potentially this failure was due to memory corruption because it was only seen
once.

b) Two-off failures in the NAPI layer's backlog queue:

There were 2 crashes in the NAPI layer's backlog queue presumably after
asix_rx_fixup_internal() called usbnet_skb_return().

[24097.273945] Unable to handle kernel NULL pointer dereference at virtual address 00000004

[24097.398944] PC is at process_backlog+0x80/0x16c

[24097.569466] Backtrace:
[24097.572007] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
[24097.591631] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
[24097.610022] [<80028624>] (__do_softirq) from [<800289cc>] (run_ksoftirqd+0x2c/0x84)

and

[ 1059.828452] Unable to handle kernel NULL pointer dereference at virtual address 00000000

[ 1059.953715] PC is at process_backlog+0x84/0x16c

[ 1060.140896] Backtrace:
[ 1060.143434] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248)
[ 1060.163075] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c)
[ 1060.181474] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8)
[ 1060.199256] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8)
[ 1060.210006] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8)
[ 1060.225492] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70)

The embedded board was only using an ASIX USB to Ethernet adaptor eth0.

Analysis suggested that the doubly-linked list pointers of the backlog queue had
been corrupted because one of the link pointers was NULL.

Potentially this failure was due to memory corruption because it was only seen
twice.

Results of the ASIX driver code review
======================================

During the code review some weaknesses were observed in the ASIX driver and the
following patches have been created to improve the robustness.

Brief overview of the patches
-----------------------------

1. asix: Add rx->ax_skb = NULL after usbnet_skb_return()

The current ASIX driver sends the received Ethernet frame to the NAPI layer of
the network stack via the call to usbnet_skb_return() in
asix_rx_fixup_internal() but retains the rx->ax_skb pointer to the netdev
buffer. The driver no longer needs the rx->ax_skb pointer at this point because
the NAPI layer now has the Ethernet frame.

This means that asix_rx_fixup_internal() must not use rx->ax_skb after the call
to usbnet_skb_return() because it could corrupt the handling of the Ethernet
frame within the network layer.

Therefore, to remove the risk of erroneous usage of rx->ax_skb, set rx->ax_skb
to NULL after the call to usbnet_skb_return(). This avoids potential erroneous
freeing of rx->ax_skb and erroneous writing to the netdev buffer.  If the
software now somehow inappropriately reused rx->ax_skb, then a NULL pointer
dereference of rx->ax_skb would occur which makes investigation easier.

2. asix: Ensure asix_rx_fixup_info members are all reset

This patch creates reset_asix_rx_fixup_info() to allow all the
asix_rx_fixup_info structure members to be consistently reset to initial
conditions.

Call reset_asix_rx_fixup_info() upon each detectable error condition so that the
next URB is processed from a known state.

Otherwise, there is a risk that some members of the asix_rx_fixup_info structure
may be incorrect after an error occurred so potentially leading to a
malfunction.

3. asix: Fix small memory leak in ax88772_unbind()

This patch creates asix_rx_fixup_common_free() to allow the rx->ax_skb to be
freed when necessary.

asix_rx_fixup_common_free() is called from ax88772_unbind() before the parent
private data structure is freed.

Without this patch, there is a risk of a small netdev buffer memory leak each
time ax88772_unbind() is called during the reception of an Ethernet frame that
spans across 2 URBs.

Testing
=======

The patches have been sanity tested on a 64-bit Linux laptop running kernel
4.13-rc2 with the 3 patches applied on top.

The ASIX USB to Adaptor used for testing was (output of lsusb):
ID 0b95:772b ASIX Electronics Corp. AX88772B

Test #1
-------

The test ran a flood ping test script which slowly incremented the ICMP Echo
Request's payload from 0 to 5000 octets. This eventually causes IPv4
fragmentation to occur which causes Ethernet frames to be sent very close to
each other so increases the probability that an Ethernet frame will span 2 URBs.
The test showed that all pings were successful. The test took about 15 minutes
to complete.

Test #2
-------

A script was run on the laptop to periodically run ifdown and ifup every second
so that the ASIX USB to Adaptor was up for 1 second and down for 1 second.

From a Linux PC connected to the laptop, the following ping command was used
ping -f -s 5000 <ip address of laptop>

The large ICMP payload causes IPv4 fragmentation resulting in multiple
Ethernet frames per original IP packet.

Kernel debug within the ASIX driver was enabled to see whether any ASIX errors
were generated. The test was run for about 24 hours and no ASIX errors were
seen.

Patches
=======

The 3 patches have been rebased off the net-next repo master branch with HEAD
fbbeefd net: fec: Allow reception of frames bigger than 1522 bytes
====================

Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
MEI device performs link reset during system suspend sequence.
The link reset cannot be performed while device is in
runtime suspend state. The resume sequence is bypassed with
suspend direct complete optimization,so the optimization should be
disabled for mei devices.

Fixes:
 [  192.940537] Restarting tasks ...
 [  192.940610] PGI is not set
 [  192.940619] ------------[ cut here ]------------ [  192.940623]
 WARNING: CPU: 0
 me.c:653 mei_me_pg_exit_sync+0x351/0x360 [  192.940624] Modules
 linked
 in:
 [  192.940627] CPU: 0 PID: 1661 Comm: kworker/0:3 Not tainted
 4.13.0-rc2+
 #2 [  192.940628] Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS
 A11
 12/08/2016 [  192.940630] Workqueue: pm pm_runtime_work <snip> [
 192.940642] Call Trace:
 [  192.940646]  ? pci_pme_active+0x1de/0x1f0 [  192.940649]  ?
 pci_restore_standard_config+0x50/0x50
 [  192.940651]  ? kfree+0x172/0x190
 [  192.940653]  ? kfree+0x172/0x190
 [  192.940655]  ? pci_restore_standard_config+0x50/0x50
 [  192.940663]  mei_me_pm_runtime_resume+0x3f/0xc0
 [  192.940665]  pci_pm_runtime_resume+0x7a/0xa0 [  192.940667]
 __rpm_callback+0xb9/0x1e0 [  192.940668]  ?
 preempt_count_add+0x6d/0xc0 [  192.940670]  rpm_callback+0x24/0x90 [
 192.940672]  ? pci_restore_standard_config+0x50/0x50
 [  192.940674]  rpm_resume+0x4e8/0x800 [  192.940676]
 pm_runtime_work+0x55/0xb0 [  192.940678]
 process_one_work+0x184/0x3e0 [  192.940680]
 worker_thread+0x4d/0x3a0 [ 192.940681]  ?
 preempt_count_sub+0x9b/0x100 [  192.940683]
 kthread+0x122/0x140 [  192.940684]  ? process_one_work+0x3e0/0x3e0 [
 192.940685]  ? __kthread_create_on_node+0x1a0/0x1a0
 [  192.940688]  ret_from_fork+0x27/0x40 [  192.940690] Code: 96 3a
 9e ff 48 8b 7d 98 e8 cd 21 58 00 83 bb bc 01 00 00
 04 0f 85 40 fe ff ff e9 41 fe ff ff 48 c7 c7 5f 04 99 96 e8 93 6b 9f
 ff <0f> ff e9 5d fd ff ff e8 33 fe 99 ff 0f 1f 00 0f 1f 44 00 00 55
 [  192.940719] ---[ end trace
 a86955597774ead8 ]--- [  192.942540] done.

Suggested-by: Rafael J. Wysocki <[email protected]>
Reported-by: Dominik Brodowski <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Dominik Brodowski <[email protected]>
Signed-off-by: Alexander Usyskin <[email protected]>
Signed-off-by: Tomas Winkler <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
pfkey_broadcast() might be called from non process contexts,
we can not use GFP_KERNEL in these cases [1].

This patch partially reverts commit ba51b6b ("net: Fix RCU splat in
af_key"), only keeping the GFP_ATOMIC forcing under rcu_read_lock()
section.

[1] : syzkaller reported :

in_atomic(): 1, irqs_disabled(): 0, pid: 2932, name: syzkaller183439
3 locks held by syzkaller183439/2932:
 #0:  (&net->xfrm.xfrm_cfg_mutex){+.+.+.}, at: [<ffffffff83b43888>] pfkey_sendmsg+0x4c8/0x9f0 net/key/af_key.c:3649
 #1:  (&pfk->dump_lock){+.+.+.}, at: [<ffffffff83b467f6>] pfkey_do_dump+0x76/0x3f0 net/key/af_key.c:293
 #2:  (&(&net->xfrm.xfrm_policy_lock)->rlock){+...+.}, at: [<ffffffff83957632>] spin_lock_bh include/linux/spinlock.h:304 [inline]
 #2:  (&(&net->xfrm.xfrm_policy_lock)->rlock){+...+.}, at: [<ffffffff83957632>] xfrm_policy_walk+0x192/0xa30 net/xfrm/xfrm_policy.c:1028
CPU: 0 PID: 2932 Comm: syzkaller183439 Not tainted 4.13.0-rc4+ #24
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 ___might_sleep+0x2b2/0x470 kernel/sched/core.c:5994
 __might_sleep+0x95/0x190 kernel/sched/core.c:5947
 slab_pre_alloc_hook mm/slab.h:416 [inline]
 slab_alloc mm/slab.c:3383 [inline]
 kmem_cache_alloc+0x24b/0x6e0 mm/slab.c:3559
 skb_clone+0x1a0/0x400 net/core/skbuff.c:1037
 pfkey_broadcast_one+0x4b2/0x6f0 net/key/af_key.c:207
 pfkey_broadcast+0x4ba/0x770 net/key/af_key.c:281
 dump_sp+0x3d6/0x500 net/key/af_key.c:2685
 xfrm_policy_walk+0x2f1/0xa30 net/xfrm/xfrm_policy.c:1042
 pfkey_dump_sp+0x42/0x50 net/key/af_key.c:2695
 pfkey_do_dump+0xaa/0x3f0 net/key/af_key.c:299
 pfkey_spddump+0x1a0/0x210 net/key/af_key.c:2722
 pfkey_process+0x606/0x710 net/key/af_key.c:2814
 pfkey_sendmsg+0x4d6/0x9f0 net/key/af_key.c:3650
sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 ___sys_sendmsg+0x755/0x890 net/socket.c:2035
 __sys_sendmsg+0xe5/0x210 net/socket.c:2069
 SYSC_sendmsg net/socket.c:2080 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2076
 entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x445d79
RSP: 002b:00007f32447c1dc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000445d79
RDX: 0000000000000000 RSI: 000000002023dfc8 RDI: 0000000000000008
RBP: 0000000000000086 R08: 00007f32447c2700 R09: 00007f32447c2700
R10: 00007f32447c2700 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffe33edec4f R14: 00007f32447c29c0 R15: 0000000000000000

Fixes: ba51b6b ("net: Fix RCU splat in af_key")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Cc: David Ahern <[email protected]>
Acked-by: David Ahern <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
syzkaller reported that DCCP could have a non empty
write queue at dismantle time.

WARNING: CPU: 1 PID: 2953 at net/core/stream.c:199 sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 2953 Comm: syz-executor0 Not tainted 4.13.0-rc4+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199
RSP: 0018:ffff8801d182f108 EFLAGS: 00010297
RAX: ffff8801d1144140 RBX: ffff8801d13cb280 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff85137b00 RDI: ffff8801d13cb280
RBP: ffff8801d182f148 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d13cb4d0
R13: ffff8801d13cb3b8 R14: ffff8801d13cb300 R15: ffff8801d13cb3b8
 inet_csk_destroy_sock+0x175/0x3f0 net/ipv4/inet_connection_sock.c:835
 dccp_close+0x84d/0xc10 net/dccp/proto.c:1067
 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
 sock_release+0x8d/0x1e0 net/socket.c:597
 sock_close+0x16/0x20 net/socket.c:1126
 __fput+0x327/0x7e0 fs/file_table.c:210
 ____fput+0x15/0x20 fs/file_table.c:246
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa32/0x1b10 kernel/exit.c:865
 do_group_exit+0x149/0x400 kernel/exit.c:969
 get_signal+0x7e8/0x17e0 kernel/signal.c:2330
 do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:808
 exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157
 prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
 syscall_return_slowpath+0x3a7/0x450 arch/x86/entry/common.c:263

Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Sep 5, 2017
The following commit:

  39a0526 ("x86/mm: Factor out LDT init from context init")

renamed init_new_context() to init_new_context_ldt() and added a new
init_new_context() which calls init_new_context_ldt().  However, the
error code of init_new_context_ldt() was ignored.  Consequently, if a
memory allocation in alloc_ldt_struct() failed during a fork(), the
->context.ldt of the new task remained the same as that of the old task
(due to the memcpy() in dup_mm()).  ldt_struct's are not intended to be
shared, so a use-after-free occurred after one task exited.

Fix the bug by making init_new_context() pass through the error code of
init_new_context_ldt().

This bug was found by syzkaller, which encountered the following splat:

    BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
    Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710

    CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:16 [inline]
     dump_stack+0x194/0x257 lib/dump_stack.c:52
     print_address_description+0x73/0x250 mm/kasan/report.c:252
     kasan_report_error mm/kasan/report.c:351 [inline]
     kasan_report+0x24e/0x340 mm/kasan/report.c:409
     __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
     free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     exec_mmap fs/exec.c:1061 [inline]
     flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291
     load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855
     search_binary_handler+0x142/0x6b0 fs/exec.c:1652
     exec_binprm fs/exec.c:1694 [inline]
     do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816
     do_execve+0x31/0x40 fs/exec.c:1860
     call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100
     ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431

    Allocated by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
     kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
     kmalloc include/linux/slab.h:493 [inline]
     alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67
     write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277
     sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307
     entry_SYSCALL_64_fastpath+0x1f/0xbe

    Freed by task 3700:
     save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
     save_stack+0x43/0xd0 mm/kasan/kasan.c:447
     set_track mm/kasan/kasan.c:459 [inline]
     kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
     __cache_free mm/slab.c:3503 [inline]
     kfree+0xca/0x250 mm/slab.c:3820
     free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121
     free_ldt_struct arch/x86/kernel/ldt.c:173 [inline]
     destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171
     destroy_context arch/x86/include/asm/mmu_context.h:157 [inline]
     __mmdrop+0xe9/0x530 kernel/fork.c:889
     mmdrop include/linux/sched/mm.h:42 [inline]
     __mmput kernel/fork.c:916 [inline]
     mmput+0x541/0x6e0 kernel/fork.c:927
     copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931
     copy_process kernel/fork.c:1546 [inline]
     _do_fork+0x1ef/0xfb0 kernel/fork.c:2025
     SYSC_clone kernel/fork.c:2135 [inline]
     SyS_clone+0x37/0x50 kernel/fork.c:2129
     do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287
     return_from_SYSCALL_64+0x0/0x7a

Here is a C reproducer:

    #include <asm/ldt.h>
    #include <pthread.h>
    #include <signal.h>
    #include <stdlib.h>
    #include <sys/syscall.h>
    #include <sys/wait.h>
    #include <unistd.h>

    static void *fork_thread(void *_arg)
    {
        fork();
    }

    int main(void)
    {
        struct user_desc desc = { .entry_number = 8191 };

        syscall(__NR_modify_ldt, 1, &desc, sizeof(desc));

        for (;;) {
            if (fork() == 0) {
                pthread_t t;

                srand(getpid());
                pthread_create(&t, NULL, fork_thread, NULL);
                usleep(rand() % 10000);
                syscall(__NR_exit_group, 0);
            }
            wait(NULL);
        }
    }

Note: the reproducer takes advantage of the fact that alloc_ldt_struct()
may use vmalloc() to allocate a large ->entries array, and after
commit:

  5d17a73 ("vmalloc: back off when the current task is killed")

it is possible for userspace to fail a task's vmalloc() by
sending a fatal signal, e.g. via exit_group().  It would be more
difficult to reproduce this bug on kernels without that commit.

This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y.

Signed-off-by: Eric Biggers <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Cc: <[email protected]> [v4.6+]
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: Dmitry Vyukov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Fixes: 39a0526 ("x86/mm: Factor out LDT init from context init")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
pcmoore pushed a commit that referenced this issue Dec 2, 2024
…ndex

Intel SoundWire machine driver always uses Pin number 2 and above.
Currently, the pin number is used as the FW DAI index directly. As a
result, FW DAI 0 and 1 are never used. That worked fine because we use
up to 2 DAIs in a SDW link. Convert the topology pin index to ALH dai
index, the mapping is using 2-off indexing, iow, pin #2 is ALH dai #0.

The issue exists since beginning. And the Fixes tag is the first commit
that this commit can be applied.

Fixes: b66bfc3 ("ASoC: SOF: sof-audio: Fix broken early bclk feature for SSP")
Signed-off-by: Bard Liao <[email protected]>
Reviewed-by: Péter Ujfalusi <[email protected]>
Reviewed-by: Liam Girdwood <[email protected]>
Reviewed-by: Kai Vehmanen <[email protected]>
Reviewed-by: Ranjani Sridharan <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Mark Brown <[email protected]>
pcmoore pushed a commit that referenced this issue Dec 2, 2024
…to HEAD

KVM/riscv changes for 6.13 part #2

- Svade and Svadu extension support for Host and Guest/VM
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Since the netlink attribute range validation provides inclusive
checking, the *max* of attribute NL80211_ATTR_MLO_LINK_ID should be
IEEE80211_MLD_MAX_NUM_LINKS - 1 otherwise causing an off-by-one.

One crash stack for demonstration:
==================================================================
BUG: KASAN: wild-memory-access in ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
Read of size 6 at addr 001102080000000c by task fuzzer.386/9508

CPU: 1 PID: 9508 Comm: syz.1.386 Not tainted 6.1.70 #2
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106
 print_report+0xe0/0x750 mm/kasan/report.c:398
 kasan_report+0x139/0x170 mm/kasan/report.c:495
 kasan_check_range+0x287/0x290 mm/kasan/generic.c:189
 memcpy+0x25/0x60 mm/kasan/shadow.c:65
 ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939
 rdev_tx_control_port net/wireless/rdev-ops.h:761 [inline]
 nl80211_tx_control_port+0x7b3/0xc40 net/wireless/nl80211.c:15453
 genl_family_rcv_msg_doit+0x22e/0x320 net/netlink/genetlink.c:756
 genl_family_rcv_msg net/netlink/genetlink.c:833 [inline]
 genl_rcv_msg+0x539/0x740 net/netlink/genetlink.c:850
 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:861
 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline]
 netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352
 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874
 sock_sendmsg_nosec net/socket.c:716 [inline]
 __sock_sendmsg net/socket.c:728 [inline]
 ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499
 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553
 __sys_sendmsg net/socket.c:2582 [inline]
 __do_sys_sendmsg net/socket.c:2591 [inline]
 __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Update the policy to ensure correct validation.

Fixes: 7b0a0e3 ("wifi: cfg80211: do some rework towards MLO link APIs")
Signed-off-by: Lin Ma <[email protected]>
Suggested-by: Cengiz Can <[email protected]>
Link: https://patch.msgid.link/[email protected]
Signed-off-by: Johannes Berg <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Its used from trace__run(), for the 'perf trace' live mode, i.e. its
strace-like, non-perf.data file processing mode, the most common one.

The trace__run() function will set trace->host using machine__new_host()
that is supposed to give a machine instance representing the running
machine, and since we'll use perf_env__arch_strerrno() to get the right
errno -> string table, we need to use machine->env, so initialize it in
machine__new_host().

Before the patch:

  (gdb) run trace --errno-summary -a sleep 1
  <SNIP>
   Summary of events:

   gvfs-afc-volume (3187), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     pselect6               1      0     0.000     0.000     0.000     0.000      0.00%

   GUsbEventThread (3519), 2 events, 0.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                   1      0     0.000     0.000     0.000     0.000      0.00%
  <SNIP>
  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  478		if (env->arch_strerrno == NULL)
  (gdb) bt
  #0  0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478
  #1  0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673
  #2  0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708
  #3  0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747
  #4  0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456
  #5  0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487
  #6  0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351
  #7  0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404
  #8  0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448
  #9  0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560
  (gdb)

After:

  root@number:~# perf trace -a --errno-summary sleep 1
  <SNIP>
     pw-data-loop (2685), 1410 events, 16.0%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     epoll_wait           188      0   983.428     0.000     5.231    15.595      8.68%
     ioctl                 94      0     0.811     0.004     0.009     0.016      2.82%
     read                 188      0     0.322     0.001     0.002     0.006      5.15%
     write                141      0     0.280     0.001     0.002     0.018      8.39%
     timerfd_settime       94      0     0.138     0.001     0.001     0.007      6.47%

   gnome-control-c (179406), 1848 events, 20.9%

     syscall            calls  errors  total       min       avg       max       stddev
                                       (msec)    (msec)    (msec)    (msec)        (%)
     --------------- --------  ------ -------- --------- --------- ---------     ------
     poll                 222      0   959.577     0.000     4.322    21.414     11.40%
     recvmsg              150      0     0.539     0.001     0.004     0.013      5.12%
     write                300      0     0.442     0.001     0.001     0.007      3.29%
     read                 150      0     0.183     0.001     0.001     0.009      5.53%
     getpid               102      0     0.101     0.000     0.001     0.008      7.82%

  root@number:~#

Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()")
Reported-by: Veronika Molnarova <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Acked-by: Veronika Molnarova <[email protected]>
Acked-by: Michael Petlan <[email protected]>
Tested-by: Michael Petlan <[email protected]>
Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1
Signed-off-by: Namhyung Kim <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Kernel will hang on destroy admin_q while we create ctrl failed, such
as following calltrace:

PID: 23644    TASK: ff2d52b40f439fc0  CPU: 2    COMMAND: "nvme"
 #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15
 #1 [ff61d23de260fc08] schedule at ffffffff8323c014
 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1
 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a
 #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006
 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce
 #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced
 #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b
 #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362
 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25
    RIP: 00007fda7891d574  RSP: 00007ffe2ef06958  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 000055e8122a4d90  RCX: 00007fda7891d574
    RDX: 000000000000012b  RSI: 000055e8122a4d90  RDI: 0000000000000004
    RBP: 00007ffe2ef079c0   R8: 000000000000012b   R9: 000055e8122a4d90
    R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000000004
    R13: 000055e8122923c0  R14: 000000000000012b  R15: 00007fda78a54500
    ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

This due to we have quiesced admi_q before cancel requests, but forgot
to unquiesce before destroy it, as a result we fail to drain the
pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here
try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and
simplify the code.

Fixes: 958dc1d ("nvme-rdma: add clean action for failed reconnection")
Reported-by: Yingfu.zhou <[email protected]>
Signed-off-by: Chunguang.xu <[email protected]>
Signed-off-by: Yue.zhao <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Hannes Reinecke <[email protected]>
Signed-off-by: Keith Busch <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Konstantin Shkolnyy says:

====================
vsock/test: fix wrong setsockopt() parameters

Parameters were created using wrong C types, which caused them to be of
wrong size on some architectures, causing problems.

The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64
didn't show it. After the fix, all tests pass on s390.
Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have
a similar problem, which turned out to be true, hence, the second patch.

Changes for v8:
- Fix whitespace warnings from "checkpatch.pl --strict"
- Add maintainers to Cc:
Changes for v7:
- Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
- Add the "net" tags to the subjects
Changes for v6:
- rework the patch #3 to avoid creating a new file for new functions,
and exclude vsock_perf from calling the new functions.
- add "Reviewed-by:" to the patch #2.
Changes for v5:
- in the patch #2 replace the introduced uint64_t with unsigned long long
to match documentation
- add a patch #3 that verifies every setsockopt() call.
Changes for v4:
- add "Reviewed-by:" to the first patch, and add a second patch fixing
SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now
a patch series.)
Changes for v3:
- fix the same problem in vsock_perf and update commit message
Changes for v2:
- add "Fixes:" lines to the commit message
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Add more test cases for LPM trie in test_maps:

1) test_lpm_trie_update_flags
It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check
whether the return value of update operation is expected.

2) test_lpm_trie_update_full_maps
It tests the update operations on a full LPM trie map. Adding new node
will fail and overwriting the value of existed node will succeed.

3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints
There two test cases test whether the iteration through get_next_key is
sorted and expected. These two test cases delete the minimal key after
each iteration and check whether next iteration returns the second
minimal key. The only difference between these two test cases is the
former one saves strings in the LPM trie and the latter saves integers.
Without the fix of get_next_key, these two cases will fail as shown
below:
  test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS
  test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1

Signed-off-by: Hou Tao <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Alexei Starovoitov <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Hou Tao says:

====================
This patch set fixes several issues for LPM trie. These issues were
found during adding new test cases or were reported by syzbot.

The patch set is structured as follows:

Patch #1~#2 are clean-ups for lpm_trie_update_elem().
Patch #3 handles BPF_EXIST and BPF_NOEXIST correctly for LPM trie.
Patch #4 fixes the accounting of n_entries when doing in-place update.
Patch #5 fixes the exact match condition in trie_get_next_key() and it
may skip keys when the passed key is not found in the map.
Patch #6~#7 switch from kmalloc() to bpf memory allocator for LPM trie
to fix several lock order warnings reported by syzbot. It also enables
raw_spinlock_t for LPM trie again. After these changes, the LPM trie will
be closer to being usable in any context (though the reentrance check of
trie->lock is still missing, but it is on my todo list).
Patch #8: move test_lpm_map to map_tests to make it run regularly.
Patch #9: add test cases for the issues fixed by patch #3~#5.

Please see individual patches for more details. Comments are always
welcome.

Change Log:
v3:
  * patch #2: remove the unnecessary NULL-init for im_node
  * patch #6: alloc the leaf node before disabling IRQ to low
    the possibility of -ENOMEM when leaf_size is large; Free
    these nodes outside the trie lock (Suggested by Alexei)
  * collect review and ack tags (Thanks for Toke & Daniel)

v2: https://lore.kernel.org/bpf/[email protected]/
  * collect review tags (Thanks for Toke)
  * drop "Add bpf_mem_cache_is_mergeable() helper" patch
  * patch #3~#4: add fix tag
  * patch #4: rename the helper to trie_check_add_elem() and increase
    n_entries in it.
  * patch #6: use one bpf mem allocator and update commit message to
    clarify that using bpf mem allocator is more appropriate.
  * patch #7: update commit message to add the possible max running time
    for update operation.
  * patch #9: update commit message to specify the purpose of these test
    cases.

v1: https://lore.kernel.org/bpf/[email protected]/
====================

Link: https://lore.kernel.org/all/[email protected]/
Signed-off-by: Alexei Starovoitov <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
cases for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

netdev_tx_reset_queue() can actually be dropped from virtnet_open path;
the device is not stopped in any case. For BQL core part, it's just like
traffic nearly ceases to exist for some period. For stall detector added
to BQL, even if virtnet_close could somehow lead to some TX completions
delayed for long, followed by virtnet_open, we can just take it as stall
as mentioned in commit 6025b91 ("net: dqs: add NIC stall detector
based on BQL"). Note also that users can still reset stall_max via sysfs.

So, drop netdev_tx_reset_queue() from virtnet_enable_queue_pair(). This
eliminates the BQL crashes. As a result, netdev_tx_reset_queue() is now
explicitly required in freeze/restore path. This patch adds it to
immediately after free_unused_bufs(), following the rule of thumb:
netdev_tx_reset_queue() should follow any SKB freeing not followed by
netdev_tx_completed_queue(). This seems the most consistent and
streamlined approach, and now netdev_tx_reset_queue() runs whenever
free_unused_bufs() is done.

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x70
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]

Fixes: c8bd1f7 ("virtio_net: add support for Byte Queue Limits")
Cc: <[email protected]> # v6.11+
Signed-off-by: Koichiro Den <[email protected]>
Acked-by: Jason Wang <[email protected]>
Reviewed-by: Xuan Zhuo <[email protected]>
[ pabeni: trimmed possibly troublesome separator ]
Signed-off-by: Paolo Abeni <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
…nts'

Koichiro Den says:

====================
virtio_net: correct netdev_tx_reset_queue() invocation points

When virtnet_close is followed by virtnet_open, some TX completions can
possibly remain unconsumed, until they are finally processed during the
first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash
[1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call
before RX napi enable") was not sufficient to eliminate all BQL crash
scenarios for virtio-net.

This issue can be reproduced with the latest net-next master by running:
`while :; do ip l set DEV down; ip l set DEV up; done` under heavy network
TX load from inside the machine.

This patch series resolves the issue and also addresses similar existing
problems:

(a). Drop netdev_tx_reset_queue() from open/close path. This eliminates the
     BQL crashes due to the problematic open/close path.

(b). As a result of (a), netdev_tx_reset_queue() is now explicitly required
     in freeze/restore path. Add netdev_tx_reset_queue() immediately after
     free_unused_bufs() invocation.

(c). Fix missing resetting in virtnet_tx_resize().
     virtnet_tx_resize() has lacked proper resetting since commit
     c8bd1f7 ("virtio_net: add support for Byte Queue Limits").

(d). Fix missing resetting in the XDP_SETUP_XSK_POOL path.
     Similar to (c), this path lacked proper resetting. Call
     netdev_tx_reset_queue() when virtqueue_reset() has actually recycled
     unused buffers.

This patch series consists of six commits:
  [1/6]: Resolves (a) and (b).                      # also -stable 6.11.y
  [2/6]: Minor fix to make [4/6] streamlined.
  [3/6]: Prerequisite for (c).                      # also -stable 6.11.y
  [4/6]: Resolves (c) (incl. Prerequisite for (d))  # also -stable 6.11.y
  [5/6]: Preresuisite for (d).
  [6/6]: Resolves (d).

Changes for v4:
  - move netdev_tx_reset_queue() out of free_unused_bufs()
  - submit to net, not net-next
Changes for v3:
  - replace 'flushed' argument with 'recycle_done'
Changes for v2:
  - add tx queue resetting for (b) to (d) above

v3: https://lore.kernel.org/all/[email protected]/
v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/

[1]:
------------[ cut here ]------------
kernel BUG at lib/dynamic_queue_limits.c:99!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G    N 6.12.0net-next_main+ #2
Tainted: [N]=TEST
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \
BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
RIP: 0010:dql_completed+0x26b/0x290
Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d
4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01
d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff
RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009
RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000
RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001
R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40
FS:  00007f41ded69500(0000) GS:ffff888667dc0000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0
PKRU: 55555554
Call Trace:
 <IRQ>
 ? die+0x32/0x80
 ? do_trap+0xd9/0x100
 ? dql_completed+0x26b/0x290
 ? dql_completed+0x26b/0x290
 ? do_error_trap+0x6d/0xb0
 ? dql_completed+0x26b/0x290
 ? exc_invalid_op+0x4c/0x60
 ? dql_completed+0x26b/0x290
 ? asm_exc_invalid_op+0x16/0x20
 ? dql_completed+0x26b/0x290
 __free_old_xmit+0xff/0x170 [virtio_net]
 free_old_xmit+0x54/0xc0 [virtio_net]
 virtnet_poll+0xf4/0xe30 [virtio_net]
 ? __update_load_avg_cfs_rq+0x264/0x2d0
 ? update_curr+0x35/0x260
 ? reweight_entity+0x1be/0x260
 __napi_poll.constprop.0+0x28/0x1c0
 net_rx_action+0x329/0x420
 ? enqueue_hrtimer+0x35/0x90
 ? trace_hardirqs_on+0x1d/0x80
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? kvm_sched_clock_read+0xd/0x20
 ? sched_clock+0xc/0x30
 ? sched_clock_cpu+0xd/0x1a0
 handle_softirqs+0x138/0x3e0
 do_softirq.part.0+0x89/0xc0
 </IRQ>
 <TASK>
 __local_bh_enable_ip+0xa7/0xb0
 virtnet_open+0xc8/0x310 [virtio_net]
 __dev_open+0xfa/0x1b0
 __dev_change_flags+0x1de/0x250
 dev_change_flags+0x22/0x60
 do_setlink.isra.0+0x2df/0x10b0
 ? rtnetlink_rcv_msg+0x34f/0x3f0
 ? netlink_rcv_skb+0x54/0x100
 ? netlink_unicast+0x23e/0x390
 ? netlink_sendmsg+0x21e/0x490
 ? ____sys_sendmsg+0x31b/0x350
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? __nla_validate_parse+0x5f/0xee0
 ? __pfx___probestub_irq_enable+0x3/0x10
 ? __create_object+0x5e/0x90
 ? security_capable+0x3b/0x7�[I0
 rtnl_newlink+0x784/0xaf0
 ? avc_has_perm_noaudit+0x67/0xf0
 ? cred_has_capability.isra.0+0x75/0x110
 ? stack_depot_save_flags+0x24/0x6d0
 ? __pfx_rtnl_newlink+0x10/0x10
 rtnetlink_rcv_msg+0x34f/0x3f0
 ? do_syscall_64+0x6c/0x180
 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
 ? __pfx_rtnetlink_rcv_msg+0x10/0x10
 netlink_rcv_skb+0x54/0x100
 netlink_unicast+0x23e/0x390
 netlink_sendmsg+0x21e/0x490
 ____sys_sendmsg+0x31b/0x350
 ? copy_msghdr_from_user+0x6d/0xa0
 ___sys_sendmsg+0x86/0xd0
 ? __pte_offset_map+0x17/0x160
 ? preempt_count_add+0x69/0xa0
 ? __call_rcu_common.constprop.0+0x147/0x610
 ? preempt_count_add+0x69/0xa0
 ? preempt_count_add+0x69/0xa0
 ? _raw_spin_trylock+0x13/0x60
 ? trace_hardirqs_on+0x1d/0x80
 __sys_sendmsg+0x66/0xc0
 do_syscall_64+0x6c/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f41defe5b34
Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00
f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34
RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003
RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001
R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003
R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000
 </TASK>
[...]
====================

Link: https://patch.msgid.link/[email protected]
Signed-off-by: Paolo Abeni <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 6.13, part #2

 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking
   SPE/TRBE initialization

 - Align nested page table walker with the intended memory attribute
   combining rules of the architecture

 - Prevent userspace from constraining the advertised ASID width,
   avoiding horrors of guest TLBIs not matching the intended context in
   hardware

 - Don't leak references on LPIs when insertion into the translation
   cache fails
pcmoore pushed a commit that referenced this issue Jan 20, 2025
The vmemmap's, which is used for RV64 with SPARSEMEM_VMEMMAP, page
tables are populated using pmd (page middle directory) hugetables.
However, the pmd allocation is not using the generic mechanism used by
the VMA code (e.g. pmd_alloc()), or the RISC-V specific
create_pgd_mapping()/alloc_pmd_late(). Instead, the vmemmap page table
code allocates a page, and calls vmemmap_set_pmd(). This results in
that the pmd ctor is *not* called, nor would it make sense to do so.

Now, when tearing down a vmemmap page table pmd, the cleanup code
would unconditionally, and incorrectly call the pmd dtor, which
results in a crash (best case).

This issue was found when running the HMM selftests:

  | tools/testing/selftests/mm# ./test_hmm.sh smoke
  | ... # when unloading the test_hmm.ko module
  | page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10915b
  | flags: 0x1000000000000000(node=0|zone=1)
  | raw: 1000000000000000 0000000000000000 dead000000000122 0000000000000000
  | raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  | page dumped because: VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte)
  | ------------[ cut here ]------------
  | kernel BUG at include/linux/mm.h:3080!
  | Kernel BUG [#1]
  | Modules linked in: test_hmm(-) sch_fq_codel fuse drm drm_panel_orientation_quirks backlight dm_mod
  | CPU: 1 UID: 0 PID: 514 Comm: modprobe Tainted: G        W          6.12.0-00982-gf2a4f1682d07 #2
  | Tainted: [W]=WARN
  | Hardware name: riscv-virtio qemu/qemu, BIOS 2024.10 10/01/2024
  | epc : remove_pgd_mapping+0xbec/0x1070
  |  ra : remove_pgd_mapping+0xbec/0x1070
  | epc : ffffffff80010a68 ra : ffffffff80010a68 sp : ff20000000a73940
  |  gp : ffffffff827b2d88 tp : ff6000008785da40 t0 : ffffffff80fbce04
  |  t1 : 0720072007200720 t2 : 706d756420656761 s0 : ff20000000a73a50
  |  s1 : ff6000008915cff8 a0 : 0000000000000039 a1 : 0000000000000008
  |  a2 : ff600003fff0de20 a3 : 0000000000000000 a4 : 0000000000000000
  |  a5 : 0000000000000000 a6 : c0000000ffffefff a7 : ffffffff824469b8
  |  s2 : ff1c0000022456c0 s3 : ff1ffffffdbfffff s4 : ff6000008915c000
  |  s5 : ff6000008915c000 s6 : ff6000008915c000 s7 : ff1ffffffdc00000
  |  s8 : 0000000000000001 s9 : ff1ffffffdc00000 s10: ffffffff819a31f0
  |  s11: ffffffffffffffff t3 : ffffffff8000c950 t4 : ff60000080244f00
  |  t5 : ff60000080244000 t6 : ff20000000a73708
  | status: 0000000200000120 badaddr: ffffffff80010a68 cause: 0000000000000003
  | [<ffffffff80010a68>] remove_pgd_mapping+0xbec/0x1070
  | [<ffffffff80fd238e>] vmemmap_free+0x14/0x1e
  | [<ffffffff8032e698>] section_deactivate+0x220/0x452
  | [<ffffffff8032ef7e>] sparse_remove_section+0x4a/0x58
  | [<ffffffff802f8700>] __remove_pages+0x7e/0xba
  | [<ffffffff803760d8>] memunmap_pages+0x2bc/0x3fe
  | [<ffffffff02a3ca28>] dmirror_device_remove_chunks+0x2ea/0x518 [test_hmm]
  | [<ffffffff02a3e026>] hmm_dmirror_exit+0x3e/0x1018 [test_hmm]
  | [<ffffffff80102c14>] __riscv_sys_delete_module+0x15a/0x2a6
  | [<ffffffff80fd020c>] do_trap_ecall_u+0x1f2/0x266
  | [<ffffffff80fde0a2>] _new_vmalloc_restore_context_a0+0xc6/0xd2
  | Code: bf51 7597 0184 8593 76a5 854a 4097 0029 80e7 2c00 (9002) 7597
  | ---[ end trace 0000000000000000 ]---
  | Kernel panic - not syncing: Fatal exception in interrupt

Add a check to avoid calling the pmd dtor, if the calling context is
vmemmap_free().

Fixes: c75a74f ("riscv: mm: Add memory hotplugging support")
Signed-off-by: Björn Töpel <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Cc: [email protected]
Signed-off-by: Palmer Dabbelt <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs
like the bellow:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
4 locks held by kworker/u9:2/5070:
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline]
 #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline]
 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335
 #2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline]
 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915
CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024
Workqueue: hci0 hci_rx_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 __might_resched+0x5d4/0x780 kernel/sched/core.c:10187
 __mutex_lock_common kernel/locking/mutex.c:585 [inline]
 __mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752
 hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline]
 hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939
 hci_event_func net/bluetooth/hci_event.c:7514 [inline]
 hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569
 hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171
 process_one_work kernel/workqueue.c:3254 [inline]
 process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335
 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416
 kthread+0x2f0/0x390 kernel/kthread.c:388
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
 </TASK>

Reported-by: [email protected]
Tested-by: [email protected]
Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
This fixes the circular locking dependency warning below, by
releasing the socket lock before enterning iso_listen_bis, to
avoid any potential deadlock with hdev lock.

[   75.307983] ======================================================
[   75.307984] WARNING: possible circular locking dependency detected
[   75.307985] 6.12.0-rc6+ #22 Not tainted
[   75.307987] ------------------------------------------------------
[   75.307987] kworker/u81:2/2623 is trying to acquire lock:
[   75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO)
               at: iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308021]
               but task is already holding lock:
[   75.308022] ffff8fdd61a10078 (&hdev->lock)
               at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308053]
               which lock already depends on the new lock.

[   75.308054]
               the existing dependency chain (in reverse order) is:
[   75.308055]
               -> #1 (&hdev->lock){+.+.}-{3:3}:
[   75.308057]        __mutex_lock+0xad/0xc50
[   75.308061]        mutex_lock_nested+0x1b/0x30
[   75.308063]        iso_sock_listen+0x143/0x5c0 [bluetooth]
[   75.308085]        __sys_listen_socket+0x49/0x60
[   75.308088]        __x64_sys_listen+0x4c/0x90
[   75.308090]        x64_sys_call+0x2517/0x25f0
[   75.308092]        do_syscall_64+0x87/0x150
[   75.308095]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   75.308098]
               -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[   75.308100]        __lock_acquire+0x155e/0x25f0
[   75.308103]        lock_acquire+0xc9/0x300
[   75.308105]        lock_sock_nested+0x32/0x90
[   75.308107]        iso_connect_cfm+0x253/0x840 [bluetooth]
[   75.308128]        hci_connect_cfm+0x6c/0x190 [bluetooth]
[   75.308155]        hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth]
[   75.308180]        hci_le_meta_evt+0xe7/0x200 [bluetooth]
[   75.308206]        hci_event_packet+0x21f/0x5c0 [bluetooth]
[   75.308230]        hci_rx_work+0x3ae/0xb10 [bluetooth]
[   75.308254]        process_one_work+0x212/0x740
[   75.308256]        worker_thread+0x1bd/0x3a0
[   75.308258]        kthread+0xe4/0x120
[   75.308259]        ret_from_fork+0x44/0x70
[   75.308261]        ret_from_fork_asm+0x1a/0x30
[   75.308263]
               other info that might help us debug this:

[   75.308264]  Possible unsafe locking scenario:

[   75.308264]        CPU0                CPU1
[   75.308265]        ----                ----
[   75.308265]   lock(&hdev->lock);
[   75.308267]                            lock(sk_lock-
                                                AF_BLUETOOTH-BTPROTO_ISO);
[   75.308268]                            lock(&hdev->lock);
[   75.308269]   lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO);
[   75.308270]
                *** DEADLOCK ***

[   75.308271] 4 locks held by kworker/u81:2/2623:
[   75.308272]  #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0},
                at: process_one_work+0x443/0x740
[   75.308276]  #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)),
                at: process_one_work+0x1ce/0x740
[   75.308280]  #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3}
                at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth]
[   75.308304]  #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2},
                at: hci_connect_cfm+0x29/0x190 [bluetooth]

Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk")
Signed-off-by: Iulia Tanasescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
This fixes the circular locking dependency warning below, by reworking
iso_sock_recvmsg, to ensure that the socket lock is always released
before calling a function that locks hdev.

[  561.670344] ======================================================
[  561.670346] WARNING: possible circular locking dependency detected
[  561.670349] 6.12.0-rc6+ #26 Not tainted
[  561.670351] ------------------------------------------------------
[  561.670353] iso-tester/3289 is trying to acquire lock:
[  561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3},
               at: iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670405]
               but task is already holding lock:
[  561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0},
               at: iso_sock_recvmsg+0xbf/0x500 [bluetooth]
[  561.670450]
               which lock already depends on the new lock.

[  561.670452]
               the existing dependency chain (in reverse order) is:
[  561.670453]
               -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}:
[  561.670458]        lock_acquire+0x7c/0xc0
[  561.670463]        lock_sock_nested+0x3b/0xf0
[  561.670467]        bt_accept_dequeue+0x1a5/0x4d0 [bluetooth]
[  561.670510]        iso_sock_accept+0x271/0x830 [bluetooth]
[  561.670547]        do_accept+0x3dd/0x610
[  561.670550]        __sys_accept4+0xd8/0x170
[  561.670553]        __x64_sys_accept+0x74/0xc0
[  561.670556]        x64_sys_call+0x17d6/0x25f0
[  561.670559]        do_syscall_64+0x87/0x150
[  561.670563]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670567]
               -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}:
[  561.670571]        lock_acquire+0x7c/0xc0
[  561.670574]        lock_sock_nested+0x3b/0xf0
[  561.670577]        iso_sock_listen+0x2de/0xf30 [bluetooth]
[  561.670617]        __sys_listen_socket+0xef/0x130
[  561.670620]        __x64_sys_listen+0xe1/0x190
[  561.670623]        x64_sys_call+0x2517/0x25f0
[  561.670626]        do_syscall_64+0x87/0x150
[  561.670629]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670632]
               -> #0 (&hdev->lock){+.+.}-{3:3}:
[  561.670636]        __lock_acquire+0x32ad/0x6ab0
[  561.670639]        lock_acquire.part.0+0x118/0x360
[  561.670642]        lock_acquire+0x7c/0xc0
[  561.670644]        __mutex_lock+0x18d/0x12f0
[  561.670647]        mutex_lock_nested+0x1b/0x30
[  561.670651]        iso_conn_big_sync+0x73/0x260 [bluetooth]
[  561.670687]        iso_sock_recvmsg+0x3e9/0x500 [bluetooth]
[  561.670722]        sock_recvmsg+0x1d5/0x240
[  561.670725]        sock_read_iter+0x27d/0x470
[  561.670727]        vfs_read+0x9a0/0xd30
[  561.670731]        ksys_read+0x1a8/0x250
[  561.670733]        __x64_sys_read+0x72/0xc0
[  561.670736]        x64_sys_call+0x1b12/0x25f0
[  561.670738]        do_syscall_64+0x87/0x150
[  561.670741]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  561.670744]
               other info that might help us debug this:

[  561.670745] Chain exists of:
&hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH

[  561.670751]  Possible unsafe locking scenario:

[  561.670753]        CPU0                    CPU1
[  561.670754]        ----                    ----
[  561.670756]   lock(sk_lock-AF_BLUETOOTH);
[  561.670758]                                lock(sk_lock
                                              AF_BLUETOOTH-BTPROTO_ISO);
[  561.670761]                                lock(sk_lock-AF_BLUETOOTH);
[  561.670764]   lock(&hdev->lock);
[  561.670767]
                *** DEADLOCK ***

Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync")
Signed-off-by: Iulia Tanasescu <[email protected]>
Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Aishwarya reports that warnings are sometimes seen when running the
ftrace kselftests, e.g.

| WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0
| Modules linked in:
| CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x4a0/0x4c0
| lr : arch_stack_walk+0x248/0x4c0
| sp : ffff800083643d20
| x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928
| x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80
| x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0
| x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000
| x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000
| x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001
| x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0
| x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000
| x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690
| x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8
| Call trace:
|  arch_stack_walk+0x4a0/0x4c0 (P)
|  arch_stack_walk+0x248/0x4c0 (L)
|  profile_pc+0x44/0x80
|  profile_tick+0x50/0x80 (F)
|  tick_nohz_handler+0xcc/0x160 (F)
|  __hrtimer_run_queues+0x2ac/0x340 (F)
|  hrtimer_interrupt+0xf4/0x268 (F)
|  arch_timer_handler_virt+0x34/0x60 (F)
|  handle_percpu_devid_irq+0x88/0x220 (F)
|  generic_handle_domain_irq+0x34/0x60 (F)
|  gic_handle_irq+0x54/0x140 (F)
|  call_on_irq_stack+0x24/0x58 (F)
|  do_interrupt_handler+0x88/0x98
|  el1_interrupt+0x34/0x68 (F)
|  el1h_64_irq_handler+0x18/0x28
|  el1h_64_irq+0x6c/0x70
|  queued_spin_lock_slowpath+0x78/0x460 (P)

The warning in question is:

  WARN_ON_ONCE(state->common.pc == orig_pc))

... in kunwind_recover_return_address(), which is triggered when
return_to_handler() is encountered in the trace, but
ftrace_graph_ret_addr() cannot find a corresponding original return
address on the fgraph return stack.

This happens because the stacktrace code encounters an exception
boundary where the LR was not live at the time of the exception, but the
LR happens to contain return_to_handler(); either because the task
recently returned there, or due to unfortunate usage of the LR at a
scratch register. In such cases attempts to recover the return address
via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE()
above and aborting the unwind (hence the stacktrace terminating after
reporting the PC at the time of the exception).

Handling unreliable LR values in these cases is likely to require some
larger rework, so for the moment avoid this problem by restoring the old
behaviour of skipping the LR at exception boundaries, which the
stacktrace code did prior to commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

This commit is effectively a partial revert, keeping the structures and
logic to explicitly identify exception boundaries while still skipping
reporting of the LR. The logic to explicitly identify exception
boundaries is still useful for general robustness and as a building
block for future support for RELIABLE_STACKTRACE.

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Aishwarya TCV <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.

To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.

Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:

  42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")

... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:

  f8a00ce ("proc: restrict kernel stack dumps to root")

... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):

| for n in $(seq 1 10); do
|     while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x1c8/0x370 (P)
|  stack_trace_save_tsk+0x8c/0x108
|  proc_pid_stack+0xb0/0x134
|  proc_single_show+0x60/0x120
|  seq_read_iter+0x104/0x438
|  seq_read+0xf8/0x140
|  vfs_read+0xc4/0x31c
|  ksys_read+0x70/0x108
|  __arm64_sys_read+0x1c/0x28
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x30/0xcc
|  el0t_64_sync_handler+0x10c/0x138
|  el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---

Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.

The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:

  c2c6b27 ("arm64: stacktrace: unwind exception boundaries")

The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:

  9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")

Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <[email protected]>
Reported-by: Kent Overstreet <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Will Deacon <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Catalin Marinas <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
The current implementation removes cache tags after disabling ATS,
leading to potential memory leaks and kernel crashes. Specifically,
CACHE_TAG_DEVTLB type cache tags may still remain in the list even
after the domain is freed, causing a use-after-free condition.

This issue really shows up when multiple VFs from different PFs
passed through to a single user-space process via vfio-pci. In such
cases, the kernel may crash with kernel messages like:

 BUG: kernel NULL pointer dereference, address: 0000000000000014
 PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0
 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
 CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2
 RIP: 0010:cache_tag_flush_range+0x9b/0x250
 Call Trace:
  <TASK>
  ? __die+0x1f/0x60
  ? page_fault_oops+0x163/0x590
  ? exc_page_fault+0x72/0x190
  ? asm_exc_page_fault+0x22/0x30
  ? cache_tag_flush_range+0x9b/0x250
  ? cache_tag_flush_range+0x5d/0x250
  intel_iommu_tlb_sync+0x29/0x40
  intel_iommu_unmap_pages+0xfe/0x160
  __iommu_unmap+0xd8/0x1a0
  vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1]
  vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1]
  vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1]

Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix
it.

Fixes: 3b1d9e2 ("iommu/vt-d: Add cache tag assignment interface")
Cc: [email protected]
Signed-off-by: Lu Baolu <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
…s_lock

For storing a value to a queue attribute, the queue_attr_store function
first freezes the queue (->q_usage_counter(io)) and then acquire
->sysfs_lock. This seems not correct as the usual ordering should be to
acquire ->sysfs_lock before freezing the queue. This incorrect ordering
causes the following lockdep splat which we are able to reproduce always
simply by accessing /sys/kernel/debug file using ls command:

[   57.597146] WARNING: possible circular locking dependency detected
[   57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G        W
[   57.597162] ------------------------------------------------------
[   57.597168] ls/4605 is trying to acquire lock:
[   57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0
[   57.597200]
               but task is already holding lock:
[   57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4
[   57.597226]
               which lock already depends on the new lock.

[   57.597233]
               the existing dependency chain (in reverse order) is:
[   57.597241]
               -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}:
[   57.597255]        down_write+0x6c/0x18c
[   57.597264]        start_creating+0xb4/0x24c
[   57.597274]        debugfs_create_dir+0x2c/0x1e8
[   57.597283]        blk_register_queue+0xec/0x294
[   57.597292]        add_disk_fwnode+0x2e4/0x548
[   57.597302]        brd_alloc+0x2c8/0x338
[   57.597309]        brd_init+0x100/0x178
[   57.597317]        do_one_initcall+0x88/0x3e4
[   57.597326]        kernel_init_freeable+0x3cc/0x6e0
[   57.597334]        kernel_init+0x34/0x1cc
[   57.597342]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597350]
               -> #4 (&q->debugfs_mutex){+.+.}-{4:4}:
[   57.597362]        __mutex_lock+0xfc/0x12a0
[   57.597370]        blk_register_queue+0xd4/0x294
[   57.597379]        add_disk_fwnode+0x2e4/0x548
[   57.597388]        brd_alloc+0x2c8/0x338
[   57.597395]        brd_init+0x100/0x178
[   57.597402]        do_one_initcall+0x88/0x3e4
[   57.597410]        kernel_init_freeable+0x3cc/0x6e0
[   57.597418]        kernel_init+0x34/0x1cc
[   57.597426]        ret_from_kernel_user_thread+0x14/0x1c
[   57.597434]
               -> #3 (&q->sysfs_lock){+.+.}-{4:4}:
[   57.597446]        __mutex_lock+0xfc/0x12a0
[   57.597454]        queue_attr_store+0x9c/0x110
[   57.597462]        sysfs_kf_write+0x70/0xb0
[   57.597471]        kernfs_fop_write_iter+0x1b0/0x2ac
[   57.597480]        vfs_write+0x3dc/0x6e8
[   57.597488]        ksys_write+0x84/0x140
[   57.597495]        system_call_exception+0x130/0x360
[   57.597504]        system_call_common+0x160/0x2c4
[   57.597516]
               -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}:
[   57.597530]        __submit_bio+0x5ec/0x828
[   57.597538]        submit_bio_noacct_nocheck+0x1e4/0x4f0
[   57.597547]        iomap_readahead+0x2a0/0x448
[   57.597556]        xfs_vm_readahead+0x28/0x3c
[   57.597564]        read_pages+0x88/0x41c
[   57.597571]        page_cache_ra_unbounded+0x1ac/0x2d8
[   57.597580]        filemap_get_pages+0x188/0x984
[   57.597588]        filemap_read+0x13c/0x4bc
[   57.597596]        xfs_file_buffered_read+0x88/0x17c
[   57.597605]        xfs_file_read_iter+0xac/0x158
[   57.597614]        vfs_read+0x2d4/0x3b4
[   57.597622]        ksys_read+0x84/0x144
[   57.597629]        system_call_exception+0x130/0x360
[   57.597637]        system_call_common+0x160/0x2c4
[   57.597647]
               -> #1 (mapping.invalidate_lock#2){++++}-{4:4}:
[   57.597661]        down_read+0x6c/0x220
[   57.597669]        filemap_fault+0x870/0x100c
[   57.597677]        xfs_filemap_fault+0xc4/0x18c
[   57.597684]        __do_fault+0x64/0x164
[   57.597693]        __handle_mm_fault+0x1274/0x1dac
[   57.597702]        handle_mm_fault+0x248/0x484
[   57.597711]        ___do_page_fault+0x428/0xc0c
[   57.597719]        hash__do_page_fault+0x30/0x68
[   57.597727]        do_hash_fault+0x90/0x35c
[   57.597736]        data_access_common_virt+0x210/0x220
[   57.597745]        _copy_from_user+0xf8/0x19c
[   57.597754]        sel_write_load+0x178/0xd54
[   57.597762]        vfs_write+0x108/0x6e8
[   57.597769]        ksys_write+0x84/0x140
[   57.597777]        system_call_exception+0x130/0x360
[   57.597785]        system_call_common+0x160/0x2c4
[   57.597794]
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
[   57.597806]        __lock_acquire+0x17cc/0x2330
[   57.597814]        lock_acquire+0x138/0x400
[   57.597822]        __might_fault+0x7c/0xc0
[   57.597830]        filldir64+0xe8/0x390
[   57.597839]        dcache_readdir+0x80/0x2d4
[   57.597846]        iterate_dir+0xd8/0x1d4
[   57.597855]        sys_getdents64+0x88/0x2d4
[   57.597864]        system_call_exception+0x130/0x360
[   57.597872]        system_call_common+0x160/0x2c4
[   57.597881]
               other info that might help us debug this:

[   57.597888] Chain exists of:
                 &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3

[   57.597905]  Possible unsafe locking scenario:

[   57.597911]        CPU0                    CPU1
[   57.597917]        ----                    ----
[   57.597922]   rlock(&sb->s_type->i_mutex_key#3);
[   57.597932]                                lock(&q->debugfs_mutex);
[   57.597940]                                lock(&sb->s_type->i_mutex_key#3);
[   57.597950]   rlock(&mm->mmap_lock);
[   57.597958]
                *** DEADLOCK ***

[   57.597965] 2 locks held by ls/4605:
[   57.597971]  #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154
[   57.597989]  #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4

Prevent the above lockdep warning by acquiring ->sysfs_lock before
freezing the queue while storing a queue attribute in queue_attr_store
function. Later, we also found[1] another function __blk_mq_update_nr_
hw_queues where we first freeze queue and then acquire the ->sysfs_lock.
So we've also updated lock ordering in __blk_mq_update_nr_hw_queues
function and ensured that in all code paths we follow the correct lock
ordering i.e. acquire ->sysfs_lock before freezing the queue.

[1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/

Reported-by: [email protected]
Fixes: af28141 ("block: freeze the queue in queue_attr_store")
Tested-by: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Nilay Shroff <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Jens Axboe <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Guangguan Wang says:

====================
net: several fixes for smc

v1 -> v2:
rewrite patch #2 suggested by Paolo.
====================

Signed-off-by: David S. Miller <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
The mapping VMA address is saved in VAS window struct when the
paste address is mapped. This VMA address is used during migration
to unmap the paste address if the window is active. The paste
address mapping will be removed when the window is closed or with
the munmap(). But the VMA address in the VAS window is not updated
with munmap() which is causing invalid access during migration.

The KASAN report shows:
[16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8
[16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928

[16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G    B              6.11.0-rc5-nxgzip #2
[16386.255128] Tainted: [B]=BAD_PAGE
[16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries
[16386.255181] Call Trace:
[16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable)
[16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764
[16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8
[16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0
[16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8
[16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc
[16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4
...

[16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s:
[16386.256149]  kasan_save_stack+0x34/0x68
[16386.256163]  kasan_save_track+0x34/0x80
[16386.256175]  kasan_save_alloc_info+0x58/0x74
[16386.256196]  __kasan_slab_alloc+0xb8/0xdc
[16386.256209]  kmem_cache_alloc_noprof+0x200/0x3d0
[16386.256225]  vm_area_alloc+0x44/0x150
[16386.256245]  mmap_region+0x214/0x10c4
[16386.256265]  do_mmap+0x5fc/0x750
[16386.256277]  vm_mmap_pgoff+0x14c/0x24c
[16386.256292]  ksys_mmap_pgoff+0x20c/0x348
[16386.256303]  sys_mmap+0xd0/0x160
...

[16386.256350] Freed by task 0 on cpu 31 at 16386.204848s:
[16386.256363]  kasan_save_stack+0x34/0x68
[16386.256374]  kasan_save_track+0x34/0x80
[16386.256384]  kasan_save_free_info+0x64/0x10c
[16386.256396]  __kasan_slab_free+0x120/0x204
[16386.256415]  kmem_cache_free+0x128/0x450
[16386.256428]  vm_area_free_rcu_cb+0xa8/0xd8
[16386.256441]  rcu_do_batch+0x2c8/0xcf0
[16386.256458]  rcu_core+0x378/0x3c4
[16386.256473]  handle_softirqs+0x20c/0x60c
[16386.256495]  do_softirq_own_stack+0x6c/0x88
[16386.256509]  do_softirq_own_stack+0x58/0x88
[16386.256521]  __irq_exit_rcu+0x1a4/0x20c
[16386.256533]  irq_exit+0x20/0x38
[16386.256544]  interrupt_async_exit_prepare.constprop.0+0x18/0x2c
...

[16386.256717] Last potentially related work creation:
[16386.256729]  kasan_save_stack+0x34/0x68
[16386.256741]  __kasan_record_aux_stack+0xcc/0x12c
[16386.256753]  __call_rcu_common.constprop.0+0x94/0xd04
[16386.256766]  vm_area_free+0x28/0x3c
[16386.256778]  remove_vma+0xf4/0x114
[16386.256797]  do_vmi_align_munmap.constprop.0+0x684/0x870
[16386.256811]  __vm_munmap+0xe0/0x1f8
[16386.256821]  sys_munmap+0x54/0x6c
[16386.256830]  system_call_exception+0x1a0/0x4a0
[16386.256841]  system_call_vectored_common+0x15c/0x2ec

[16386.256868] The buggy address belongs to the object at c00000014a819670
                which belongs to the cache vm_area_struct of size 168
[16386.256887] The buggy address is located 0 bytes inside of
                freed 168-byte region [c00000014a819670, c00000014a819718)

[16386.256915] The buggy address belongs to the physical page:
[16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81
[16386.256950] memcg:c0000000ba430001
[16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff)
[16386.256975] page_type: 0xfdffffff(slab)
[16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001
[16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001
[16386.257018] page dumped because: kasan: bad access detected

This patch adds close() callback in vas_vm_ops vm_operations_struct
which will be executed during munmap() before freeing VMA. The VMA
address in the VAS window is set to NULL after holding the window
mmap_mutex.

Fixes: 37e6764 ("powerpc/pseries/vas: Add VAS migration handler")
Signed-off-by: Haren Myneni <[email protected]>
Signed-off-by: Madhavan Srinivasan <[email protected]>
Link: https://patch.msgid.link/[email protected]
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Access to genmask field in struct nft_set_ext results in unaligned
atomic read:

[   72.130109] Unable to handle kernel paging request at virtual address ffff0000c2bb708c
[   72.131036] Mem abort info:
[   72.131213]   ESR = 0x0000000096000021
[   72.131446]   EC = 0x25: DABT (current EL), IL = 32 bits
[   72.132209]   SET = 0, FnV = 0
[   72.133216]   EA = 0, S1PTW = 0
[   72.134080]   FSC = 0x21: alignment fault
[   72.135593] Data abort info:
[   72.137194]   ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
[   72.142351]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[   72.145989]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[   72.150115] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000237d27000
[   72.154893] [ffff0000c2bb708c] pgd=0000000000000000, p4d=180000023ffff403, pud=180000023f84b403, pmd=180000023f835403,
+pte=0068000102bb7707
[   72.163021] Internal error: Oops: 0000000096000021 [#1] SMP
[...]
[   72.170041] CPU: 7 UID: 0 PID: 54 Comm: kworker/7:0 Tainted: G            E      6.13.0-rc3+ #2
[   72.170509] Tainted: [E]=UNSIGNED_MODULE
[   72.170720] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-stable202302-for-qemu 03/01/2023
[   72.171192] Workqueue: events_power_efficient nft_rhash_gc [nf_tables]
[   72.171552] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   72.171915] pc : nft_rhash_gc+0x200/0x2d8 [nf_tables]
[   72.172166] lr : nft_rhash_gc+0x128/0x2d8 [nf_tables]
[   72.172546] sp : ffff800081f2bce0
[   72.172724] x29: ffff800081f2bd40 x28: ffff0000c2bb708c x27: 0000000000000038
[   72.173078] x26: ffff0000c6780ef0 x25: ffff0000c643df00 x24: ffff0000c6778f78
[   72.173431] x23: 000000000000001a x22: ffff0000c4b1f000 x21: ffff0000c6780f78
[   72.173782] x20: ffff0000c2bb70dc x19: ffff0000c2bb7080 x18: 0000000000000000
[   72.174135] x17: ffff0000c0a4e1c0 x16: 0000000000003000 x15: 0000ac26d173b978
[   72.174485] x14: ffffffffffffffff x13: 0000000000000030 x12: ffff0000c6780ef0
[   72.174841] x11: 0000000000000000 x10: ffff800081f2bcf8 x9 : ffff0000c3000000
[   72.175193] x8 : 00000000000004be x7 : 0000000000000000 x6 : 0000000000000000
[   72.175544] x5 : 0000000000000040 x4 : ffff0000c3000010 x3 : 0000000000000000
[   72.175871] x2 : 0000000000003a98 x1 : ffff0000c2bb708c x0 : 0000000000000004
[   72.176207] Call trace:
[   72.176316]  nft_rhash_gc+0x200/0x2d8 [nf_tables] (P)
[   72.176653]  process_one_work+0x178/0x3d0
[   72.176831]  worker_thread+0x200/0x3f0
[   72.176995]  kthread+0xe8/0xf8
[   72.177130]  ret_from_fork+0x10/0x20
[   72.177289] Code: 54fff984 d503201f d2800080 91003261 (f820303f)
[   72.177557] ---[ end trace 0000000000000000 ]---

Align struct nft_set_ext to word size to address this and
documentation it.

pahole reports that this increases the size of elements for rhash and
pipapo in 8 bytes on x86_64.

Fixes: 7ffc748 ("netfilter: nft_set_hash: skip duplicated elements pending gc run")
Signed-off-by: Pablo Neira Ayuso <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
syzbot reports that a recent fix causes nesting issues between the (now)
raw timeoutlock and the eventfd locking:

=============================
[ BUG: Invalid wait context ]
6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted
-----------------------------
kworker/u32:0/68094 is trying to lock:
ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180
other info that might help us debug this:
context-{5:5}
6 locks held by kworker/u32:0/68094:
 #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0
 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0
 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180
 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180
 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38
stack backtrace:
CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29
Hardware name: linux,dummy-virt (DT)
Workqueue: iou_exit io_ring_exit_work
Call trace:
 show_stack+0x1c/0x30 (C)
 __dump_stack+0x24/0x30
 dump_stack_lvl+0x60/0x80
 dump_stack+0x14/0x20
 __lock_acquire+0x19f8/0x60c8
 lock_acquire+0x1a4/0x540
 _raw_spin_lock_irqsave+0x90/0xd0
 eventfd_signal_mask+0x64/0x180
 io_eventfd_signal+0x64/0x108
 io_req_local_work_add+0x294/0x430
 __io_req_task_work_add+0x1c0/0x270
 io_kill_timeout+0x1f0/0x288
 io_kill_timeouts+0xd4/0x180
 io_uring_try_cancel_requests+0x2e8/0x388
 io_ring_exit_work+0x150/0x550
 process_one_work+0x5e8/0xfc0
 worker_thread+0x7ec/0xc80
 kthread+0x24c/0x300
 ret_from_fork+0x10/0x20

because after the preempt-rt fix for the timeout lock nesting inside
the io-wq lock, we now have the eventfd spinlock nesting inside the
raw timeout spinlock.

Rather than play whack-a-mole with other nesting on the timeout lock,
split the deletion and killing of timeouts so queueing the task_work
for the timeout cancelations can get done outside of the timeout lock.

Reported-by: [email protected]
Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock")
Signed-off-by: Jens Axboe <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
…le_direct_reclaim()

The task sometimes continues looping in throttle_direct_reclaim() because
allow_direct_reclaim(pgdat) keeps returning false.  

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero.  

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to
the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having
free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced.  The node balancing logic in
balance_pgdat() evaluates all zones collectively.  If one or more zones
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the
entire node is deemed balanced.  This causes balance_pgdat() to exit early
before incrementing the kswapd_failures, as it considers the overall
memory state acceptable, even though some zones (like ZONE_NORMAL) remain
under significant pressure.


The patch ensures that zone_reclaimable_pages() includes free pages
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are
available (e.g., file-backed or anonymous pages).  This change prevents
zones like ZONE_DMA32, which have sufficient free pages, from being
mistakenly deemed unreclaimable.  By doing so, the patch ensures proper
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by
allow_direct_reclaim(pgdat) repeatedly returning false.


The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused
by a node being incorrectly deemed balanced despite pressure in certain
zones, such as ZONE_NORMAL.  This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.

The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored
during reclaim, masking pressure in other zones.  Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().

This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist.  This ensures zones
with sufficient free pages are not skipped, enabling proper balancing and
reclaim behavior.

[[email protected]: coding-style cleanups]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations")
Signed-off-by: Seiji Nishikawa <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
We found a timeout problem with the pldm command on our system.  The
reason is that the MCTP-I3C driver has a race condition when receiving
multiple-packet messages in multi-thread, resulting in a wrong packet
order problem.

We identified this problem by adding a debug message to the
mctp_i3c_read function.

According to the MCTP spec, a multiple-packet message must be composed
in sequence, and if there is a wrong sequence, the whole message will be
discarded and wait for the next SOM.
For example, SOM → Pkt Seq #2 → Pkt Seq #1 → Pkt Seq #3 → EOM.

Therefore, we try to solve this problem by adding a mutex to the
mctp_i3c_read function.  Before the modification, when a command
requesting a multiple-packet message response is sent consecutively, an
error usually occurs within 100 loops.  After the mutex, it can go
through 40000 loops without any error, and it seems to run well.

Fixes: c8755b2 ("mctp i3c: MCTP I3C driver")
Signed-off-by: Leo Yang <[email protected]>
Link: https://patch.msgid.link/[email protected]
[[email protected]: dropped already answered question from changelog]
Signed-off-by: Paolo Abeni <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
irq_chip functions may be called in raw spinlock context. Therefore, we
must also use a raw spinlock for our own internal locking.

This fixes the following lockdep splat:

[    5.349336] =============================
[    5.353349] [ BUG: Invalid wait context ]
[    5.357361] 6.13.0-rc5+ #69 Tainted: G        W
[    5.363031] -----------------------------
[    5.367045] kworker/u17:1/44 is trying to lock:
[    5.371587] ffffff88018b02c0 (&chip->gpio_lock){....}-{3:3}, at: xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8))
[    5.380079] other info that might help us debug this:
[    5.385138] context-{5:5}
[    5.387762] 5 locks held by kworker/u17:1/44:
[    5.392123] #0: ffffff8800014958 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3204)
[    5.402260] #1: ffffffc082fcbdd8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3205)
[    5.411528] #2: ffffff880172c900 (&dev->mutex){....}-{4:4}, at: __device_attach (drivers/base/dd.c:1006)
[    5.419929] #3: ffffff88039c8268 (request_class#2){+.+.}-{4:4}, at: __setup_irq (kernel/irq/internals.h:156 kernel/irq/manage.c:1596)
[    5.428331] #4: ffffff88039c80c8 (lock_class#2){....}-{2:2}, at: __setup_irq (kernel/irq/manage.c:1614)
[    5.436472] stack backtrace:
[    5.439359] CPU: 2 UID: 0 PID: 44 Comm: kworker/u17:1 Tainted: G        W          6.13.0-rc5+ #69
[    5.448690] Tainted: [W]=WARN
[    5.451656] Hardware name: xlnx,zynqmp (DT)
[    5.455845] Workqueue: events_unbound deferred_probe_work_func
[    5.461699] Call trace:
[    5.464147] show_stack+0x18/0x24 C
[    5.467821] dump_stack_lvl (lib/dump_stack.c:123)
[    5.471501] dump_stack (lib/dump_stack.c:130)
[    5.474824] __lock_acquire (kernel/locking/lockdep.c:4828 kernel/locking/lockdep.c:4898 kernel/locking/lockdep.c:5176)
[    5.478758] lock_acquire (arch/arm64/include/asm/percpu.h:40 kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5851 kernel/locking/lockdep.c:5814)
[    5.482429] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162)
[    5.486797] xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8))
[    5.490737] irq_enable (kernel/irq/internals.h:236 kernel/irq/chip.c:170 kernel/irq/chip.c:439 kernel/irq/chip.c:432 kernel/irq/chip.c:345)
[    5.494060] __irq_startup (kernel/irq/internals.h:241 kernel/irq/chip.c:180 kernel/irq/chip.c:250)
[    5.497645] irq_startup (kernel/irq/chip.c:270)
[    5.501143] __setup_irq (kernel/irq/manage.c:1807)
[    5.504728] request_threaded_irq (kernel/irq/manage.c:2208)

Fixes: a32c7ca ("gpio: gpio-xilinx: Add interrupt support")
Signed-off-by: Sean Anderson <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bartosz Golaszewski <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
This commit addresses a circular locking dependency issue within the GFX
isolation mechanism. The problem was identified by a warning indicating
a potential deadlock due to inconsistent lock acquisition order.

- The `amdgpu_gfx_enforce_isolation_ring_begin_use` and
  `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously
  acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`,
  leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is
  called while `enforce_isolation_mutex` is held, and
  `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is
  held, it can create a circular dependency.

By ensuring consistent lock usage, this fix resolves the issue:

[  606.297333] ======================================================
[  606.297343] WARNING: possible circular locking dependency detected
[  606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G           OE
[  606.297365] ------------------------------------------------------
[  606.297375] kworker/u96:3/3825 is trying to acquire lock:
[  606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610
[  606.297413]
               but task is already holding lock:
[  606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.297725]
               which lock already depends on the new lock.

[  606.297738]
               the existing dependency chain (in reverse order) is:
[  606.297749]
               -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}:
[  606.297765]        __mutex_lock+0x85/0x930
[  606.297776]        mutex_lock_nested+0x1b/0x30
[  606.297786]        amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.298007]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.298225]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.298412]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.298603]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.298866]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.298880]        process_one_work+0x21e/0x680
[  606.298890]        worker_thread+0x190/0x350
[  606.298899]        kthread+0xe7/0x120
[  606.298908]        ret_from_fork+0x3c/0x60
[  606.298919]        ret_from_fork_asm+0x1a/0x30
[  606.298929]
               -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}:
[  606.298947]        __mutex_lock+0x85/0x930
[  606.298956]        mutex_lock_nested+0x1b/0x30
[  606.298966]        amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu]
[  606.299190]        process_one_work+0x21e/0x680
[  606.299199]        worker_thread+0x190/0x350
[  606.299208]        kthread+0xe7/0x120
[  606.299217]        ret_from_fork+0x3c/0x60
[  606.299227]        ret_from_fork_asm+0x1a/0x30
[  606.299236]
               -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}:
[  606.299257]        __lock_acquire+0x16f9/0x2810
[  606.299267]        lock_acquire+0xd1/0x300
[  606.299276]        __flush_work+0x250/0x610
[  606.299286]        cancel_delayed_work_sync+0x71/0x80
[  606.299296]        amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.299509]        amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.299723]        amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.299909]        amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.300101]        amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.300355]        drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.300369]        process_one_work+0x21e/0x680
[  606.300378]        worker_thread+0x190/0x350
[  606.300387]        kthread+0xe7/0x120
[  606.300396]        ret_from_fork+0x3c/0x60
[  606.300406]        ret_from_fork_asm+0x1a/0x30
[  606.300416]
               other info that might help us debug this:

[  606.300428] Chain exists of:
                 (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex

[  606.300458]  Possible unsafe locking scenario:

[  606.300468]        CPU0                    CPU1
[  606.300476]        ----                    ----
[  606.300484]   lock(&adev->gfx.kfd_sch_mutex);
[  606.300494]                                lock(&adev->enforce_isolation_mutex);
[  606.300508]                                lock(&adev->gfx.kfd_sch_mutex);
[  606.300521]   lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work));
[  606.300536]
                *** DEADLOCK ***

[  606.300546] 5 locks held by kworker/u96:3/3825:
[  606.300555]  #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680
[  606.300577]  #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680
[  606.300600]  #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu]
[  606.300837]  #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu]
[  606.301062]  #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610
[  606.301083]
               stack backtrace:
[  606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G           OE      6.10.0-amd-mlkd-610-311224-lof #19
[  606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024
[  606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched]
[  606.301140] Call Trace:
[  606.301146]  <TASK>
[  606.301154]  dump_stack_lvl+0x9b/0xf0
[  606.301166]  dump_stack+0x10/0x20
[  606.301175]  print_circular_bug+0x26c/0x340
[  606.301187]  check_noncircular+0x157/0x170
[  606.301197]  ? register_lock_class+0x48/0x490
[  606.301213]  __lock_acquire+0x16f9/0x2810
[  606.301230]  lock_acquire+0xd1/0x300
[  606.301239]  ? __flush_work+0x232/0x610
[  606.301250]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301261]  ? mark_held_locks+0x54/0x90
[  606.301274]  ? __flush_work+0x232/0x610
[  606.301284]  __flush_work+0x250/0x610
[  606.301293]  ? __flush_work+0x232/0x610
[  606.301305]  ? __pfx_wq_barrier_func+0x10/0x10
[  606.301318]  ? mark_held_locks+0x54/0x90
[  606.301331]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.301345]  cancel_delayed_work_sync+0x71/0x80
[  606.301356]  amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu]
[  606.301661]  amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu]
[  606.302050]  ? srso_alias_return_thunk+0x5/0xfbef5
[  606.302069]  amdgpu_ring_alloc+0x48/0x70 [amdgpu]
[  606.302452]  amdgpu_ib_schedule+0x176/0x8a0 [amdgpu]
[  606.302862]  ? drm_sched_entity_error+0x82/0x190 [gpu_sched]
[  606.302890]  amdgpu_job_run+0xac/0x1e0 [amdgpu]
[  606.303366]  drm_sched_run_job_work+0x24f/0x430 [gpu_sched]
[  606.303388]  process_one_work+0x21e/0x680
[  606.303409]  worker_thread+0x190/0x350
[  606.303424]  ? __pfx_worker_thread+0x10/0x10
[  606.303437]  kthread+0xe7/0x120
[  606.303449]  ? __pfx_kthread+0x10/0x10
[  606.303463]  ret_from_fork+0x3c/0x60
[  606.303476]  ? __pfx_kthread+0x10/0x10
[  606.303489]  ret_from_fork_asm+0x1a/0x30
[  606.303512]  </TASK>

v2: Refactor lock handling to resolve circular dependency (Alex)

- Introduced a `sched_work` flag to defer the call to
  `amdgpu_gfx_kfd_sch_ctrl` until after releasing
  `enforce_isolation_mutex`.
- This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside
  the critical section, preventing the circular dependency and deadlock.
- The `sched_work` flag is set within the mutex-protected section if
  conditions are met, and the actual function call is made afterward.
- This approach ensures consistent lock acquisition order.

Fixes: afefd6f ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization")
Cc: Christian König <[email protected]>
Cc: Alex Deucher <[email protected]>
Signed-off-by: Srinivasan Shanmugam <[email protected]>
Suggested-by: Alex Deucher <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 0b6b2dd38336d5fd49214f0e4e6495e658e3ab44)
Cc: [email protected]
pcmoore pushed a commit that referenced this issue Jan 20, 2025
syz reports an out of bounds read:

==================================================================
BUG: KASAN: slab-out-of-bounds in ocfs2_match fs/ocfs2/dir.c:334
[inline]
BUG: KASAN: slab-out-of-bounds in ocfs2_search_dirblock+0x283/0x6e0
fs/ocfs2/dir.c:367
Read of size 1 at addr ffff88804d8b9982 by task syz-executor.2/14802

CPU: 0 UID: 0 PID: 14802 Comm: syz-executor.2 Not tainted 6.13.0-rc4 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
04/01/2014
Sched_ext: serialise (enabled+all), task: runnable_at=-10ms
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x229/0x350 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x164/0x530 mm/kasan/report.c:489
kasan_report+0x147/0x180 mm/kasan/report.c:602
ocfs2_match fs/ocfs2/dir.c:334 [inline]
ocfs2_search_dirblock+0x283/0x6e0 fs/ocfs2/dir.c:367
ocfs2_find_entry_id fs/ocfs2/dir.c:414 [inline]
ocfs2_find_entry+0x1143/0x2db0 fs/ocfs2/dir.c:1078
ocfs2_find_files_on_disk+0x18e/0x530 fs/ocfs2/dir.c:1981
ocfs2_lookup_ino_from_name+0xb6/0x110 fs/ocfs2/dir.c:2003
ocfs2_lookup+0x30a/0xd40 fs/ocfs2/namei.c:122
lookup_open fs/namei.c:3627 [inline]
open_last_lookups fs/namei.c:3748 [inline]
path_openat+0x145a/0x3870 fs/namei.c:3984
do_filp_open+0xe9/0x1c0 fs/namei.c:4014
do_sys_openat2+0x135/0x1d0 fs/open.c:1402
do_sys_open fs/open.c:1417 [inline]
__do_sys_openat fs/open.c:1433 [inline]
__se_sys_openat fs/open.c:1428 [inline]
__x64_sys_openat+0x15d/0x1c0 fs/open.c:1428
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f01076903ad
Code: c3 e8 a7 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f01084acfc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f01077cbf80 RCX: 00007f01076903ad
RDX: 0000000000105042 RSI: 0000000020000080 RDI: ffffffffffffff9c
RBP: 00007f01077cbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001ff R11: 0000000000000246 R12: 0000000000000000
R13: 00007f01077cbf80 R14: 00007f010764fc90 R15: 00007f010848d000
</TASK>
==================================================================

And a general protection fault in ocfs2_prepare_dir_for_insert:

==================================================================
loop0: detected capacity change from 0 to 32768
JBD2: Ignoring recovery information on journal
ocfs2: Mounting device (7,0) on (node local, slot 0) with ordered data
mode.
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 5096 Comm: syz-executor792 Not tainted
6.11.0-rc4-syzkaller-00002-gb0da640826ba #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:ocfs2_find_dir_space_id fs/ocfs2/dir.c:3406 [inline]
RIP: 0010:ocfs2_prepare_dir_for_insert+0x3309/0x5c70 fs/ocfs2/dir.c:4280
Code: 00 00 e8 2a 25 13 fe e9 ba 06 00 00 e8 20 25 13 fe e9 4f 01 00 00
e8 16 25 13 fe 49 8d 7f 08 49 8d 5f 09 48 89 f8 48 c1 e8 03 <42> 0f b6
04 20 84 c0 0f 85 bd 23 00 00 48 89 d8 48 c1 e8 03 42 0f
RSP: 0018:ffffc9000af9f020 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000009 RCX: ffff88801e27a440
RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000008
RBP: ffffc9000af9f830 R08: ffffffff8380395b R09: ffffffff838090a7
R10: 0000000000000002 R11: ffff88801e27a440 R12: dffffc0000000000
R13: ffff88803c660878 R14: f700000000000088 R15: 0000000000000000
FS:  000055555a677380(0000) GS:ffff888020800000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560bce569178 CR3: 000000001de5a000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ocfs2_mknod+0xcaf/0x2b40 fs/ocfs2/namei.c:292
vfs_mknod+0x36d/0x3b0 fs/namei.c:4088
do_mknodat+0x3ec/0x5b0
__do_sys_mknodat fs/namei.c:4166 [inline]
__se_sys_mknodat fs/namei.c:4163 [inline]
__x64_sys_mknodat+0xa7/0xc0 fs/namei.c:4163
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2dafda3a99
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 17 00 00 90 48 89
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8
64 89 01 48
RSP: 002b:00007ffe336a6658 EFLAGS: 00000246 ORIG_RAX:
0000000000000103
RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f2dafda3a99
RDX: 00000000000021c0 RSI: 0000000020000040 RDI:
00000000ffffff9c
RBP: 00007f2dafe1b5f0 R08: 0000000000004480 R09:
000055555a6784c0
R10: 0000000000000103 R11: 0000000000000246 R12:
00007ffe336a6680
R13: 00007ffe336a68a8 R14: 431bde82d7b634db R15:
00007f2dafdec03b
</TASK>
==================================================================

The two reports are all caused invalid negative i_size of dir inode.  For
ocfs2, dir_inode can't be negative or zero.

Here add a check in which is called by ocfs2_check_dir_for_entry().  It
fixes the second report as ocfs2_check_dir_for_entry() must be called
before ocfs2_prepare_dir_for_insert().  Also set a up limit for dir with
OCFS2_INLINE_DATA_FL.  The i_size can't be great than blocksize.

Link: https://lkml.kernel.org/r/[email protected]
Reported-by: Jiacheng Xu <[email protected]>
Link: https://lore.kernel.org/ocfs2-devel/[email protected]/T/#u
Reported-by: [email protected]
Link: https://lore.kernel.org/all/[email protected]/T/
Signed-off-by: Su Yue <[email protected]>
Reviewed-by: Heming Zhao <[email protected]>
Reviewed-by: Joseph Qi <[email protected]>
Cc: Mark Fasheh <[email protected]>
Cc: Joel Becker <[email protected]>
Cc: Junxiao Bi <[email protected]>
Cc: Changwei Ge <[email protected]>
Cc: Jun Piao <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Fix a lockdep warning [1] observed during the write combining test.

The warning indicates a potential nested lock scenario that could lead
to a deadlock.

However, this is a false positive alarm because the SF lock and its
parent lock are distinct ones.

The lockdep confusion arises because the locks belong to the same object
class (i.e., struct mlx5_core_dev).

To resolve this, the code has been refactored to avoid taking both
locks. Instead, only the parent lock is acquired.

[1]
raw_ethernet_bw/2118 is trying to acquire lock:
[  213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.620270]
[  213.620270] but task is already holding lock:
[  213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
               mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.622045]
[  213.622045] other info that might help us debug this:
[  213.622778]  Possible unsafe locking scenario:
[  213.622778]
[  213.623465]        CPU0
[  213.623815]        ----
[  213.624148]   lock(&dev->wc_state_lock);
[  213.624615]   lock(&dev->wc_state_lock);
[  213.625071]
[  213.625071]  *** DEADLOCK ***
[  213.625071]
[  213.625805]  May be due to missing lock nesting notation
[  213.625805]
[  213.626522] 4 locks held by raw_ethernet_bw/2118:
[  213.627019]  #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0},
                at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.628088]  #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3},
                at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs]
[  213.629094]  #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3},
                at: ib_init_ucontext+0x49/0xf0 [ib_uverbs]
[  213.630106]  #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3},
                at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[  213.631185]
[  213.631185] stack backtrace:
[  213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted
               6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1
[  213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
               rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  213.633785] Call Trace:
[  213.634099]
[  213.634393]  dump_stack_lvl+0x7e/0xc0
[  213.634806]  print_deadlock_bug+0x278/0x3c0
[  213.635265]  __lock_acquire+0x15f4/0x2c40
[  213.635712]  lock_acquire+0xcd/0x2d0
[  213.636120]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.636722]  ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib]
[  213.637277]  __mutex_lock+0x81/0xda0
[  213.637697]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638305]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.638902]  ? rcu_read_lock_sched_held+0x3f/0x70
[  213.639400]  ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640016]  mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[  213.640615]  set_ucontext_resp+0x68/0x2b0 [mlx5_ib]
[  213.641144]  ? debug_mutex_init+0x33/0x40
[  213.641586]  mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib]
[  213.642145]  ib_init_ucontext+0xa0/0xf0 [ib_uverbs]
[  213.642679]  ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0
                [ib_uverbs]
[  213.643426]  ? _copy_from_user+0x46/0x80
[  213.643878]  ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs]
[  213.644426]  ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130
               [ib_uverbs]
[  213.645213]  ? __lock_acquire+0xa99/0x2c40
[  213.645675]  ? lock_acquire+0xcd/0x2d0
[  213.646101]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.646625]  ? reacquire_held_locks+0xcf/0x1f0
[  213.647102]  ? do_user_addr_fault+0x45d/0x770
[  213.647586]  ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs]
[  213.648102]  ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[  213.648632]  __x64_sys_ioctl+0x4d3/0xaa0
[  213.649060]  ? do_user_addr_fault+0x4a8/0x770
[  213.649528]  do_syscall_64+0x6d/0x140
[  213.649947]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
[  213.650478] RIP: 0033:0x7fa179b0737b
[  213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c
               89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8
               10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d
               7d 2a 0f 00 f7 d8 64 89 01 48
[  213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX:
               0000000000000010
[  213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX:
               00007fa179b0737b
[  213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI:
               0000000000000003
[  213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09:
               0000000000000002
[  213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12:
               00007ffd2e6d487c
[  213.656170] R13: 0000000000000027 R14: 0000000000000001 R15:
               00007ffd2e6d4f70

Fixes: d98995b ("net/mlx5: Reimplement write combining test")
Signed-off-by: Yishai Hadas <[email protected]>
Reviewed-by: Michael Guralnik <[email protected]>
Reviewed-by: Larysa Zaremba <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
pcmoore pushed a commit that referenced this issue Jan 20, 2025
Clear the port select structure on error so no stale values left after
definers are destroyed. That's because the mlx5_lag_destroy_definers()
always try to destroy all lag definers in the tt_map, so in the flow
below lag definers get double-destroyed and cause kernel crash:

  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 1
        mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed
  mlx5_lag_port_sel_create()
    mlx5_lag_create_definers()
      mlx5_lag_create_definer()     <- Failed on tt 0
        mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed

 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
 Mem abort info:
   ESR = 0x0000000096000005
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x05: level 1 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00
 [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
 Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)]
  CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G           OE      6.11.0+ #2
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
  Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
  pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
  lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
  sp : ffff800085fafb00
  x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000
  x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000
  x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000
  x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
  x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350
  x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0
  x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c
  x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190
  x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000
  x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000
  Call trace:
   mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
   mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
   mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core]
   mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core]
   mlx5_activate_lag+0x60c/0x6f8 [mlx5_core]
   mlx5_do_bond_work+0x284/0x5c8 [mlx5_core]
   process_one_work+0x170/0x3e0
   worker_thread+0x2d8/0x3e0
   kthread+0x11c/0x128
   ret_from_fork+0x10/0x20
  Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400)
  ---[ end trace 0000000000000000 ]---

Fixes: dc48516 ("net/mlx5: Lag, add support to create definers for LAG")
Signed-off-by: Mark Zhang <[email protected]>
Reviewed-by: Leon Romanovsky <[email protected]>
Reviewed-by: Mark Bloch <[email protected]>
Reviewed-by: Jacob Keller <[email protected]>
Signed-off-by: Tariq Toukan <[email protected]>
Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants