-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: SELinux should not contain hardcoded tests of filesystem type names #2
Labels
Comments
pcmoore
changed the title
SELinux should not contain hardcoded tests of filesystem type names
RFE: SELinux should not contain hardcoded tests of filesystem type names
Nov 18, 2016
pcmoore
pushed a commit
that referenced
this issue
Dec 5, 2016
During rpm resume we restore the fences, but we do not have the protection of struct_mutex. This rules out updating the activity tracking on the fences, and requires us to rely on the rpm as the serialisation barrier instead. [ 350.298052] [drm:intel_runtime_resume [i915]] Resuming device [ 350.308606] [ 350.310520] =============================== [ 350.315560] [ INFO: suspicious RCU usage. ] [ 350.320554] 4.8.0-rc8-bsw-rapl+ #3133 Tainted: G U W [ 350.327208] ------------------------------- [ 350.331977] ../drivers/gpu/drm/i915/i915_gem_request.h:371 suspicious rcu_dereference_protected() usage! [ 350.342619] [ 350.342619] other info that might help us debug this: [ 350.342619] [ 350.351593] [ 350.351593] rcu_scheduler_active = 1, debug_locks = 0 [ 350.358952] 3 locks held by Xorg/320: [ 350.363077] #0: (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa030589c>] drm_modeset_lock_all+0x3c/0xd0 [drm] [ 350.375162] #1: (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa03058a6>] drm_modeset_lock_all+0x46/0xd0 [drm] [ 350.387022] #2: (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0305056>] drm_modeset_lock+0x36/0x110 [drm] [ 350.398236] [ 350.398236] stack backtrace: [ 350.403196] CPU: 1 PID: 320 Comm: Xorg Tainted: G U W 4.8.0-rc8-bsw-rapl+ #3133 [ 350.412457] Hardware name: Intel Corporation CHERRYVIEW C0 PLATFORM/Braswell CRB, BIOS BRAS.X64.X088.R00.1510270350 10/27/2015 [ 350.425212] 0000000000000000 ffff8801680a78c8 ffffffff81332187 ffff88016c5c5000 [ 350.433611] 0000000000000001 ffff8801680a78f8 ffffffff810ca6da ffff88016cc8b0f0 [ 350.442012] ffff88016cc80000 ffff88016cc80000 ffff880177ad0000 ffff8801680a7948 [ 350.450409] Call Trace: [ 350.453165] [<ffffffff81332187>] dump_stack+0x67/0x90 [ 350.458931] [<ffffffff810ca6da>] lockdep_rcu_suspicious+0xea/0x120 [ 350.466002] [<ffffffffa039e8dd>] fence_update+0xbd/0x670 [i915] [ 350.472766] [<ffffffffa039efe2>] i915_gem_restore_fences+0x52/0x70 [i915] [ 350.480496] [<ffffffffa0368f42>] vlv_resume_prepare+0x72/0x570 [i915] [ 350.487839] [<ffffffffa0369802>] intel_runtime_resume+0x102/0x210 [i915] [ 350.495442] [<ffffffff8137f26f>] pci_pm_runtime_resume+0x7f/0xb0 [ 350.502274] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.509883] [<ffffffff814401c5>] __rpm_callback+0x35/0x70 [ 350.516037] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.523646] [<ffffffff81440224>] rpm_callback+0x24/0x80 [ 350.529604] [<ffffffff8137f1f0>] ? pci_restore_standard_config+0x40/0x40 [ 350.537212] [<ffffffff814417bd>] rpm_resume+0x4ad/0x740 [ 350.543161] [<ffffffff81441aa1>] __pm_runtime_resume+0x51/0x80 [ 350.549824] [<ffffffffa03889c8>] intel_runtime_pm_get+0x28/0x90 [i915] [ 350.557265] [<ffffffffa0388a53>] intel_display_power_get+0x23/0x50 [i915] [ 350.565001] [<ffffffffa03ef23d>] intel_atomic_commit_tail+0xdfd/0x10b0 [i915] [ 350.573106] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.582659] [<ffffffff81615091>] ? _raw_spin_unlock+0x31/0x50 [ 350.589205] [<ffffffffa034b2e9>] ? drm_atomic_helper_swap_state+0x159/0x300 [drm_kms_helper] [ 350.598787] [<ffffffffa03ef8a5>] intel_atomic_commit+0x3b5/0x500 [i915] [ 350.606319] [<ffffffffa03061dc>] ? drm_atomic_set_crtc_for_connector+0xcc/0x100 [drm] [ 350.615209] [<ffffffffa0306b49>] drm_atomic_commit+0x49/0x50 [drm] [ 350.622242] [<ffffffffa034dee8>] drm_atomic_helper_set_config+0x88/0xc0 [drm_kms_helper] [ 350.631419] [<ffffffffa02f94ac>] drm_mode_set_config_internal+0x6c/0x120 [drm] [ 350.639623] [<ffffffffa02fa94c>] drm_mode_setcrtc+0x22c/0x4d0 [drm] [ 350.646760] [<ffffffffa02f0f19>] drm_ioctl+0x209/0x460 [drm] [ 350.653217] [<ffffffffa02fa720>] ? drm_mode_getcrtc+0x150/0x150 [drm] [ 350.660536] [<ffffffff810c984a>] ? __lock_is_held+0x4a/0x70 [ 350.666885] [<ffffffff81202303>] do_vfs_ioctl+0x93/0x6b0 [ 350.672939] [<ffffffff8120f843>] ? __fget+0x113/0x200 [ 350.678797] [<ffffffff8120f735>] ? __fget+0x5/0x200 [ 350.684361] [<ffffffff81202964>] SyS_ioctl+0x44/0x80 [ 350.690030] [<ffffffff81001deb>] do_syscall_64+0x5b/0x120 [ 350.696184] [<ffffffff81615ada>] entry_SYSCALL64_slow_path+0x25/0x25 Note we also have to remember the lesson from commit 4fc788f ("drm/i915: Flush delayed fence releases after reset") where we have to flush any changes to the fence on restore. v2: Replace call to release user mmaps with an assertion that they have already been zapped. Fixes: 49ef529 ("drm/i915: Move fence tracking from object to vma") Reported-by: Ville Syrjälä <[email protected]> Signed-off-by: Chris Wilson <[email protected]> Cc: Ville Syrjälä <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Mika Kuoppala <[email protected]> Reviewed-by: Joonas Lahtinen <[email protected]> Link: http://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 4676dc8) Signed-off-by: Jani Nikula <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Dec 5, 2016
The following panic was caught when run ocfs2 disconfig single test (block size 512 and cluster size 8192). ocfs2_journal_dirty() return -ENOSPC, that means credits were used up. The total credit should include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(), because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and 1 time will be consumed in ocfs2_dx_dir_new_cluster() -> __ocfs2_dx_dir_new_cluster() -> ocfs2_dx_dir_format_cluster(). But only two times is included in ocfs2_dx_dir_rebalance_credits(), fix it. This can cause read-only fs(v4.1+) or panic for mainline linux depending on mount option. ------------[ cut here ]------------ kernel BUG at fs/ocfs2/journal.c:775! invalid opcode: 0000 [#1] SMP Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport acpi_cpufreq i2c_piix4 i2c_core pcspkr ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod CPU: 2 PID: 10601 Comm: dd Not tainted 4.1.12-71.el6uek.bug24939243.x86_64 #2 Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016 task: ffff8800b6de6200 ti: ffff8800a7d48000 task.ti: ffff8800a7d48000 RIP: ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] RSP: 0018:ffff8800a7d4b6d8 EFLAGS: 00010286 RAX: 00000000ffffffe4 RBX: 00000000814d0a9c RCX: 00000000000004f9 RDX: ffffffffa008e990 RSI: ffffffffa008f1ee RDI: ffff8800622b6460 RBP: ffff8800a7d4b6f8 R08: ffffffffa008f288 R09: ffff8800622b6460 R10: 0000000000000000 R11: 0000000000000282 R12: 0000000002c8421e R13: ffff88006d0cad00 R14: ffff880092beef60 R15: 0000000000000070 FS: 00007f9b83e92700(0000) GS:ffff8800be880000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb2c0d1a000 CR3: 0000000008f80000 CR4: 00000000000406e0 Call Trace: ocfs2_dx_dir_transfer_leaf+0x159/0x1a0 [ocfs2] ocfs2_dx_dir_rebalance+0xd9b/0xea0 [ocfs2] ocfs2_find_dir_space_dx+0xd3/0x300 [ocfs2] ocfs2_prepare_dx_dir_for_insert+0x219/0x450 [ocfs2] ocfs2_prepare_dir_for_insert+0x1d6/0x580 [ocfs2] ocfs2_mknod+0x5a2/0x1400 [ocfs2] ocfs2_create+0x73/0x180 [ocfs2] vfs_create+0xd8/0x100 lookup_open+0x185/0x1c0 do_last+0x36d/0x780 path_openat+0x92/0x470 do_filp_open+0x4a/0xa0 do_sys_open+0x11a/0x230 SyS_open+0x1e/0x20 system_call_fastpath+0x12/0x71 Code: 1d 3f 29 09 00 48 85 db 74 1f 48 8b 03 0f 1f 80 00 00 00 00 48 8b 7b 08 48 83 c3 10 4c 89 e6 ff d0 48 8b 03 48 85 c0 75 eb eb 90 <0f> 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 RIP ocfs2_journal_dirty+0xa7/0xb0 [ocfs2] ---[ end trace 91ac5312a6ee1288 ]--- Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Junxiao Bi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Joseph Qi <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Dec 15, 2016
Missing initialization of udp_tunnel_sock_cfg causes to following kernel panic, while kernel tries to execute gro_receive(). While being there, we converted udp_port_cfg to use the same initialization scheme as udp_tunnel_sock_cfg. ------------[ cut here ]------------ kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle kernel paging request at ffffffffa0588c50 IP: [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe] PGD 1c09067 PUD 1c0a063 PMD bb394067 PTE 80000000ad5e8163 Oops: 0011 [#1] SMP Modules linked in: ib_rxe ip6_udp_tunnel udp_tunnel CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc3+ #2 Hardware name: Red Hat KVM, BIOS Bochs 01/01/2011 task: ffff880235e4e680 ti: ffff880235e68000 task.ti: ffff880235e68000 RIP: 0010:[<ffffffffa0588c50>] [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe] RSP: 0018:ffff880237343c80 EFLAGS: 00010282 RAX: 00000000dffe482d RBX: ffff8800ae330900 RCX: 000000002001b712 RDX: ffff8800ae330900 RSI: ffff8800ae102578 RDI: ffff880235589c00 RBP: ffff880237343cb0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8800ae33e262 R13: ffff880235589c00 R14: 0000000000000014 R15: ffff8800ae102578 FS: 0000000000000000(0000) GS:ffff880237340000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa0588c50 CR3: 0000000001c06000 CR4: 00000000000006e0 Stack: ffffffff8160860e ffff8800ae330900 ffff8800ae102578 0000000000000014 000000000000004e ffff8800ae102578 ffff880237343ce0 ffffffff816088fb 0000000000000000 ffff8800ae330900 0000000000000000 00000000ffad0000 Call Trace: <IRQ> [<ffffffff8160860e>] ? udp_gro_receive+0xde/0x130 [<ffffffff816088fb>] udp4_gro_receive+0x10b/0x2d0 [<ffffffff81611373>] inet_gro_receive+0x1d3/0x270 [<ffffffff81594e29>] dev_gro_receive+0x269/0x3b0 [<ffffffff81595188>] napi_gro_receive+0x38/0x120 [<ffffffffa011caee>] mlx5e_handle_rx_cqe+0x27e/0x340 [mlx5_core] [<ffffffffa011d076>] mlx5e_poll_rx_cq+0x66/0x6d0 [mlx5_core] [<ffffffffa011d7ae>] mlx5e_napi_poll+0x8e/0x400 [mlx5_core] [<ffffffff815949a0>] net_rx_action+0x160/0x380 [<ffffffff816a9197>] __do_softirq+0xd7/0x2c5 [<ffffffff81085c35>] irq_exit+0xf5/0x100 [<ffffffff816a8f16>] do_IRQ+0x56/0xd0 [<ffffffff816a6dcc>] common_interrupt+0x8c/0x8c <EOI> [<ffffffff81061f96>] ? native_safe_halt+0x6/0x10 [<ffffffff81037ade>] default_idle+0x1e/0xd0 [<ffffffff8103828f>] arch_cpu_idle+0xf/0x20 [<ffffffff810c37dc>] default_idle_call+0x3c/0x50 [<ffffffff810c3b13>] cpu_startup_entry+0x323/0x3c0 [<ffffffff81050d8c>] start_secondary+0x15c/0x1a0 RIP [<ffffffffa0588c50>] __this_module+0x50/0xffffffffffff8400 [ib_rxe] RSP <ffff880237343c80> CR2: ffffffffa0588c50 ---[ end trace 489ee31fa7614ac5 ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt ------------[ cut here ]------------ Fixes: 8700e3e ("Soft RoCE driver") Signed-off-by: Yonatan Cohen <[email protected]> Reviewed-by: Moni Shoua <[email protected]> Signed-off-by: Leon Romanovsky <[email protected]> Signed-off-by: Doug Ledford <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Dec 15, 2016
Tushar Dave says: ==================== sparc: Enable sun4v hypervisor PCI IOMMU v2 APIs and ATU ATU (Address Translation Unit) is a new IOMMU in SPARC supported with sun4v hypervisor PCI IOMMU v2 APIs. Current SPARC IOMMU supports only 32bit address ranges and one TSB per PCIe root complex that has a 2GB per root complex DVMA space limit. The limit has become a scalability bottleneck nowadays that a typical 10G/40G NIC can consume 500MB DVMA space per instance. When DVMA resource is exhausted, devices will not be usable since the driver can't allocate DVMA. For example, we recently experienced legacy IOMMU limitation while using i40e driver in system with large number of CPUs (e.g. 128). Four ports of i40e, each request 128 QP (Queue Pairs). Each queue has 512 (default) descriptors. So considering only RX queues (because RX premap DMA buffers), i40e takes 4*128*512 number of DMA entries in IOMMU table. Legacy IOMMU can have at max (2G/8K)- 1 entries available in table. So bringing up four instance of i40e alone saturate existing IOMMU resource. ATU removes bottleneck by allowing guest os to create IOTSB of size 32G (or more) with 64bit address ranges available in ATU HW. 32G is more than enough DVMA space to be shared by all PCIe devices under root complex contrast to 2G space provided by legacy IOMMU. ATU allows PCIe devices to use 64bit DMA addressing. Devices which choose to use 32bit DMA mask will continue to work with the existing legacy IOMMU. The patch set is tested on sun4v (T1000, T2000, T3, T4, T5, T7, S7) and sun4u SPARC. Thanks. -Tushar v2->v3: - Patch #5 addresses comment by Joe Perches. -- use %s, __func__ instead of embedding the function name. v1->v2: - Patch #2 addresses comments by Dave M. -- use page allocator to allocate IOTSB. -- use true/false with boolean variables. ==================== Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
Mathieu reported that the LTTNG modules are broken as of 4.10-rc1 due to the removal of the cpu hotplug notifiers. Usually I don't care much about out of tree modules, but LTTNG is widely used in distros. There are two ways to solve that: 1) Reserve a hotplug state for LTTNG 2) Add a dynamic range for the prepare states. While #1 is the simplest solution, #2 is the proper one as we can convert in tree users, which do not care about ordering, to the dynamic range as well. Add a dynamic range which allows LTTNG to request states in the prepare stage. Reported-and-tested-by: Mathieu Desnoyers <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Reviewed-by: Mathieu Desnoyers <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Sebastian Sewior <[email protected]> Cc: Steven Rostedt <[email protected]> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701101353010.3401@nanos Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
The lockdep splat below hints at a bug in RCU usage in dm-crypt that was introduced with commit c538f6e ("dm crypt: add ability to use keys from the kernel key retention service"). The kernel keyring function user_key_payload() is in fact a wrapper for rcu_dereference_protected() which must not be called with only rcu_read_lock() section mark. Unfortunately the kernel keyring subsystem doesn't currently provide an interface that allows the use of an RCU read-side section. So for now we must drop RCU in favour of rwsem until a proper function is made available in the kernel keyring subsystem. =============================== [ INFO: suspicious RCU usage. ] 4.10.0-rc5 #2 Not tainted ------------------------------- ./include/keys/user-type.h:53 suspicious rcu_dereference_protected() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 2 locks held by cryptsetup/6464: #0: (&md->type_lock){+.+.+.}, at: [<ffffffffa02472a2>] dm_lock_md_type+0x12/0x20 [dm_mod] #1: (rcu_read_lock){......}, at: [<ffffffffa02822f8>] crypt_set_key+0x1d8/0x4b0 [dm_crypt] stack backtrace: CPU: 1 PID: 6464 Comm: cryptsetup Not tainted 4.10.0-rc5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 Call Trace: dump_stack+0x67/0x92 lockdep_rcu_suspicious+0xc5/0x100 crypt_set_key+0x351/0x4b0 [dm_crypt] ? crypt_set_key+0x1d8/0x4b0 [dm_crypt] crypt_ctr+0x341/0xa53 [dm_crypt] dm_table_add_target+0x147/0x330 [dm_mod] table_load+0x111/0x350 [dm_mod] ? retrieve_status+0x1c0/0x1c0 [dm_mod] ctl_ioctl+0x1f5/0x510 [dm_mod] dm_ctl_ioctl+0xe/0x20 [dm_mod] do_vfs_ioctl+0x8e/0x690 ? ____fput+0x9/0x10 ? task_work_run+0x7e/0xa0 ? trace_hardirqs_on_caller+0x122/0x1b0 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x7f392c9a4ec7 RSP: 002b:00007ffef6383378 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007ffef63830a0 RCX: 00007f392c9a4ec7 RDX: 000000000124fcc0 RSI: 00000000c138fd09 RDI: 0000000000000005 RBP: 00007ffef6383090 R08: 00000000ffffffff R09: 00000000012482b0 R10: 2a28205d34383336 R11: 0000000000000246 R12: 00007f392d803a08 R13: 00007ffef63831e0 R14: 0000000000000000 R15: 00007f392d803a0b Fixes: c538f6e ("dm crypt: add ability to use keys from the kernel key retention service") Reported-by: Milan Broz <[email protected]> Signed-off-by: Ondrej Kozina <[email protected]> Reviewed-by: Mikulas Patocka <[email protected]> Signed-off-by: Mike Snitzer <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
Syzkaller fuzzer managed to trigger this: BUG: sleeping function called from invalid context at mm/shmem.c:852 in_atomic(): 1, irqs_disabled(): 0, pid: 529, name: khugepaged 3 locks held by khugepaged/529: #0: (shrinker_rwsem){++++..}, at: [<ffffffff818d7ef1>] shrink_slab.part.59+0x121/0xd30 mm/vmscan.c:451 #1: (&type->s_umount_key#29){++++..}, at: [<ffffffff81a63630>] trylock_super+0x20/0x100 fs/super.c:392 #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] spin_lock include/linux/spinlock.h:302 [inline] #2: (&(&sbinfo->shrinklist_lock)->rlock){+.+.-.}, at: [<ffffffff818fd83e>] shmem_unused_huge_shrink+0x28e/0x1490 mm/shmem.c:427 CPU: 2 PID: 529 Comm: khugepaged Not tainted 4.10.0-rc5+ #201 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: shmem_undo_range+0xb20/0x2710 mm/shmem.c:852 shmem_truncate_range+0x27/0xa0 mm/shmem.c:939 shmem_evict_inode+0x35f/0xca0 mm/shmem.c:1030 evict+0x46e/0x980 fs/inode.c:553 iput_final fs/inode.c:1515 [inline] iput+0x589/0xb20 fs/inode.c:1542 shmem_unused_huge_shrink+0xbad/0x1490 mm/shmem.c:446 shmem_unused_huge_scan+0x10c/0x170 mm/shmem.c:512 super_cache_scan+0x376/0x450 fs/super.c:106 do_shrink_slab mm/vmscan.c:378 [inline] shrink_slab.part.59+0x543/0xd30 mm/vmscan.c:481 shrink_slab mm/vmscan.c:2592 [inline] shrink_node+0x2c7/0x870 mm/vmscan.c:2592 shrink_zones mm/vmscan.c:2734 [inline] do_try_to_free_pages+0x369/0xc80 mm/vmscan.c:2776 try_to_free_pages+0x3c6/0x900 mm/vmscan.c:2982 __perform_reclaim mm/page_alloc.c:3301 [inline] __alloc_pages_direct_reclaim mm/page_alloc.c:3322 [inline] __alloc_pages_slowpath+0xa24/0x1c30 mm/page_alloc.c:3683 __alloc_pages_nodemask+0x544/0xae0 mm/page_alloc.c:3848 __alloc_pages include/linux/gfp.h:426 [inline] __alloc_pages_node include/linux/gfp.h:439 [inline] khugepaged_alloc_page+0xc2/0x1b0 mm/khugepaged.c:750 collapse_huge_page+0x182/0x1fe0 mm/khugepaged.c:955 khugepaged_scan_pmd+0xfdf/0x12a0 mm/khugepaged.c:1208 khugepaged_scan_mm_slot mm/khugepaged.c:1727 [inline] khugepaged_do_scan mm/khugepaged.c:1808 [inline] khugepaged+0xe9b/0x1590 mm/khugepaged.c:1853 kthread+0x326/0x3f0 kernel/kthread.c:227 ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 The iput() from atomic context was a bad idea: if after igrab() somebody else calls iput() and we left with the last inode reference, our iput() would lead to inode eviction and therefore sleeping. This patch should fix the situation. Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Kirill A. Shutemov <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
This fixes a race condition that may occur whenever ST micro busy end interrupt is raised just after being unmasked but before leaving mmci interrupt context. A dead-lock has been found if connecting mmci ST Micro variant whose amba id is 0x10480180 to some new eMMC that supports internal caches. Whenever mmci driver enables cache control by programming eMMC's EXT_CSD register, block driver may request to flush the eMMC internal caches causing mmci driver to send a MMC_SWITCH command to the card with FLUSH_CACHE operation. And because busy end interrupt may be mistakenly cleared while not yet processed, this mmc request may never complete. As a result, mmcqd task may be stuck forever. Here is an instance caught by lockup detector which shows that mmcqd task was hung while waiting for mmc_flush_cache command to complete: .. [ 240.251595] INFO: task mmcqd/1:52 blocked for more than 120 seconds. [ 240.257973] Not tainted 4.1.13-00510-g9d91424 #2 [ 240.263109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.270955] mmcqd/1 D c047504c 0 52 2 0x00000000 [ 240.277359] [<c047504c>] (__schedule) from [<c04754a0>] (schedule+0x40/0x98) [ 240.284418] [<c04754a0>] (schedule) from [<c0477d40>] (schedule_timeout+0x148/0x188) [ 240.292191] [<c0477d40>] (schedule_timeout) from [<c0476040>] (wait_for_common+0xa4/0x170) [ 240.300491] [<c0476040>] (wait_for_common) from [<c02efc1c>] (mmc_wait_for_req_done+0x4c/0x13c) [ 240.309224] [<c02efc1c>] (mmc_wait_for_req_done) from [<c02efd90>] (mmc_wait_for_cmd+0x64/0x84) [ 240.317953] [<c02efd90>] (mmc_wait_for_cmd) from [<c02f5b14>] (__mmc_switch+0xa4/0x2a8) [ 240.325964] [<c02f5b14>] (__mmc_switch) from [<c02f5d40>] (mmc_switch+0x28/0x30) [ 240.333389] [<c02f5d40>] (mmc_switch) from [<c02f0984>] (mmc_flush_cache+0x54/0x80) [ 240.341073] [<c02f0984>] (mmc_flush_cache) from [<c02ff0c4>] (mmc_blk_issue_rq+0x114/0x4e8) [ 240.349459] [<c02ff0c4>] (mmc_blk_issue_rq) from [<c03008d4>] (mmc_queue_thread+0xc0/0x180) [ 240.357844] [<c03008d4>] (mmc_queue_thread) from [<c003cf90>] (kthread+0xdc/0xf4) [ 240.365339] [<c003cf90>] (kthread) from [<c0010068>] (ret_from_fork+0x14/0x2c) .. .. [ 240.664311] INFO: task partprobe:564 blocked for more than 120 seconds. [ 240.670943] Not tainted 4.1.13-00510-g9d91424 #2 [ 240.676078] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.683922] partprobe D c047504c 0 564 486 0x00000000 [ 240.690318] [<c047504c>] (__schedule) from [<c04754a0>] (schedule+0x40/0x98) [ 240.697396] [<c04754a0>] (schedule) from [<c0477d40>] (schedule_timeout+0x148/0x188) [ 240.705149] [<c0477d40>] (schedule_timeout) from [<c0476040>] (wait_for_common+0xa4/0x170) [ 240.713446] [<c0476040>] (wait_for_common) from [<c01f3300>] (submit_bio_wait+0x58/0x64) [ 240.721571] [<c01f3300>] (submit_bio_wait) from [<c01fbbd8>] (blkdev_issue_flush+0x60/0x88) [ 240.729957] [<c01fbbd8>] (blkdev_issue_flush) from [<c010ff84>] (blkdev_fsync+0x34/0x44) [ 240.738083] [<c010ff84>] (blkdev_fsync) from [<c0109594>] (do_fsync+0x3c/0x64) [ 240.745319] [<c0109594>] (do_fsync) from [<c000ffc0>] (ret_fast_syscall+0x0/0x3c) .. Here is the detailed sequence showing when this issue may happen: 1) At probe time, mmci device is initialized and card busy detection based on DAT[0] monitoring is enabled. 2) Later during run time, since card reported to support internal caches, a MMCI_SWITCH command is sent to eMMC device with FLUSH_CACHE operation. On receiving this command, eMMC may enter busy state (for a relatively short time in the case of the dead-lock). 3) Then mmci interrupt is raised and mmci_irq() is called: MMCISTATUS register is read and is equal to 0x01000440. So the following status bits are set: - MCI_CMDRESPEND (= 6) - MCI_DATABLOCKEND (= 10) - MCI_ST_CARDBUSY (= 24) Since MMCIMASK0 register is 0x3FF, status variable is set to 0x00000040 and BIT MCI_CMDRESPEND is cleared by writing MMCICLEAR register. Then mmci_cmd_irq() is called. Considering the following conditions: - host->busy_status is 0, - this is a "busy response", - reading again MMCISTATUS register gives 0x1000400, MMCIMASK0 is updated to unmask MCI_ST_BUSYEND bit. Thus, MMCIMASK0 is set to 0x010003FF and host->busy_status is set to wait for busy end completion. Back again in status loop of mmci_irq(), we quickly go through mmci_data_irq() as there are no data in that case. And we finally go through following test at the end of while(status) loop: /* * Don't poll for busy completion in irq context. */ if (host->variant->busy_detect && host->busy_status) status &= ~host->variant->busy_detect_flag; Because status variable is not yet null (is equal to 0x40), we do not leave interrupt context yet but we loop again into while(status) loop. So we run across following steps: a) MMCISTATUS register is read again and this time is equal to 0x01000400. So that following bits are set: - MCI_DATABLOCKEND (= 10) - MCI_ST_CARDBUSY (= 24) Since MMCIMASK0 register is equal to 0x010003FF: b) status variable is set to 0x01000000. c) MCI_ST_CARDBUSY bit is cleared by writing MMCICLEAR register. Then, mmci_cmd_irq() is called one more time. Since host->busy_status is set and that MCI_ST_CARDBUSY is set in status variable, we just return from this function. Back again in mmci_irq(), status variable is set to 0 and we finally leave the while(status) loop. As a result we leave interrupt context, waiting for busy end interrupt event. Now, consider that busy end completion is raised IN BETWEEN steps 3.a) and 3.c). In such a case, we may mistakenly clear busy end interrupt at step 3.c) while it has not yet been processed. This will result in mmc command to wait forever for a busy end completion that will never happen. To fix the problem, this patch implements the following changes: Considering that the mmci seems to be triggering the IRQ on both edges while monitoring DAT0 for busy completion and that same status bit is used to monitor start and end of busy detection, special care must be taken to make sure that both start and end interrupts are always cleared one after the other. 1) Clearing of card busy bit is moved in mmc_cmd_irq() function where unmasking of busy end bit is effectively handled. 2) Just before unmasking busy end event, busy start event is cleared by writing card busy bit in MMCICLEAR register. 3) Finally, once we are no more busy with a command, busy end event is cleared writing again card busy bit in MMCICLEAR register. This patch has been tested with the ST Accordo5 machine, not yet supported upstream but relies on the mmci driver. Signed-off-by: Sarang Mairal <[email protected]> Signed-off-by: Jean-Nicolas Graux <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Tested-by: Ulf Hansson <[email protected]> Signed-off-by: Ulf Hansson <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
…failed send Dan Carpenter kindly reported: <quote> The patch d27a7cb: "zfcp: trace on request for open and close of WKA port" from Aug 10, 2016, leads to the following static checker warning: drivers/s390/scsi/zfcp_fsf.c:1615 zfcp_fsf_open_wka_port() warn: 'req' was already freed. drivers/s390/scsi/zfcp_fsf.c 1609 zfcp_fsf_start_timer(req, ZFCP_FSF_REQUEST_TIMEOUT); 1610 retval = zfcp_fsf_req_send(req); 1611 if (retval) 1612 zfcp_fsf_req_free(req); ^^^ Freed. 1613 out: 1614 spin_unlock_irq(&qdio->req_q_lock); 1615 if (req && !IS_ERR(req)) 1616 zfcp_dbf_rec_run_wka("fsowp_1", wka_port, req->req_id); ^^^^^^^^^^^ Use after free. 1617 return retval; 1618 } Same thing for zfcp_fsf_close_wka_port() as well. </quote> Rather than relying on req being NULL (or ERR_PTR) for all cases where we don't want to trace or should not trace, simply check retval which is unconditionally initialized with -EIO != 0 and it can only become 0 on successful retval = zfcp_fsf_req_send(req). With that we can also remove the then again unnecessary unconditional initialization of req which was introduced with that earlier commit. Reported-by: Dan Carpenter <[email protected]> Suggested-by: Benjamin Block <[email protected]> Signed-off-by: Steffen Maier <[email protected]> Fixes: d27a7cb ("zfcp: trace on request for open and close of WKA port") Cc: <[email protected]> #2.6.38+ Reviewed-by: Benjamin Block <[email protected]> Reviewed-by: Jens Remus <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
Olof reported that on a machine which has a BIOS wreckaged TSC the timestamps in dmesg are making a large jump because the TSC value is jumping forward after resetting the TSC ADJUST register to a sane value. This can be avoided by calling the TSC ADJUST saniziting function before initializing the per cpu sched clock machinery. That takes the offset into account and avoid the time jump. What cannot be avoided is that the 'Firmware Bug' warnings on the secondary CPUs are printed with the large time offsets because it would be too much effort and ugly hackery to print those warnings into a buffer and emit them after the adjustemt on the starting CPUs. It's a firmware bug and should be fixed in firmware. The weird timestamps are collateral damage and just illustrate the sillyness of the BIOS folks: [ 0.397445] smp: Bringing up secondary CPUs ... [ 0.402100] x86: Booting SMP configuration: [ 0.406343] .... node #0, CPUs: #1 [1265776479.930667] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU1: -2978888639183101 [1265776479.944664] TSC ADJUST synchronize: Reference CPU0: 0 CPU1: -2978888639183101 [ 0.508119] #2 [1265776480.032346] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU2: -2978888639183677 [1265776480.044192] TSC ADJUST synchronize: Reference CPU0: 0 CPU2: -2978888639183677 [ 0.607643] #3 [1265776480.131874] [Firmware Bug]: TSC ADJUST differs: Reference CPU0: -2978888639075328 CPU3: -2978888639184530 [1265776480.143720] TSC ADJUST synchronize: Reference CPU0: 0 CPU3: -2978888639184530 [ 0.707108] smp: Brought up 1 node, 4 CPUs [ 0.711271] smpboot: Total of 4 processors activated (21698.88 BogoMIPS) Reported-by: Olof Johansson <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
We cannot do printk() from tk_debug_account_sleep_time(), because tk_debug_account_sleep_time() is called under tk_core seq lock. The reason why printk() is unsafe there is that console_sem may invoke scheduler (up()->wake_up_process()->activate_task()), which, in turn, can return back to timekeeping code, for instance, via get_time()->ktime_get(), deadlocking the system on tk_core seq lock. [ 48.950592] ====================================================== [ 48.950622] [ INFO: possible circular locking dependency detected ] [ 48.950622] 4.10.0-rc7-next-20170213+ #101 Not tainted [ 48.950622] ------------------------------------------------------- [ 48.950622] kworker/0:0/3 is trying to acquire lock: [ 48.950653] (tk_core){----..}, at: [<c01cc624>] retrigger_next_event+0x4c/0x90 [ 48.950683] but task is already holding lock: [ 48.950683] (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90 [ 48.950714] which lock already depends on the new lock. [ 48.950714] the existing dependency chain (in reverse order) is: [ 48.950714] -> #5 (hrtimer_bases.lock){-.-...}: [ 48.950744] _raw_spin_lock_irqsave+0x50/0x64 [ 48.950775] lock_hrtimer_base+0x28/0x58 [ 48.950775] hrtimer_start_range_ns+0x20/0x5c8 [ 48.950775] __enqueue_rt_entity+0x320/0x360 [ 48.950805] enqueue_rt_entity+0x2c/0x44 [ 48.950805] enqueue_task_rt+0x24/0x94 [ 48.950836] ttwu_do_activate+0x54/0xc0 [ 48.950836] try_to_wake_up+0x248/0x5c8 [ 48.950836] __setup_irq+0x420/0x5f0 [ 48.950836] request_threaded_irq+0xdc/0x184 [ 48.950866] devm_request_threaded_irq+0x58/0xa4 [ 48.950866] omap_i2c_probe+0x530/0x6a0 [ 48.950897] platform_drv_probe+0x50/0xb0 [ 48.950897] driver_probe_device+0x1f8/0x2cc [ 48.950897] __driver_attach+0xc0/0xc4 [ 48.950927] bus_for_each_dev+0x6c/0xa0 [ 48.950927] bus_add_driver+0x100/0x210 [ 48.950927] driver_register+0x78/0xf4 [ 48.950958] do_one_initcall+0x3c/0x16c [ 48.950958] kernel_init_freeable+0x20c/0x2d8 [ 48.950958] kernel_init+0x8/0x110 [ 48.950988] ret_from_fork+0x14/0x24 [ 48.950988] -> #4 (&rt_b->rt_runtime_lock){-.-...}: [ 48.951019] _raw_spin_lock+0x40/0x50 [ 48.951019] rq_offline_rt+0x9c/0x2bc [ 48.951019] set_rq_offline.part.2+0x2c/0x58 [ 48.951049] rq_attach_root+0x134/0x144 [ 48.951049] cpu_attach_domain+0x18c/0x6f4 [ 48.951049] build_sched_domains+0xba4/0xd80 [ 48.951080] sched_init_smp+0x68/0x10c [ 48.951080] kernel_init_freeable+0x160/0x2d8 [ 48.951080] kernel_init+0x8/0x110 [ 48.951080] ret_from_fork+0x14/0x24 [ 48.951110] -> #3 (&rq->lock){-.-.-.}: [ 48.951110] _raw_spin_lock+0x40/0x50 [ 48.951141] task_fork_fair+0x30/0x124 [ 48.951141] sched_fork+0x194/0x2e0 [ 48.951141] copy_process.part.5+0x448/0x1a20 [ 48.951171] _do_fork+0x98/0x7e8 [ 48.951171] kernel_thread+0x2c/0x34 [ 48.951171] rest_init+0x1c/0x18c [ 48.951202] start_kernel+0x35c/0x3d4 [ 48.951202] 0x8000807c [ 48.951202] -> #2 (&p->pi_lock){-.-.-.}: [ 48.951232] _raw_spin_lock_irqsave+0x50/0x64 [ 48.951232] try_to_wake_up+0x30/0x5c8 [ 48.951232] up+0x4c/0x60 [ 48.951263] __up_console_sem+0x2c/0x58 [ 48.951263] console_unlock+0x3b4/0x650 [ 48.951263] vprintk_emit+0x270/0x474 [ 48.951293] vprintk_default+0x20/0x28 [ 48.951293] printk+0x20/0x30 [ 48.951324] kauditd_hold_skb+0x94/0xb8 [ 48.951324] kauditd_thread+0x1a4/0x56c [ 48.951324] kthread+0x104/0x148 [ 48.951354] ret_from_fork+0x14/0x24 [ 48.951354] -> #1 ((console_sem).lock){-.....}: [ 48.951385] _raw_spin_lock_irqsave+0x50/0x64 [ 48.951385] down_trylock+0xc/0x2c [ 48.951385] __down_trylock_console_sem+0x24/0x80 [ 48.951385] console_trylock+0x10/0x8c [ 48.951416] vprintk_emit+0x264/0x474 [ 48.951416] vprintk_default+0x20/0x28 [ 48.951416] printk+0x20/0x30 [ 48.951446] tk_debug_account_sleep_time+0x5c/0x70 [ 48.951446] __timekeeping_inject_sleeptime.constprop.3+0x170/0x1a0 [ 48.951446] timekeeping_resume+0x218/0x23c [ 48.951477] syscore_resume+0x94/0x42c [ 48.951477] suspend_enter+0x554/0x9b4 [ 48.951477] suspend_devices_and_enter+0xd8/0x4b4 [ 48.951507] enter_state+0x934/0xbd4 [ 48.951507] pm_suspend+0x14/0x70 [ 48.951507] state_store+0x68/0xc8 [ 48.951538] kernfs_fop_write+0xf4/0x1f8 [ 48.951538] __vfs_write+0x1c/0x114 [ 48.951538] vfs_write+0xa0/0x168 [ 48.951568] SyS_write+0x3c/0x90 [ 48.951568] __sys_trace_return+0x0/0x10 [ 48.951568] -> #0 (tk_core){----..}: [ 48.951599] lock_acquire+0xe0/0x294 [ 48.951599] ktime_get_update_offsets_now+0x5c/0x1d4 [ 48.951629] retrigger_next_event+0x4c/0x90 [ 48.951629] on_each_cpu+0x40/0x7c [ 48.951629] clock_was_set_work+0x14/0x20 [ 48.951660] process_one_work+0x2b4/0x808 [ 48.951660] worker_thread+0x3c/0x550 [ 48.951660] kthread+0x104/0x148 [ 48.951690] ret_from_fork+0x14/0x24 [ 48.951690] other info that might help us debug this: [ 48.951690] Chain exists of: tk_core --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock [ 48.951721] Possible unsafe locking scenario: [ 48.951721] CPU0 CPU1 [ 48.951721] ---- ---- [ 48.951721] lock(hrtimer_bases.lock); [ 48.951751] lock(&rt_b->rt_runtime_lock); [ 48.951751] lock(hrtimer_bases.lock); [ 48.951751] lock(tk_core); [ 48.951782] *** DEADLOCK *** [ 48.951782] 3 locks held by kworker/0:0/3: [ 48.951782] #0: ("events"){.+.+.+}, at: [<c0156590>] process_one_work+0x1f8/0x808 [ 48.951812] #1: (hrtimer_work){+.+...}, at: [<c0156590>] process_one_work+0x1f8/0x808 [ 48.951843] #2: (hrtimer_bases.lock){-.-...}, at: [<c01cc610>] retrigger_next_event+0x38/0x90 [ 48.951843] stack backtrace: [ 48.951873] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc7-next-20170213+ [ 48.951904] Workqueue: events clock_was_set_work [ 48.951904] [<c0110208>] (unwind_backtrace) from [<c010c224>] (show_stack+0x10/0x14) [ 48.951934] [<c010c224>] (show_stack) from [<c04ca6c0>] (dump_stack+0xac/0xe0) [ 48.951934] [<c04ca6c0>] (dump_stack) from [<c019b5cc>] (print_circular_bug+0x1d0/0x308) [ 48.951965] [<c019b5cc>] (print_circular_bug) from [<c019d2a8>] (validate_chain+0xf50/0x1324) [ 48.951965] [<c019d2a8>] (validate_chain) from [<c019ec18>] (__lock_acquire+0x468/0x7e8) [ 48.951995] [<c019ec18>] (__lock_acquire) from [<c019f634>] (lock_acquire+0xe0/0x294) [ 48.951995] [<c019f634>] (lock_acquire) from [<c01d0ea0>] (ktime_get_update_offsets_now+0x5c/0x1d4) [ 48.952026] [<c01d0ea0>] (ktime_get_update_offsets_now) from [<c01cc624>] (retrigger_next_event+0x4c/0x90) [ 48.952026] [<c01cc624>] (retrigger_next_event) from [<c01e4e24>] (on_each_cpu+0x40/0x7c) [ 48.952056] [<c01e4e24>] (on_each_cpu) from [<c01cafc4>] (clock_was_set_work+0x14/0x20) [ 48.952056] [<c01cafc4>] (clock_was_set_work) from [<c015664c>] (process_one_work+0x2b4/0x808) [ 48.952087] [<c015664c>] (process_one_work) from [<c0157774>] (worker_thread+0x3c/0x550) [ 48.952087] [<c0157774>] (worker_thread) from [<c015d644>] (kthread+0x104/0x148) [ 48.952087] [<c015d644>] (kthread) from [<c0107830>] (ret_from_fork+0x14/0x24) Replace printk() with printk_deferred(), which does not call into the scheduler. Fixes: 0bf43f1 ("timekeeping: Prints the amounts of time spent during suspend") Reported-and-tested-by: Tony Lindgren <[email protected]> Signed-off-by: Sergey Senozhatsky <[email protected]> Cc: Petr Mladek <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Rafael J . Wysocki" <[email protected]> Cc: Steven Rostedt <[email protected]> Cc: John Stultz <[email protected]> Cc: "[4.9+]" <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Feb 23, 2017
Commit 6664498 ("packet: call fanout_release, while UNREGISTERING a netdev"), unfortunately, introduced the following issues. 1. calling mutex_lock(&fanout_mutex) (fanout_release()) from inside rcu_read-side critical section. rcu_read_lock disables preemption, most often, which prohibits calling sleeping functions. [ ] include/linux/rcupdate.h:560 Illegal context switch in RCU read-side critical section! [ ] [ ] rcu_scheduler_active = 1, debug_locks = 0 [ ] 4 locks held by ovs-vswitchd/1969: [ ] #0: (cb_lock){++++++}, at: [<ffffffff8158a6c9>] genl_rcv+0x19/0x40 [ ] #1: (ovs_mutex){+.+.+.}, at: [<ffffffffa04878ca>] ovs_vport_cmd_del+0x4a/0x100 [openvswitch] [ ] #2: (rtnl_mutex){+.+.+.}, at: [<ffffffff81564157>] rtnl_lock+0x17/0x20 [ ] #3: (rcu_read_lock){......}, at: [<ffffffff81614165>] packet_notifier+0x5/0x3f0 [ ] [ ] Call Trace: [ ] [<ffffffff813770c1>] dump_stack+0x85/0xc4 [ ] [<ffffffff810c9077>] lockdep_rcu_suspicious+0x107/0x110 [ ] [<ffffffff810a2da7>] ___might_sleep+0x57/0x210 [ ] [<ffffffff810a2fd0>] __might_sleep+0x70/0x90 [ ] [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0 [ ] [<ffffffff810de93f>] ? vprintk_default+0x1f/0x30 [ ] [<ffffffff81186e88>] ? printk+0x4d/0x4f [ ] [<ffffffff816106dd>] fanout_release+0x1d/0xe0 [ ] [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0 2. calling mutex_lock(&fanout_mutex) inside spin_lock(&po->bind_lock). "sleeping function called from invalid context" [ ] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ ] in_atomic(): 1, irqs_disabled(): 0, pid: 1969, name: ovs-vswitchd [ ] INFO: lockdep is turned off. [ ] Call Trace: [ ] [<ffffffff813770c1>] dump_stack+0x85/0xc4 [ ] [<ffffffff810a2f52>] ___might_sleep+0x202/0x210 [ ] [<ffffffff810a2fd0>] __might_sleep+0x70/0x90 [ ] [<ffffffff8162e80c>] mutex_lock_nested+0x3c/0x3a0 [ ] [<ffffffff816106dd>] fanout_release+0x1d/0xe0 [ ] [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0 3. calling dev_remove_pack(&fanout->prot_hook), from inside spin_lock(&po->bind_lock) or rcu_read-side critical-section. dev_remove_pack() -> synchronize_net(), which might sleep. [ ] BUG: scheduling while atomic: ovs-vswitchd/1969/0x00000002 [ ] INFO: lockdep is turned off. [ ] Call Trace: [ ] [<ffffffff813770c1>] dump_stack+0x85/0xc4 [ ] [<ffffffff81186274>] __schedule_bug+0x64/0x73 [ ] [<ffffffff8162b8cb>] __schedule+0x6b/0xd10 [ ] [<ffffffff8162c5db>] schedule+0x6b/0x80 [ ] [<ffffffff81630b1d>] schedule_timeout+0x38d/0x410 [ ] [<ffffffff810ea3fd>] synchronize_sched_expedited+0x53d/0x810 [ ] [<ffffffff810ea6de>] synchronize_rcu_expedited+0xe/0x10 [ ] [<ffffffff8154eab5>] synchronize_net+0x35/0x50 [ ] [<ffffffff8154eae3>] dev_remove_pack+0x13/0x20 [ ] [<ffffffff8161077e>] fanout_release+0xbe/0xe0 [ ] [<ffffffff81614459>] packet_notifier+0x2f9/0x3f0 4. fanout_release() races with calls from different CPU. To fix the above problems, remove the call to fanout_release() under rcu_read_lock(). Instead, call __dev_remove_pack(&fanout->prot_hook) and netdev_run_todo will be happy that &dev->ptype_specific list is empty. In order to achieve this, I moved dev_{add,remove}_pack() out of fanout_{add,release} to __fanout_{link,unlink}. So, call to {,__}unregister_prot_hook() will make sure fanout->prot_hook is removed as well. Fixes: 6664498 ("packet: call fanout_release, while UNREGISTERING a netdev") Reported-by: Eric Dumazet <[email protected]> Signed-off-by: Anoob Soman <[email protected]> Acked-by: Eric Dumazet <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
May 17, 2017
This is a story about 4 distinct (and very old) btrfs bugs. Commit c8b9781 ("Btrfs: Add zlib compression support") added three data corruption bugs for inline extents (bugs #1-3). Commit 93c82d5 ("Btrfs: zero page past end of inline file items") fixed bug #1: uncompressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read. The fix was to add a memset in btrfs_get_extent to zero out the hole. Commit 166ae5a ("btrfs: fix inline compressed read err corruption") fixed bug #2: compressed inline extents which contained non-zero bytes might be replaced with zero bytes in some cases. This patch removed an unhelpful memset from uncompress_inline, but the case where memset is required was missed. There is also a memset in the decompression code, but this only covers decompressed data that is shorter than the ram_bytes from the extent ref record. This memset doesn't cover the region between the end of the decompressed data and the end of the page. It has also moved around a few times over the years, so there's no single patch to refer to. This patch fixes bug #3: compressed inline extents followed by a hole and more extents could get non-zero data in the hole as they were read (i.e. bug #3 is the same as bug #1, but s/uncompressed/compressed/). The fix is the same: zero out the hole in the compressed case too, by putting a memset back in uncompress_inline, but this time with correct parameters. The last and oldest bug, bug #0, is the cause of the offending inline extent/hole/extent pattern. Bug #0 is a subtle and mostly-harmless quirk of behavior somewhere in the btrfs write code. In a few special cases, an inline extent and hole are allowed to persist where they normally would be combined with later extents in the file. A fast reproducer for bug #0 is presented below. A few offending extents are also created in the wild during large rsync transfers with the -S flag. A Linux kernel build (git checkout; make allyesconfig; make -j8) will produce a handful of offending files as well. Once an offending file is created, it can present different content to userspace each time it is read. Bug #0 is at least 4 and possibly 8 years old. I verified every vX.Y kernel back to v3.5 has this behavior. There are fossil records of this bug's effects in commits all the way back to v2.6.32. I have no reason to believe bug #0 wasn't present at the beginning of btrfs compression support in v2.6.29, but I can't easily test kernels that old to be sure. It is not clear whether bug #0 is worth fixing. A fix would likely require injecting extra reads into currently write-only paths, and most of the exceptional cases caused by bug #0 are already handled now. Whether we like them or not, bug #0's inline extents followed by holes are part of the btrfs de-facto disk format now, and we need to be able to read them without data corruption or an infoleak. So enough about bug #0, let's get back to bug #3 (this patch). An example of on-disk structure leading to data corruption found in the wild: item 61 key (606890 INODE_ITEM 0) itemoff 9662 itemsize 160 inode generation 50 transid 50 size 47424 nbytes 49141 block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0 flags 0x0(none) item 62 key (606890 INODE_REF 603050) itemoff 9642 itemsize 20 inode ref index 3 namelen 10 name: DB_File.so item 63 key (606890 EXTENT_DATA 0) itemoff 8280 itemsize 1362 inline extent data size 1341 ram 4085 compress(zlib) item 64 key (606890 EXTENT_DATA 4096) itemoff 8227 itemsize 53 extent data disk byte 5367308288 nr 20480 extent data offset 0 nr 45056 ram 45056 extent compression(zlib) Different data appears in userspace during each read of the 11 bytes between 4085 and 4096. The extent in item 63 is not long enough to fill the first page of the file, so a memset is required to fill the space between item 63 (ending at 4085) and item 64 (beginning at 4096) with zero. Here is a reproducer from Liu Bo, which demonstrates another method of creating the same inline extent and hole pattern: Using 'page_poison=on' kernel command line (or enable CONFIG_PAGE_POISONING) run the following: # touch foo # chattr +c foo # xfs_io -f -c "pwrite -W 0 1000" foo # xfs_io -f -c "falloc 4 8188" foo # od -x foo # echo 3 >/proc/sys/vm/drop_caches # od -x foo This produce the following on my box: Correct output: file contains 1000 data bytes followed by zeros: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 0000 0000 0000 0000 0001760 0000 0000 0000 0000 0000 0000 0000 0000 * 0020000 Actual output: the data after the first 1000 bytes will be different each run: 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd * 0001740 cdcd cdcd cdcd cdcd 6c63 7400 635f 006d 0001760 5f74 6f43 7400 435f 0053 5f74 7363 7400 0002000 435f 0056 5f74 6164 7400 645f 0062 5f74 (...) Signed-off-by: Zygo Blaxell <[email protected]> Reviewed-by: Liu Bo <[email protected]> Reviewed-by: Chris Mason <[email protected]> Signed-off-by: Chris Mason <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
May 17, 2017
The rcu_barrier() takes the cpu_hotplug mutex which itself is not reclaim-safe, and so rcu_barrier() is illegal from inside the shrinker. [ 309.661373] ========================================================= [ 309.661376] [ INFO: possible irq lock inversion dependency detected ] [ 309.661380] 4.11.0-rc1-CI-CI_DRM_2333+ #1 Tainted: G W [ 309.661383] --------------------------------------------------------- [ 309.661386] gem_exec_gttfil/6435 just changed the state of lock: [ 309.661389] (rcu_preempt_state.barrier_mutex){+.+.-.}, at: [<ffffffff81100731>] _rcu_barrier+0x31/0x160 [ 309.661399] but this lock took another, RECLAIM_FS-unsafe lock in the past: [ 309.661402] (cpu_hotplug.lock){+.+.+.} [ 309.661404] and interrupts could create inverse lock ordering between them. [ 309.661410] other info that might help us debug this: [ 309.661414] Possible interrupt unsafe locking scenario: [ 309.661417] CPU0 CPU1 [ 309.661419] ---- ---- [ 309.661421] lock(cpu_hotplug.lock); [ 309.661425] local_irq_disable(); [ 309.661432] lock(rcu_preempt_state.barrier_mutex); [ 309.661441] lock(cpu_hotplug.lock); [ 309.661446] <Interrupt> [ 309.661448] lock(rcu_preempt_state.barrier_mutex); [ 309.661453] *** DEADLOCK *** [ 309.661460] 4 locks held by gem_exec_gttfil/6435: [ 309.661464] #0: (sb_writers#10){.+.+.+}, at: [<ffffffff8120d83d>] vfs_write+0x17d/0x1f0 [ 309.661475] #1: (debugfs_srcu){......}, at: [<ffffffff81320491>] debugfs_use_file_start+0x41/0xa0 [ 309.661486] #2: (&attr->mutex){+.+.+.}, at: [<ffffffff8123a3e7>] simple_attr_write+0x37/0xe0 [ 309.661495] #3: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0091b4a>] i915_drop_caches_set+0x3a/0x150 [i915] [ 309.661540] the shortest dependencies between 2nd lock and 1st lock: [ 309.661547] -> (cpu_hotplug.lock){+.+.+.} ops: 829 { [ 309.661553] HARDIRQ-ON-W at: [ 309.661560] __lock_acquire+0x5e5/0x1b50 [ 309.661565] lock_acquire+0xc9/0x220 [ 309.661572] __mutex_lock+0x6e/0x990 [ 309.661576] mutex_lock_nested+0x16/0x20 [ 309.661583] get_online_cpus+0x61/0x80 [ 309.661590] kmem_cache_create+0x25/0x1d0 [ 309.661596] debug_objects_mem_init+0x30/0x249 [ 309.661602] start_kernel+0x341/0x3fe [ 309.661607] x86_64_start_reservations+0x2a/0x2c [ 309.661612] x86_64_start_kernel+0x173/0x186 [ 309.661619] verify_cpu+0x0/0xfc [ 309.661622] SOFTIRQ-ON-W at: [ 309.661627] __lock_acquire+0x611/0x1b50 [ 309.661632] lock_acquire+0xc9/0x220 [ 309.661636] __mutex_lock+0x6e/0x990 [ 309.661641] mutex_lock_nested+0x16/0x20 [ 309.661646] get_online_cpus+0x61/0x80 [ 309.661650] kmem_cache_create+0x25/0x1d0 [ 309.661655] debug_objects_mem_init+0x30/0x249 [ 309.661660] start_kernel+0x341/0x3fe [ 309.661664] x86_64_start_reservations+0x2a/0x2c [ 309.661669] x86_64_start_kernel+0x173/0x186 [ 309.661674] verify_cpu+0x0/0xfc [ 309.661677] RECLAIM_FS-ON-W at: [ 309.661682] mark_held_locks+0x6f/0xa0 [ 309.661687] lockdep_trace_alloc+0xb3/0x100 [ 309.661693] kmem_cache_alloc_trace+0x31/0x2e0 [ 309.661699] __smpboot_create_thread.part.1+0x27/0xe0 [ 309.661704] smpboot_create_threads+0x61/0x90 [ 309.661709] cpuhp_invoke_callback+0x9c/0x8a0 [ 309.661713] cpuhp_up_callbacks+0x31/0xb0 [ 309.661718] _cpu_up+0x7a/0xc0 [ 309.661723] do_cpu_up+0x5f/0x80 [ 309.661727] cpu_up+0xe/0x10 [ 309.661734] smp_init+0x71/0xb3 [ 309.661738] kernel_init_freeable+0x94/0x19e [ 309.661743] kernel_init+0x9/0xf0 [ 309.661748] ret_from_fork+0x2e/0x40 [ 309.661752] INITIAL USE at: [ 309.661757] __lock_acquire+0x234/0x1b50 [ 309.661761] lock_acquire+0xc9/0x220 [ 309.661766] __mutex_lock+0x6e/0x990 [ 309.661771] mutex_lock_nested+0x16/0x20 [ 309.661775] get_online_cpus+0x61/0x80 [ 309.661780] __cpuhp_setup_state+0x44/0x170 [ 309.661785] page_alloc_init+0x23/0x3a [ 309.661790] start_kernel+0x124/0x3fe [ 309.661794] x86_64_start_reservations+0x2a/0x2c [ 309.661799] x86_64_start_kernel+0x173/0x186 [ 309.661804] verify_cpu+0x0/0xfc [ 309.661807] } [ 309.661813] ... key at: [<ffffffff81e37690>] cpu_hotplug+0xb0/0x100 [ 309.661817] ... acquired at: [ 309.661821] lock_acquire+0xc9/0x220 [ 309.661825] __mutex_lock+0x6e/0x990 [ 309.661829] mutex_lock_nested+0x16/0x20 [ 309.661833] get_online_cpus+0x61/0x80 [ 309.661837] _rcu_barrier+0x9f/0x160 [ 309.661841] rcu_barrier+0x10/0x20 [ 309.661847] netdev_run_todo+0x5f/0x310 [ 309.661852] rtnl_unlock+0x9/0x10 [ 309.661856] default_device_exit_batch+0x133/0x150 [ 309.661862] ops_exit_list.isra.0+0x4d/0x60 [ 309.661866] cleanup_net+0x1d8/0x2c0 [ 309.661872] process_one_work+0x1f4/0x6d0 [ 309.661876] worker_thread+0x49/0x4a0 [ 309.661881] kthread+0x107/0x140 [ 309.661884] ret_from_fork+0x2e/0x40 [ 309.661890] -> (rcu_preempt_state.barrier_mutex){+.+.-.} ops: 179 { [ 309.661896] HARDIRQ-ON-W at: [ 309.661901] __lock_acquire+0x5e5/0x1b50 [ 309.661905] lock_acquire+0xc9/0x220 [ 309.661910] __mutex_lock+0x6e/0x990 [ 309.661914] mutex_lock_nested+0x16/0x20 [ 309.661919] _rcu_barrier+0x31/0x160 [ 309.661923] rcu_barrier+0x10/0x20 [ 309.661928] netdev_run_todo+0x5f/0x310 [ 309.661932] rtnl_unlock+0x9/0x10 [ 309.661936] default_device_exit_batch+0x133/0x150 [ 309.661941] ops_exit_list.isra.0+0x4d/0x60 [ 309.661946] cleanup_net+0x1d8/0x2c0 [ 309.661951] process_one_work+0x1f4/0x6d0 [ 309.661955] worker_thread+0x49/0x4a0 [ 309.661960] kthread+0x107/0x140 [ 309.661964] ret_from_fork+0x2e/0x40 [ 309.661968] SOFTIRQ-ON-W at: [ 309.661972] __lock_acquire+0x611/0x1b50 [ 309.661977] lock_acquire+0xc9/0x220 [ 309.661981] __mutex_lock+0x6e/0x990 [ 309.661986] mutex_lock_nested+0x16/0x20 [ 309.661990] _rcu_barrier+0x31/0x160 [ 309.661995] rcu_barrier+0x10/0x20 [ 309.661999] netdev_run_todo+0x5f/0x310 [ 309.662003] rtnl_unlock+0x9/0x10 [ 309.662008] default_device_exit_batch+0x133/0x150 [ 309.662013] ops_exit_list.isra.0+0x4d/0x60 [ 309.662017] cleanup_net+0x1d8/0x2c0 [ 309.662022] process_one_work+0x1f4/0x6d0 [ 309.662027] worker_thread+0x49/0x4a0 [ 309.662031] kthread+0x107/0x140 [ 309.662035] ret_from_fork+0x2e/0x40 [ 309.662039] IN-RECLAIM_FS-W at: [ 309.662043] __lock_acquire+0x638/0x1b50 [ 309.662048] lock_acquire+0xc9/0x220 [ 309.662053] __mutex_lock+0x6e/0x990 [ 309.662058] mutex_lock_nested+0x16/0x20 [ 309.662062] _rcu_barrier+0x31/0x160 [ 309.662067] rcu_barrier+0x10/0x20 [ 309.662089] i915_gem_shrink_all+0x33/0x40 [i915] [ 309.662109] i915_drop_caches_set+0x141/0x150 [i915] [ 309.662114] simple_attr_write+0xc7/0xe0 [ 309.662119] full_proxy_write+0x4f/0x70 [ 309.662124] __vfs_write+0x23/0x120 [ 309.662128] vfs_write+0xc6/0x1f0 [ 309.662133] SyS_write+0x44/0xb0 [ 309.662138] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 309.662142] INITIAL USE at: [ 309.662147] __lock_acquire+0x234/0x1b50 [ 309.662151] lock_acquire+0xc9/0x220 [ 309.662156] __mutex_lock+0x6e/0x990 [ 309.662160] mutex_lock_nested+0x16/0x20 [ 309.662165] _rcu_barrier+0x31/0x160 [ 309.662169] rcu_barrier+0x10/0x20 [ 309.662174] netdev_run_todo+0x5f/0x310 [ 309.662178] rtnl_unlock+0x9/0x10 [ 309.662183] default_device_exit_batch+0x133/0x150 [ 309.662188] ops_exit_list.isra.0+0x4d/0x60 [ 309.662192] cleanup_net+0x1d8/0x2c0 [ 309.662197] process_one_work+0x1f4/0x6d0 [ 309.662202] worker_thread+0x49/0x4a0 [ 309.662206] kthread+0x107/0x140 [ 309.662210] ret_from_fork+0x2e/0x40 [ 309.662214] } [ 309.662220] ... key at: [<ffffffff81e4e1c8>] rcu_preempt_state+0x508/0x780 [ 309.662225] ... acquired at: [ 309.662229] check_usage_forwards+0x12b/0x130 [ 309.662233] mark_lock+0x360/0x6f0 [ 309.662237] __lock_acquire+0x638/0x1b50 [ 309.662241] lock_acquire+0xc9/0x220 [ 309.662245] __mutex_lock+0x6e/0x990 [ 309.662249] mutex_lock_nested+0x16/0x20 [ 309.662253] _rcu_barrier+0x31/0x160 [ 309.662257] rcu_barrier+0x10/0x20 [ 309.662279] i915_gem_shrink_all+0x33/0x40 [i915] [ 309.662298] i915_drop_caches_set+0x141/0x150 [i915] [ 309.662303] simple_attr_write+0xc7/0xe0 [ 309.662307] full_proxy_write+0x4f/0x70 [ 309.662311] __vfs_write+0x23/0x120 [ 309.662315] vfs_write+0xc6/0x1f0 [ 309.662319] SyS_write+0x44/0xb0 [ 309.662323] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 309.662329] stack backtrace: [ 309.662335] CPU: 1 PID: 6435 Comm: gem_exec_gttfil Tainted: G W 4.11.0-rc1-CI-CI_DRM_2333+ #1 [ 309.662342] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011 [ 309.662348] Call Trace: [ 309.662354] dump_stack+0x67/0x92 [ 309.662359] print_irq_inversion_bug.part.19+0x1a4/0x1b0 [ 309.662365] check_usage_forwards+0x12b/0x130 [ 309.662369] mark_lock+0x360/0x6f0 [ 309.662374] ? print_shortest_lock_dependencies+0x1a0/0x1a0 [ 309.662379] __lock_acquire+0x638/0x1b50 [ 309.662383] ? __mutex_unlock_slowpath+0x3e/0x2e0 [ 309.662388] ? trace_hardirqs_on+0xd/0x10 [ 309.662392] ? _rcu_barrier+0x31/0x160 [ 309.662396] lock_acquire+0xc9/0x220 [ 309.662400] ? _rcu_barrier+0x31/0x160 [ 309.662404] ? _rcu_barrier+0x31/0x160 [ 309.662409] __mutex_lock+0x6e/0x990 [ 309.662412] ? _rcu_barrier+0x31/0x160 [ 309.662416] ? _rcu_barrier+0x31/0x160 [ 309.662421] ? synchronize_rcu_expedited+0x35/0xb0 [ 309.662426] ? _raw_spin_unlock_irqrestore+0x52/0x60 [ 309.662434] mutex_lock_nested+0x16/0x20 [ 309.662438] _rcu_barrier+0x31/0x160 [ 309.662442] rcu_barrier+0x10/0x20 [ 309.662464] i915_gem_shrink_all+0x33/0x40 [i915] [ 309.662484] i915_drop_caches_set+0x141/0x150 [i915] [ 309.662489] simple_attr_write+0xc7/0xe0 [ 309.662494] full_proxy_write+0x4f/0x70 [ 309.662498] __vfs_write+0x23/0x120 [ 309.662503] ? rcu_read_lock_sched_held+0x75/0x80 [ 309.662507] ? rcu_sync_lockdep_assert+0x2a/0x50 [ 309.662512] ? __sb_start_write+0x102/0x210 [ 309.662516] ? vfs_write+0x17d/0x1f0 [ 309.662520] vfs_write+0xc6/0x1f0 [ 309.662524] ? trace_hardirqs_on_caller+0xe7/0x200 [ 309.662529] SyS_write+0x44/0xb0 [ 309.662533] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 309.662537] RIP: 0033:0x7f507eac24a0 [ 309.662541] RSP: 002b:00007fffda8720e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 309.662548] RAX: ffffffffffffffda RBX: ffffffff81482bd3 RCX: 00007f507eac24a0 [ 309.662552] RDX: 0000000000000005 RSI: 00007fffda8720f0 RDI: 0000000000000005 [ 309.662557] RBP: ffffc9000048bf88 R08: 0000000000000000 R09: 000000000000002c [ 309.662561] R10: 0000000000000014 R11: 0000000000000246 R12: 00007fffda872230 [ 309.662566] R13: 00007fffda872228 R14: 0000000000000201 R15: 00007fffda8720f0 [ 309.662572] ? __this_cpu_preempt_check+0x13/0x20 Fixes: 0eafec6 ("drm/i915: Enable lockless lookup of request tracking via RCU") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100192 Signed-off-by: Chris Wilson <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: <[email protected]> # v4.9+ Link: http://patchwork.freedesktop.org/patch/msgid/[email protected] Reviewed-by: Daniel Vetter <[email protected]> (cherry picked from commit bd784b7) Signed-off-by: Jani Nikula <[email protected]> Link: http://patchwork.freedesktop.org/patch/msgid/[email protected]
pcmoore
pushed a commit
that referenced
this issue
May 17, 2017
Dmitry reported a lockdep splat [1] (false positive) that we can fix by releasing the spinlock before calling icmp_send() from ip_expire() This is a false positive because sending an ICMP message can not possibly re-enter the IP frag engine. [1] [ INFO: possible circular locking dependency detected ] 4.10.0+ #29 Not tainted ------------------------------------------------------- modprobe/12392 is trying to acquire lock: (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] spin_lock include/linux/spinlock.h:299 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] __netif_tx_lock include/linux/netdevice.h:3486 [inline] (_xmit_ETHER#2){+.-...}, at: [<ffffffff837a8182>] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 but task is already holding lock: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&q->lock)->rlock){+.-...}: validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] ip_defrag+0x3a2/0x4130 net/ipv4/ip_fragment.c:669 ip_check_defrag+0x4e3/0x8b0 net/ipv4/ip_fragment.c:713 packet_rcv_fanout+0x282/0x800 net/packet/af_packet.c:1459 deliver_skb net/core/dev.c:1834 [inline] dev_queue_xmit_nit+0x294/0xa90 net/core/dev.c:1890 xmit_one net/core/dev.c:2903 [inline] dev_hard_start_xmit+0x16b/0xab0 net/core/dev.c:2923 sch_direct_xmit+0x31f/0x6d0 net/sched/sch_generic.c:182 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_resolve_output+0x6b9/0xb10 net/core/neighbour.c:1308 neigh_output include/net/neighbour.h:478 [inline] ip_finish_output2+0x8b8/0x15a0 net/ipv4/ip_output.c:228 ip_do_fragment+0x1d93/0x2720 net/ipv4/ip_output.c:672 ip_fragment.constprop.54+0x145/0x200 net/ipv4/ip_output.c:545 ip_finish_output+0x82d/0xe10 net/ipv4/ip_output.c:314 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 raw_sendmsg+0x26de/0x3a00 net/ipv4/raw.c:655 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:761 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x4a3/0x9f0 net/socket.c:1985 __sys_sendmmsg+0x25c/0x750 net/socket.c:2075 SYSC_sendmmsg net/socket.c:2106 [inline] SyS_sendmmsg+0x35/0x60 net/socket.c:2101 do_syscall_64+0x2e8/0x930 arch/x86/entry/common.c:281 return_from_SYSCALL_64+0x0/0x7a -> #0 (_xmit_ETHER#2){+.-...}: check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 __read_once_size include/linux/compiler.h:254 [inline] atomic_read arch/x86/include/asm/atomic.h:26 [inline] rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] __rcu_is_watching kernel/rcu/tree.c:1133 [inline] rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); lock(&(&q->lock)->rlock); lock(_xmit_ETHER#2); *** DEADLOCK *** 10 locks held by modprobe/12392: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff81329758>] __do_page_fault+0x2b8/0xb60 arch/x86/mm/fault.c:1336 #1: (rcu_read_lock){......}, at: [<ffffffff8188cab6>] filemap_map_pages+0x1e6/0x1570 mm/filemap.c:2324 #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] spin_lock include/linux/spinlock.h:299 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] pte_alloc_one_map mm/memory.c:2944 [inline] #2: (&(ptlock_ptr(page))->rlock#2){+.+...}, at: [<ffffffff81984a78>] alloc_set_pte+0x13b8/0x1b90 mm/memory.c:3072 #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] lockdep_copy_map include/linux/lockdep.h:175 [inline] #3: (((&q->timer))){+.-...}, at: [<ffffffff81627e72>] call_timer_fn+0x1c2/0x820 kernel/time/timer.c:1258 #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] spin_lock include/linux/spinlock.h:299 [inline] #4: (&(&q->lock)->rlock){+.-...}, at: [<ffffffff8389a4d1>] ip_expire+0x51/0x6c0 net/ipv4/ip_fragment.c:201 #5: (rcu_read_lock){......}, at: [<ffffffff8389a633>] ip_expire+0x1b3/0x6c0 net/ipv4/ip_fragment.c:216 #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] spin_trylock include/linux/spinlock.h:309 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_xmit_lock net/ipv4/icmp.c:219 [inline] #6: (slock-AF_INET){+.-...}, at: [<ffffffff839b3313>] icmp_send+0x803/0x1c80 net/ipv4/icmp.c:681 #7: (rcu_read_lock_bh){......}, at: [<ffffffff838ab9a1>] ip_finish_output2+0x2c1/0x15a0 net/ipv4/ip_output.c:198 #8: (rcu_read_lock_bh){......}, at: [<ffffffff836d1dee>] __dev_queue_xmit+0x23e/0x1e60 net/core/dev.c:3324 #9: (dev->qdisc_running_key ?: &qdisc_running_key){+.....}, at: [<ffffffff836d3a27>] dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 stack backtrace: CPU: 0 PID: 12392 Comm: modprobe Not tainted 4.10.0+ #29 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x2ee/0x3ef lib/dump_stack.c:52 print_circular_bug+0x307/0x3b0 kernel/locking/lockdep.c:1204 check_prev_add kernel/locking/lockdep.c:1830 [inline] check_prevs_add+0xa8f/0x19f0 kernel/locking/lockdep.c:1940 validate_chain kernel/locking/lockdep.c:2267 [inline] __lock_acquire+0x2149/0x3430 kernel/locking/lockdep.c:3340 lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3755 __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline] _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151 spin_lock include/linux/spinlock.h:299 [inline] __netif_tx_lock include/linux/netdevice.h:3486 [inline] sch_direct_xmit+0x282/0x6d0 net/sched/sch_generic.c:180 __dev_xmit_skb net/core/dev.c:3092 [inline] __dev_queue_xmit+0x13e5/0x1e60 net/core/dev.c:3358 dev_queue_xmit+0x17/0x20 net/core/dev.c:3423 neigh_hh_output include/net/neighbour.h:468 [inline] neigh_output include/net/neighbour.h:476 [inline] ip_finish_output2+0xf6c/0x15a0 net/ipv4/ip_output.c:228 ip_finish_output+0xa29/0xe10 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f0/0x7a0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x170 net/ipv4/ip_output.c:124 ip_send_skb+0x3c/0xc0 net/ipv4/ip_output.c:1492 ip_push_pending_frames+0x64/0x80 net/ipv4/ip_output.c:1512 icmp_push_reply+0x372/0x4d0 net/ipv4/icmp.c:394 icmp_send+0x156c/0x1c80 net/ipv4/icmp.c:754 ip_expire+0x40e/0x6c0 net/ipv4/ip_fragment.c:239 call_timer_fn+0x241/0x820 kernel/time/timer.c:1268 expire_timers kernel/time/timer.c:1307 [inline] __run_timers+0x960/0xcf0 kernel/time/timer.c:1601 run_timer_softirq+0x21/0x80 kernel/time/timer.c:1614 __do_softirq+0x31f/0xbe7 kernel/softirq.c:284 invoke_softirq kernel/softirq.c:364 [inline] irq_exit+0x1cc/0x200 kernel/softirq.c:405 exiting_irq arch/x86/include/asm/apic.h:657 [inline] smp_apic_timer_interrupt+0x76/0xa0 arch/x86/kernel/apic/apic.c:962 apic_timer_interrupt+0x93/0xa0 arch/x86/entry/entry_64.S:707 RIP: 0010:__read_once_size include/linux/compiler.h:254 [inline] RIP: 0010:atomic_read arch/x86/include/asm/atomic.h:26 [inline] RIP: 0010:rcu_dynticks_curr_cpu_in_eqs kernel/rcu/tree.c:350 [inline] RIP: 0010:__rcu_is_watching kernel/rcu/tree.c:1133 [inline] RIP: 0010:rcu_is_watching+0x83/0x110 kernel/rcu/tree.c:1147 RSP: 0000:ffff8801c391f120 EFLAGS: 00000a03 ORIG_RAX: ffffffffffffff10 RAX: dffffc0000000000 RBX: ffff8801c391f148 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000055edd4374000 RDI: ffff8801dbe1ae0c RBP: ffff8801c391f1a0 R08: 0000000000000002 R09: 0000000000000000 R10: dffffc0000000000 R11: 0000000000000002 R12: 1ffff10038723e25 R13: ffff8801dbe1ae00 R14: ffff8801c391f680 R15: dffffc0000000000 </IRQ> rcu_read_lock_held+0x87/0xc0 kernel/rcu/update.c:293 radix_tree_deref_slot include/linux/radix-tree.h:238 [inline] filemap_map_pages+0x6d4/0x1570 mm/filemap.c:2335 do_fault_around mm/memory.c:3231 [inline] do_read_fault mm/memory.c:3265 [inline] do_fault+0xbd5/0x2080 mm/memory.c:3370 handle_pte_fault mm/memory.c:3600 [inline] __handle_mm_fault+0x1062/0x2cb0 mm/memory.c:3714 handle_mm_fault+0x1e2/0x480 mm/memory.c:3751 __do_page_fault+0x4f6/0xb60 arch/x86/mm/fault.c:1397 do_page_fault+0x54/0x70 arch/x86/mm/fault.c:1460 page_fault+0x28/0x30 arch/x86/entry/entry_64.S:1011 RIP: 0033:0x7f83172f2786 RSP: 002b:00007fffe859ae80 EFLAGS: 00010293 RAX: 000055edd4373040 RBX: 00007f83175111c8 RCX: 000055edd4373238 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007f8317510970 RBP: 00007fffe859afd0 R08: 0000000000000009 R09: 0000000000000000 R10: 0000000000000064 R11: 0000000000000000 R12: 000055edd4373040 R13: 0000000000000000 R14: 00007fffe859afe8 R15: 0000000000000000 Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jul 6, 2017
'perf annotate' is dropping the cr* fields from branch instructions. Fix it by adding support to display branch instructions having multiple operands. Power Arch objdump of int_sqrt: 20.36 | c0000000004d2694: subf r10,r10,r3 | c0000000004d2698: v bgt cr6,c0000000004d26a0 <int_sqrt+0x40> 1.82 | c0000000004d269c: mr r3,r10 29.18 | c0000000004d26a0: mr r10,r8 | c0000000004d26a4: v bgt cr7,c0000000004d26ac <int_sqrt+0x4c> | c0000000004d26a8: mr r10,r7 Power Arch Before Patch: 20.36 | subf r10,r10,r3 | v bgt 40 1.82 | mr r3,r10 29.18 | 40: mr r10,r8 | v bgt 4c | mr r10,r7 Power Arch After patch: 20.36 | subf r10,r10,r3 | v bgt cr6,40 1.82 | mr r3,r10 29.18 | 40: mr r10,r8 | v bgt cr7,4c | mr r10,r7 Also support AArch64 conditional branch instructions, which can have up to three operands: Aarch64 Non-simplified (raw objdump) view: │ffff0000083cd11c: ↑ cbz w0, ffff0000083cd100 <security_fil▒ ... 4.44 │ffff000│083cd134: ↓ tbnz w0, #26, ffff0000083cd190 <securit▒ ... 1.37 │ffff000│083cd144: ↓ tbnz w22, #5, ffff0000083cd1a4 <securit▒ │ffff000│083cd148: mov w19, #0x20000 //▒ 1.02 │ffff000│083cd14c: ↓ tbz w22, #2, ffff0000083cd1ac <securit▒ ... 0.68 │ffff000└──3cd16c: ↑ cbnz w0, ffff0000083cd120 <security_fil▒ Aarch64 Simplified, before this patch: │ ↑ cbz 40 ... 4.44 │ │↓ tbnz w0, #26, ffff0000083cd190 <security_file_permiss▒ ... 1.37 │ │↓ tbnz w22, #5, ffff0000083cd1a4 <security_file_permiss▒ │ │ mov w19, #0x20000 // #131072 1.02 │ │↓ tbz w22, #2, ffff0000083cd1ac <security_file_permiss▒ ... 0.68 │ └──cbnz 60 the cbz operand is missing, and the tbz doesn't get simplified processing at all because the parsing function failed to match an address. Aarch64 Simplified, After this patch applied: │ ↑ cbz w0, 40 ... 4.44 │ │↓ tbnz w0, #26, d0 ... 1.37 │ │↓ tbnz w22, #5, e4 │ │ mov w19, #0x20000 // #131072 1.02 │ │↓ tbz w22, #2, ec ... 0.68 │ └──cbnz w0, 60 Originally-by: Ravi Bangoria <[email protected]> Tested-by: Ravi Bangoria <[email protected]> Reported-by: Anton Blanchard <[email protected]> Reported-by: Robin Murphy <[email protected]> Signed-off-by: Kim Phillips <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Christian Borntraeger <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Taeung Song <[email protected]> Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jul 6, 2017
Since the introduction of .init_rq_fn() and .exit_rq_fn() it is essential that the memory allocated for struct request_queue stays around until all blk_exit_rl() calls have finished. Hence make blk_init_rl() take a reference on struct request_queue. This patch fixes the following crash: general protection fault: 0000 [#2] SMP CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 task: ffff88013a108040 task.stack: ffffc9000071c000 RIP: 0010:free_request_size+0x1a/0x30 RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003 RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418 RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009 R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8 R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0 Call Trace: mempool_destroy.part.10+0x21/0x40 mempool_destroy+0xe/0x10 blk_exit_rl+0x12/0x20 blkg_free+0x4d/0xa0 __blkg_release_rcu+0x59/0x170 rcu_process_callbacks+0x260/0x4e0 __do_softirq+0x116/0x250 smpboot_thread_fn+0x123/0x1e0 kthread+0x109/0x140 ret_from_fork+0x31/0x40 Fixes: commit e9c787e ("scsi: allocate scsi_cmnd structures as part of struct request") Signed-off-by: Bart Van Assche <[email protected]> Acked-by: Tejun Heo <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Cc: Jan Kara <[email protected]> Cc: <[email protected]> # v4.11+ Signed-off-by: Jens Axboe <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jul 6, 2017
Yuval Mintz says: ==================== bnx2x: Fix malicious VFs indication It was discovered that for a VF there's a simple [yet uncommon] scenario which would cause device firmware to declare that VF as malicious - Add a vlan interface on top of a VF and disable txvlan offloading for that VF [causing VF to transmit packets where vlan is on payload]. Patch #1 corrects driver transmission to prevent this issue. Patch #2 is a by-product correcting PF behavior once a VF is declared malicious. ==================== Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jul 6, 2017
When HSR interface is setup using ip link command, an annoying warning appears with the trace as below:- [ 203.019828] hsr_get_node: Non-HSR frame [ 203.019833] Modules linked in: [ 203.019848] CPU: 0 PID: 158 Comm: sd-resolve Tainted: G W 4.12.0-rc3-00052-g9fa6bf70 #2 [ 203.019853] Hardware name: Generic DRA74X (Flattened Device Tree) [ 203.019869] [<c0110280>] (unwind_backtrace) from [<c010c2f4>] (show_stack+0x10/0x14) [ 203.019880] [<c010c2f4>] (show_stack) from [<c04b9f64>] (dump_stack+0xac/0xe0) [ 203.019894] [<c04b9f64>] (dump_stack) from [<c01374e8>] (__warn+0xd8/0x104) [ 203.019907] [<c01374e8>] (__warn) from [<c0137548>] (warn_slowpath_fmt+0x34/0x44) root@am57xx-evm:~# [ 203.019921] [<c0137548>] (warn_slowpath_fmt) from [<c081126c>] (hsr_get_node+0x148/0x170) [ 203.019932] [<c081126c>] (hsr_get_node) from [<c0814240>] (hsr_forward_skb+0x110/0x7c0) [ 203.019942] [<c0814240>] (hsr_forward_skb) from [<c0811d64>] (hsr_dev_xmit+0x2c/0x34) [ 203.019954] [<c0811d64>] (hsr_dev_xmit) from [<c06c0828>] (dev_hard_start_xmit+0xc4/0x3bc) [ 203.019963] [<c06c0828>] (dev_hard_start_xmit) from [<c06c13d8>] (__dev_queue_xmit+0x7c4/0x98c) [ 203.019974] [<c06c13d8>] (__dev_queue_xmit) from [<c0782f54>] (ip6_finish_output2+0x330/0xc1c) [ 203.019983] [<c0782f54>] (ip6_finish_output2) from [<c0788f0c>] (ip6_output+0x58/0x454) [ 203.019994] [<c0788f0c>] (ip6_output) from [<c07b16cc>] (mld_sendpack+0x420/0x744) As this is an expected path to hsr_get_node() with frame coming from the master interface, add a check to ensure packet is not from the master port and then warn. Signed-off-by: Murali Karicheri <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
Andre Wild reported the following warning: WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60 Modules linked in: CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10 Hardware name: IBM 2964 N96 702 (z/VM 6.4.0) task: 00000000701d8100 task.stack: 0000000073594000 Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60) ... Call Trace: lockdep_assert_cpus_held+0x42/0x60) stop_machine_cpuslocked+0x62/0xf0 build_all_zonelists+0x92/0x150 numa_zonelist_order_handler+0x102/0x150 proc_sys_call_handler.isra.12+0xda/0x118 proc_sys_write+0x34/0x48 __vfs_write+0x3c/0x178 vfs_write+0xbc/0x1a0 SyS_write+0x66/0xc0 system_call+0xc4/0x2b0 locks held by bash/1205: #0: (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0 #1: (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150 #2: (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150 Last Breaking-Event-Address: lockdep_assert_cpus_held+0x48/0x60 This can be easily triggered with e.g. echo n > /proc/sys/vm/numa_zonelist_order In commit 3f906ba ("mm/memory-hotplug: switch locking to a percpu rwsem") memory hotplug locking was changed to fix a potential deadlock. This also switched the stop_machine() invocation within build_all_zonelists() to stop_machine_cpuslocked() which now expects that online cpus are locked when being called. This assumption is not true if build_all_zonelists() is being called from numa_zonelist_order_handler(). In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done() pair to numa_zonelist_order_handler(). Link: http://lkml.kernel.org/r/[email protected] Fixes: 3f906ba ("mm/memory-hotplug: switch locking to a percpu rwsem") Signed-off-by: Heiko Carstens <[email protected]> Reported-by: Andre Wild <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
This patch fixes a bug associated with iscsit_reset_np_thread() that can occur during parallel configfs rmdir of a single iscsi_np used across multiple iscsi-target instances, that would result in hung task(s) similar to below where configfs rmdir process context was blocked indefinately waiting for iscsi_np->np_restart_comp to finish: [ 6726.112076] INFO: task dcp_proxy_node_:15550 blocked for more than 120 seconds. [ 6726.119440] Tainted: G W O 4.1.26-3321 #2 [ 6726.125045] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6726.132927] dcp_proxy_node_ D ffff8803f202bc88 0 15550 1 0x00000000 [ 6726.140058] ffff8803f202bc88 ffff88085c64d960 ffff88083b3b1ad0 ffff88087fffeb08 [ 6726.147593] ffff8803f202c000 7fffffffffffffff ffff88083f459c28 ffff88083b3b1ad0 [ 6726.155132] ffff88035373c100 ffff8803f202bca8 ffffffff8168ced2 ffff8803f202bcb8 [ 6726.162667] Call Trace: [ 6726.165150] [<ffffffff8168ced2>] schedule+0x32/0x80 [ 6726.170156] [<ffffffff8168f5b4>] schedule_timeout+0x214/0x290 [ 6726.176030] [<ffffffff810caef2>] ? __send_signal+0x52/0x4a0 [ 6726.181728] [<ffffffff8168d7d6>] wait_for_completion+0x96/0x100 [ 6726.187774] [<ffffffff810e7c80>] ? wake_up_state+0x10/0x10 [ 6726.193395] [<ffffffffa035d6e2>] iscsit_reset_np_thread+0x62/0xe0 [iscsi_target_mod] [ 6726.201278] [<ffffffffa0355d86>] iscsit_tpg_disable_portal_group+0x96/0x190 [iscsi_target_mod] [ 6726.210033] [<ffffffffa0363f7f>] lio_target_tpg_store_enable+0x4f/0xc0 [iscsi_target_mod] [ 6726.218351] [<ffffffff81260c5a>] configfs_write_file+0xaa/0x110 [ 6726.224392] [<ffffffff811ea364>] vfs_write+0xa4/0x1b0 [ 6726.229576] [<ffffffff811eb111>] SyS_write+0x41/0xb0 [ 6726.234659] [<ffffffff8169042e>] system_call_fastpath+0x12/0x71 It would happen because each iscsit_reset_np_thread() sets state to ISCSI_NP_THREAD_RESET, sends SIGINT, and then blocks waiting for completion on iscsi_np->np_restart_comp. However, if iscsi_np was active processing a login request and more than a single iscsit_reset_np_thread() caller to the same iscsi_np was blocked on iscsi_np->np_restart_comp, iscsi_np kthread process context in __iscsi_target_login_thread() would flush pending signals and only perform a single completion of np->np_restart_comp before going back to sleep within transport specific iscsit_transport->iscsi_accept_np code. To address this bug, add a iscsi_np->np_reset_count and update __iscsi_target_login_thread() to keep completing np->np_restart_comp until ->np_reset_count has reached zero. Reported-by: Gary Guo <[email protected]> Tested-by: Gary Guo <[email protected]> Cc: Mike Christie <[email protected]> Cc: Hannes Reinecke <[email protected]> Cc: [email protected] # 3.10+ Signed-off-by: Nicholas Bellinger <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
Dean Jenkins says: ==================== asix: Improve robustness Please consider taking these patches to improve the robustness of the ASIX USB to Ethernet driver. Failures prompting an ASIX driver code review ============================================= On an ARM i.MX6 embedded platform some strange one-off and two-off failures were observed in and around the ASIX USB to Ethernet driver. This was observed on a highly modified kernel 3.14 with the ASIX driver containing back-ported changes from kernel.org up to kernel 4.8 approximately. a) A one-off failure in asix_rx_fixup_internal(): There was an occurrence of an attempt to write off the end of the netdev buffer which was trapped by skb_over_panic() in skb_put(). [20030.846440] skbuff: skb_over_panic: text:7f2271c0 len:120 put:60 head:8366ecc0 data:8366ed02 tail:0x8366ed7a end:0x8366ed40 dev:eth0 [20030.863007] Kernel BUG at 8044ce38 [verbose debug info unavailable] [20031.215345] Backtrace: [20031.217884] [<8044cde0>] (skb_panic) from [<8044d50c>] (skb_put+0x50/0x5c) [20031.227408] [<8044d4bc>] (skb_put) from [<7f2271c0>] (asix_rx_fixup_internal+0x1c4/0x23c [asix]) [20031.242024] [<7f226ffc>] (asix_rx_fixup_internal [asix]) from [<7f22724c>] (asix_rx_fixup_common+0x14/0x18 [asix]) [20031.260309] [<7f227238>] (asix_rx_fixup_common [asix]) from [<7f21f7d4>] (usbnet_bh+0x74/0x224 [usbnet]) [20031.269879] [<7f21f760>] (usbnet_bh [usbnet]) from [<8002f834>] (call_timer_fn+0xa4/0x1f0) [20031.283961] [<8002f790>] (call_timer_fn) from [<80030834>] (run_timer_softirq+0x230/0x2a8) [20031.302782] [<80030604>] (run_timer_softirq) from [<80028780>] (__do_softirq+0x15c/0x37c) [20031.321511] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8) [20031.339298] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8) [20031.350038] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8) [20031.365528] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70) Analysis of the logic of the ASIX driver (containing backported changes from kernel.org up to kernel 4.8 approximately) suggested that the software could not trigger skb_over_panic(). The analysis of the kernel BUG() crash information suggested that the netdev buffer was written with 2 minimal 60 octet length Ethernet frames (ASIX hardware drops the 4 octet FCS field) and the 2nd Ethernet frame attempted to write off the end of the netdev buffer. Note that the netdev buffer should only contain 1 Ethernet frame so if an attempt to write 2 Ethernet frames into the buffer is made then that is wrong. However, the logic of the asix_rx_fixup_internal() only allows 1 Ethernet frame to be written into the netdev buffer. Potentially this failure was due to memory corruption because it was only seen once. b) Two-off failures in the NAPI layer's backlog queue: There were 2 crashes in the NAPI layer's backlog queue presumably after asix_rx_fixup_internal() called usbnet_skb_return(). [24097.273945] Unable to handle kernel NULL pointer dereference at virtual address 00000004 [24097.398944] PC is at process_backlog+0x80/0x16c [24097.569466] Backtrace: [24097.572007] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248) [24097.591631] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c) [24097.610022] [<80028624>] (__do_softirq) from [<800289cc>] (run_ksoftirqd+0x2c/0x84) and [ 1059.828452] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 1059.953715] PC is at process_backlog+0x84/0x16c [ 1060.140896] Backtrace: [ 1060.143434] [<8045ad98>] (process_backlog) from [<8045b64c>] (net_rx_action+0xcc/0x248) [ 1060.163075] [<8045b580>] (net_rx_action) from [<80028780>] (__do_softirq+0x15c/0x37c) [ 1060.181474] [<80028624>] (__do_softirq) from [<80028c38>] (irq_exit+0x8c/0xe8) [ 1060.199256] [<80028bac>] (irq_exit) from [<8000e9c8>] (handle_IRQ+0x8c/0xc8) [ 1060.210006] [<8000e93c>] (handle_IRQ) from [<800085c8>] (gic_handle_irq+0xb8/0xf8) [ 1060.225492] [<80008510>] (gic_handle_irq) from [<8050de80>] (__irq_svc+0x40/0x70) The embedded board was only using an ASIX USB to Ethernet adaptor eth0. Analysis suggested that the doubly-linked list pointers of the backlog queue had been corrupted because one of the link pointers was NULL. Potentially this failure was due to memory corruption because it was only seen twice. Results of the ASIX driver code review ====================================== During the code review some weaknesses were observed in the ASIX driver and the following patches have been created to improve the robustness. Brief overview of the patches ----------------------------- 1. asix: Add rx->ax_skb = NULL after usbnet_skb_return() The current ASIX driver sends the received Ethernet frame to the NAPI layer of the network stack via the call to usbnet_skb_return() in asix_rx_fixup_internal() but retains the rx->ax_skb pointer to the netdev buffer. The driver no longer needs the rx->ax_skb pointer at this point because the NAPI layer now has the Ethernet frame. This means that asix_rx_fixup_internal() must not use rx->ax_skb after the call to usbnet_skb_return() because it could corrupt the handling of the Ethernet frame within the network layer. Therefore, to remove the risk of erroneous usage of rx->ax_skb, set rx->ax_skb to NULL after the call to usbnet_skb_return(). This avoids potential erroneous freeing of rx->ax_skb and erroneous writing to the netdev buffer. If the software now somehow inappropriately reused rx->ax_skb, then a NULL pointer dereference of rx->ax_skb would occur which makes investigation easier. 2. asix: Ensure asix_rx_fixup_info members are all reset This patch creates reset_asix_rx_fixup_info() to allow all the asix_rx_fixup_info structure members to be consistently reset to initial conditions. Call reset_asix_rx_fixup_info() upon each detectable error condition so that the next URB is processed from a known state. Otherwise, there is a risk that some members of the asix_rx_fixup_info structure may be incorrect after an error occurred so potentially leading to a malfunction. 3. asix: Fix small memory leak in ax88772_unbind() This patch creates asix_rx_fixup_common_free() to allow the rx->ax_skb to be freed when necessary. asix_rx_fixup_common_free() is called from ax88772_unbind() before the parent private data structure is freed. Without this patch, there is a risk of a small netdev buffer memory leak each time ax88772_unbind() is called during the reception of an Ethernet frame that spans across 2 URBs. Testing ======= The patches have been sanity tested on a 64-bit Linux laptop running kernel 4.13-rc2 with the 3 patches applied on top. The ASIX USB to Adaptor used for testing was (output of lsusb): ID 0b95:772b ASIX Electronics Corp. AX88772B Test #1 ------- The test ran a flood ping test script which slowly incremented the ICMP Echo Request's payload from 0 to 5000 octets. This eventually causes IPv4 fragmentation to occur which causes Ethernet frames to be sent very close to each other so increases the probability that an Ethernet frame will span 2 URBs. The test showed that all pings were successful. The test took about 15 minutes to complete. Test #2 ------- A script was run on the laptop to periodically run ifdown and ifup every second so that the ASIX USB to Adaptor was up for 1 second and down for 1 second. From a Linux PC connected to the laptop, the following ping command was used ping -f -s 5000 <ip address of laptop> The large ICMP payload causes IPv4 fragmentation resulting in multiple Ethernet frames per original IP packet. Kernel debug within the ASIX driver was enabled to see whether any ASIX errors were generated. The test was run for about 24 hours and no ASIX errors were seen. Patches ======= The 3 patches have been rebased off the net-next repo master branch with HEAD fbbeefd net: fec: Allow reception of frames bigger than 1522 bytes ==================== Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
MEI device performs link reset during system suspend sequence. The link reset cannot be performed while device is in runtime suspend state. The resume sequence is bypassed with suspend direct complete optimization,so the optimization should be disabled for mei devices. Fixes: [ 192.940537] Restarting tasks ... [ 192.940610] PGI is not set [ 192.940619] ------------[ cut here ]------------ [ 192.940623] WARNING: CPU: 0 me.c:653 mei_me_pg_exit_sync+0x351/0x360 [ 192.940624] Modules linked in: [ 192.940627] CPU: 0 PID: 1661 Comm: kworker/0:3 Not tainted 4.13.0-rc2+ #2 [ 192.940628] Hardware name: Dell Inc. XPS 13 9343/0TM99H, BIOS A11 12/08/2016 [ 192.940630] Workqueue: pm pm_runtime_work <snip> [ 192.940642] Call Trace: [ 192.940646] ? pci_pme_active+0x1de/0x1f0 [ 192.940649] ? pci_restore_standard_config+0x50/0x50 [ 192.940651] ? kfree+0x172/0x190 [ 192.940653] ? kfree+0x172/0x190 [ 192.940655] ? pci_restore_standard_config+0x50/0x50 [ 192.940663] mei_me_pm_runtime_resume+0x3f/0xc0 [ 192.940665] pci_pm_runtime_resume+0x7a/0xa0 [ 192.940667] __rpm_callback+0xb9/0x1e0 [ 192.940668] ? preempt_count_add+0x6d/0xc0 [ 192.940670] rpm_callback+0x24/0x90 [ 192.940672] ? pci_restore_standard_config+0x50/0x50 [ 192.940674] rpm_resume+0x4e8/0x800 [ 192.940676] pm_runtime_work+0x55/0xb0 [ 192.940678] process_one_work+0x184/0x3e0 [ 192.940680] worker_thread+0x4d/0x3a0 [ 192.940681] ? preempt_count_sub+0x9b/0x100 [ 192.940683] kthread+0x122/0x140 [ 192.940684] ? process_one_work+0x3e0/0x3e0 [ 192.940685] ? __kthread_create_on_node+0x1a0/0x1a0 [ 192.940688] ret_from_fork+0x27/0x40 [ 192.940690] Code: 96 3a 9e ff 48 8b 7d 98 e8 cd 21 58 00 83 bb bc 01 00 00 04 0f 85 40 fe ff ff e9 41 fe ff ff 48 c7 c7 5f 04 99 96 e8 93 6b 9f ff <0f> ff e9 5d fd ff ff e8 33 fe 99 ff 0f 1f 00 0f 1f 44 00 00 55 [ 192.940719] ---[ end trace a86955597774ead8 ]--- [ 192.942540] done. Suggested-by: Rafael J. Wysocki <[email protected]> Reported-by: Dominik Brodowski <[email protected]> Cc: Rafael J. Wysocki <[email protected]> Cc: Dominik Brodowski <[email protected]> Signed-off-by: Alexander Usyskin <[email protected]> Signed-off-by: Tomas Winkler <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
pfkey_broadcast() might be called from non process contexts, we can not use GFP_KERNEL in these cases [1]. This patch partially reverts commit ba51b6b ("net: Fix RCU splat in af_key"), only keeping the GFP_ATOMIC forcing under rcu_read_lock() section. [1] : syzkaller reported : in_atomic(): 1, irqs_disabled(): 0, pid: 2932, name: syzkaller183439 3 locks held by syzkaller183439/2932: #0: (&net->xfrm.xfrm_cfg_mutex){+.+.+.}, at: [<ffffffff83b43888>] pfkey_sendmsg+0x4c8/0x9f0 net/key/af_key.c:3649 #1: (&pfk->dump_lock){+.+.+.}, at: [<ffffffff83b467f6>] pfkey_do_dump+0x76/0x3f0 net/key/af_key.c:293 #2: (&(&net->xfrm.xfrm_policy_lock)->rlock){+...+.}, at: [<ffffffff83957632>] spin_lock_bh include/linux/spinlock.h:304 [inline] #2: (&(&net->xfrm.xfrm_policy_lock)->rlock){+...+.}, at: [<ffffffff83957632>] xfrm_policy_walk+0x192/0xa30 net/xfrm/xfrm_policy.c:1028 CPU: 0 PID: 2932 Comm: syzkaller183439 Not tainted 4.13.0-rc4+ #24 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 ___might_sleep+0x2b2/0x470 kernel/sched/core.c:5994 __might_sleep+0x95/0x190 kernel/sched/core.c:5947 slab_pre_alloc_hook mm/slab.h:416 [inline] slab_alloc mm/slab.c:3383 [inline] kmem_cache_alloc+0x24b/0x6e0 mm/slab.c:3559 skb_clone+0x1a0/0x400 net/core/skbuff.c:1037 pfkey_broadcast_one+0x4b2/0x6f0 net/key/af_key.c:207 pfkey_broadcast+0x4ba/0x770 net/key/af_key.c:281 dump_sp+0x3d6/0x500 net/key/af_key.c:2685 xfrm_policy_walk+0x2f1/0xa30 net/xfrm/xfrm_policy.c:1042 pfkey_dump_sp+0x42/0x50 net/key/af_key.c:2695 pfkey_do_dump+0xaa/0x3f0 net/key/af_key.c:299 pfkey_spddump+0x1a0/0x210 net/key/af_key.c:2722 pfkey_process+0x606/0x710 net/key/af_key.c:2814 pfkey_sendmsg+0x4d6/0x9f0 net/key/af_key.c:3650 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 ___sys_sendmsg+0x755/0x890 net/socket.c:2035 __sys_sendmsg+0xe5/0x210 net/socket.c:2069 SYSC_sendmsg net/socket.c:2080 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2076 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x445d79 RSP: 002b:00007f32447c1dc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000445d79 RDX: 0000000000000000 RSI: 000000002023dfc8 RDI: 0000000000000008 RBP: 0000000000000086 R08: 00007f32447c2700 R09: 00007f32447c2700 R10: 00007f32447c2700 R11: 0000000000000202 R12: 0000000000000000 R13: 00007ffe33edec4f R14: 00007f32447c29c0 R15: 0000000000000000 Fixes: ba51b6b ("net: Fix RCU splat in af_key") Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Cc: David Ahern <[email protected]> Acked-by: David Ahern <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
syzkaller reported that DCCP could have a non empty write queue at dismantle time. WARNING: CPU: 1 PID: 2953 at net/core/stream.c:199 sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2953 Comm: syz-executor0 Not tainted 4.13.0-rc4+ #2 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 panic+0x1e4/0x417 kernel/panic.c:180 __warn+0x1c4/0x1d9 kernel/panic.c:541 report_bug+0x211/0x2d0 lib/bug.c:183 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline] do_trap+0x260/0x390 arch/x86/kernel/traps.c:273 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846 RIP: 0010:sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199 RSP: 0018:ffff8801d182f108 EFLAGS: 00010297 RAX: ffff8801d1144140 RBX: ffff8801d13cb280 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff85137b00 RDI: ffff8801d13cb280 RBP: ffff8801d182f148 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801d13cb4d0 R13: ffff8801d13cb3b8 R14: ffff8801d13cb300 R15: ffff8801d13cb3b8 inet_csk_destroy_sock+0x175/0x3f0 net/ipv4/inet_connection_sock.c:835 dccp_close+0x84d/0xc10 net/dccp/proto.c:1067 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425 sock_release+0x8d/0x1e0 net/socket.c:597 sock_close+0x16/0x20 net/socket.c:1126 __fput+0x327/0x7e0 fs/file_table.c:210 ____fput+0x15/0x20 fs/file_table.c:246 task_work_run+0x18a/0x260 kernel/task_work.c:116 exit_task_work include/linux/task_work.h:21 [inline] do_exit+0xa32/0x1b10 kernel/exit.c:865 do_group_exit+0x149/0x400 kernel/exit.c:969 get_signal+0x7e8/0x17e0 kernel/signal.c:2330 do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:808 exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157 prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline] syscall_return_slowpath+0x3a7/0x450 arch/x86/entry/common.c:263 Signed-off-by: Eric Dumazet <[email protected]> Reported-by: Dmitry Vyukov <[email protected]> Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Sep 5, 2017
The following commit: 39a0526 ("x86/mm: Factor out LDT init from context init") renamed init_new_context() to init_new_context_ldt() and added a new init_new_context() which calls init_new_context_ldt(). However, the error code of init_new_context_ldt() was ignored. Consequently, if a memory allocation in alloc_ldt_struct() failed during a fork(), the ->context.ldt of the new task remained the same as that of the old task (due to the memcpy() in dup_mm()). ldt_struct's are not intended to be shared, so a use-after-free occurred after one task exited. Fix the bug by making init_new_context() pass through the error code of init_new_context_ldt(). This bug was found by syzkaller, which encountered the following splat: BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710 CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x24e/0x340 mm/kasan/report.c:409 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] exec_mmap fs/exec.c:1061 [inline] flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291 load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855 search_binary_handler+0x142/0x6b0 fs/exec.c:1652 exec_binprm fs/exec.c:1694 [inline] do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816 do_execve+0x31/0x40 fs/exec.c:1860 call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Allocated by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67 write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277 sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xca/0x250 mm/slab.c:3820 free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] __mmput kernel/fork.c:916 [inline] mmput+0x541/0x6e0 kernel/fork.c:927 copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931 copy_process kernel/fork.c:1546 [inline] _do_fork+0x1ef/0xfb0 kernel/fork.c:2025 SYSC_clone kernel/fork.c:2135 [inline] SyS_clone+0x37/0x50 kernel/fork.c:2129 do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287 return_from_SYSCALL_64+0x0/0x7a Here is a C reproducer: #include <asm/ldt.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void *fork_thread(void *_arg) { fork(); } int main(void) { struct user_desc desc = { .entry_number = 8191 }; syscall(__NR_modify_ldt, 1, &desc, sizeof(desc)); for (;;) { if (fork() == 0) { pthread_t t; srand(getpid()); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } Note: the reproducer takes advantage of the fact that alloc_ldt_struct() may use vmalloc() to allocate a large ->entries array, and after commit: 5d17a73 ("vmalloc: back off when the current task is killed") it is possible for userspace to fail a task's vmalloc() by sending a fatal signal, e.g. via exit_group(). It would be more difficult to reproduce this bug on kernels without that commit. This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y. Signed-off-by: Eric Biggers <[email protected]> Acked-by: Dave Hansen <[email protected]> Cc: <[email protected]> [v4.6+] Cc: Andrew Morton <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Brian Gerst <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Denys Vlasenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: [email protected] Fixes: 39a0526 ("x86/mm: Factor out LDT init from context init") Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Ingo Molnar <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Dec 2, 2024
…ndex Intel SoundWire machine driver always uses Pin number 2 and above. Currently, the pin number is used as the FW DAI index directly. As a result, FW DAI 0 and 1 are never used. That worked fine because we use up to 2 DAIs in a SDW link. Convert the topology pin index to ALH dai index, the mapping is using 2-off indexing, iow, pin #2 is ALH dai #0. The issue exists since beginning. And the Fixes tag is the first commit that this commit can be applied. Fixes: b66bfc3 ("ASoC: SOF: sof-audio: Fix broken early bclk feature for SSP") Signed-off-by: Bard Liao <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Reviewed-by: Liam Girdwood <[email protected]> Reviewed-by: Kai Vehmanen <[email protected]> Reviewed-by: Ranjani Sridharan <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Dec 2, 2024
…to HEAD KVM/riscv changes for 6.13 part #2 - Svade and Svadu extension support for Host and Guest/VM
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Since the netlink attribute range validation provides inclusive checking, the *max* of attribute NL80211_ATTR_MLO_LINK_ID should be IEEE80211_MLD_MAX_NUM_LINKS - 1 otherwise causing an off-by-one. One crash stack for demonstration: ================================================================== BUG: KASAN: wild-memory-access in ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939 Read of size 6 at addr 001102080000000c by task fuzzer.386/9508 CPU: 1 PID: 9508 Comm: syz.1.386 Not tainted 6.1.70 #2 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x177/0x231 lib/dump_stack.c:106 print_report+0xe0/0x750 mm/kasan/report.c:398 kasan_report+0x139/0x170 mm/kasan/report.c:495 kasan_check_range+0x287/0x290 mm/kasan/generic.c:189 memcpy+0x25/0x60 mm/kasan/shadow.c:65 ieee80211_tx_control_port+0x3b6/0xca0 net/mac80211/tx.c:5939 rdev_tx_control_port net/wireless/rdev-ops.h:761 [inline] nl80211_tx_control_port+0x7b3/0xc40 net/wireless/nl80211.c:15453 genl_family_rcv_msg_doit+0x22e/0x320 net/netlink/genetlink.c:756 genl_family_rcv_msg net/netlink/genetlink.c:833 [inline] genl_rcv_msg+0x539/0x740 net/netlink/genetlink.c:850 netlink_rcv_skb+0x1de/0x420 net/netlink/af_netlink.c:2508 genl_rcv+0x24/0x40 net/netlink/genetlink.c:861 netlink_unicast_kernel net/netlink/af_netlink.c:1326 [inline] netlink_unicast+0x74b/0x8c0 net/netlink/af_netlink.c:1352 netlink_sendmsg+0x882/0xb90 net/netlink/af_netlink.c:1874 sock_sendmsg_nosec net/socket.c:716 [inline] __sock_sendmsg net/socket.c:728 [inline] ____sys_sendmsg+0x5cc/0x8f0 net/socket.c:2499 ___sys_sendmsg+0x21c/0x290 net/socket.c:2553 __sys_sendmsg net/socket.c:2582 [inline] __do_sys_sendmsg net/socket.c:2591 [inline] __se_sys_sendmsg+0x19e/0x270 net/socket.c:2589 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x45/0x90 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x63/0xcd Update the policy to ensure correct validation. Fixes: 7b0a0e3 ("wifi: cfg80211: do some rework towards MLO link APIs") Signed-off-by: Lin Ma <[email protected]> Suggested-by: Cengiz Can <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Johannes Berg <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Its used from trace__run(), for the 'perf trace' live mode, i.e. its strace-like, non-perf.data file processing mode, the most common one. The trace__run() function will set trace->host using machine__new_host() that is supposed to give a machine instance representing the running machine, and since we'll use perf_env__arch_strerrno() to get the right errno -> string table, we need to use machine->env, so initialize it in machine__new_host(). Before the patch: (gdb) run trace --errno-summary -a sleep 1 <SNIP> Summary of events: gvfs-afc-volume (3187), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ pselect6 1 0 0.000 0.000 0.000 0.000 0.00% GUsbEventThread (3519), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 1 0 0.000 0.000 0.000 0.000 0.00% <SNIP> Program received signal SIGSEGV, Segmentation fault. 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 478 if (env->arch_strerrno == NULL) (gdb) bt #0 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 #1 0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673 #2 0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708 #3 0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747 #4 0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456 #5 0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487 #6 0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351 #7 0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404 #8 0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448 #9 0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560 (gdb) After: root@number:~# perf trace -a --errno-summary sleep 1 <SNIP> pw-data-loop (2685), 1410 events, 16.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 188 0 983.428 0.000 5.231 15.595 8.68% ioctl 94 0 0.811 0.004 0.009 0.016 2.82% read 188 0 0.322 0.001 0.002 0.006 5.15% write 141 0 0.280 0.001 0.002 0.018 8.39% timerfd_settime 94 0 0.138 0.001 0.001 0.007 6.47% gnome-control-c (179406), 1848 events, 20.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 222 0 959.577 0.000 4.322 21.414 11.40% recvmsg 150 0 0.539 0.001 0.004 0.013 5.12% write 300 0 0.442 0.001 0.001 0.007 3.29% read 150 0 0.183 0.001 0.001 0.009 5.53% getpid 102 0 0.101 0.000 0.001 0.008 7.82% root@number:~# Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()") Reported-by: Veronika Molnarova <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]> Acked-by: Veronika Molnarova <[email protected]> Acked-by: Michael Petlan <[email protected]> Tested-by: Michael Petlan <[email protected]> Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1 Signed-off-by: Namhyung Kim <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Kernel will hang on destroy admin_q while we create ctrl failed, such as following calltrace: PID: 23644 TASK: ff2d52b40f439fc0 CPU: 2 COMMAND: "nvme" #0 [ff61d23de260fb78] __schedule at ffffffff8323bc15 #1 [ff61d23de260fc08] schedule at ffffffff8323c014 #2 [ff61d23de260fc28] blk_mq_freeze_queue_wait at ffffffff82a3dba1 #3 [ff61d23de260fc78] blk_freeze_queue at ffffffff82a4113a #4 [ff61d23de260fc90] blk_cleanup_queue at ffffffff82a33006 #5 [ff61d23de260fcb0] nvme_rdma_destroy_admin_queue at ffffffffc12686ce #6 [ff61d23de260fcc8] nvme_rdma_setup_ctrl at ffffffffc1268ced #7 [ff61d23de260fd28] nvme_rdma_create_ctrl at ffffffffc126919b #8 [ff61d23de260fd68] nvmf_dev_write at ffffffffc024f362 #9 [ff61d23de260fe38] vfs_write at ffffffff827d5f25 RIP: 00007fda7891d574 RSP: 00007ffe2ef06958 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 000055e8122a4d90 RCX: 00007fda7891d574 RDX: 000000000000012b RSI: 000055e8122a4d90 RDI: 0000000000000004 RBP: 00007ffe2ef079c0 R8: 000000000000012b R9: 000055e8122a4d90 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000004 R13: 000055e8122923c0 R14: 000000000000012b R15: 00007fda78a54500 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b This due to we have quiesced admi_q before cancel requests, but forgot to unquiesce before destroy it, as a result we fail to drain the pending requests, and hang on blk_mq_freeze_queue_wait() forever. Here try to reuse nvme_rdma_teardown_admin_queue() to fix this issue and simplify the code. Fixes: 958dc1d ("nvme-rdma: add clean action for failed reconnection") Reported-by: Yingfu.zhou <[email protected]> Signed-off-by: Chunguang.xu <[email protected]> Signed-off-by: Yue.zhao <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hannes Reinecke <[email protected]> Signed-off-by: Keith Busch <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Konstantin Shkolnyy says: ==================== vsock/test: fix wrong setsockopt() parameters Parameters were created using wrong C types, which caused them to be of wrong size on some architectures, causing problems. The problem with SO_RCVLOWAT was found on s390 (big endian), while x86-64 didn't show it. After the fix, all tests pass on s390. Then Stefano Garzarella pointed out that SO_VM_SOCKETS_* calls might have a similar problem, which turned out to be true, hence, the second patch. Changes for v8: - Fix whitespace warnings from "checkpatch.pl --strict" - Add maintainers to Cc: Changes for v7: - Rebase on top of https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git - Add the "net" tags to the subjects Changes for v6: - rework the patch #3 to avoid creating a new file for new functions, and exclude vsock_perf from calling the new functions. - add "Reviewed-by:" to the patch #2. Changes for v5: - in the patch #2 replace the introduced uint64_t with unsigned long long to match documentation - add a patch #3 that verifies every setsockopt() call. Changes for v4: - add "Reviewed-by:" to the first patch, and add a second patch fixing SO_VM_SOCKETS_* calls, which depends on the first one (hence, it's now a patch series.) Changes for v3: - fix the same problem in vsock_perf and update commit message Changes for v2: - add "Fixes:" lines to the commit message ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Add more test cases for LPM trie in test_maps: 1) test_lpm_trie_update_flags It constructs various use cases for BPF_EXIST and BPF_NOEXIST and check whether the return value of update operation is expected. 2) test_lpm_trie_update_full_maps It tests the update operations on a full LPM trie map. Adding new node will fail and overwriting the value of existed node will succeed. 3) test_lpm_trie_iterate_strs and test_lpm_trie_iterate_ints There two test cases test whether the iteration through get_next_key is sorted and expected. These two test cases delete the minimal key after each iteration and check whether next iteration returns the second minimal key. The only difference between these two test cases is the former one saves strings in the LPM trie and the latter saves integers. Without the fix of get_next_key, these two cases will fail as shown below: test_lpm_trie_iterate_strs(1091):FAIL:iterate #2 got abc exp abS test_lpm_trie_iterate_ints(1142):FAIL:iterate #1 got 0x2 exp 0x1 Signed-off-by: Hou Tao <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Hou Tao says: ==================== This patch set fixes several issues for LPM trie. These issues were found during adding new test cases or were reported by syzbot. The patch set is structured as follows: Patch #1~#2 are clean-ups for lpm_trie_update_elem(). Patch #3 handles BPF_EXIST and BPF_NOEXIST correctly for LPM trie. Patch #4 fixes the accounting of n_entries when doing in-place update. Patch #5 fixes the exact match condition in trie_get_next_key() and it may skip keys when the passed key is not found in the map. Patch #6~#7 switch from kmalloc() to bpf memory allocator for LPM trie to fix several lock order warnings reported by syzbot. It also enables raw_spinlock_t for LPM trie again. After these changes, the LPM trie will be closer to being usable in any context (though the reentrance check of trie->lock is still missing, but it is on my todo list). Patch #8: move test_lpm_map to map_tests to make it run regularly. Patch #9: add test cases for the issues fixed by patch #3~#5. Please see individual patches for more details. Comments are always welcome. Change Log: v3: * patch #2: remove the unnecessary NULL-init for im_node * patch #6: alloc the leaf node before disabling IRQ to low the possibility of -ENOMEM when leaf_size is large; Free these nodes outside the trie lock (Suggested by Alexei) * collect review and ack tags (Thanks for Toke & Daniel) v2: https://lore.kernel.org/bpf/[email protected]/ * collect review tags (Thanks for Toke) * drop "Add bpf_mem_cache_is_mergeable() helper" patch * patch #3~#4: add fix tag * patch #4: rename the helper to trie_check_add_elem() and increase n_entries in it. * patch #6: use one bpf mem allocator and update commit message to clarify that using bpf mem allocator is more appropriate. * patch #7: update commit message to add the possible max running time for update operation. * patch #9: update commit message to specify the purpose of these test cases. v1: https://lore.kernel.org/bpf/[email protected]/ ==================== Link: https://lore.kernel.org/all/[email protected]/ Signed-off-by: Alexei Starovoitov <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
When virtnet_close is followed by virtnet_open, some TX completions can possibly remain unconsumed, until they are finally processed during the first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash [1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call before RX napi enable") was not sufficient to eliminate all BQL crash cases for virtio-net. This issue can be reproduced with the latest net-next master by running: `while :; do ip l set DEV down; ip l set DEV up; done` under heavy network TX load from inside the machine. netdev_tx_reset_queue() can actually be dropped from virtnet_open path; the device is not stopped in any case. For BQL core part, it's just like traffic nearly ceases to exist for some period. For stall detector added to BQL, even if virtnet_close could somehow lead to some TX completions delayed for long, followed by virtnet_open, we can just take it as stall as mentioned in commit 6025b91 ("net: dqs: add NIC stall detector based on BQL"). Note also that users can still reset stall_max via sysfs. So, drop netdev_tx_reset_queue() from virtnet_enable_queue_pair(). This eliminates the BQL crashes. As a result, netdev_tx_reset_queue() is now explicitly required in freeze/restore path. This patch adds it to immediately after free_unused_bufs(), following the rule of thumb: netdev_tx_reset_queue() should follow any SKB freeing not followed by netdev_tx_completed_queue(). This seems the most consistent and streamlined approach, and now netdev_tx_reset_queue() runs whenever free_unused_bufs() is done. [1]: ------------[ cut here ]------------ kernel BUG at lib/dynamic_queue_limits.c:99! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G N 6.12.0net-next_main+ #2 Tainted: [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \ BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:dql_completed+0x26b/0x290 Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d 4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01 d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009 RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000 RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001 R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40 FS: 00007f41ded69500(0000) GS:ffff888667dc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0 PKRU: 55555554 Call Trace: <IRQ> ? die+0x32/0x80 ? do_trap+0xd9/0x100 ? dql_completed+0x26b/0x290 ? dql_completed+0x26b/0x290 ? do_error_trap+0x6d/0xb0 ? dql_completed+0x26b/0x290 ? exc_invalid_op+0x4c/0x60 ? dql_completed+0x26b/0x290 ? asm_exc_invalid_op+0x16/0x20 ? dql_completed+0x26b/0x290 __free_old_xmit+0xff/0x170 [virtio_net] free_old_xmit+0x54/0xc0 [virtio_net] virtnet_poll+0xf4/0xe30 [virtio_net] ? __update_load_avg_cfs_rq+0x264/0x2d0 ? update_curr+0x35/0x260 ? reweight_entity+0x1be/0x260 __napi_poll.constprop.0+0x28/0x1c0 net_rx_action+0x329/0x420 ? enqueue_hrtimer+0x35/0x90 ? trace_hardirqs_on+0x1d/0x80 ? kvm_sched_clock_read+0xd/0x20 ? sched_clock+0xc/0x30 ? kvm_sched_clock_read+0xd/0x20 ? sched_clock+0xc/0x30 ? sched_clock_cpu+0xd/0x1a0 handle_softirqs+0x138/0x3e0 do_softirq.part.0+0x89/0xc0 </IRQ> <TASK> __local_bh_enable_ip+0xa7/0xb0 virtnet_open+0xc8/0x310 [virtio_net] __dev_open+0xfa/0x1b0 __dev_change_flags+0x1de/0x250 dev_change_flags+0x22/0x60 do_setlink.isra.0+0x2df/0x10b0 ? rtnetlink_rcv_msg+0x34f/0x3f0 ? netlink_rcv_skb+0x54/0x100 ? netlink_unicast+0x23e/0x390 ? netlink_sendmsg+0x21e/0x490 ? ____sys_sendmsg+0x31b/0x350 ? avc_has_perm_noaudit+0x67/0xf0 ? cred_has_capability.isra.0+0x75/0x110 ? __nla_validate_parse+0x5f/0xee0 ? __pfx___probestub_irq_enable+0x3/0x10 ? __create_object+0x5e/0x90 ? security_capable+0x3b/0x70 rtnl_newlink+0x784/0xaf0 ? avc_has_perm_noaudit+0x67/0xf0 ? cred_has_capability.isra.0+0x75/0x110 ? stack_depot_save_flags+0x24/0x6d0 ? __pfx_rtnl_newlink+0x10/0x10 rtnetlink_rcv_msg+0x34f/0x3f0 ? do_syscall_64+0x6c/0x180 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? __pfx_rtnetlink_rcv_msg+0x10/0x10 netlink_rcv_skb+0x54/0x100 netlink_unicast+0x23e/0x390 netlink_sendmsg+0x21e/0x490 ____sys_sendmsg+0x31b/0x350 ? copy_msghdr_from_user+0x6d/0xa0 ___sys_sendmsg+0x86/0xd0 ? __pte_offset_map+0x17/0x160 ? preempt_count_add+0x69/0xa0 ? __call_rcu_common.constprop.0+0x147/0x610 ? preempt_count_add+0x69/0xa0 ? preempt_count_add+0x69/0xa0 ? _raw_spin_trylock+0x13/0x60 ? trace_hardirqs_on+0x1d/0x80 __sys_sendmsg+0x66/0xc0 do_syscall_64+0x6c/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f41defe5b34 Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55 RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34 RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003 RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001 R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003 R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000 </TASK> [...] Fixes: c8bd1f7 ("virtio_net: add support for Byte Queue Limits") Cc: <[email protected]> # v6.11+ Signed-off-by: Koichiro Den <[email protected]> Acked-by: Jason Wang <[email protected]> Reviewed-by: Xuan Zhuo <[email protected]> [ pabeni: trimmed possibly troublesome separator ] Signed-off-by: Paolo Abeni <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
…nts' Koichiro Den says: ==================== virtio_net: correct netdev_tx_reset_queue() invocation points When virtnet_close is followed by virtnet_open, some TX completions can possibly remain unconsumed, until they are finally processed during the first NAPI poll after the netdev_tx_reset_queue(), resulting in a crash [1]. Commit b96ed2c ("virtio_net: move netdev_tx_reset_queue() call before RX napi enable") was not sufficient to eliminate all BQL crash scenarios for virtio-net. This issue can be reproduced with the latest net-next master by running: `while :; do ip l set DEV down; ip l set DEV up; done` under heavy network TX load from inside the machine. This patch series resolves the issue and also addresses similar existing problems: (a). Drop netdev_tx_reset_queue() from open/close path. This eliminates the BQL crashes due to the problematic open/close path. (b). As a result of (a), netdev_tx_reset_queue() is now explicitly required in freeze/restore path. Add netdev_tx_reset_queue() immediately after free_unused_bufs() invocation. (c). Fix missing resetting in virtnet_tx_resize(). virtnet_tx_resize() has lacked proper resetting since commit c8bd1f7 ("virtio_net: add support for Byte Queue Limits"). (d). Fix missing resetting in the XDP_SETUP_XSK_POOL path. Similar to (c), this path lacked proper resetting. Call netdev_tx_reset_queue() when virtqueue_reset() has actually recycled unused buffers. This patch series consists of six commits: [1/6]: Resolves (a) and (b). # also -stable 6.11.y [2/6]: Minor fix to make [4/6] streamlined. [3/6]: Prerequisite for (c). # also -stable 6.11.y [4/6]: Resolves (c) (incl. Prerequisite for (d)) # also -stable 6.11.y [5/6]: Preresuisite for (d). [6/6]: Resolves (d). Changes for v4: - move netdev_tx_reset_queue() out of free_unused_bufs() - submit to net, not net-next Changes for v3: - replace 'flushed' argument with 'recycle_done' Changes for v2: - add tx queue resetting for (b) to (d) above v3: https://lore.kernel.org/all/[email protected]/ v2: https://lore.kernel.org/all/[email protected]/ v1: https://lore.kernel.org/all/[email protected]/ [1]: ------------[ cut here ]------------ kernel BUG at lib/dynamic_queue_limits.c:99! Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 7 UID: 0 PID: 1598 Comm: ip Tainted: G N 6.12.0net-next_main+ #2 Tainted: [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), \ BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:dql_completed+0x26b/0x290 Code: b7 c2 49 89 e9 44 89 da 89 c6 4c 89 d7 e8 ed 17 47 00 58 65 ff 0d 4d 27 90 7e 0f 85 fd fe ff ff e8 ea 53 8d ff e9 f3 fe ff ff <0f> 0b 01 d2 44 89 d1 29 d1 ba 00 00 00 00 0f 48 ca e9 28 ff ff ff RSP: 0018:ffffc900002b0d08 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff888102398c80 RCX: 0000000080190009 RDX: 0000000000000000 RSI: 000000000000006a RDI: 0000000000000000 RBP: ffff888102398c00 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000000ca R11: 0000000000015681 R12: 0000000000000001 R13: ffffc900002b0d68 R14: ffff88811115e000 R15: ffff8881107aca40 FS: 00007f41ded69500(0000) GS:ffff888667dc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000556ccc2dc1a0 CR3: 0000000104fd8003 CR4: 0000000000772ef0 PKRU: 55555554 Call Trace: <IRQ> ? die+0x32/0x80 ? do_trap+0xd9/0x100 ? dql_completed+0x26b/0x290 ? dql_completed+0x26b/0x290 ? do_error_trap+0x6d/0xb0 ? dql_completed+0x26b/0x290 ? exc_invalid_op+0x4c/0x60 ? dql_completed+0x26b/0x290 ? asm_exc_invalid_op+0x16/0x20 ? dql_completed+0x26b/0x290 __free_old_xmit+0xff/0x170 [virtio_net] free_old_xmit+0x54/0xc0 [virtio_net] virtnet_poll+0xf4/0xe30 [virtio_net] ? __update_load_avg_cfs_rq+0x264/0x2d0 ? update_curr+0x35/0x260 ? reweight_entity+0x1be/0x260 __napi_poll.constprop.0+0x28/0x1c0 net_rx_action+0x329/0x420 ? enqueue_hrtimer+0x35/0x90 ? trace_hardirqs_on+0x1d/0x80 ? kvm_sched_clock_read+0xd/0x20 ? sched_clock+0xc/0x30 ? kvm_sched_clock_read+0xd/0x20 ? sched_clock+0xc/0x30 ? sched_clock_cpu+0xd/0x1a0 handle_softirqs+0x138/0x3e0 do_softirq.part.0+0x89/0xc0 </IRQ> <TASK> __local_bh_enable_ip+0xa7/0xb0 virtnet_open+0xc8/0x310 [virtio_net] __dev_open+0xfa/0x1b0 __dev_change_flags+0x1de/0x250 dev_change_flags+0x22/0x60 do_setlink.isra.0+0x2df/0x10b0 ? rtnetlink_rcv_msg+0x34f/0x3f0 ? netlink_rcv_skb+0x54/0x100 ? netlink_unicast+0x23e/0x390 ? netlink_sendmsg+0x21e/0x490 ? ____sys_sendmsg+0x31b/0x350 ? avc_has_perm_noaudit+0x67/0xf0 ? cred_has_capability.isra.0+0x75/0x110 ? __nla_validate_parse+0x5f/0xee0 ? __pfx___probestub_irq_enable+0x3/0x10 ? __create_object+0x5e/0x90 ? security_capable+0x3b/0x7�[I0 rtnl_newlink+0x784/0xaf0 ? avc_has_perm_noaudit+0x67/0xf0 ? cred_has_capability.isra.0+0x75/0x110 ? stack_depot_save_flags+0x24/0x6d0 ? __pfx_rtnl_newlink+0x10/0x10 rtnetlink_rcv_msg+0x34f/0x3f0 ? do_syscall_64+0x6c/0x180 ? entry_SYSCALL_64_after_hwframe+0x76/0x7e ? __pfx_rtnetlink_rcv_msg+0x10/0x10 netlink_rcv_skb+0x54/0x100 netlink_unicast+0x23e/0x390 netlink_sendmsg+0x21e/0x490 ____sys_sendmsg+0x31b/0x350 ? copy_msghdr_from_user+0x6d/0xa0 ___sys_sendmsg+0x86/0xd0 ? __pte_offset_map+0x17/0x160 ? preempt_count_add+0x69/0xa0 ? __call_rcu_common.constprop.0+0x147/0x610 ? preempt_count_add+0x69/0xa0 ? preempt_count_add+0x69/0xa0 ? _raw_spin_trylock+0x13/0x60 ? trace_hardirqs_on+0x1d/0x80 __sys_sendmsg+0x66/0xc0 do_syscall_64+0x6c/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f41defe5b34 Code: 15 e1 12 0f 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d 35 95 0f 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55 RSP: 002b:00007ffe5336ecc8 EFLAGS: 00000202 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f41defe5b34 RDX: 0000000000000000 RSI: 00007ffe5336ed30 RDI: 0000000000000003 RBP: 00007ffe5336eda0 R08: 0000000000000010 R09: 0000000000000001 R10: 00007ffe5336f6f9 R11: 0000000000000202 R12: 0000000000000003 R13: 0000000067452259 R14: 0000556ccc28b040 R15: 0000000000000000 </TASK> [...] ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.13, part #2 - Fix confusion with implicitly-shifted MDCR_EL2 masks breaking SPE/TRBE initialization - Align nested page table walker with the intended memory attribute combining rules of the architecture - Prevent userspace from constraining the advertised ASID width, avoiding horrors of guest TLBIs not matching the intended context in hardware - Don't leak references on LPIs when insertion into the translation cache fails
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
The vmemmap's, which is used for RV64 with SPARSEMEM_VMEMMAP, page tables are populated using pmd (page middle directory) hugetables. However, the pmd allocation is not using the generic mechanism used by the VMA code (e.g. pmd_alloc()), or the RISC-V specific create_pgd_mapping()/alloc_pmd_late(). Instead, the vmemmap page table code allocates a page, and calls vmemmap_set_pmd(). This results in that the pmd ctor is *not* called, nor would it make sense to do so. Now, when tearing down a vmemmap page table pmd, the cleanup code would unconditionally, and incorrectly call the pmd dtor, which results in a crash (best case). This issue was found when running the HMM selftests: | tools/testing/selftests/mm# ./test_hmm.sh smoke | ... # when unloading the test_hmm.ko module | page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10915b | flags: 0x1000000000000000(node=0|zone=1) | raw: 1000000000000000 0000000000000000 dead000000000122 0000000000000000 | raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 | page dumped because: VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte) | ------------[ cut here ]------------ | kernel BUG at include/linux/mm.h:3080! | Kernel BUG [#1] | Modules linked in: test_hmm(-) sch_fq_codel fuse drm drm_panel_orientation_quirks backlight dm_mod | CPU: 1 UID: 0 PID: 514 Comm: modprobe Tainted: G W 6.12.0-00982-gf2a4f1682d07 #2 | Tainted: [W]=WARN | Hardware name: riscv-virtio qemu/qemu, BIOS 2024.10 10/01/2024 | epc : remove_pgd_mapping+0xbec/0x1070 | ra : remove_pgd_mapping+0xbec/0x1070 | epc : ffffffff80010a68 ra : ffffffff80010a68 sp : ff20000000a73940 | gp : ffffffff827b2d88 tp : ff6000008785da40 t0 : ffffffff80fbce04 | t1 : 0720072007200720 t2 : 706d756420656761 s0 : ff20000000a73a50 | s1 : ff6000008915cff8 a0 : 0000000000000039 a1 : 0000000000000008 | a2 : ff600003fff0de20 a3 : 0000000000000000 a4 : 0000000000000000 | a5 : 0000000000000000 a6 : c0000000ffffefff a7 : ffffffff824469b8 | s2 : ff1c0000022456c0 s3 : ff1ffffffdbfffff s4 : ff6000008915c000 | s5 : ff6000008915c000 s6 : ff6000008915c000 s7 : ff1ffffffdc00000 | s8 : 0000000000000001 s9 : ff1ffffffdc00000 s10: ffffffff819a31f0 | s11: ffffffffffffffff t3 : ffffffff8000c950 t4 : ff60000080244f00 | t5 : ff60000080244000 t6 : ff20000000a73708 | status: 0000000200000120 badaddr: ffffffff80010a68 cause: 0000000000000003 | [<ffffffff80010a68>] remove_pgd_mapping+0xbec/0x1070 | [<ffffffff80fd238e>] vmemmap_free+0x14/0x1e | [<ffffffff8032e698>] section_deactivate+0x220/0x452 | [<ffffffff8032ef7e>] sparse_remove_section+0x4a/0x58 | [<ffffffff802f8700>] __remove_pages+0x7e/0xba | [<ffffffff803760d8>] memunmap_pages+0x2bc/0x3fe | [<ffffffff02a3ca28>] dmirror_device_remove_chunks+0x2ea/0x518 [test_hmm] | [<ffffffff02a3e026>] hmm_dmirror_exit+0x3e/0x1018 [test_hmm] | [<ffffffff80102c14>] __riscv_sys_delete_module+0x15a/0x2a6 | [<ffffffff80fd020c>] do_trap_ecall_u+0x1f2/0x266 | [<ffffffff80fde0a2>] _new_vmalloc_restore_context_a0+0xc6/0xd2 | Code: bf51 7597 0184 8593 76a5 854a 4097 0029 80e7 2c00 (9002) 7597 | ---[ end trace 0000000000000000 ]--- | Kernel panic - not syncing: Fatal exception in interrupt Add a check to avoid calling the pmd dtor, if the calling context is vmemmap_free(). Fixes: c75a74f ("riscv: mm: Add memory hotplugging support") Signed-off-by: Björn Töpel <[email protected]> Reviewed-by: Alexandre Ghiti <[email protected]> Link: https://lore.kernel.org/r/[email protected] Cc: [email protected] Signed-off-by: Palmer Dabbelt <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
This reworks hci_cb_list to not use mutex hci_cb_list_lock to avoid bugs like the bellow: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 5070, name: kworker/u9:2 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 4 locks held by kworker/u9:2/5070: #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3229 [inline] #0: ffff888015be3948 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_scheduled_works+0x8e0/0x1770 kernel/workqueue.c:3335 #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_one_work kernel/workqueue.c:3230 [inline] #1: ffffc90003b6fd00 ((work_completion)(&hdev->rx_work)){+.+.}-{0:0}, at: process_scheduled_works+0x91b/0x1770 kernel/workqueue.c:3335 #2: ffff8880665d0078 (&hdev->lock){+.+.}-{3:3}, at: hci_le_create_big_complete_evt+0xcf/0xae0 net/bluetooth/hci_event.c:6914 #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:298 [inline] #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:750 [inline] #3: ffffffff8e132020 (rcu_read_lock){....}-{1:2}, at: hci_le_create_big_complete_evt+0xdb/0xae0 net/bluetooth/hci_event.c:6915 CPU: 0 PID: 5070 Comm: kworker/u9:2 Not tainted 6.8.0-syzkaller-08073-g480e035fc4c7 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/27/2024 Workqueue: hci0 hci_rx_work Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114 __might_resched+0x5d4/0x780 kernel/sched/core.c:10187 __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0xc1/0xd70 kernel/locking/mutex.c:752 hci_connect_cfm include/net/bluetooth/hci_core.h:2004 [inline] hci_le_create_big_complete_evt+0x3d9/0xae0 net/bluetooth/hci_event.c:6939 hci_event_func net/bluetooth/hci_event.c:7514 [inline] hci_event_packet+0xa53/0x1540 net/bluetooth/hci_event.c:7569 hci_rx_work+0x3e8/0xca0 net/bluetooth/hci_core.c:4171 process_one_work kernel/workqueue.c:3254 [inline] process_scheduled_works+0xa00/0x1770 kernel/workqueue.c:3335 worker_thread+0x86d/0xd70 kernel/workqueue.c:3416 kthread+0x2f0/0x390 kernel/kthread.c:388 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243 </TASK> Reported-by: [email protected] Tested-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=2fb0835e0c9cefc34614 Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
This fixes the circular locking dependency warning below, by releasing the socket lock before enterning iso_listen_bis, to avoid any potential deadlock with hdev lock. [ 75.307983] ====================================================== [ 75.307984] WARNING: possible circular locking dependency detected [ 75.307985] 6.12.0-rc6+ #22 Not tainted [ 75.307987] ------------------------------------------------------ [ 75.307987] kworker/u81:2/2623 is trying to acquire lock: [ 75.307988] ffff8fde1769da58 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO) at: iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308021] but task is already holding lock: [ 75.308022] ffff8fdd61a10078 (&hdev->lock) at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308053] which lock already depends on the new lock. [ 75.308054] the existing dependency chain (in reverse order) is: [ 75.308055] -> #1 (&hdev->lock){+.+.}-{3:3}: [ 75.308057] __mutex_lock+0xad/0xc50 [ 75.308061] mutex_lock_nested+0x1b/0x30 [ 75.308063] iso_sock_listen+0x143/0x5c0 [bluetooth] [ 75.308085] __sys_listen_socket+0x49/0x60 [ 75.308088] __x64_sys_listen+0x4c/0x90 [ 75.308090] x64_sys_call+0x2517/0x25f0 [ 75.308092] do_syscall_64+0x87/0x150 [ 75.308095] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 75.308098] -> #0 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 75.308100] __lock_acquire+0x155e/0x25f0 [ 75.308103] lock_acquire+0xc9/0x300 [ 75.308105] lock_sock_nested+0x32/0x90 [ 75.308107] iso_connect_cfm+0x253/0x840 [bluetooth] [ 75.308128] hci_connect_cfm+0x6c/0x190 [bluetooth] [ 75.308155] hci_le_per_adv_report_evt+0x27b/0x2f0 [bluetooth] [ 75.308180] hci_le_meta_evt+0xe7/0x200 [bluetooth] [ 75.308206] hci_event_packet+0x21f/0x5c0 [bluetooth] [ 75.308230] hci_rx_work+0x3ae/0xb10 [bluetooth] [ 75.308254] process_one_work+0x212/0x740 [ 75.308256] worker_thread+0x1bd/0x3a0 [ 75.308258] kthread+0xe4/0x120 [ 75.308259] ret_from_fork+0x44/0x70 [ 75.308261] ret_from_fork_asm+0x1a/0x30 [ 75.308263] other info that might help us debug this: [ 75.308264] Possible unsafe locking scenario: [ 75.308264] CPU0 CPU1 [ 75.308265] ---- ---- [ 75.308265] lock(&hdev->lock); [ 75.308267] lock(sk_lock- AF_BLUETOOTH-BTPROTO_ISO); [ 75.308268] lock(&hdev->lock); [ 75.308269] lock(sk_lock-AF_BLUETOOTH-BTPROTO_ISO); [ 75.308270] *** DEADLOCK *** [ 75.308271] 4 locks held by kworker/u81:2/2623: [ 75.308272] #0: ffff8fdd66e52148 ((wq_completion)hci0#2){+.+.}-{0:0}, at: process_one_work+0x443/0x740 [ 75.308276] #1: ffffafb488b7fe48 ((work_completion)(&hdev->rx_work)), at: process_one_work+0x1ce/0x740 [ 75.308280] #2: ffff8fdd61a10078 (&hdev->lock){+.+.}-{3:3} at: hci_le_per_adv_report_evt+0x47/0x2f0 [bluetooth] [ 75.308304] #3: ffffffffb6ba4900 (rcu_read_lock){....}-{1:2}, at: hci_connect_cfm+0x29/0x190 [bluetooth] Fixes: 02171da ("Bluetooth: ISO: Add hcon for listening bis sk") Signed-off-by: Iulia Tanasescu <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
This fixes the circular locking dependency warning below, by reworking iso_sock_recvmsg, to ensure that the socket lock is always released before calling a function that locks hdev. [ 561.670344] ====================================================== [ 561.670346] WARNING: possible circular locking dependency detected [ 561.670349] 6.12.0-rc6+ #26 Not tainted [ 561.670351] ------------------------------------------------------ [ 561.670353] iso-tester/3289 is trying to acquire lock: [ 561.670355] ffff88811f600078 (&hdev->lock){+.+.}-{3:3}, at: iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670405] but task is already holding lock: [ 561.670407] ffff88815af58258 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}, at: iso_sock_recvmsg+0xbf/0x500 [bluetooth] [ 561.670450] which lock already depends on the new lock. [ 561.670452] the existing dependency chain (in reverse order) is: [ 561.670453] -> #2 (sk_lock-AF_BLUETOOTH){+.+.}-{0:0}: [ 561.670458] lock_acquire+0x7c/0xc0 [ 561.670463] lock_sock_nested+0x3b/0xf0 [ 561.670467] bt_accept_dequeue+0x1a5/0x4d0 [bluetooth] [ 561.670510] iso_sock_accept+0x271/0x830 [bluetooth] [ 561.670547] do_accept+0x3dd/0x610 [ 561.670550] __sys_accept4+0xd8/0x170 [ 561.670553] __x64_sys_accept+0x74/0xc0 [ 561.670556] x64_sys_call+0x17d6/0x25f0 [ 561.670559] do_syscall_64+0x87/0x150 [ 561.670563] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670567] -> #1 (sk_lock-AF_BLUETOOTH-BTPROTO_ISO){+.+.}-{0:0}: [ 561.670571] lock_acquire+0x7c/0xc0 [ 561.670574] lock_sock_nested+0x3b/0xf0 [ 561.670577] iso_sock_listen+0x2de/0xf30 [bluetooth] [ 561.670617] __sys_listen_socket+0xef/0x130 [ 561.670620] __x64_sys_listen+0xe1/0x190 [ 561.670623] x64_sys_call+0x2517/0x25f0 [ 561.670626] do_syscall_64+0x87/0x150 [ 561.670629] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670632] -> #0 (&hdev->lock){+.+.}-{3:3}: [ 561.670636] __lock_acquire+0x32ad/0x6ab0 [ 561.670639] lock_acquire.part.0+0x118/0x360 [ 561.670642] lock_acquire+0x7c/0xc0 [ 561.670644] __mutex_lock+0x18d/0x12f0 [ 561.670647] mutex_lock_nested+0x1b/0x30 [ 561.670651] iso_conn_big_sync+0x73/0x260 [bluetooth] [ 561.670687] iso_sock_recvmsg+0x3e9/0x500 [bluetooth] [ 561.670722] sock_recvmsg+0x1d5/0x240 [ 561.670725] sock_read_iter+0x27d/0x470 [ 561.670727] vfs_read+0x9a0/0xd30 [ 561.670731] ksys_read+0x1a8/0x250 [ 561.670733] __x64_sys_read+0x72/0xc0 [ 561.670736] x64_sys_call+0x1b12/0x25f0 [ 561.670738] do_syscall_64+0x87/0x150 [ 561.670741] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 561.670744] other info that might help us debug this: [ 561.670745] Chain exists of: &hdev->lock --> sk_lock-AF_BLUETOOTH-BTPROTO_ISO --> sk_lock-AF_BLUETOOTH [ 561.670751] Possible unsafe locking scenario: [ 561.670753] CPU0 CPU1 [ 561.670754] ---- ---- [ 561.670756] lock(sk_lock-AF_BLUETOOTH); [ 561.670758] lock(sk_lock AF_BLUETOOTH-BTPROTO_ISO); [ 561.670761] lock(sk_lock-AF_BLUETOOTH); [ 561.670764] lock(&hdev->lock); [ 561.670767] *** DEADLOCK *** Fixes: 07a9342 ("Bluetooth: ISO: Send BIG Create Sync via hci_sync") Signed-off-by: Iulia Tanasescu <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Aishwarya reports that warnings are sometimes seen when running the ftrace kselftests, e.g. | WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0 | Modules linked in: | CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2 | Hardware name: linux,dummy-virt (DT) | pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : arch_stack_walk+0x4a0/0x4c0 | lr : arch_stack_walk+0x248/0x4c0 | sp : ffff800083643d20 | x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928 | x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80 | x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0 | x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000 | x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000 | x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001 | x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0 | x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000 | x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690 | x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8 | Call trace: | arch_stack_walk+0x4a0/0x4c0 (P) | arch_stack_walk+0x248/0x4c0 (L) | profile_pc+0x44/0x80 | profile_tick+0x50/0x80 (F) | tick_nohz_handler+0xcc/0x160 (F) | __hrtimer_run_queues+0x2ac/0x340 (F) | hrtimer_interrupt+0xf4/0x268 (F) | arch_timer_handler_virt+0x34/0x60 (F) | handle_percpu_devid_irq+0x88/0x220 (F) | generic_handle_domain_irq+0x34/0x60 (F) | gic_handle_irq+0x54/0x140 (F) | call_on_irq_stack+0x24/0x58 (F) | do_interrupt_handler+0x88/0x98 | el1_interrupt+0x34/0x68 (F) | el1h_64_irq_handler+0x18/0x28 | el1h_64_irq+0x6c/0x70 | queued_spin_lock_slowpath+0x78/0x460 (P) The warning in question is: WARN_ON_ONCE(state->common.pc == orig_pc)) ... in kunwind_recover_return_address(), which is triggered when return_to_handler() is encountered in the trace, but ftrace_graph_ret_addr() cannot find a corresponding original return address on the fgraph return stack. This happens because the stacktrace code encounters an exception boundary where the LR was not live at the time of the exception, but the LR happens to contain return_to_handler(); either because the task recently returned there, or due to unfortunate usage of the LR at a scratch register. In such cases attempts to recover the return address via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE() above and aborting the unwind (hence the stacktrace terminating after reporting the PC at the time of the exception). Handling unreliable LR values in these cases is likely to require some larger rework, so for the moment avoid this problem by restoring the old behaviour of skipping the LR at exception boundaries, which the stacktrace code did prior to commit: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") This commit is effectively a partial revert, keeping the structures and logic to explicitly identify exception boundaries while still skipping reporting of the LR. The logic to explicitly identify exception boundaries is still useful for general robustness and as a building block for future support for RELIABLE_STACKTRACE. Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") Signed-off-by: Mark Rutland <[email protected]> Reported-by: Aishwarya TCV <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
The arm64 stacktrace code has a few error conditions where a WARN_ON_ONCE() is triggered before the stacktrace is terminated and an error is returned to the caller. The conditions shouldn't be triggered when unwinding the current task, but it is possible to trigger these when unwinding another task which is not blocked, as the stack of that task is concurrently modified. Kent reports that these warnings can be triggered while running filesystem tests on bcachefs, which calls the stacktrace code directly. To produce a meaningful stacktrace of another task, the task in question should be blocked, but the stacktrace code is expected to be robust to cases where it is not blocked. Note that this is purely about not unuduly scaring the user and/or crashing the kernel; stacktraces in such cases are meaningless and may leak kernel secrets from the stack of the task being unwound. Ideally we'd pin the task in a blocked state during the unwind, as we do for /proc/${PID}/wchan since commit: 42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked") ... but a bunch of places don't do that, notably /proc/${PID}/stack, where we don't pin the task in a blocked state, but do restrict the output to privileged users since commit: f8a00ce ("proc: restrict kernel stack dumps to root") ... and so it's possible to trigger these warnings accidentally, e.g. by reading /proc/*/stack (as root): | for n in $(seq 1 10); do | while true; do cat /proc/*/stack > /dev/null 2>&1; done & | done | ------------[ cut here ]------------ | WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370 | Modules linked in: | CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2 | Hardware name: linux,dummy-virt (DT) | pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) | pc : arch_stack_walk+0x1c8/0x370 | lr : arch_stack_walk+0x1b0/0x370 | sp : ffff800080773890 | x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038 | x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000 | x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0 | x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000 | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 | x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 | x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818 | x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0 | x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000 | Call trace: | arch_stack_walk+0x1c8/0x370 (P) | stack_trace_save_tsk+0x8c/0x108 | proc_pid_stack+0xb0/0x134 | proc_single_show+0x60/0x120 | seq_read_iter+0x104/0x438 | seq_read+0xf8/0x140 | vfs_read+0xc4/0x31c | ksys_read+0x70/0x108 | __arm64_sys_read+0x1c/0x28 | invoke_syscall+0x48/0x104 | el0_svc_common.constprop.0+0x40/0xe0 | do_el0_svc+0x1c/0x28 | el0_svc+0x30/0xcc | el0t_64_sync_handler+0x10c/0x138 | el0t_64_sync+0x198/0x19c | ---[ end trace 0000000000000000 ]--- Fix this by only warning when unwinding the current task. When unwinding another task the error conditions will be handled by returning an error without producing a warning. The two warnings in kunwind_next_frame_record_meta() were added recently as part of commit: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") The warning when recovering the fgraph return address has changed form many times, but was originally introduced back in commit: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing") Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing") Signed-off-by: Mark Rutland <[email protected]> Reported-by: Kent Overstreet <[email protected]> Cc: Kees Cook <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Catalin Marinas <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
The current implementation removes cache tags after disabling ATS, leading to potential memory leaks and kernel crashes. Specifically, CACHE_TAG_DEVTLB type cache tags may still remain in the list even after the domain is freed, causing a use-after-free condition. This issue really shows up when multiple VFs from different PFs passed through to a single user-space process via vfio-pci. In such cases, the kernel may crash with kernel messages like: BUG: kernel NULL pointer dereference, address: 0000000000000014 PGD 19036a067 P4D 1940a3067 PUD 136c9b067 PMD 0 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 74 UID: 0 PID: 3183 Comm: testCli Not tainted 6.11.9 #2 RIP: 0010:cache_tag_flush_range+0x9b/0x250 Call Trace: <TASK> ? __die+0x1f/0x60 ? page_fault_oops+0x163/0x590 ? exc_page_fault+0x72/0x190 ? asm_exc_page_fault+0x22/0x30 ? cache_tag_flush_range+0x9b/0x250 ? cache_tag_flush_range+0x5d/0x250 intel_iommu_tlb_sync+0x29/0x40 intel_iommu_unmap_pages+0xfe/0x160 __iommu_unmap+0xd8/0x1a0 vfio_unmap_unpin+0x182/0x340 [vfio_iommu_type1] vfio_remove_dma+0x2a/0xb0 [vfio_iommu_type1] vfio_iommu_type1_ioctl+0xafa/0x18e0 [vfio_iommu_type1] Move cache_tag_unassign_domain() before iommu_disable_pci_caps() to fix it. Fixes: 3b1d9e2 ("iommu/vt-d: Add cache tag assignment interface") Cc: [email protected] Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
…s_lock For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: [email protected] Fixes: af28141 ("block: freeze the queue in queue_attr_store") Tested-by: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Cc: [email protected] Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Ming Lei <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Guangguan Wang says: ==================== net: several fixes for smc v1 -> v2: rewrite patch #2 suggested by Paolo. ==================== Signed-off-by: David S. Miller <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
The mapping VMA address is saved in VAS window struct when the paste address is mapped. This VMA address is used during migration to unmap the paste address if the window is active. The paste address mapping will be removed when the window is closed or with the munmap(). But the VMA address in the VAS window is not updated with munmap() which is causing invalid access during migration. The KASAN report shows: [16386.254991] BUG: KASAN: slab-use-after-free in reconfig_close_windows+0x1a0/0x4e8 [16386.255043] Read of size 8 at addr c00000014a819670 by task drmgr/696928 [16386.255096] CPU: 29 UID: 0 PID: 696928 Comm: drmgr Kdump: loaded Tainted: G B 6.11.0-rc5-nxgzip #2 [16386.255128] Tainted: [B]=BAD_PAGE [16386.255148] Hardware name: IBM,9080-HEX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1110.00 (NH1110_016) hv:phyp pSeries [16386.255181] Call Trace: [16386.255202] [c00000016b297660] [c0000000018ad0ac] dump_stack_lvl+0x84/0xe8 (unreliable) [16386.255246] [c00000016b297690] [c0000000006e8a90] print_report+0x19c/0x764 [16386.255285] [c00000016b297760] [c0000000006e9490] kasan_report+0x128/0x1f8 [16386.255309] [c00000016b297880] [c0000000006eb5c8] __asan_load8+0xac/0xe0 [16386.255326] [c00000016b2978a0] [c00000000013f898] reconfig_close_windows+0x1a0/0x4e8 [16386.255343] [c00000016b297990] [c000000000140e58] vas_migration_handler+0x3a4/0x3fc [16386.255368] [c00000016b297a90] [c000000000128848] pseries_migrate_partition+0x4c/0x4c4 ... [16386.256136] Allocated by task 696554 on cpu 31 at 16377.277618s: [16386.256149] kasan_save_stack+0x34/0x68 [16386.256163] kasan_save_track+0x34/0x80 [16386.256175] kasan_save_alloc_info+0x58/0x74 [16386.256196] __kasan_slab_alloc+0xb8/0xdc [16386.256209] kmem_cache_alloc_noprof+0x200/0x3d0 [16386.256225] vm_area_alloc+0x44/0x150 [16386.256245] mmap_region+0x214/0x10c4 [16386.256265] do_mmap+0x5fc/0x750 [16386.256277] vm_mmap_pgoff+0x14c/0x24c [16386.256292] ksys_mmap_pgoff+0x20c/0x348 [16386.256303] sys_mmap+0xd0/0x160 ... [16386.256350] Freed by task 0 on cpu 31 at 16386.204848s: [16386.256363] kasan_save_stack+0x34/0x68 [16386.256374] kasan_save_track+0x34/0x80 [16386.256384] kasan_save_free_info+0x64/0x10c [16386.256396] __kasan_slab_free+0x120/0x204 [16386.256415] kmem_cache_free+0x128/0x450 [16386.256428] vm_area_free_rcu_cb+0xa8/0xd8 [16386.256441] rcu_do_batch+0x2c8/0xcf0 [16386.256458] rcu_core+0x378/0x3c4 [16386.256473] handle_softirqs+0x20c/0x60c [16386.256495] do_softirq_own_stack+0x6c/0x88 [16386.256509] do_softirq_own_stack+0x58/0x88 [16386.256521] __irq_exit_rcu+0x1a4/0x20c [16386.256533] irq_exit+0x20/0x38 [16386.256544] interrupt_async_exit_prepare.constprop.0+0x18/0x2c ... [16386.256717] Last potentially related work creation: [16386.256729] kasan_save_stack+0x34/0x68 [16386.256741] __kasan_record_aux_stack+0xcc/0x12c [16386.256753] __call_rcu_common.constprop.0+0x94/0xd04 [16386.256766] vm_area_free+0x28/0x3c [16386.256778] remove_vma+0xf4/0x114 [16386.256797] do_vmi_align_munmap.constprop.0+0x684/0x870 [16386.256811] __vm_munmap+0xe0/0x1f8 [16386.256821] sys_munmap+0x54/0x6c [16386.256830] system_call_exception+0x1a0/0x4a0 [16386.256841] system_call_vectored_common+0x15c/0x2ec [16386.256868] The buggy address belongs to the object at c00000014a819670 which belongs to the cache vm_area_struct of size 168 [16386.256887] The buggy address is located 0 bytes inside of freed 168-byte region [c00000014a819670, c00000014a819718) [16386.256915] The buggy address belongs to the physical page: [16386.256928] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a81 [16386.256950] memcg:c0000000ba430001 [16386.256961] anon flags: 0x43ffff800000000(node=4|zone=0|lastcpupid=0x7ffff) [16386.256975] page_type: 0xfdffffff(slab) [16386.256990] raw: 043ffff800000000 c00000000501c080 0000000000000000 5deadbee00000001 [16386.257003] raw: 0000000000000000 00000000011a011a 00000001fdffffff c0000000ba430001 [16386.257018] page dumped because: kasan: bad access detected This patch adds close() callback in vas_vm_ops vm_operations_struct which will be executed during munmap() before freeing VMA. The VMA address in the VAS window is set to NULL after holding the window mmap_mutex. Fixes: 37e6764 ("powerpc/pseries/vas: Add VAS migration handler") Signed-off-by: Haren Myneni <[email protected]> Signed-off-by: Madhavan Srinivasan <[email protected]> Link: https://patch.msgid.link/[email protected]
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Access to genmask field in struct nft_set_ext results in unaligned atomic read: [ 72.130109] Unable to handle kernel paging request at virtual address ffff0000c2bb708c [ 72.131036] Mem abort info: [ 72.131213] ESR = 0x0000000096000021 [ 72.131446] EC = 0x25: DABT (current EL), IL = 32 bits [ 72.132209] SET = 0, FnV = 0 [ 72.133216] EA = 0, S1PTW = 0 [ 72.134080] FSC = 0x21: alignment fault [ 72.135593] Data abort info: [ 72.137194] ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000 [ 72.142351] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 72.145989] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 72.150115] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000237d27000 [ 72.154893] [ffff0000c2bb708c] pgd=0000000000000000, p4d=180000023ffff403, pud=180000023f84b403, pmd=180000023f835403, +pte=0068000102bb7707 [ 72.163021] Internal error: Oops: 0000000096000021 [#1] SMP [...] [ 72.170041] CPU: 7 UID: 0 PID: 54 Comm: kworker/7:0 Tainted: G E 6.13.0-rc3+ #2 [ 72.170509] Tainted: [E]=UNSIGNED_MODULE [ 72.170720] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-stable202302-for-qemu 03/01/2023 [ 72.171192] Workqueue: events_power_efficient nft_rhash_gc [nf_tables] [ 72.171552] pstate: 21400005 (nzCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 72.171915] pc : nft_rhash_gc+0x200/0x2d8 [nf_tables] [ 72.172166] lr : nft_rhash_gc+0x128/0x2d8 [nf_tables] [ 72.172546] sp : ffff800081f2bce0 [ 72.172724] x29: ffff800081f2bd40 x28: ffff0000c2bb708c x27: 0000000000000038 [ 72.173078] x26: ffff0000c6780ef0 x25: ffff0000c643df00 x24: ffff0000c6778f78 [ 72.173431] x23: 000000000000001a x22: ffff0000c4b1f000 x21: ffff0000c6780f78 [ 72.173782] x20: ffff0000c2bb70dc x19: ffff0000c2bb7080 x18: 0000000000000000 [ 72.174135] x17: ffff0000c0a4e1c0 x16: 0000000000003000 x15: 0000ac26d173b978 [ 72.174485] x14: ffffffffffffffff x13: 0000000000000030 x12: ffff0000c6780ef0 [ 72.174841] x11: 0000000000000000 x10: ffff800081f2bcf8 x9 : ffff0000c3000000 [ 72.175193] x8 : 00000000000004be x7 : 0000000000000000 x6 : 0000000000000000 [ 72.175544] x5 : 0000000000000040 x4 : ffff0000c3000010 x3 : 0000000000000000 [ 72.175871] x2 : 0000000000003a98 x1 : ffff0000c2bb708c x0 : 0000000000000004 [ 72.176207] Call trace: [ 72.176316] nft_rhash_gc+0x200/0x2d8 [nf_tables] (P) [ 72.176653] process_one_work+0x178/0x3d0 [ 72.176831] worker_thread+0x200/0x3f0 [ 72.176995] kthread+0xe8/0xf8 [ 72.177130] ret_from_fork+0x10/0x20 [ 72.177289] Code: 54fff984 d503201f d2800080 91003261 (f820303f) [ 72.177557] ---[ end trace 0000000000000000 ]--- Align struct nft_set_ext to word size to address this and documentation it. pahole reports that this increases the size of elements for rhash and pipapo in 8 bytes on x86_64. Fixes: 7ffc748 ("netfilter: nft_set_hash: skip duplicated elements pending gc run") Signed-off-by: Pablo Neira Ayuso <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
syzbot reports that a recent fix causes nesting issues between the (now) raw timeoutlock and the eventfd locking: ============================= [ BUG: Invalid wait context ] 6.13.0-rc4-00080-g9828a4c0901f #29 Not tainted ----------------------------- kworker/u32:0/68094 is trying to lock: ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180 other info that might help us debug this: context-{5:5} 6 locks held by kworker/u32:0/68094: #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 stack backtrace: CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f #29 Hardware name: linux,dummy-virt (DT) Workqueue: iou_exit io_ring_exit_work Call trace: show_stack+0x1c/0x30 (C) __dump_stack+0x24/0x30 dump_stack_lvl+0x60/0x80 dump_stack+0x14/0x20 __lock_acquire+0x19f8/0x60c8 lock_acquire+0x1a4/0x540 _raw_spin_lock_irqsave+0x90/0xd0 eventfd_signal_mask+0x64/0x180 io_eventfd_signal+0x64/0x108 io_req_local_work_add+0x294/0x430 __io_req_task_work_add+0x1c0/0x270 io_kill_timeout+0x1f0/0x288 io_kill_timeouts+0xd4/0x180 io_uring_try_cancel_requests+0x2e8/0x388 io_ring_exit_work+0x150/0x550 process_one_work+0x5e8/0xfc0 worker_thread+0x7ec/0xc80 kthread+0x24c/0x300 ret_from_fork+0x10/0x20 because after the preempt-rt fix for the timeout lock nesting inside the io-wq lock, we now have the eventfd spinlock nesting inside the raw timeout spinlock. Rather than play whack-a-mole with other nesting on the timeout lock, split the deletion and killing of timeouts so queueing the task_work for the timeout cancelations can get done outside of the timeout lock. Reported-by: [email protected] Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock") Signed-off-by: Jens Axboe <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
…le_direct_reclaim() The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [[email protected]: coding-style cleanups] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <[email protected]> Cc: Mel Gorman <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
We found a timeout problem with the pldm command on our system. The reason is that the MCTP-I3C driver has a race condition when receiving multiple-packet messages in multi-thread, resulting in a wrong packet order problem. We identified this problem by adding a debug message to the mctp_i3c_read function. According to the MCTP spec, a multiple-packet message must be composed in sequence, and if there is a wrong sequence, the whole message will be discarded and wait for the next SOM. For example, SOM → Pkt Seq #2 → Pkt Seq #1 → Pkt Seq #3 → EOM. Therefore, we try to solve this problem by adding a mutex to the mctp_i3c_read function. Before the modification, when a command requesting a multiple-packet message response is sent consecutively, an error usually occurs within 100 loops. After the mutex, it can go through 40000 loops without any error, and it seems to run well. Fixes: c8755b2 ("mctp i3c: MCTP I3C driver") Signed-off-by: Leo Yang <[email protected]> Link: https://patch.msgid.link/[email protected] [[email protected]: dropped already answered question from changelog] Signed-off-by: Paolo Abeni <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
irq_chip functions may be called in raw spinlock context. Therefore, we must also use a raw spinlock for our own internal locking. This fixes the following lockdep splat: [ 5.349336] ============================= [ 5.353349] [ BUG: Invalid wait context ] [ 5.357361] 6.13.0-rc5+ #69 Tainted: G W [ 5.363031] ----------------------------- [ 5.367045] kworker/u17:1/44 is trying to lock: [ 5.371587] ffffff88018b02c0 (&chip->gpio_lock){....}-{3:3}, at: xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8)) [ 5.380079] other info that might help us debug this: [ 5.385138] context-{5:5} [ 5.387762] 5 locks held by kworker/u17:1/44: [ 5.392123] #0: ffffff8800014958 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3204) [ 5.402260] #1: ffffffc082fcbdd8 (deferred_probe_work){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3205) [ 5.411528] #2: ffffff880172c900 (&dev->mutex){....}-{4:4}, at: __device_attach (drivers/base/dd.c:1006) [ 5.419929] #3: ffffff88039c8268 (request_class#2){+.+.}-{4:4}, at: __setup_irq (kernel/irq/internals.h:156 kernel/irq/manage.c:1596) [ 5.428331] #4: ffffff88039c80c8 (lock_class#2){....}-{2:2}, at: __setup_irq (kernel/irq/manage.c:1614) [ 5.436472] stack backtrace: [ 5.439359] CPU: 2 UID: 0 PID: 44 Comm: kworker/u17:1 Tainted: G W 6.13.0-rc5+ #69 [ 5.448690] Tainted: [W]=WARN [ 5.451656] Hardware name: xlnx,zynqmp (DT) [ 5.455845] Workqueue: events_unbound deferred_probe_work_func [ 5.461699] Call trace: [ 5.464147] show_stack+0x18/0x24 C [ 5.467821] dump_stack_lvl (lib/dump_stack.c:123) [ 5.471501] dump_stack (lib/dump_stack.c:130) [ 5.474824] __lock_acquire (kernel/locking/lockdep.c:4828 kernel/locking/lockdep.c:4898 kernel/locking/lockdep.c:5176) [ 5.478758] lock_acquire (arch/arm64/include/asm/percpu.h:40 kernel/locking/lockdep.c:467 kernel/locking/lockdep.c:5851 kernel/locking/lockdep.c:5814) [ 5.482429] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:111 kernel/locking/spinlock.c:162) [ 5.486797] xgpio_irq_unmask (drivers/gpio/gpio-xilinx.c:433 (discriminator 8)) [ 5.490737] irq_enable (kernel/irq/internals.h:236 kernel/irq/chip.c:170 kernel/irq/chip.c:439 kernel/irq/chip.c:432 kernel/irq/chip.c:345) [ 5.494060] __irq_startup (kernel/irq/internals.h:241 kernel/irq/chip.c:180 kernel/irq/chip.c:250) [ 5.497645] irq_startup (kernel/irq/chip.c:270) [ 5.501143] __setup_irq (kernel/irq/manage.c:1807) [ 5.504728] request_threaded_irq (kernel/irq/manage.c:2208) Fixes: a32c7ca ("gpio: gpio-xilinx: Add interrupt support") Signed-off-by: Sean Anderson <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bartosz Golaszewski <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
This commit addresses a circular locking dependency issue within the GFX isolation mechanism. The problem was identified by a warning indicating a potential deadlock due to inconsistent lock acquisition order. - The `amdgpu_gfx_enforce_isolation_ring_begin_use` and `amdgpu_gfx_enforce_isolation_ring_end_use` functions previously acquired `enforce_isolation_mutex` and called `amdgpu_gfx_kfd_sch_ctrl`, leading to potential deadlocks. ie., If `amdgpu_gfx_kfd_sch_ctrl` is called while `enforce_isolation_mutex` is held, and `amdgpu_gfx_enforce_isolation_handler` is called while `kfd_sch_mutex` is held, it can create a circular dependency. By ensuring consistent lock usage, this fix resolves the issue: [ 606.297333] ====================================================== [ 606.297343] WARNING: possible circular locking dependency detected [ 606.297353] 6.10.0-amd-mlkd-610-311224-lof #19 Tainted: G OE [ 606.297365] ------------------------------------------------------ [ 606.297375] kworker/u96:3/3825 is trying to acquire lock: [ 606.297385] ffff9aa64e431cb8 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}, at: __flush_work+0x232/0x610 [ 606.297413] but task is already holding lock: [ 606.297423] ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.297725] which lock already depends on the new lock. [ 606.297738] the existing dependency chain (in reverse order) is: [ 606.297749] -> #2 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}: [ 606.297765] __mutex_lock+0x85/0x930 [ 606.297776] mutex_lock_nested+0x1b/0x30 [ 606.297786] amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.298007] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.298225] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.298412] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.298603] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.298866] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.298880] process_one_work+0x21e/0x680 [ 606.298890] worker_thread+0x190/0x350 [ 606.298899] kthread+0xe7/0x120 [ 606.298908] ret_from_fork+0x3c/0x60 [ 606.298919] ret_from_fork_asm+0x1a/0x30 [ 606.298929] -> #1 (&adev->enforce_isolation_mutex){+.+.}-{3:3}: [ 606.298947] __mutex_lock+0x85/0x930 [ 606.298956] mutex_lock_nested+0x1b/0x30 [ 606.298966] amdgpu_gfx_enforce_isolation_handler+0x87/0x370 [amdgpu] [ 606.299190] process_one_work+0x21e/0x680 [ 606.299199] worker_thread+0x190/0x350 [ 606.299208] kthread+0xe7/0x120 [ 606.299217] ret_from_fork+0x3c/0x60 [ 606.299227] ret_from_fork_asm+0x1a/0x30 [ 606.299236] -> #0 ((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)){+.+.}-{0:0}: [ 606.299257] __lock_acquire+0x16f9/0x2810 [ 606.299267] lock_acquire+0xd1/0x300 [ 606.299276] __flush_work+0x250/0x610 [ 606.299286] cancel_delayed_work_sync+0x71/0x80 [ 606.299296] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.299509] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.299723] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.299909] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.300101] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.300355] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.300369] process_one_work+0x21e/0x680 [ 606.300378] worker_thread+0x190/0x350 [ 606.300387] kthread+0xe7/0x120 [ 606.300396] ret_from_fork+0x3c/0x60 [ 606.300406] ret_from_fork_asm+0x1a/0x30 [ 606.300416] other info that might help us debug this: [ 606.300428] Chain exists of: (work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work) --> &adev->enforce_isolation_mutex --> &adev->gfx.kfd_sch_mutex [ 606.300458] Possible unsafe locking scenario: [ 606.300468] CPU0 CPU1 [ 606.300476] ---- ---- [ 606.300484] lock(&adev->gfx.kfd_sch_mutex); [ 606.300494] lock(&adev->enforce_isolation_mutex); [ 606.300508] lock(&adev->gfx.kfd_sch_mutex); [ 606.300521] lock((work_completion)(&(&adev->gfx.enforce_isolation[i].work)->work)); [ 606.300536] *** DEADLOCK *** [ 606.300546] 5 locks held by kworker/u96:3/3825: [ 606.300555] #0: ffff9aa5aa1f5d58 ((wq_completion)comp_1.1.0){+.+.}-{0:0}, at: process_one_work+0x3f5/0x680 [ 606.300577] #1: ffffaa53c3c97e40 ((work_completion)(&sched->work_run_job)){+.+.}-{0:0}, at: process_one_work+0x1d6/0x680 [ 606.300600] #2: ffff9aa64e463c98 (&adev->enforce_isolation_mutex){+.+.}-{3:3}, at: amdgpu_gfx_enforce_isolation_ring_begin_use+0x1c3/0x5d0 [amdgpu] [ 606.300837] #3: ffff9aa64e432338 (&adev->gfx.kfd_sch_mutex){+.+.}-{3:3}, at: amdgpu_gfx_kfd_sch_ctrl+0x51/0x4d0 [amdgpu] [ 606.301062] #4: ffffffff8c1a5660 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x70/0x610 [ 606.301083] stack backtrace: [ 606.301092] CPU: 14 PID: 3825 Comm: kworker/u96:3 Tainted: G OE 6.10.0-amd-mlkd-610-311224-lof #19 [ 606.301109] Hardware name: Gigabyte Technology Co., Ltd. X570S GAMING X/X570S GAMING X, BIOS F7 03/22/2024 [ 606.301124] Workqueue: comp_1.1.0 drm_sched_run_job_work [gpu_sched] [ 606.301140] Call Trace: [ 606.301146] <TASK> [ 606.301154] dump_stack_lvl+0x9b/0xf0 [ 606.301166] dump_stack+0x10/0x20 [ 606.301175] print_circular_bug+0x26c/0x340 [ 606.301187] check_noncircular+0x157/0x170 [ 606.301197] ? register_lock_class+0x48/0x490 [ 606.301213] __lock_acquire+0x16f9/0x2810 [ 606.301230] lock_acquire+0xd1/0x300 [ 606.301239] ? __flush_work+0x232/0x610 [ 606.301250] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301261] ? mark_held_locks+0x54/0x90 [ 606.301274] ? __flush_work+0x232/0x610 [ 606.301284] __flush_work+0x250/0x610 [ 606.301293] ? __flush_work+0x232/0x610 [ 606.301305] ? __pfx_wq_barrier_func+0x10/0x10 [ 606.301318] ? mark_held_locks+0x54/0x90 [ 606.301331] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.301345] cancel_delayed_work_sync+0x71/0x80 [ 606.301356] amdgpu_gfx_kfd_sch_ctrl+0x287/0x4d0 [amdgpu] [ 606.301661] amdgpu_gfx_enforce_isolation_ring_begin_use+0x2a4/0x5d0 [amdgpu] [ 606.302050] ? srso_alias_return_thunk+0x5/0xfbef5 [ 606.302069] amdgpu_ring_alloc+0x48/0x70 [amdgpu] [ 606.302452] amdgpu_ib_schedule+0x176/0x8a0 [amdgpu] [ 606.302862] ? drm_sched_entity_error+0x82/0x190 [gpu_sched] [ 606.302890] amdgpu_job_run+0xac/0x1e0 [amdgpu] [ 606.303366] drm_sched_run_job_work+0x24f/0x430 [gpu_sched] [ 606.303388] process_one_work+0x21e/0x680 [ 606.303409] worker_thread+0x190/0x350 [ 606.303424] ? __pfx_worker_thread+0x10/0x10 [ 606.303437] kthread+0xe7/0x120 [ 606.303449] ? __pfx_kthread+0x10/0x10 [ 606.303463] ret_from_fork+0x3c/0x60 [ 606.303476] ? __pfx_kthread+0x10/0x10 [ 606.303489] ret_from_fork_asm+0x1a/0x30 [ 606.303512] </TASK> v2: Refactor lock handling to resolve circular dependency (Alex) - Introduced a `sched_work` flag to defer the call to `amdgpu_gfx_kfd_sch_ctrl` until after releasing `enforce_isolation_mutex`. - This change ensures that `amdgpu_gfx_kfd_sch_ctrl` is called outside the critical section, preventing the circular dependency and deadlock. - The `sched_work` flag is set within the mutex-protected section if conditions are met, and the actual function call is made afterward. - This approach ensures consistent lock acquisition order. Fixes: afefd6f ("drm/amdgpu: Implement Enforce Isolation Handler for KGD/KFD serialization") Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Suggested-by: Alex Deucher <[email protected]> Reviewed-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 0b6b2dd38336d5fd49214f0e4e6495e658e3ab44) Cc: [email protected]
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
syz reports an out of bounds read: ================================================================== BUG: KASAN: slab-out-of-bounds in ocfs2_match fs/ocfs2/dir.c:334 [inline] BUG: KASAN: slab-out-of-bounds in ocfs2_search_dirblock+0x283/0x6e0 fs/ocfs2/dir.c:367 Read of size 1 at addr ffff88804d8b9982 by task syz-executor.2/14802 CPU: 0 UID: 0 PID: 14802 Comm: syz-executor.2 Not tainted 6.13.0-rc4 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Sched_ext: serialise (enabled+all), task: runnable_at=-10ms Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x229/0x350 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0x164/0x530 mm/kasan/report.c:489 kasan_report+0x147/0x180 mm/kasan/report.c:602 ocfs2_match fs/ocfs2/dir.c:334 [inline] ocfs2_search_dirblock+0x283/0x6e0 fs/ocfs2/dir.c:367 ocfs2_find_entry_id fs/ocfs2/dir.c:414 [inline] ocfs2_find_entry+0x1143/0x2db0 fs/ocfs2/dir.c:1078 ocfs2_find_files_on_disk+0x18e/0x530 fs/ocfs2/dir.c:1981 ocfs2_lookup_ino_from_name+0xb6/0x110 fs/ocfs2/dir.c:2003 ocfs2_lookup+0x30a/0xd40 fs/ocfs2/namei.c:122 lookup_open fs/namei.c:3627 [inline] open_last_lookups fs/namei.c:3748 [inline] path_openat+0x145a/0x3870 fs/namei.c:3984 do_filp_open+0xe9/0x1c0 fs/namei.c:4014 do_sys_openat2+0x135/0x1d0 fs/open.c:1402 do_sys_open fs/open.c:1417 [inline] __do_sys_openat fs/open.c:1433 [inline] __se_sys_openat fs/open.c:1428 [inline] __x64_sys_openat+0x15d/0x1c0 fs/open.c:1428 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf6/0x210 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f01076903ad Code: c3 e8 a7 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f01084acfc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 00007f01077cbf80 RCX: 00007f01076903ad RDX: 0000000000105042 RSI: 0000000020000080 RDI: ffffffffffffff9c RBP: 00007f01077cbf80 R08: 0000000000000000 R09: 0000000000000000 R10: 00000000000001ff R11: 0000000000000246 R12: 0000000000000000 R13: 00007f01077cbf80 R14: 00007f010764fc90 R15: 00007f010848d000 </TASK> ================================================================== And a general protection fault in ocfs2_prepare_dir_for_insert: ================================================================== loop0: detected capacity change from 0 to 32768 JBD2: Ignoring recovery information on journal ocfs2: Mounting device (7,0) on (node local, slot 0) with ordered data mode. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 UID: 0 PID: 5096 Comm: syz-executor792 Not tainted 6.11.0-rc4-syzkaller-00002-gb0da640826ba #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:ocfs2_find_dir_space_id fs/ocfs2/dir.c:3406 [inline] RIP: 0010:ocfs2_prepare_dir_for_insert+0x3309/0x5c70 fs/ocfs2/dir.c:4280 Code: 00 00 e8 2a 25 13 fe e9 ba 06 00 00 e8 20 25 13 fe e9 4f 01 00 00 e8 16 25 13 fe 49 8d 7f 08 49 8d 5f 09 48 89 f8 48 c1 e8 03 <42> 0f b6 04 20 84 c0 0f 85 bd 23 00 00 48 89 d8 48 c1 e8 03 42 0f RSP: 0018:ffffc9000af9f020 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000009 RCX: ffff88801e27a440 RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000008 RBP: ffffc9000af9f830 R08: ffffffff8380395b R09: ffffffff838090a7 R10: 0000000000000002 R11: ffff88801e27a440 R12: dffffc0000000000 R13: ffff88803c660878 R14: f700000000000088 R15: 0000000000000000 FS: 000055555a677380(0000) GS:ffff888020800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000560bce569178 CR3: 000000001de5a000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ocfs2_mknod+0xcaf/0x2b40 fs/ocfs2/namei.c:292 vfs_mknod+0x36d/0x3b0 fs/namei.c:4088 do_mknodat+0x3ec/0x5b0 __do_sys_mknodat fs/namei.c:4166 [inline] __se_sys_mknodat fs/namei.c:4163 [inline] __x64_sys_mknodat+0xa7/0xc0 fs/namei.c:4163 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f2dafda3a99 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 17 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe336a6658 EFLAGS: 00000246 ORIG_RAX: 0000000000000103 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f2dafda3a99 RDX: 00000000000021c0 RSI: 0000000020000040 RDI: 00000000ffffff9c RBP: 00007f2dafe1b5f0 R08: 0000000000004480 R09: 000055555a6784c0 R10: 0000000000000103 R11: 0000000000000246 R12: 00007ffe336a6680 R13: 00007ffe336a68a8 R14: 431bde82d7b634db R15: 00007f2dafdec03b </TASK> ================================================================== The two reports are all caused invalid negative i_size of dir inode. For ocfs2, dir_inode can't be negative or zero. Here add a check in which is called by ocfs2_check_dir_for_entry(). It fixes the second report as ocfs2_check_dir_for_entry() must be called before ocfs2_prepare_dir_for_insert(). Also set a up limit for dir with OCFS2_INLINE_DATA_FL. The i_size can't be great than blocksize. Link: https://lkml.kernel.org/r/[email protected] Reported-by: Jiacheng Xu <[email protected]> Link: https://lore.kernel.org/ocfs2-devel/[email protected]/T/#u Reported-by: [email protected] Link: https://lore.kernel.org/all/[email protected]/T/ Signed-off-by: Su Yue <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Fix a lockdep warning [1] observed during the write combining test. The warning indicates a potential nested lock scenario that could lead to a deadlock. However, this is a false positive alarm because the SF lock and its parent lock are distinct ones. The lockdep confusion arises because the locks belong to the same object class (i.e., struct mlx5_core_dev). To resolve this, the code has been refactored to avoid taking both locks. Instead, only the parent lock is acquired. [1] raw_ethernet_bw/2118 is trying to acquire lock: [ 213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.620270] [ 213.620270] but task is already holding lock: [ 213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core] [ 213.622045] [ 213.622045] other info that might help us debug this: [ 213.622778] Possible unsafe locking scenario: [ 213.622778] [ 213.623465] CPU0 [ 213.623815] ---- [ 213.624148] lock(&dev->wc_state_lock); [ 213.624615] lock(&dev->wc_state_lock); [ 213.625071] [ 213.625071] *** DEADLOCK *** [ 213.625071] [ 213.625805] May be due to missing lock nesting notation [ 213.625805] [ 213.626522] 4 locks held by raw_ethernet_bw/2118: [ 213.627019] #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0}, at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.628088] #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3}, at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs] [ 213.629094] #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3}, at: ib_init_ucontext+0x49/0xf0 [ib_uverbs] [ 213.630106] #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core] [ 213.631185] [ 213.631185] stack backtrace: [ 213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted 6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1 [ 213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 213.633785] Call Trace: [ 213.634099] [ 213.634393] dump_stack_lvl+0x7e/0xc0 [ 213.634806] print_deadlock_bug+0x278/0x3c0 [ 213.635265] __lock_acquire+0x15f4/0x2c40 [ 213.635712] lock_acquire+0xcd/0x2d0 [ 213.636120] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.636722] ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib] [ 213.637277] __mutex_lock+0x81/0xda0 [ 213.637697] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.638305] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.638902] ? rcu_read_lock_sched_held+0x3f/0x70 [ 213.639400] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.640016] mlx5_wc_support_get+0x18c/0x210 [mlx5_core] [ 213.640615] set_ucontext_resp+0x68/0x2b0 [mlx5_ib] [ 213.641144] ? debug_mutex_init+0x33/0x40 [ 213.641586] mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib] [ 213.642145] ib_init_ucontext+0xa0/0xf0 [ib_uverbs] [ 213.642679] ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0 [ib_uverbs] [ 213.643426] ? _copy_from_user+0x46/0x80 [ 213.643878] ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs] [ 213.644426] ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130 [ib_uverbs] [ 213.645213] ? __lock_acquire+0xa99/0x2c40 [ 213.645675] ? lock_acquire+0xcd/0x2d0 [ 213.646101] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.646625] ? reacquire_held_locks+0xcf/0x1f0 [ 213.647102] ? do_user_addr_fault+0x45d/0x770 [ 213.647586] ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs] [ 213.648102] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs] [ 213.648632] __x64_sys_ioctl+0x4d3/0xaa0 [ 213.649060] ? do_user_addr_fault+0x4a8/0x770 [ 213.649528] do_syscall_64+0x6d/0x140 [ 213.649947] entry_SYSCALL_64_after_hwframe+0x4b/0x53 [ 213.650478] RIP: 0033:0x7fa179b0737b [ 213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 2a 0f 00 f7 d8 64 89 01 48 [ 213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX: 00007fa179b0737b [ 213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI: 0000000000000003 [ 213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09: 0000000000000002 [ 213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12: 00007ffd2e6d487c [ 213.656170] R13: 0000000000000027 R14: 0000000000000001 R15: 00007ffd2e6d4f70 Fixes: d98995b ("net/mlx5: Reimplement write combining test") Signed-off-by: Yishai Hadas <[email protected]> Reviewed-by: Michael Guralnik <[email protected]> Reviewed-by: Larysa Zaremba <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
pcmoore
pushed a commit
that referenced
this issue
Jan 20, 2025
Clear the port select structure on error so no stale values left after definers are destroyed. That's because the mlx5_lag_destroy_definers() always try to destroy all lag definers in the tt_map, so in the flow below lag definers get double-destroyed and cause kernel crash: mlx5_lag_port_sel_create() mlx5_lag_create_definers() mlx5_lag_create_definer() <- Failed on tt 1 mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed mlx5_lag_port_sel_create() mlx5_lag_create_definers() mlx5_lag_create_definer() <- Failed on tt 0 mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Mem abort info: ESR = 0x0000000096000005 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x05: level 1 translation fault Data abort info: ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00 [0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)] CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G OE 6.11.0+ #2 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core] lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core] sp : ffff800085fafb00 x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000 x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000 x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000 x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350 x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0 x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190 x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000 x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000 Call trace: mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core] mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core] mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core] mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core] mlx5_activate_lag+0x60c/0x6f8 [mlx5_core] mlx5_do_bond_work+0x284/0x5c8 [mlx5_core] process_one_work+0x170/0x3e0 worker_thread+0x2d8/0x3e0 kthread+0x11c/0x128 ret_from_fork+0x10/0x20 Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400) ---[ end trace 0000000000000000 ]--- Fixes: dc48516 ("net/mlx5: Lag, add support to create definers for LAG") Signed-off-by: Mark Zhang <[email protected]> Reviewed-by: Leon Romanovsky <[email protected]> Reviewed-by: Mark Bloch <[email protected]> Reviewed-by: Jacob Keller <[email protected]> Signed-off-by: Tariq Toukan <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Generalize the current hardcoded tests of specific filesystem type names used to determine whether to support setting per-file security contexts via setxattr on a genfscon-labeled filesystem and whether to initially label the files from policy based on pathname from the root of the filesystem. The former is only safe if the filesystem either implements its own setxattr handler for security labels or the filesystem pins its inodes in memory, as otherwise the label may not be preserved for the lifetime of the file. The latter is only safe if the filesystem does not permit userspace to modify the directory tree (i.e. no .create/.link/.rename methods or filesystem is not mountable by userspace), as otherwise userspace can potentially cause files to move in and out of a given label or to be accessible under different labels depending on which path is first looked up. We currently permit the former for sysfs (implements its own handler that saves/restores the value when the inode is evicted and later re-created from a backing data structure), and for pstore, debugfs, and rootfs (all of which pin their inodes in memory). We currently permit the latter for debugfs, sysfs, and pstore, as the first two do not permit any userspace manipulation of directories and the latter only permits unlink, which causes no issues by itself. We either need some way to detect which filesystems are safe to use in the kernel or specify the whitelists of filesystem type names in the policy.
The text was updated successfully, but these errors were encountered: