Skip to content

Test for-next (regular, SELF kvm)#1624

Open
kdave wants to merge 10000 commits into
ci-kvmfrom
for-next
Open

Test for-next (regular, SELF kvm)#1624
kdave wants to merge 10000 commits into
ci-kvmfrom
for-next

Conversation

@kdave

@kdave kdave commented Mar 5, 2026

Copy link
Copy Markdown
Member

Keep this open, the build tests are on self-hosted workers.

@kdave

kdave commented Mar 5, 2026

Copy link
Copy Markdown
Member Author

Re #1623 .

@kdave kdave changed the title Test for-next (regular, GH kvm) 2 Test for-next (regular, GH kvm) Mar 5, 2026
@kdave kdave closed this Mar 5, 2026
@kdave kdave reopened this Mar 5, 2026
@kdave kdave force-pushed the ci-kvm branch 2 times, most recently from 69fc6c9 to 98bf7e7 Compare March 5, 2026 23:30
@kdave kdave force-pushed the for-next branch 3 times, most recently from 934d926 to 0cb5a8a Compare March 13, 2026 11:46
@kdave kdave force-pushed the ci-kvm branch 3 times, most recently from 2cd3911 to c09d7cb Compare March 13, 2026 19:11
@kdave kdave force-pushed the for-next branch 8 times, most recently from c61e262 to daed989 Compare March 17, 2026 16:00
@kdave kdave force-pushed the for-next branch 4 times, most recently from 9dc51d6 to d76ed94 Compare March 19, 2026 13:21
fdmanana and others added 30 commits June 9, 2026 18:22
While debugging a relocation issue I hit an assertion in backref.c but it
was not super useful, since it could not tell what was the unexpected
value that triggered the assertion. The stack trace was this:

  [583246.338097] assertion failed: !cache->nr_nodes, in fs/btrfs/backref.c:3158
  [583246.339588] ------------[ cut here ]------------
  [583246.340573] kernel BUG at fs/btrfs/backref.c:3158!
  [583246.342075] Oops: invalid opcode: 0000 [#1] SMP PTI
  [583246.343294] CPU: 5 UID: 0 PID: 677957 Comm: btrfs Not tainted 7.1.0-rc4-btrfs-next-234+ #1 PREEMPT(full)
  [583246.345715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
  [583246.348694] RIP: 0010:btrfs_backref_release_cache.cold+0x61/0x84 [btrfs]
  [583246.350759] Code: 90 d5 7c (...)
  [583246.354923] RSP: 0018:ffffd4fc88c93ad8 EFLAGS: 00010246
  [583246.355982] RAX: 000000000000003e RBX: ffff8dec90d97020 RCX: 0000000000000000
  [583246.357459] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00000000ffffffff
  [583246.359517] RBP: ffff8dec8eeb78c0 R08: 0000000000000000 R09: 3fffffffffefffff
  [583246.361180] R10: ffffd4fc88c93970 R11: 0000000000000003 R12: ffff8decd21f3470
  [583246.363184] R13: 00000000fffffffe R14: ffff8decd21f3000 R15: ffff8decd21f3000
  [583246.364666] FS:  00007f9a51751400(0000) GS:ffff8df3f4255000(0000) knlGS:0000000000000000
  [583246.366287] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [583246.367443] CR2: 00007f9a518ed8f5 CR3: 00000004467c8002 CR4: 0000000000370ef0
  [583246.368969] Call Trace:
  [583246.369541]  <TASK>
  [583246.370040]  relocate_block_group+0xf2/0x520 [btrfs]
  [583246.371243]  btrfs_relocate_block_group+0x9a9/0x22e0 [btrfs]
  [583246.372443]  ? preempt_count_add+0x47/0xa0
  [583247.532978]  ? btrfs_tree_read_lock_nested+0x19/0x90 [btrfs]
  [583247.534520]  ? mutex_lock+0x1a/0x40
  [583247.602233]  ? btrfs_scrub_pause+0x2e/0x120 [btrfs]
  [583247.603543]  btrfs_relocate_chunk+0x3b/0x1a0 [btrfs]
  [583247.604893]  btrfs_balance+0x9d5/0x1920 [btrfs]
  [583247.606189]  ? preempt_count_add+0x69/0xa0
  [583247.607030]  btrfs_ioctl+0x260c/0x2a20 [btrfs]
  [583247.608015]  ? __memcg_slab_free_hook+0x156/0x1a0
  [583247.636971]  __x64_sys_ioctl+0x92/0xe0
  [583247.679247]  do_syscall_64+0x60/0xf20
  [583247.753297]  ? clear_bhb_loop+0x60/0xb0
  [583247.756321]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [583247.787018] RIP: 0033:0x7f9a5186a8db
  [583247.787787] Code: 00 48 89 (...)
  [583247.791410] RSP: 002b:00007fff2ffa6ac0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  [583247.792897] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f9a5186a8db
  [583247.794319] RDX: 00007fff2ffa6bb0 RSI: 00000000c4009420 RDI: 0000000000000003
  [583247.795714] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
  [583247.797149] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff2ffa903f
  [583247.798685] R13: 00007fff2ffa6bb0 R14: 0000000000000002 R15: 0000000000000002
  [583247.800136]  </TASK>

So update all simple assertions in backref.c to print out the values when
they aren't testing simple boolean conditions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
The test case generic/362 will fail with "nodatasum" mount option (*):

 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:21:17.574771567 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +First write failed: Input/output error
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside __iomap_dio_rw(), the -EFAULT/-ENOTBLK error is not directly returned.
Thus we never got an error pointer from __iomap_dio_rw().

The call chain looks like this:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  |- btrfs_finish_ordered_extent(uptodate = false)
 |  |  |  |  |- can_finish_ordered_extent()
 |  |  |  |     |- btrfs_mark_ordered_extent_error()
 |  |  |  |        |- mapping_set_error()
 |  |  |  |           Now the address space is marked error.
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |     Thus no error pointer will be returned.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is 0.
 |
 |- Fallback to buffered IO
 |  And the buffered write finished without error
 |
 |- filemap_fdatawait_range()
    |- filemap_check_errors()
       The previous error is recorded, thus an error is returned

However the buffered write is properly submitted and finished, the error
is from the btrfs_finish_ordered_extent() call with @uptodate = false.

[FIX]
When a short dio write happened, any range that is submitted will have
btrfs_extract_ordered_extent() to be called, thus the submitted range
will always have an OE just covering the submitted range.

The remaining OE range is never submitted, thus they should be treated
as truncated, not an error. So that we can properly reclaim and not
insert an unnecessary file extent item, without marking the mapping as
error.

Extract a helper, btrfs_mark_ordered_extent_truncated(), and utilize
that helper to mark the direct IO ordered extent as truncated, so it
won't cause failure for the later buffered fallback.

[REASON FOR NO FIXES TAG]
The bug itself is pretty old, at commit f85781f ("btrfs: switch to
iomap for direct IO") we're already passing @uptodate=false finishing
the OE.
But at that time OE with IOERR won't call mapping_set_error(), so it's
not exposed.
Later commit d61bec0 ("btrfs: mark ordered extent and inode with
error if we fail to finish") finally exposed the bug, but that commit
is doing a correct job, not the root cause.

Anyway the bug is very old, dating back to 5.1x days, thus only CC to
stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With the previous bug of short direct writes fixed, test case
generic/362 (*) still fails with the following error with nodatasum
mount option:

 generic/362  0s ... - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
 - output mismatch (see /home/adam/xfstests/results//generic/362.out.bad)
    --- tests/generic/362.out	2024-08-24 15:31:37.200000000 +0930
    +++ /home/adam/xfstests/results//generic/362.out.bad	2026-05-27 10:13:09.072485767 +0930
    @@ -1,2 +1,3 @@
     QA output created by 362
    +Wrong file size after first write, got 8192 expected 4096
     Silence is golden
    ...

*: If the test case has been executed before with default data checksum,
the failure will not reproduce. Need the following fix to make it
reliably reproducible:
https://lore.kernel.org/linux-btrfs/20260528111659.87113-1-wqu@suse.com/

[CAUSE]
Inside btrfs_dio_iomap_begin() for a direct write, we increase the isize
if it's beyond the current isize.

But if the direct io finished short, we do not revert the isize to the
previous value nor to the short write end.

Then if we need to fall back to buffered writes, and the write has
IOCB_APPEND flag, then the buffered write will be positioned at the
incorrect isize.

The call chain looks like this:

 btrfs_direct_write(pos=0, length=4K)
 |- __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     |- btrfs_get_blocks_direct_write()
 |  |        |- i_size_write()
 |  |           Which updates the isize to the write end (4K).
 |  |
 |  |- iomap_dio_iter()
 |  |  Failed with -EFAULT on the first page.
 |  |
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |     Detects a short write, return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0;}
 |     Which resets the return value.
 |
 |- ret = iomap_dio_complet()
 |  Which returns 0.
 |
 |- btrfs_buffered_write(iocb, from);
    |- generic_write_checks()
       |- iocb->ki_pos = i_size_read()
          Which is still the new size (4K), other than the original
	  isize 0.

[FIX]
Introduce the following btrfs_dio_data members:

- old_isize

- updated_isize
  If the direct write has enlarged the isize.

Then if we got a short write, and btrfs_dio_data::updated_isize is set,
revert to the correct isize based on old_isize and current file
position.

And here we call i_size_write() without holding an extent lock, which is
a very special case that we're safe to do:

 - Only a single writer can be enlarging isize
   Enlarging isize will take the exclusive inode lock.

 - Buffered readers need to wait for the OE we're holding
   Buffered readers will lock extent and wait for OE of the folio range.
   Sometimes we can skip the OE wait, but since all page cache is
   invalidated, the OE wait can not be skipped.

But I do not think this is the most elegant solution, nor covers all
cases. E.g. if the bio is submitted but IO failed, we are unable to do
the revert.

I believe the more elegant one would be extend the EXTENT_DIO_LOCKED
lifespan for direct writes, so that we can update the isize when a
write beyond EOF finished successfully.

However that change is too huge for a small bug fix.
So only implement the minimal partial fix for now.

[REASON FOR NO FIXES TAG]
The bug is again very old, before commit f85781f ("btrfs: switch to
iomap for direct IO") we are already increasing isize without a
proper rollback for short writes.

Thus only a CC to stable.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_direct_write() will not try to fault in the pages, but
directly fall back to buffered writes, if the first page of the buffer
can not be faulted in.

For example, during generic/362 with nodatasum mount option, there is a
write at file offset 0, length PAGE_SIZE, and the page is not faulted in.
Then we go the following callchain and directly fall back to buffered
IO:

 btrfs_direct_write()
 |- btrfs_dio_write()
 |-  __iomap_dio_rw()
 |  |- iomap_iter()
 |  |  |- btrfs_dio_iomap_begin()
 |  |     Now an ordered extent is allocated for the 4K write.
 |  |
 |  |- iomi.status = iomap_dio_iter()
 |  |  Where iomap_dio_iter() returned -EFAULT.
 |  |
 |  |- ret = iomap_iter()
 |  |  |- btrfs_dio_iomap_end()
 |  |  |  | return -ENOTBLK
 |  |  |- return -ENOTBLK
 |  |- if (ret == -ENOTBLK) { ret = 0; }
 |     Now the return value is reset to 0.
 |
 |- ret = iomap_dio_complete()
 |  Since no byte is submitted, @ret is now zero.
 |
 |- if (iov_iter_count() > 0 && (ret == -EFAULT || ret > 0))
 |  @ret is zero, thus not meeting the above retry condition
 |
 |- Fallback to buffered

Just slightly loosen the condition to allow retry faulting in pages after
a zero sized short write.

Unlike the previous two bug fixes, this one is not really cause any real
bug, but only reducing the chance to do zero-copy direct IO.
Thus it doesn't really require stable-CC nor fixes-tag.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…nput

lzo_decompress_bio() validates each on-disk segment length seg_len only
against the workspace cbuf size, not against the compressed input size
(compressed_len, the total folio bytes of the bio).  A crafted extent can
carry a segment whose seg_len passes the cbuf check but runs past the end
of the bio, so copy_compressed_segment() walks off the last folio:
get_current_folio() then returns the NULL folio from bio_next_folio(), and
with CONFIG_BTRFS_ASSERT disabled (default) folio_size(NULL) faults.

 BUG: KASAN: null-ptr-deref in lzo_decompress_bio (fs/btrfs/lzo.c:383)
 Read of size 8 at addr 0000000000000000 by task kworker/u8:1/29
 Workqueue: btrfs-endio simple_end_io_work
  kasan_report (mm/kasan/report.c:590)
  lzo_decompress_bio (fs/btrfs/lzo.c:383)
  end_bbio_compressed_read (fs/btrfs/compression.c:1065)
  btrfs_bio_end_io (fs/btrfs/bio.c:135)
  btrfs_check_read_bio (fs/btrfs/bio.c:180 fs/btrfs/bio.c:285)
  simple_end_io_work
  process_one_work
  worker_thread

Reject any segment whose payload would extend beyond compressed_len before
copying it, treating it as corruption like the other on-disk validation
failures in this function.

Reported-by: Xiang Mei <xmei5@asu.edu>
Fixes: a6e66e6 ("btrfs: rework lzo_decompress_bio() to make it subpage compatible")
Assisted-by: Claude:claude-opus-4-8
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It does not make sense for the single caller to have the responsability
to lock the relocation mutex before calling the function and then have
the function to assert the lock is held. As this is a function in
relocation.c, move the locking details into it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having the WARN_ON(1) inside the if statement for the
unexpected error. Move it into the if statement's condition, which brings
a couple benefits:

1) It marks the branch as unlikely, hinting the compiler to generate
   better code;

2) The WARN_ON() produces a stack trace after the dumped leaf and error
   message which can hide that more important information in case we get
   a truncated dmesg/syslog.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get a failure during relocation, before we update all the extent
buffers that have file extent items pointing to extents from the block
group being relocated, we can trigger a user-after-free on the reloc
control structure (fs_info->reloc_control) if we have a concurrent task
that is COWing a subvolume leaf.

This happens like this:

1) Relocation of data block group X starts;

2) Relocation changes its state to UPDATE_DATA_PTRS;

3) A task doing a rename for example, COWs leaf A from a subvolume tree
   and ends up at btrfs_reloc_cow_block() and extracts fs_info->reloc_ctl
   into a local variable, which then passes to replace_file_extents();

4) The relocation task gets an error and under the label 'out_put_bg' in
   btrfs_relocate_block_group() calls free_reloc_control(), which frees
   the reloc control structure that the rename task is using;

5) The rename task triggers a use-after-free on the reloc control
   structure that was just freed.

Syzbot reported this recently, with the following stack trace:

   [   88.389822][ T5325] BTRFS error (device loop0 state A): Transaction aborted (error -5)
   [   88.389842][ T5325] BTRFS: error (device loop0 state A) in cleanup_transaction:2067: errno=-5 IO failure
   [   88.389864][ T5325] BTRFS info (device loop0 state EA): forced readonly
   [   88.392277][ T5324] BTRFS: error (device loop0 state EA) in btrfs_sync_log:3572: errno=-5 IO failure
   [   88.396630][ T5325] BTRFS info (device loop0 state EA): balance: ended with status: -5
   [   88.400135][ T5346] ==================================================================
   [   88.400148][ T5346] BUG: KASAN: slab-use-after-free in replace_file_extents+0x85f/0x1590
   [   88.400288][ T5346] Read of size 8 at addr ffff888012312010 by task syz.0.0/5346
   [   88.400299][ T5346]
   [   88.400306][ T5346] CPU: 0 UID: 0 PID: 5346 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
   [   88.400319][ T5346] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
   [   88.400325][ T5346] Call Trace:
   [   88.400331][ T5346]  <TASK>
   [   88.400336][ T5346]  dump_stack_lvl+0xe8/0x150
   [   88.400351][ T5346]  print_address_description+0x55/0x1e0
   [   88.400364][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400378][ T5346]  print_report+0x58/0x70
   [   88.400389][ T5346]  kasan_report+0x117/0x150
   [   88.400405][ T5346]  ? replace_file_extents+0x85f/0x1590
   [   88.400420][ T5346]  replace_file_extents+0x85f/0x1590
   [   88.400440][ T5346]  ? __pfx_replace_file_extents+0x10/0x10
   [   88.400452][ T5346]  ? update_ref_for_cow+0xa71/0x1270
   [   88.400473][ T5346]  btrfs_force_cow_block+0xa4d/0x2450
   [   88.400492][ T5346]  ? __pfx_btrfs_force_cow_block+0x10/0x10
   [   88.400508][ T5346]  ? __pfx_btrfs_get_32+0x10/0x10
   [   88.400523][ T5346]  btrfs_cow_block+0x3c4/0xa90
   [   88.400542][ T5346]  push_leaf_left+0x2ac/0x4a0
   [   88.400561][ T5346]  split_leaf+0xd16/0x12e0
   [   88.400574][ T5346]  ? btrfs_bin_search+0x924/0xc70
   [   88.400592][ T5346]  ? __pfx_split_leaf+0x10/0x10
   [   88.400602][ T5346]  ? leaf_space_used+0x177/0x1e0
   [   88.400618][ T5346]  ? btrfs_leaf_free_space+0x14a/0x2f0
   [   88.400634][ T5346]  btrfs_search_slot+0x2641/0x2d20
   [   88.400654][ T5346]  ? __pfx_btrfs_search_slot+0x10/0x10
   [   88.400669][ T5346]  ? rcu_is_watching+0x15/0xb0
   [   88.400681][ T5346]  ? trace_kmem_cache_alloc+0x29/0xe0
   [   88.400694][ T5346]  btrfs_insert_empty_items+0x9c/0x190
   [   88.400711][ T5346]  btrfs_insert_inode_ref+0x229/0xcb0
   [   88.400724][ T5346]  ? __pfx_btrfs_insert_inode_ref+0x10/0x10
   [   88.400736][ T5346]  ? __pfx_btrfs_qgroup_convert_reserved_meta+0x10/0x10
   [   88.400751][ T5346]  ? btrfs_record_root_in_trans+0x124/0x180
   [   88.400767][ T5346]  ? start_transaction+0x8a0/0x1820
   [   88.400778][ T5346]  ? btrfs_set_inode_index+0x5e/0x100
   [   88.400787][ T5346]  btrfs_rename2+0x17bb/0x40d0
   [   88.400800][ T5346]  ? check_noncircular+0xda/0x150
   [   88.400814][ T5346]  ? add_lock_to_list+0xc7/0x100
   [   88.400828][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400842][ T5346]  ? lockdep_hardirqs_on+0x7a/0x110
   [   88.400901][ T5346]  ? lock_acquire+0x221/0x350
   [   88.400915][ T5346]  ? down_write_nested+0x174/0x210
   [   88.400931][ T5346]  ? __pfx_down_write_nested+0x10/0x10
   [   88.400941][ T5346]  ? do_raw_spin_unlock+0x4d/0x210
   [   88.400952][ T5346]  ? try_break_deleg+0x5b/0x180
   [   88.400963][ T5346]  ? __pfx_btrfs_rename2+0x10/0x10
   [   88.400973][ T5346]  vfs_rename+0xa96/0xeb0
   [   88.400992][ T5346]  ? __pfx_vfs_rename+0x10/0x10
   [   88.401010][ T5346]  ovl_fill_super+0x46b7/0x5e20
   [   88.401030][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401042][ T5346]  ? xas_create+0x1902/0x1b90
   [   88.401060][ T5346]  ? __pfx___mutex_trylock_common+0x10/0x10
   [   88.401076][ T5346]  ? trace_contention_end+0x3d/0x140
   [   88.401094][ T5346]  ? shrinker_register+0x124/0x230
   [   88.401111][ T5346]  ? __mutex_unlock_slowpath+0x1be/0x6f0
   [   88.401127][ T5346]  ? shrinker_register+0x61/0x230
   [   88.401143][ T5346]  ? __pfx___mutex_lock+0x10/0x10
   [   88.401158][ T5346]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
   [   88.401177][ T5346]  ? __raw_spin_lock_init+0x45/0x100
   [   88.401196][ T5346]  ? sget_fc+0x962/0xa40
   [   88.401208][ T5346]  ? __pfx_set_anon_super_fc+0x10/0x10
   [   88.401222][ T5346]  ? __pfx_ovl_fill_super+0x10/0x10
   [   88.401241][ T5346]  get_tree_nodev+0xbb/0x150
   [   88.401257][ T5346]  vfs_get_tree+0x92/0x2a0
   [   88.401272][ T5346]  do_new_mount+0x341/0xd30
   [   88.401283][ T5346]  ? apparmor_capable+0x126/0x170
   [   88.401301][ T5346]  ? __pfx_do_new_mount+0x10/0x10
   [   88.401311][ T5346]  ? ns_capable+0x89/0xe0
   [   88.401322][ T5346]  ? path_mount+0x690/0x10e0
   [   88.401333][ T5346]  ? user_path_at+0xd4/0x160
   [   88.401346][ T5346]  __se_sys_mount+0x31d/0x420
   [   88.401358][ T5346]  ? __pfx___se_sys_mount+0x10/0x10
   [   88.401370][ T5346]  ? __x64_sys_mount+0x20/0xc0
   [   88.401381][ T5346]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401391][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401403][ T5346]  ? trace_irq_disable+0x3b/0x140
   [   88.401413][ T5346]  ? clear_bhb_loop+0x40/0x90
   [   88.401421][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401429][ T5346] RIP: 0033:0x7fa1ff79ce59
   [   88.401436][ T5346] Code: ff c3 66 (...)
   [   88.401443][ T5346] RSP: 002b:00007fa2005affe8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
   [   88.401456][ T5346] RAX: ffffffffffffffda RBX: 00007fa1ffa16180 RCX: 00007fa1ff79ce59
   [   88.401464][ T5346] RDX: 0000200000000100 RSI: 0000200000002240 RDI: 0000000000000000
   [   88.401474][ T5346] RBP: 00007fa1ff832d6f R08: 0000200000000440 R09: 0000000000000000
   [   88.401481][ T5346] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [   88.401488][ T5346] R13: 00007fa1ffa16218 R14: 00007fa1ffa16180 R15: 00007ffc734fba78
   [   88.401500][ T5346]  </TASK>
   [   88.401506][ T5346]
   [   88.401510][ T5346] Allocated by task 5325:
   [   88.401516][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401529][ T5346]  __kasan_kmalloc+0x93/0xb0
   [   88.401542][ T5346]  __kmalloc_cache_noprof+0x31c/0x660
   [   88.401554][ T5346]  btrfs_relocate_block_group+0x217/0xc40
   [   88.401568][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401577][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401587][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401596][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401612][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401626][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401640][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401650][ T5346]
   [   88.401653][ T5346] Freed by task 5325:
   [   88.401659][ T5346]  kasan_save_track+0x3e/0x80
   [   88.401671][ T5346]  kasan_save_free_info+0x46/0x50
   [   88.401680][ T5346]  __kasan_slab_free+0x5c/0x80
   [   88.401692][ T5346]  kfree+0x1c5/0x640
   [   88.401703][ T5346]  btrfs_relocate_block_group+0x95d/0xc40
   [   88.401715][ T5346]  btrfs_relocate_chunk+0x115/0x820
   [   88.401724][ T5346]  __btrfs_balance+0x1db0/0x2ae0
   [   88.401733][ T5346]  btrfs_balance+0xaf3/0x11b0
   [   88.401742][ T5346]  btrfs_ioctl_balance+0x3d3/0x610
   [   88.401757][ T5346]  __se_sys_ioctl+0xfc/0x170
   [   88.401770][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.401785][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.401795][ T5346]
   [   88.401798][ T5346] The buggy address belongs to the object at ffff888012312000
   [   88.401798][ T5346]  which belongs to the cache kmalloc-2k of size 2048
   [   88.401807][ T5346] The buggy address is located 16 bytes inside of
   [   88.401807][ T5346]  freed 2048-byte region [ffff888012312000, ffff888012312800)
   [   88.401819][ T5346]
   [   88.401822][ T5346] The buggy address belongs to the physical page:
   [   88.401829][ T5346] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12310
   [   88.401840][ T5346] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   [   88.401849][ T5346] flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
   [   88.401860][ T5346] page_type: f5(slab)
   [   88.401871][ T5346] raw: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401881][ T5346] raw: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401892][ T5346] head: 00fff00000000040 ffff88801ac42000 dead000000000100 dead000000000122
   [   88.401902][ T5346] head: 0000000000000000 0000000800080008 00000000f5000000 0000000000000000
   [   88.401913][ T5346] head: 00fff00000000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
   [   88.401923][ T5346] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
   [   88.401929][ T5346] page dumped because: kasan: bad access detected
   [   88.401935][ T5346] page_owner tracks the page as allocated
   [   88.401941][ T5346] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 9, tgid 9 (kworker/0:0), ts 83905464494, free_ts 83674944822
   [   88.401961][ T5346]  post_alloc_hook+0x231/0x280
   [   88.401975][ T5346]  get_page_from_freelist+0x24ba/0x2540
   [   88.401990][ T5346]  __alloc_frozen_pages_noprof+0x18d/0x380
   [   88.402004][ T5346]  allocate_slab+0x77/0x660
   [   88.402019][ T5346]  refill_objects+0x339/0x3d0
   [   88.402033][ T5346]  __pcs_replace_empty_main+0x321/0x720
   [   88.402043][ T5346]  __kmalloc_node_track_caller_noprof+0x572/0x7b0
   [   88.402055][ T5346]  __alloc_skb+0x2c1/0x7d0
   [   88.402067][ T5346]  mld_newpack+0x14c/0xc90
   [   88.402080][ T5346]  add_grhead+0x5a/0x2a0
   [   88.402093][ T5346]  add_grec+0x1452/0x1740
   [   88.402105][ T5346]  mld_ifc_work+0x6e6/0xe70
   [   88.402116][ T5346]  process_scheduled_works+0xb5d/0x1860
   [   88.402127][ T5346]  worker_thread+0xa53/0xfc0
   [   88.402138][ T5346]  kthread+0x389/0x470
   [   88.402150][ T5346]  ret_from_fork+0x514/0xb70
   [   88.402161][ T5346] page last free pid 5282 tgid 5282 stack trace:
   [   88.402168][ T5346]  __free_frozen_pages+0xbc7/0xd30
   [   88.402180][ T5346]  __slab_free+0x274/0x2c0
   [   88.402191][ T5346]  qlist_free_all+0x99/0x100
   [   88.402201][ T5346]  kasan_quarantine_reduce+0x148/0x160
   [   88.402211][ T5346]  __kasan_slab_alloc+0x22/0x80
   [   88.402221][ T5346]  __kmalloc_cache_noprof+0x2ba/0x660
   [   88.402231][ T5346]  kernfs_fop_open+0x3f0/0xda0
   [   88.402253][ T5346]  do_dentry_open+0x785/0x14e0
   [   88.402262][ T5346]  vfs_open+0x3b/0x340
   [   88.402270][ T5346]  path_openat+0x2e08/0x3860
   [   88.402281][ T5346]  do_file_open+0x23e/0x4a0
   [   88.402292][ T5346]  do_sys_openat2+0x113/0x200
   [   88.402300][ T5346]  __x64_sys_openat+0x138/0x170
   [   88.402309][ T5346]  do_syscall_64+0x15f/0xf80
   [   88.402326][ T5346]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [   88.402336][ T5346]
   [   88.402339][ T5346] Memory state around the buggy address:
   [   88.402345][ T5346]  ffff888012311f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402352][ T5346]  ffff888012311f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
   [   88.402359][ T5346] >ffff888012312000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402365][ T5346]                          ^
   [   88.402370][ T5346]  ffff888012312080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402380][ T5346]  ffff888012312100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [   88.402385][ T5346] ==================================================================

Fix this by:

1) Making the reloc control structure ref counted;

2) Make revery place that access fs_info->reloc_ctl outside the relocation
   code, which at the moment it's only replace_file_extents() and
   btrfs_init_reloc_root(), get a reference count on the structure.
   There's also btrfs_update_reloc_root() that is called outside the
   relocation code, but this case is safe because it's only called in
   the transaction commit path while under the fs_info->reloc_mutex
   protection, but nevertheless grab a reference to make the code more
   consistent and avoid false alerts from AI reviews;

3) Add a spinlock to protect fs_info->reloc_ctl, since we can not take the
   fs_info->reloc_mutex as that would cause a deadlock since that lock is
   taken in the transaction commit path. That spinlock is taken before
   setting fs_info->reloc_ctl to an allocated structure, setting it to
   NULL and reading fs_info->reloc_ctl;

4) Make sure the structure is freed only when its reference count drops to
   zero.

Reported-by: syzbot+0eea49bba18051dea35e@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1df323.bb0696ed.125a22.000a.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That member is to record how many bytes are submitted for a direct
read/write, utilized by iomap_end() callback to handle short IO cases.

However iomap_end() callback is already providing an internally tracked
@written member, which is doing the same accounting and providing the
same value as btrfs_dio_data::submitted.

There is no need to duplicate the work, just remove
btrfs_dio_data::submitted.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
That function has the following problems:

- Read/write handling scattered across different locations
  E.g. At the beginning there is a dedicated hole read handling, but
  later short read handling is at an if() branch.

- Modifying of @pos and @Length parameter for short read
  Although it's completely fine to modify those parameters as they are
  passed by value, but it can still be confusing to read.
  As normally we would assume @pos and @Length to be the original range.

  But for short IO handling we modify @pos/@Length, and completely
  ignore @written.

- Unnecessary split for ordered extent and changeset handling
  Both OE and changeset are only for writes, but they are handled in two
  different if (write) {} blocks.

Refactor the function so that:

- Handling of reads and writes are concentrated in their code block
  Now the handling of reads are in its own small if () branch.

  Leaving the more complex writes handling to take the remaining
  function, and reduce the indent level.

  This also removes all unnecessary "if (write)" checks.

- Do not modify @pos and @Length
  Let short IO handling to manually calculate the remaining range.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The qgroup ioctls update the quota tree, but they currently start their
transactions using the root of the inode passed to the ioctl. This makes
the transaction reservation depend on the path used for the ioctl instead
of the tree being modified.

Start qgroup ioctl transactions on the quota root instead. Take a reference
to fs_info->quota_root under qgroup_ioctl_lock before starting the
transaction, because quota disable can clear and put fs_info->quota_root
after the early quota-enabled check. Keep the reference until the
transaction handle is ended.

Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dongjiang Zhu <zhudongjiang@fnnas.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
We set the xattr and then attempt to apply the property. If the apply
fails we then attempt to delete the xattr to avoid an inconsistency.
However we don't verify if the deletion succeed, so if it fails we
leave an inconsistency between the state in the btree and the in-memory
inode.

So address this by validating first if we can apply the property, then
set the xattr, then apply the property, and this last step should not
fail since the validation succeeded before - assert that it does not fail
but leave code to attempt to delete the xattr if it happens, and then
abort the transaction only if the xattr delete failed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…tr_set()

We are using 2 units for properties but we only set one property.
Fix this by using the correct amount: 1 unit.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
There's no need to abort the transaction if we failed to set or delete a
property, as we haven't done any change. However we need to abort if we
set a property or delete a property and then fail to update the inode
item, as that would leave the inode's state in subvolume tree
inconsistent.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
[TEST FAILURE]
The test case generic/628 will fail if MOUNT_OPTIONS is set to "-o
nodatasum":

 FSTYP         -- btrfs
 PLATFORM      -- Linux/x86_64 btrfs-vm 7.1.0-rc4-custom+ #383 SMP PREEMPT_DYNAMIC Sat May 30 07:35:42 ACST 2026
 MKFS_OPTIONS  -- -O bgt -K /dev/mapper/test-scratch1
 MOUNT_OPTIONS -- -o nodatasum /dev/mapper/test-scratch1 /mnt/scratch

 generic/628  1s ... - output mismatch (see /home/adam/xfstests/results//generic/628.out.bad)
    --- tests/generic/628.out	2022-05-11 11:25:30.816666664 +0930
    +++ /home/adam/xfstests/results//generic/628.out.bad	2026-06-08 18:56:49.878542927 +0930
    @@ -8,8 +8,9 @@
     310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
     310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/d
     test reflink flag not set iflag
    +XFS_IOC_CLONE: Invalid argument
     310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/a
    -310f146ce52077fcd3308dcbe7632bb2  SCRATCH_MNT/b
    +d41d8cd98f00b204e9800998ecf8427e  SCRATCH_MNT/b
    ...

[CAUSE]
The direct cause is that after "chattr +S", the btrfs inode will lose its
NODATASUM flag inherited from the mount option. E.g:

 # mkfs.btrfs -f $dev
 # mount $dev $mnt -o nodatasum
 # touch $mnt/foobar
 # sync
 # btrfs ins dump-tree -t 5 $dev | grep "(257 INODE_ITEM 0) itemoff" -A 3
	item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
		generation 9 transid 9 size 0 nbytes 0
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 1 flags 0x1(NODATASUM)
		                     ^^^^^^^^^ Proper NODATASUM flag

 # chattr +S $mnt/foobar
 # sync
 # btrfs ins dump-tree -t 5 $dev | grep "(257 INODE_ITEM 0) itemoff" -A 3
 	item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
		generation 9 transid 10 size 0 nbytes 0
		block group 0 mode 100644 links 1 uid 0 gid 0 rdev 0
		sequence 2 flags 0x20(SYNC)
		                      ^^^^ Only the new SYNC flag

This makes the inode to drop the old NODATASUM flag, meanwhile the new
reflink destination will still inherit the NODATASUM flag.
The mismatching NODATASUM flags will cause the reflink to fail.

The root cause is that, inside btrfs_fileattr_set() if no FS_NOCOW_FL is
set, we remove both NODATASUM and NODATACOW flag.

However we should not touch NODATASUM flag, as data COW doesn't require
checksum.
Only NODATACOW implies NODATASUM, but DATACOW doesn't imply DATASUM.

The deeper problems are:

- Fileattr API is too binary
  It either clears or sets a flag, there is no "do not change" option.
  So that why "chattr +S" implies "chattr -C", and is forcing us to
  change NODATACOW along with NODATASUM flag.

- No way to change NODATASUM through fileattr API
  In fact NODATASUM can only be modified through mount option.

The deeper problems are much harder to attack.

[FIX]
Remove NODATACOW flag when FS_NOCOW_FL is not set, but only remove
NODATASUM if "nodatasum" mount option is not set.

This allows the existing "chattr +C" then "chattr -C" to remove
both NODATACOW and NODATASUM flags on a default mount.

But for a mount with "nodatasum" option, the NODATASUM inode flag will
persist through either "chattr +C" and "chattr -C".

Fixes: 7e97b8d ("btrfs: allow setting NOCOW for a zero sized file via ioctl")
Cc: stable@vger.kernel.org
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
When loading a v1 free space cache, __load_free_space_cache() takes
num_entries and num_bitmaps straight from the on-disk
btrfs_free_space_header. That header is stored in the tree_root under a key
with type 0, which the tree-checker has no case for, so neither count is
validated before the load trusts it.

The load loops num_entries times and maps the next page whenever the current
one runs out, going through io_ctl_check_crc() -> io_ctl_map_page(), which
does io_ctl->pages[io_ctl->index++]. But pages[] is allocated in
io_ctl_init() from the cache inode's i_size, not from num_entries:

	num_pages = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
	io_ctl->pages = kcalloc(num_pages, sizeof(struct page *), GFP_NOFS);

So if num_entries claims more records than the pages can hold, io_ctl->index
runs off the end of pages[]. The write side never hits this because
io_ctl_add_entry() and io_ctl_add_bitmap() both stop once
io_ctl->index >= io_ctl->num_pages; the read side just never had the same
check.

To trigger it, take a clean cache (num_entries = <N> here), set num_entries
in the header to 0x10000, and fix up the leaf checksum so it still passes
the tree-checker. The cache inode has i_size = 65536, so num_pages is 16 and
pages[] is a 16-pointer (kmalloc-128) array. The load now tries to read
65536 entries, io_ctl->index walks up to 16, and pages[16] is read past the
array:

  BUG: KASAN: slab-out-of-bounds in io_ctl_check_crc (fs/btrfs/free-space-cache.c:420 fs/btrfs/free-space-cache.c:565)
  Read of size 8 at addr ffff88800c833a80 by task kworker/u8:3/58
   io_ctl_check_crc (fs/btrfs/free-space-cache.c:420 fs/btrfs/free-space-cache.c:565)
   __load_free_space_cache (fs/btrfs/free-space-cache.c:655 fs/btrfs/free-space-cache.c:820)
   load_free_space_cache (fs/btrfs/free-space-cache.c:1017)
   caching_thread (fs/btrfs/block-group.c:880)
   btrfs_work_helper (fs/btrfs/async-thread.c:312)
   process_one_work
   worker_thread
   kthread
   ret_from_fork

free-space-cache.c:420 is io_ctl_map_page(), inlined into io_ctl_check_crc()
at line 565, which is why that is the frame KASAN names. The out-of-bounds
slot is then treated as a struct page and handed to crc32c(), so the bad
read turns into a GP fault.

Add the missing check to io_ctl_check_crc(), which is where both the entry
loop and the bitmap loop end up. When num_entries is too large the load now
fails like any corrupt cache: __load_free_space_cache() drops it and rebuilds
the free space from the extent tree, so a valid cache is never rejected.

Fixes: 5b0e95b ("Btrfs: inline checksums into the disk free space cache")
Link: https://lore.kernel.org/linux-btrfs/CAPpSM+RMPByMCKXvM5QFKToxsyNccfuFLWMdD0mfd0wh2Ja62w@mail.gmail.com/
Reported-by: Weiming Shi <bestswngs@gmail.com>
Assisted-by: Claude:claude-opus-4-8
Signed-off-by: Xiang Mei <xmei5@asu.edu>
Reviewed-by: Qu Wenruo <wqu@suse.com>
…ubvol()

If during relocation we fail in insert_dirty_subvol() because
btrfs_update_reloc_root() returned an error, we will leave a root's
reloc_root field pointing to a reloc root that was freed instead of NULL,
resulting later in a use-after-free, or double free attempt during
unmount.

The sequence of steps is this:

1) During relocation the call to btrfs_update_reloc_root() in
   insert_dirty_subvol() fails, so insert_dirty_subvol() returns the
   error to merge_reloc_root() without adding the root to the list
   rc->dirty_subvol_roots;

2) Then merge_reloc_root() aborts the current transaction because
   insert_dirty_subvol() returned an error;

3) Up the call chain, merge_reloc_roots() gets the error, adds the
   reloc root for root X to the local reloc_roots list and jumps to the
   'out' label, where it calls free_reloc_roots() to free all the reloc
   roots in the local reloc_roots list. This frees the reloc root for
   root X;

4) We go up the call chain to relocate_block_group() which calls
   clean_dirty_subvols() to go over dirty roots and set their
   ->reloc_root field to NULL, but root X is not in the dirty_subvol_roots
   list, so its ->reloc_root still points to a reloc root;

5) Relocation finishes, with an error and a transaction abort, but the
   ->reloc_root field for root X still points to the reloc root that was
   freed in step 3;

6) When unmounting the fs we end up calling:

     btrfs_free_fs_roots()
        btrfs_drop_and_free_fs_root()
           --> calls btrfs_put_root() against root X's ->reloc_root
               which is not NULL and points to the already freed
               reloc root in step 4 above

  Resulting in a use-after-free to a double free attempt.

Syzbot reported this with the following dmesg/syslog:

   [  106.004389][ T5339] BTRFS error (device loop0 state A): Transaction aborted (error -5)
   [  106.014266][ T5339] BTRFS: error (device loop0 state A) in merge_reloc_root:1655: errno=-5 IO failure
   [  106.021891][ T1061] BTRFS error (device loop0 state A): error while writing out transaction: -5
   [  106.026964][ T1061] BTRFS warning (device loop0 state A): Skipping commit of aborted transaction.
   [  106.033807][ T5340] BTRFS error (device loop0 state A): bdev /dev/loop0 errs: wr 3, rd 0, flush 0, corrupt 0, gen 0
   [  106.039265][ T1061] BTRFS: error (device loop0 state A) in cleanup_transaction:2067: errno=-5 IO failure
   [  106.044382][ T5339] BTRFS info (device loop0 state EA): forced readonly
   [  106.074329][ T5339] BTRFS: error (device loop0 state EA) in merge_reloc_roots:1887: errno=-5 IO failure
   [  106.081004][ T5356] BTRFS info (device loop0 state EA): scrub: started on devid 1
   [  106.085611][ T5339] BTRFS info (device loop0 state EA): balance: ended with status: -30
   [  106.089517][ T5356] BTRFS info (device loop0 state EA): scrub: not finished on devid 1 with status: -30
   [  106.662365][ T5338] BTRFS info (device loop0 state EA): last unmount of filesystem 3a375e4e-b156-4d76-a2ad-16e198ce1409
   [  106.682946][ T5338] ==================================================================
   [  106.686574][ T5338] BUG: KASAN: slab-use-after-free in btrfs_put_root+0x2f/0x250
   [  106.690090][ T5338] Write of size 4 at addr ffff88803f978630 by task syz.0.0/5338
   [  106.693173][ T5338]
   [  106.694279][ T5338] CPU: 0 UID: 0 PID: 5338 Comm: syz.0.0 Not tainted syzkaller #0 PREEMPT(full)
   [  106.694293][ T5338] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
   [  106.694300][ T5338] Call Trace:
   [  106.694308][ T5338]  <TASK>
   [  106.694314][ T5338]  dump_stack_lvl+0xe8/0x150
   [  106.694331][ T5338]  print_address_description+0x55/0x1e0
   [  106.694343][ T5338]  ? btrfs_put_root+0x2f/0x250
   [  106.694358][ T5338]  print_report+0x58/0x70
   [  106.694368][ T5338]  kasan_report+0x117/0x150
   [  106.694384][ T5338]  ? btrfs_put_root+0x2f/0x250
   [  106.694399][ T5338]  kasan_check_range+0x264/0x2c0
   [  106.694416][ T5338]  btrfs_put_root+0x2f/0x250
   [  106.694430][ T5338]  btrfs_drop_and_free_fs_root+0x160/0x210
   [  106.694447][ T5338]  btrfs_free_fs_roots+0x2f9/0x3c0
   [  106.694464][ T5338]  ? __pfx_btrfs_free_fs_roots+0x10/0x10
   [  106.694479][ T5338]  ? free_root_pointers+0x5bf/0x5f0
   [  106.694494][ T5338]  close_ctree+0x798/0x12d0
   [  106.694511][ T5338]  ? __pfx_close_ctree+0x10/0x10
   [  106.694526][ T5338]  ? _raw_spin_unlock_irqrestore+0x74/0x80
   [  106.694599][ T5338]  ? rcu_preempt_deferred_qs_irqrestore+0x906/0xbc0
   [  106.694620][ T5338]  ? __rcu_read_unlock+0x83/0xe0
   [  106.694636][ T5338]  ? btrfs_put_super+0x48/0x1c0
   [  106.694652][ T5338]  ? __pfx_btrfs_put_super+0x10/0x10
   [  106.694667][ T5338]  generic_shutdown_super+0x13d/0x2d0
   [  106.694682][ T5338]  kill_anon_super+0x3b/0x70
   [  106.694695][ T5338]  btrfs_kill_super+0x41/0x50
   [  106.694710][ T5338]  deactivate_locked_super+0xbc/0x130
   [  106.694722][ T5338]  cleanup_mnt+0x437/0x4d0
   [  106.694736][ T5338]  ? _raw_spin_unlock_irq+0x23/0x50
   [  106.694752][ T5338]  task_work_run+0x1d9/0x270
   [  106.694769][ T5338]  ? __pfx_task_work_run+0x10/0x10
   [  106.694784][ T5338]  ? do_raw_spin_unlock+0x4d/0x210
   [  106.694802][ T5338]  do_exit+0x70f/0x22c0
   [  106.694817][ T5338]  ? trace_irq_disable+0x3b/0x140
   [  106.694835][ T5338]  ? __pfx_do_exit+0x10/0x10
   [  106.694848][ T5338]  ? preempt_schedule_thunk+0x16/0x30
   [  106.694863][ T5338]  ? preempt_schedule_common+0x82/0xd0
   [  106.694878][ T5338]  ? preempt_schedule_thunk+0x16/0x30
   [  106.694892][ T5338]  do_group_exit+0x21b/0x2d0
   [  106.694906][ T5338]  ? entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [  106.694918][ T5338]  __x64_sys_exit_group+0x3f/0x40
   [  106.694932][ T5338]  x64_sys_call+0x221a/0x2240
   [  106.694944][ T5338]  do_syscall_64+0x174/0x580
   [  106.694954][ T5338]  ? clear_bhb_loop+0x40/0x90
   [  106.694967][ T5338]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [  106.694978][ T5338] RIP: 0033:0x7f958ef9ce59
   [  106.694988][ T5338] Code: Unable to access opcode bytes at 0x7f958ef9ce2f.
   [  106.694994][ T5338] RSP: 002b:00007fffd4058318 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
   [  106.695008][ T5338] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f958ef9ce59
   [  106.695015][ T5338] RDX: 00007f958c3f8000 RSI: 0000000000000000 RDI: 0000000000000000
   [  106.695022][ T5338] RBP: 0000000000000003 R08: 0000000000000000 R09: 00007f958f1e73e0
   [  106.695028][ T5338] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [  106.695034][ T5338] R13: 00007f958f1e73e0 R14: 0000000000000003 R15: 00007fffd40583d0
   [  106.695046][ T5338]  </TASK>
   [  106.695050][ T5338]
   [  106.821635][ T5338] Allocated by task 1061:
   [  106.823446][ T5338]  kasan_save_track+0x3e/0x80
   [  106.825498][ T5338]  __kasan_kmalloc+0x93/0xb0
   [  106.827381][ T5338]  __kmalloc_cache_noprof+0x31c/0x660
   [  106.829525][ T5338]  btrfs_alloc_root+0x75/0x930
   [  106.831458][ T5338]  read_tree_root_path+0x127/0xb00
   [  106.833556][ T5338]  btrfs_read_tree_root+0x34/0x60
   [  106.835553][ T5338]  create_reloc_root+0x6b3/0xcb0
   [  106.837556][ T5338]  btrfs_init_reloc_root+0x2ec/0x4b0
   [  106.839557][ T5338]  record_root_in_trans+0x2ab/0x350
   [  106.841685][ T5338]  btrfs_record_root_in_trans+0x15c/0x180
   [  106.844237][ T5338]  start_transaction+0x39c/0x1820
   [  106.846638][ T5338]  btrfs_finish_one_ordered+0x88e/0x2680
   [  106.849436][ T5338]  btrfs_work_helper+0x37b/0xc20
   [  106.851549][ T5338]  process_scheduled_works+0xb5d/0x1860
   [  106.853807][ T5338]  worker_thread+0xa53/0xfc0
   [  106.855773][ T5338]  kthread+0x389/0x470
   [  106.857548][ T5338]  ret_from_fork+0x514/0xb70
   [  106.859493][ T5338]  ret_from_fork_asm+0x1a/0x30
   [  106.861504][ T5338]
   [  106.862527][ T5338] Freed by task 5339:
   [  106.864224][ T5338]  kasan_save_track+0x3e/0x80
   [  106.866180][ T5338]  kasan_save_free_info+0x46/0x50
   [  106.868371][ T5338]  __kasan_slab_free+0x5c/0x80
   [  106.870462][ T5338]  kfree+0x1c5/0x640
   [  106.872180][ T5338]  __del_reloc_root+0x341/0x3b0
   [  106.874290][ T5338]  free_reloc_roots+0x5f/0x90
   [  106.876282][ T5338]  merge_reloc_roots+0x73f/0x8a0
   [  106.878489][ T5338]  relocate_block_group+0xbcc/0xe70
   [  106.880742][ T5338]  do_nonremap_reloc+0xa8/0x5b0
   [  106.882885][ T5338]  btrfs_relocate_block_group+0x7e6/0xc40
   [  106.885336][ T5338]  btrfs_relocate_chunk+0x115/0x820
   [  106.887502][ T5338]  __btrfs_balance+0x1db0/0x2ae0
   [  106.889543][ T5338]  btrfs_balance+0xaf3/0x11b0
   [  106.891456][ T5338]  btrfs_ioctl_balance+0x3d3/0x610
   [  106.893672][ T5338]  __se_sys_ioctl+0xfc/0x170
   [  106.895530][ T5338]  do_syscall_64+0x174/0x580
   [  106.897518][ T5338]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
   [  106.900101][ T5338]
   [  106.901123][ T5338] The buggy address belongs to the object at ffff88803f978000
   [  106.901123][ T5338]  which belongs to the cache kmalloc-4k of size 4096
   [  106.906907][ T5338] The buggy address is located 1584 bytes inside of
   [  106.906907][ T5338]  freed 4096-byte region [ffff88803f978000, ffff88803f979000)
   [  106.912980][ T5338]
   [  106.914022][ T5338] The buggy address belongs to the physical page:
   [  106.916716][ T5338] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3f978
   [  106.920390][ T5338] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
   [  106.923834][ T5338] flags: 0x4fff00000000040(head|node=1|zone=1|lastcpupid=0x7ff)
   [  106.927104][ T5338] page_type: f5(slab)
   [  106.928898][ T5338] raw: 04fff00000000040 ffff88801ac42140 dead000000000122 0000000000000000
   [  106.932507][ T5338] raw: 0000000000000000 0000000800040004 00000000f5000000 0000000000000000
   [  106.936193][ T5338] head: 04fff00000000040 ffff88801ac42140 dead000000000122 0000000000000000
   [  106.939856][ T5338] head: 0000000000000000 0000000800040004 00000000f5000000 0000000000000000
   [  106.943601][ T5338] head: 04fff00000000003 fffffffffffffe01 00000000ffffffff 00000000ffffffff
   [  106.947268][ T5338] head: ffffffffffffffff 0000000000000000 00000000ffffffff 0000000000000008
   [  106.950988][ T5338] page dumped because: kasan: bad access detected
   [  106.953710][ T5338] page_owner tracks the page as allocated
   [  106.956198][ T5338] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 24, tgid 24 (kworker/u4:2), ts 105728970387, free_ts 29540875453
   [  106.964984][ T5338]  post_alloc_hook+0x22d/0x280
   [  106.966956][ T5338]  get_page_from_freelist+0x2593/0x2610
   [  106.969307][ T5338]  __alloc_frozen_pages_noprof+0x18d/0x380
   [  106.971839][ T5338]  allocate_slab+0x77/0x660
   [  106.973709][ T5338]  refill_objects+0x339/0x3d0
   [  106.975696][ T5338]  __pcs_replace_empty_main+0x321/0x720
   [  106.978136][ T5338]  __kmalloc_node_track_caller_noprof+0x572/0x7b0
   [  106.981009][ T5338]  __alloc_skb+0x2c1/0x7d0
   [  106.982983][ T5338]  nsim_dev_trap_report_work+0x29a/0xb90
   [  106.985356][ T5338]  process_scheduled_works+0xb5d/0x1860
   [  106.987710][ T5338]  worker_thread+0xa53/0xfc0
   [  106.989847][ T5338]  kthread+0x389/0x470
   [  106.991727][ T5338]  ret_from_fork+0x514/0xb70
   [  106.993722][ T5338]  ret_from_fork_asm+0x1a/0x30
   [  106.995900][ T5338] page last free pid 77 tgid 77 stack trace:
   [  106.998479][ T5338]  __free_frozen_pages+0xc1c/0xd30
   [  107.000819][ T5338]  vfree+0x1d1/0x2f0
   [  107.002631][ T5338]  delayed_vfree_work+0x55/0x80
   [  107.004848][ T5338]  process_scheduled_works+0xb5d/0x1860
   [  107.007366][ T5338]  worker_thread+0xa53/0xfc0
   [  107.009388][ T5338]  kthread+0x389/0x470
   [  107.011177][ T5338]  ret_from_fork+0x514/0xb70
   [  107.013313][ T5338]  ret_from_fork_asm+0x1a/0x30
   [  107.015454][ T5338]
   [  107.016460][ T5338] Memory state around the buggy address:
   [  107.019052][ T5338]  ffff88803f978500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [  107.022691][ T5338]  ffff88803f978580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [  107.026264][ T5338] >ffff88803f978600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [  107.029721][ T5338]                                      ^
   [  107.032062][ T5338]  ffff88803f978680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [  107.035547][ T5338]  ffff88803f978700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   [  107.038865][ T5338] ==================================================================

Fix this by resetting a root's ->reloc_root if we get an error while
trying to merge a reloc root.

Reported-by: syzbot+b3d472d13f9d7bf20669@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/6a1ebde9.c1435f33.112120.0176.GAE@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…oots()

If we have an unexpected reloc_root for our root, we jump to the out label
but never drop the reference we obtained for root, resulting in a leak.
Add a missing btrfs_put_root() call.

Fixes: 24213fa ("btrfs: do proper error handling in merge_reloc_roots")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
We have been running with commit root csums enabled for some time and
have noticed a slight uptick in zero csum errors. Investigating those
revealed that they were same transaction reads of extents that were just
relocated, but the extent map generation was long ago.

It turns out that relocation intentionally does not update the extent
generation (replace_file_extents()), but must write a new csum since the
data has moved, so we must account for this with commit root csum reading.

Luckily this is a short lived condition: after the relocation transaction
the commit root will once again have the csum. So we can add a generic
fallback to the lookup to try again with the transaction csum root.

Fixes: f07b855 ("btrfs: try to search for data csums in commit root")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
If the root we got has zero root refs in its root item, we are resetting
the root's ->reloc_root without using barriers like we do everywhere else.
Sashiko complained about this while reviewing another patch, and it's
correct (see the Link tag below).

Also, we should not clear BTRFS_ROOT_DEAD_RELOC_TREE from the root unless
the root points to the reloc root we have.

Fix this by using clear_reloc_root(), which issues the memory barrier
after setting the root's ->reloc_root to NULL and before clearing the bit
BTRFS_ROOT_DEAD_RELOC_TREE from the root.

Link: https://sashiko.dev/#/patchset/cf84f1a217c719e25b6b69e4298dd7afd36c9427.1781194426.git.fdmanana%40suse.com
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
When we set a root's reloc_root to NULL, we do it like this:

   static void clear_reloc_root(struct btrfs_root *root)
   {
       root->reloc_root = NULL;
       /*
        * Need barrier to ensure clear_bit() only happens after
        * root->reloc_root = NULL. Pairs with have_reloc_root().
        */
        smp_wmb();
        clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
   }

So that a NULL reloc_root is always seen before seeing that the bit
BTRFS_ROOT_DEAD_RELOC_TREE was cleared.

But on the read side we have:

   static bool reloc_root_is_dead(const struct btrfs_root *root)
   {
        smp_rmb();
        if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
            return true;
        return false;
   }

And then callers of reloc_root_is_dead() access root->reloc_root.

Because the read memory barrier is placed before testing the bit, the CPU
is completely free to speculatively reorder those two loads. It can read
root->reloc_root before it actually checks the dead tree bit.

Sashiko reported this as an existing problem in another patch review, see
the link in the Link tag below.

Fix this by moving the read memory barrier to happen after testing the bit
and update the comment to reflect current reality.

Link: https://sashiko.dev/#/patchset/cf84f1a217c719e25b6b69e4298dd7afd36c9427.1781194426.git.fdmanana%40suse.com
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
…nded

The loop intends to copy the data in chunks up to 1M but we allocate the
pages array for the entire length and don't cap it to 1M. Fix this by
computing 'nr_pages' using 'copy_len' instead of 'length'.

While at it, also make 'nr_pages' and 'copy_len' const, as they never
change, to make the code more clear.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
In Meta production, we have observed a large number of hosts running
kernels newer than 6.13 which hit hung tasks on
btrfs_read_folio()->lock_extents_for_read(). Looking through the history
in this codepath reveals an interesting history.

in 6.12, we merged
commit ac325fc ("btrfs: do not hold the extent lock for entire read")
which holds the extent lock very narrowly while looking up the
extent_map. However, this proved to introduce a serious race with DIO
writes which was fixed in 6.14 with
commit acc18e1 ("btrfs: fix stale page cache after race between readahead and direct IO write")

That latter fix subtly changed the extent unlock point from the pre-6.12
regime. In 6.11, each read endio unlocked the extent it finished
reading, but in 6.14, the extent is locked/unlocked as a unit around the
entire readahead loop, while the individual folios are still unlocked as
the endios finish. This is mostly the same behavior, as all successful
reads will populate the page cache, so subsequent reads won't enter
btrfs and hit the extent lock. But in the case where the readahead
fails, perhaps because of a memory allocation failure doing compressed
reads, the page will not be brought up to date and a later read of an
overlapping range *will* block on the extent lock.

Why is this a problem?

On sufficiently large loaded systems, I have observed that direct
reclaim can run for minutes. Given that, consider two tasks on such a
system reading an overlapping range of a compressed file:

  Task 1 locks the whole range and starts to read. Some allocation for
  the compressed read for folio F fails and we carry on while holding the
  extent lock for the full range.

  Task 2 wants to read F, which is not uptodate and in page cache, so it
  blocks on the extent lock held by Task 1.

  Task 1 keeps getting stuck in direct reclaim (likely, we already
  supposed an allocation failure above)

  Task 2 stays blocked on the extent lock the whole time.

If you consider the effects of readahead_expand and imagine a file with
a 128k compressed extent followed by many smaller compressed extents,
you can imagine that the expanded window will result in subsequent reads
hitting many extents (128k/4k = 32) per lock window in the worst case.

The system likeley wouldn't be all that healthy anyway, so this is
likely not a critical improvement, but it does alleviate this one source
of stress and one thread's slowdown escalating to others.

To bring this behavior back to the old model, we should unlock the
extent at each loop of the readahead loop rather than in one shot at the
end. This allows such overlapping reads to proceed as they should.
Writes are fine because either the page has already been read and has an
appropriate state in the page cache to be invalidated (or not uptodate)
or it is still-to-be-read and the extent lock is still held protecting
it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
…ck groups

A swap file on btrfs will pin down block groups that cover the swap file
extent.

Pinned down block groups will be skipped for scrub and relocation.

These degradation on critical btrfs maintenance operations is never
properly educated to end users, and have already caused problems
including:

- Scrub finished too quick
  Because the enabled swap file has pinned down most of the block
  groups. Thus any file extents in those block groups, even not utilized
  by the swap file, will be skipped from scrub.

- Unbalanced data and metadata usage, meanwhile relocation won't help
  The same reason, pinned down block groups will not be considered as
  relocation target, thus data extents that are not utilized by the swap
  file can still be skipped from relocation.

Although we already have kernel messages for both scrub and balance, the
balance one is still info level.

To better communicate those potential long term problems, add the
following output into dmesg:

- Change the message level to warn for __btrfs_balance()

- Total pinned down block group number and size during swapfile activation
- Total released block group number and size during swapfile deactivation
  The above messages have info level.

- The fact that pinned down block groups will not be scrubbed nor
  balanced
  The above message has warning level.

The example output would look like the following, for enabling a 1.2G
swapfile, which pinned down 2G block groups:

 BTRFS info (device dm-3): swapfile activated on root 5 ino 257, pinned down 2147483648 bytes from 2 block group(s)
 BTRFS warning (device dm-3): block groups with swapfile extents will not be scrubbed or balanced
 Adding 1257468k swap on /mnt/btrfs/foobar.  Priority:-1 extents:1 across:1257468k
 BTRFS info (device dm-3): swapfile deactivated on root 5 ino 257, released 2147483648 bytes from 2 block group(s)

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
The variable-sized buffer buf in struct btrfs_ioctl_search_args_v2 is
declared as __u64[], but it holds a packed byte stream of search results,
where all offsets into the buffer are in bytes.

Declaring buf as __u64[] makes it easy for user space to write incorrect
pointer arithmetic: adding a byte offset directly to a __u64 pointer
scales the offset by 8, landing at byte position offset*8 instead of
offset.

This recently caused an infinite loop in btrfs-progs: the accessor read
all-zero data from misaddressed items, which fed zeroed search keys back
into the ioctl loop and spun forever. The issue was worked around at the
time by disabling TREE_SEARCH_V2 entirely in btrfs-progs (d73e69824854:
"btrfs-progs: temporarily disable usage of v2 of search tree ioctl").

The kernel side already treats buf as a byte buffer, so change the
declaration to __u8[] to match the actual semantics and prevent similar
misuse in user space. The change is ABI compatible: both the structure size
and alignment are unchanged.

Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: You-Kai Zheng <ykzheng@synology.com>
Fixes: cc68a8a ("btrfs: new ioctl TREE_SEARCH_V2")
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Inside btrfs we always pair -EUCLEAN error with an error message to
indicate which data is corrupted.

However there are 3 cases inside lzo decompression where there is no
error message for corrupted headers.

Add those missing error messages to show exactly where the corruption
is.

Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
[BUG]
A crafted btrfs image can trigger the following crash:

 BUG: unable to handle page fault for address: ffffd1dc42884000
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 CPU: 9 UID: 0 PID: 1034 Comm: poc Not tainted 7.1.0-rc4-custom+ #383 PREEMPT(full)  46af0a92938a63be7132e0dfd71e62327c51d5c2
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
 RIP: 0010:memcpy+0xc/0x10
 Call Trace:
  <TASK>
  read_extent_buffer+0xe4/0x100 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
  btrfs_get_name+0x15e/0x1e0 [btrfs 3cf0785dd58fec8c5ff84633b772f17ce1f92a8f]
  reconnect_path+0x165/0x390
  exportfs_decode_fh_raw+0x337/0x400
  ? drop_caches_sysctl_handler+0xb0/0xb0
  </TASK>
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:memcpy+0xc/0x10
 Kernel panic - not syncing: Fatal exception

[CAUSE]
The crafted image has the following corrupted INODE_REF item:

	item 9 key (258 INODE_REF 257) itemoff 11544 itemsize 4106
		index 2 namelen 4096 name: d\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000

The itemsize matches the namelen, but the namelen is 4096, way larger
than normal name length limit (BTRFS_NAME_LEN, 255).

Meanwhile the memory of the @name is only 255 byte sized, this will cause
out-of-boundary access, and cause the above crash.

[FIX]
Add extra namelen verification for INODE_REF, just like what we have
done in ROOT_REF checks.

Now the crafted image can be rejected gracefully:

 BTRFS critical (device dm-2): corrupt leaf: root=5 block=30572544 slot=14 ino=259, invalid inode ref name length, has 4096 expect [1, 255]
 BTRFS error (device dm-2): read time tree block corruption detected on logical 30572544 mirror 2

Link: https://lore.kernel.org/linux-btrfs/aik0hEV6ehKx6Ldv@Air.local/
Reported-by: Xiang Mei <xmei5@asu.edu>
Signed-off-by: Weiming Shi <bestswngs@gmail.com>
[ Rebase, add a Link: tag, add an simple cause analyze ]
Acked-by: Weiming Shi <bestswngs@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
V2 space cache is already the default mkfs option since btrfs-progs
v5.15, and commit 1e7bec1 ("btrfs: emit a warning about space cache
v1 being deprecated") has already added a warning to show v1 space cache
is already deprecated.

It has been long enough that we should remove v1 space cache completely.

As the first step, disable v1 space cache by:

- Make "space_cache" mount option fallback to "nospace_cache"
- Make "space_cache=v1" to fallback to "nospace_cache"
  Which is the safer than forcing "space_cache=v2", as forcing v2 cache
  will require removal of v1 cache and regenerate v2 cache.
  Such operation can be slow, and will take extra metadata space, thus
  it is not always safe for existing filesystems.

With this done, v1 cache mount will always fallback to nospace cache,
and mount option will not be able to force v1 space cache usage.

For example, even for a fs with v1 cache:

 # btrfs ins dump-super test.img
 superblock: bytenr=65536, device=test.img
 ---------------------------------------------------------
 csum_type		0 (crc32c)
 csum_size		4
 csum			0xdce44b2c [match]
 bytenr			65536
 flags			0x1
 			( WRITTEN )
 magic			_BHRfS_M [match]
 fsid			7d7c3bba-8211-4206-868d-10eedd5703f8
 metadata_uuid		00000000-0000-0000-0000-000000000000
 label
 generation		9
 root			30605312
 [...]
 compat_ro_flags		0x0 << No FST feature
 incompat_flags		0x361
 			( MIXED_BACKREF |
 			  BIG_METADATA |
 			  EXTENDED_IREF |
 			  SKINNY_METADATA |
 			  NO_HOLES )
 cache_generation	9 <<< Matches generation
 uuid_tree_generation	9

Mounting it will lead to no space cache other than v1 space cache:

 # mount test.img /mnt/btrfs
 # dmesg -t | tail -n 5
 BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
 BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
 BTRFS info (device loop0): using crc32c checksum algorithm
 BTRFS info (device loop0): turning on async discard
 BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

Even forcing v1 cache will not work, but fallback to the usual
nospace_cache:

 # mount test.img -o space_cache=v1 /mnt/btrfs
 # dmesg -t | tail -n 6
 BTRFS warning: v1 space cache is deprecated, fallback to no space cache
 BTRFS: device fsid 7d7c3bba-8211-4206-868d-10eedd5703f8 devid 1 transid 9 /dev/loop0 (7:0) scanned by mount (1264)
 BTRFS info (device loop0): first mount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8
 BTRFS info (device loop0): using crc32c checksum algorithm
 BTRFS info (device loop0): turning on async discard
 BTRFS info (device loop0): last unmount of filesystem 7d7c3bba-8211-4206-868d-10eedd5703f8

And there will be no way to force converting a v2 cache back to v1, such
attempt will only clear free space tree and fallback to no space cache.

 # mkfs.btrfs -f -O fst,^bgt test.img
 # mount -o clear_cache,space_cache=v1 test.img /mnt/btrfs
 # dmesg -t | tail -n 11
 BTRFS warning: v1 space cache is deprecated, fallback to no space cache
 BTRFS: device fsid f59daad2-3ab5-4f33-b752-a36cfb09b674 devid 1 transid 8 /dev/loop0 (7:0) scanned by mount (1419)
 BTRFS info (device loop0): first mount of filesystem f59daad2-3ab5-4f33-b752-a36cfb09b674
 BTRFS info (device loop0): using crc32c checksum algorithm
 BTRFS info (device loop0): rebuilding free space tree
 BTRFS info (device loop0): disabling free space tree
 BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
 BTRFS info (device loop0): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
 BTRFS info (device loop0): checking UUID tree
 BTRFS info (device loop0): turning on async discard
 BTRFS info (device loop0): force clearing of disk cache
 # mount | grep /mnt/btrfs
 /home/adam/test.img on /mnt/btrfs type btrfs (rw,relatime,discard=async,nospace_cache,subvolid=5,subvol=/)

Signed-off-by: Qu Wenruo <wqu@suse.com>
Since commit bac3c29 ("btrfs: remove 2K block size support") there
is no 2K block size support inside btrfs anymore.

Remove the stale comments of btrfs_supported_blocksize().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Since v5.15 btrfs has support for block size < page size, but we still
only support 4K block size, meanwhile there is no special reason that we
can not support 8K/16K/32K block sizes for 64K page size.

That 4K limit is completely artificial, and mostly to reduce test
runtime so we do not need to test all the extra block size combinations.

However that also limits the user choices, some users may understand
what they are doing, and want larger block sizes.
In that case, fixed 4K block size for subpage routine is blocking our
way.

Just remove that fixed 4K requirement for block size < page size.

This should not affect regular end users, since mkfs is already using 4K
block size as default for quite a while, and the existing bs == ps support is
always there.

But for power users, this allows extra block size support, and may
provide extra test coverage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.