Skip to content

Commit 156cac3

Browse files
boryasgregkh
authored andcommitted
btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation
[ Upstream commit b14c5e0 ] I have been observing a number of systems aborting at insert_dev_extents() in btrfs_create_pending_block_groups(). The following is a sample stack trace of such an abort coming from forced chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this can theoretically happen to any DUP chunk allocation. [81.801] ------------[ cut here ]------------ [81.801] BTRFS: Transaction aborted (error -17) [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319 [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ Rust-for-Linux#319 NONE [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014 [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs] [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282 [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000 [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0 [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007 [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000 [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000 [81.809] FS: 00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000 [81.809] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0 [81.810] Call Trace: [81.810] <TASK> [81.810] __btrfs_end_transaction+0x3e/0x2b0 [btrfs] [81.811] btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs] [81.811] kernfs_fop_write_iter+0x15f/0x240 [81.812] vfs_write+0x264/0x500 [81.812] ksys_write+0x6c/0xe0 [81.812] do_syscall_64+0x66/0x770 [81.812] entry_SYSCALL_64_after_hwframe+0x76/0x7e [81.813] RIP: 0033:0x7fec6be66197 [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197 [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001 [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000 [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0 [81.817] </TASK> [81.817] irq event stamp: 20039 [81.818] hardirqs last enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60 [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60 [81.819] softirqs last enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0 [81.820] ---[ end trace 0000000000000000 ]--- [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists Inspecting these aborts with drgn, I observed a pattern of overlapping chunk_maps. Note how stripe 1 of the first chunk overlaps in physical address with stripe 0 of the second chunk. Physical Start Physical End Length Logical Type Stripe ---------------------------------------------------------------------------------------------------- 0x0000000102500000 0x0000000142500000 1.0G 0x0000000641d00000 META|DUP 0/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000641d00000 META|DUP 1/2 0x0000000142500000 0x0000000182500000 1.0G 0x0000000601d00000 META|DUP 0/2 0x0000000182500000 0x00000001c2500000 1.0G 0x0000000601d00000 META|DUP 1/2 Now how could this possibly happen? All chunk allocation is protected by the chunk_mutex so racing allocations should see a consistent view of the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree (device->alloc_state as set by chunk_map_device_set_bits()) The tree itself is protected by a spin lock, and clearing/setting the bits is always protected by fs_info->mapping_tree_lock, so no race is apparent. It turns out that there is a subtle bug in the logic regarding chunk allocations that have happened in the current transaction, known as "pending extents". The chunk allocation as defined in find_free_dev_extent() is a loop which searches the commit root of the dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it then checks alloc_state bitmap for any pending extents and adjusts the hole that it finds accordingly. However, the logic in that adjustment assumes that the first pending extent is the only one in that range. e.g., given a layout with two non-consecutive pending extents in a hole passed to dev_extent_hole_check() via *hole_start and *hole_size: |----pending A----| real hole |----pending B----| | candidate hole | *hole_start *hole_start + *hole_size the code incorrectly returns a "hole" from the end of pending extent A until the passed in hole end, failing to account for pending B. However, it is not entirely obvious that it is actually possible to produce such a layout. I was able to reproduce it, but with some contortions: I continued to use the force chunk allocation sysfs file and I introduced a long delay (10 seconds) into the start of the cleaner thread. I also prevented the unused bgs cleaning logic from ever deleting metadata bgs. These help make it easier to deterministically produce the condition but shouldn't really matter if you imagine the conditions happening by race/luck. Allocations/frees can happen concurrently with the cleaner thread preparing to process an unused extent and both create some used chunks with an unused chunk interleaved, all during one transaction. Then btrfs_delete_unused_bgs() sees the unused one and clears it, leaving a range with several pending chunk allocations and a gap in the middle. The basic idea is that the unused_bgs cleanup work happens on a worker so if we allocate 3 block groups in one transaction, then the cleaner work kicked off by the previous transaction comes through and deletes the middle one of the 3, then the commit root shows no dev extents and we have the bad pattern in the extent-io-tree. One final consideration is that the code happens to loop to the next hole if there are no more extents at all, so we need one more dev extent way past the area we are working in. Something like the following demonstrates the technique: # push the BG frontier out to 20G fallocate -l 20G $mnt/foo # allocate one more that will prevent the "no more dev extents" luck fallocate -l 1G $mnt/sticky # sync sync # clear out the allocation area rm $mnt/foo sync _cleaner # let everything quiesce sleep 20 sync # dev tree should have one bg 20G out and the rest at the beginning.. # sort of like an empty FS but with a random sticky chunk. # kick off the cleaner in the background, remember it will sleep 10s # before doing interesting work _cleaner & sleep 3 # create 3 trivial block groups, all empty, all immediately marked as unused. echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc" echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" # let the cleaner thread definitely finish, it will remove the data bg sleep 10 # this allocation sees the non-consecutive pending metadata chunks with # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC! echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc" As for the fix, it is not that obvious. I could not see a trivial way to do it even by adding backup loops into find_free_dev_extent(), so I opted to change the semantics of dev_extent_hole_check() to not stop looping until it finds a sufficiently big hole. For clarity, this also required changing the helper function contains_pending_extent() into two new helpers which find the first pending extent and the first suitable hole in a range. I attempted to clean up the documentation and range calculations to be as consistent and clear as possible for the future. I also looked at the zoned case and concluded that the loop there is different and not to be unified with this one. As far as I can tell, the zoned check will only further constrain the hole so looping back to find more holes is acceptable. Though given that zoned really only appends, I find it highly unlikely that it is susceptible to this bug. Fixes: 1b98450 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole") Reported-by: Dimitrios Apostolou <jimis@gmx.net> Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
1 parent 201091d commit 156cac3

1 file changed

Lines changed: 183 additions & 60 deletions

File tree

fs/btrfs/volumes.c

Lines changed: 183 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1505,30 +1505,158 @@ struct btrfs_device *btrfs_scan_one_device(const char *path,
15051505
}
15061506

15071507
/*
1508-
* Try to find a chunk that intersects [start, start + len] range and when one
1509-
* such is found, record the end of it in *start
1508+
* Find the first pending extent intersecting a range.
1509+
*
1510+
* @device: the device to search
1511+
* @start: start of the range to check
1512+
* @len: length of the range to check
1513+
* @pending_start: output pointer for the start of the found pending extent
1514+
* @pending_end: output pointer for the end of the found pending extent (inclusive)
1515+
*
1516+
* Search for a pending chunk allocation that intersects the half-open range
1517+
* [start, start + len).
1518+
*
1519+
* Return: true if a pending extent was found, false otherwise.
1520+
* If the return value is true, store the first pending extent in
1521+
* [*pending_start, *pending_end]. Otherwise, the two output variables
1522+
* may still be modified, to something outside the range and should not
1523+
* be used.
15101524
*/
1511-
static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
1512-
u64 len)
1525+
static bool first_pending_extent(struct btrfs_device *device, u64 start, u64 len,
1526+
u64 *pending_start, u64 *pending_end)
15131527
{
1514-
u64 physical_start, physical_end;
1515-
15161528
lockdep_assert_held(&device->fs_info->chunk_mutex);
15171529

1518-
if (btrfs_find_first_extent_bit(&device->alloc_state, *start,
1519-
&physical_start, &physical_end,
1530+
if (btrfs_find_first_extent_bit(&device->alloc_state, start,
1531+
pending_start, pending_end,
15201532
CHUNK_ALLOCATED, NULL)) {
15211533

1522-
if (in_range(physical_start, *start, len) ||
1523-
in_range(*start, physical_start,
1524-
physical_end + 1 - physical_start)) {
1525-
*start = physical_end + 1;
1534+
if (in_range(*pending_start, start, len) ||
1535+
in_range(start, *pending_start, *pending_end + 1 - *pending_start)) {
15261536
return true;
15271537
}
15281538
}
15291539
return false;
15301540
}
15311541

1542+
/*
1543+
* Find the first real hole accounting for pending extents.
1544+
*
1545+
* @device: the device containing the candidate hole
1546+
* @start: input/output pointer for the hole start position
1547+
* @len: input/output pointer for the hole length
1548+
* @min_hole_size: the size of hole we are looking for
1549+
*
1550+
* Given a potential hole specified by [*start, *start + *len), check for pending
1551+
* chunk allocations within that range. If pending extents are found, the hole is
1552+
* adjusted to represent the first true free space that is large enough when
1553+
* accounting for pending chunks.
1554+
*
1555+
* Note that this function must handle various cases involving non consecutive
1556+
* pending extents.
1557+
*
1558+
* Returns: true if a suitable hole was found and false otherwise.
1559+
* If the return value is true, then *start and *len are set to represent the hole.
1560+
* If the return value is false, then *start is set to the largest hole we
1561+
* found and *len is set to its length.
1562+
* If there are no holes at all, then *start is set to the end of the range and
1563+
* *len is set to 0.
1564+
*/
1565+
static bool find_hole_in_pending_extents(struct btrfs_device *device, u64 *start,
1566+
u64 *len, u64 min_hole_size)
1567+
{
1568+
u64 pending_start, pending_end;
1569+
u64 end;
1570+
u64 max_hole_start = 0;
1571+
u64 max_hole_len = 0;
1572+
1573+
lockdep_assert_held(&device->fs_info->chunk_mutex);
1574+
1575+
if (*len == 0)
1576+
return false;
1577+
1578+
end = *start + *len - 1;
1579+
1580+
/*
1581+
* Loop until we either see a large enough hole or check every pending
1582+
* extent overlapping the candidate hole.
1583+
* At every hole that we observe, record it if it is the new max.
1584+
* At the end of the iteration, set the output variables to the max hole.
1585+
*/
1586+
while (true) {
1587+
if (first_pending_extent(device, *start, *len, &pending_start, &pending_end)) {
1588+
/*
1589+
* Case 1: the pending extent overlaps the start of
1590+
* candidate hole. That means the true hole is after the
1591+
* pending extent, but we need to find the next pending
1592+
* extent to properly size the hole. In the next loop,
1593+
* we will reduce to case 2 or 3.
1594+
* e.g.,
1595+
*
1596+
* |----pending A----| real hole |----pending B----|
1597+
* | candidate hole |
1598+
* *start end
1599+
*/
1600+
if (pending_start <= *start) {
1601+
*start = pending_end + 1;
1602+
goto next;
1603+
}
1604+
/*
1605+
* Case 2: The pending extent starts after *start (and overlaps
1606+
* [*start, end), so the first hole just goes up to the start
1607+
* of the pending extent.
1608+
* e.g.,
1609+
*
1610+
* | real hole |----pending A----|
1611+
* | candidate hole |
1612+
* *start end
1613+
*/
1614+
*len = pending_start - *start;
1615+
if (*len > max_hole_len) {
1616+
max_hole_start = *start;
1617+
max_hole_len = *len;
1618+
}
1619+
if (*len >= min_hole_size)
1620+
break;
1621+
/*
1622+
* If the hole wasn't big enough, then we advance past
1623+
* the pending extent and keep looking.
1624+
*/
1625+
*start = pending_end + 1;
1626+
goto next;
1627+
} else {
1628+
/*
1629+
* Case 3: There is no pending extent overlapping the
1630+
* range [*start, *start + *len - 1], so the only remaining
1631+
* hole is the remaining range.
1632+
* e.g.,
1633+
*
1634+
* | candidate hole |
1635+
* | real hole |
1636+
* *start end
1637+
*/
1638+
1639+
if (*len > max_hole_len) {
1640+
max_hole_start = *start;
1641+
max_hole_len = *len;
1642+
}
1643+
break;
1644+
}
1645+
next:
1646+
if (*start > end)
1647+
break;
1648+
*len = end - *start + 1;
1649+
}
1650+
if (max_hole_len) {
1651+
*start = max_hole_start;
1652+
*len = max_hole_len;
1653+
} else {
1654+
*start = end + 1;
1655+
*len = 0;
1656+
}
1657+
return max_hole_len >= min_hole_size;
1658+
}
1659+
15321660
static u64 dev_extent_search_start(struct btrfs_device *device)
15331661
{
15341662
switch (device->fs_devices->chunk_alloc_policy) {
@@ -1593,59 +1721,57 @@ static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
15931721
}
15941722

15951723
/*
1596-
* Check if specified hole is suitable for allocation.
1724+
* Validate and adjust a hole for chunk allocation
1725+
*
1726+
* @device: the device containing the candidate hole
1727+
* @hole_start: input/output pointer for the hole start position
1728+
* @hole_size: input/output pointer for the hole size
1729+
* @num_bytes: minimum allocation size required
15971730
*
1598-
* @device: the device which we have the hole
1599-
* @hole_start: starting position of the hole
1600-
* @hole_size: the size of the hole
1601-
* @num_bytes: the size of the free space that we need
1731+
* Check if the specified hole is suitable for allocation and adjust it if
1732+
* necessary. The hole may be modified to skip over pending chunk allocations
1733+
* and to satisfy stricter zoned requirements on zoned filesystems.
16021734
*
1603-
* This function may modify @hole_start and @hole_size to reflect the suitable
1604-
* position for allocation. Returns 1 if hole position is updated, 0 otherwise.
1735+
* For regular (non-zoned) allocation, if the hole after adjustment is smaller
1736+
* than @num_bytes, the search continues past additional pending extents until
1737+
* either a sufficiently large hole is found or no more pending extents exist.
1738+
*
1739+
* Return: true if a suitable hole was found and false otherwise.
1740+
* If the return value is true, then *hole_start and *hole_size are set to
1741+
* represent the hole we found.
1742+
* If the return value is false, then *hole_start is set to the largest
1743+
* hole we found and *hole_size is set to its length.
1744+
* If there are no holes at all, then *hole_start is set to the end of the range
1745+
* and *hole_size is set to 0.
16051746
*/
16061747
static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
16071748
u64 *hole_size, u64 num_bytes)
16081749
{
1609-
bool changed = false;
1610-
u64 hole_end = *hole_start + *hole_size;
1750+
bool found = false;
1751+
const u64 hole_end = *hole_start + *hole_size - 1;
16111752

1612-
for (;;) {
1613-
/*
1614-
* Check before we set max_hole_start, otherwise we could end up
1615-
* sending back this offset anyway.
1616-
*/
1617-
if (contains_pending_extent(device, hole_start, *hole_size)) {
1618-
if (hole_end >= *hole_start)
1619-
*hole_size = hole_end - *hole_start;
1620-
else
1621-
*hole_size = 0;
1622-
changed = true;
1623-
}
1753+
ASSERT(*hole_size > 0);
16241754

1625-
switch (device->fs_devices->chunk_alloc_policy) {
1626-
default:
1627-
btrfs_warn_unknown_chunk_allocation(device->fs_devices->chunk_alloc_policy);
1628-
fallthrough;
1629-
case BTRFS_CHUNK_ALLOC_REGULAR:
1630-
/* No extra check */
1631-
break;
1632-
case BTRFS_CHUNK_ALLOC_ZONED:
1633-
if (dev_extent_hole_check_zoned(device, hole_start,
1634-
hole_size, num_bytes)) {
1635-
changed = true;
1636-
/*
1637-
* The changed hole can contain pending extent.
1638-
* Loop again to check that.
1639-
*/
1640-
continue;
1641-
}
1642-
break;
1643-
}
1755+
again:
1756+
*hole_size = hole_end - *hole_start + 1;
1757+
found = find_hole_in_pending_extents(device, hole_start, hole_size, num_bytes);
1758+
if (!found)
1759+
return found;
1760+
ASSERT(*hole_size >= num_bytes);
16441761

1762+
switch (device->fs_devices->chunk_alloc_policy) {
1763+
default:
1764+
btrfs_warn_unknown_chunk_allocation(device->fs_devices->chunk_alloc_policy);
1765+
fallthrough;
1766+
case BTRFS_CHUNK_ALLOC_REGULAR:
1767+
return found;
1768+
case BTRFS_CHUNK_ALLOC_ZONED:
1769+
if (dev_extent_hole_check_zoned(device, hole_start, hole_size, num_bytes))
1770+
goto again;
16451771
break;
16461772
}
16471773

1648-
return changed;
1774+
return found;
16491775
}
16501776

16511777
/*
@@ -1704,7 +1830,7 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
17041830
ret = -ENOMEM;
17051831
goto out;
17061832
}
1707-
again:
1833+
17081834
if (search_start >= search_end ||
17091835
test_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state)) {
17101836
ret = -ENOSPC;
@@ -1791,11 +1917,7 @@ static int find_free_dev_extent(struct btrfs_device *device, u64 num_bytes,
17911917
*/
17921918
if (search_end > search_start) {
17931919
hole_size = search_end - search_start;
1794-
if (dev_extent_hole_check(device, &search_start, &hole_size,
1795-
num_bytes)) {
1796-
btrfs_release_path(path);
1797-
goto again;
1798-
}
1920+
dev_extent_hole_check(device, &search_start, &hole_size, num_bytes);
17991921

18001922
if (hole_size > max_hole_size) {
18011923
max_hole_start = search_start;
@@ -4844,6 +4966,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
48444966
u64 diff;
48454967
u64 start;
48464968
u64 free_diff = 0;
4969+
u64 pending_start, pending_end;
48474970

48484971
new_size = round_down(new_size, fs_info->sectorsize);
48494972
start = new_size;
@@ -4889,7 +5012,7 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 new_size)
48895012
* in-memory chunks are synced to disk so that the loop below sees them
48905013
* and relocates them accordingly.
48915014
*/
4892-
if (contains_pending_extent(device, &start, diff)) {
5015+
if (first_pending_extent(device, start, diff, &pending_start, &pending_end)) {
48935016
mutex_unlock(&fs_info->chunk_mutex);
48945017
ret = btrfs_commit_transaction(trans);
48955018
if (ret)

0 commit comments

Comments
 (0)