gh-148323: release the GIL in `bytes.join` when operands are immutable by picnixz · Pull Request #148456 · python/cpython

picnixz · 2026-04-12T15:50:39Z

Issue: b"".join should release GIL in case of memoryview #148323

…mutable

picnixz · 2026-04-12T15:53:18Z

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

Objects/stringlib/join.h

eendebakpt · 2026-04-12T16:14:02Z

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

I'll have a look at it today or tomorrow

picnixz · 2026-04-12T16:15:46Z

Thanks! My gut feeling is that it won't decrease the performance but I'm more interested in knowing how much we gain for the NEWS entry.

sunmy2019 · 2026-04-12T16:46:41Z

Please give me some time to review its correctness.

Objects/stringlib/join.h

sunmy2019 · 2026-04-12T18:29:07Z

Objects/stringlib/join.h

+            PyObject *bufobj = buffers[i].obj;
+            if (!bufobj || !PyBytes_CheckExact(bufobj)) {
+                drop_gil = 0;
+            }


Suggested change

PyObject *bufobj = buffers[i].obj;

if (!bufobj || !PyBytes_CheckExact(bufobj)) {

drop_gil = 0;

}

if (drop_gil) {

PyObject *bufobj = buffers[i].obj;

if (!bufobj || !PyBytes_CheckExact(bufobj)) {

drop_gil = 0;

}

}

minor speedup

I don't think it'll matter much honestly. I'll wait for benchmarks in this case. Most of the time will be spent in memcpy() and the rest.

sunmy2019 · 2026-04-12T18:30:09Z

The rest LGTM!

eendebakpt · 2026-04-12T18:57:05Z

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

What exactly do you want to benchmark? Possible regressions during normal usage of bytes.join or the improvements for the no-GIL scenario? And why a macrobenchmark? This is is very specific about join.

I created a script to test the releasing of the GIL (included in the details below). It shows that bytes.join(...) indeed releases the gil when arguments are exact bytes. However, the gil is not released for arguments of type memoryview(bytes). The reason is PyObject_GetBuffer does not put a PyBytes object in buffers[i].obj but a copy of itself.

Script for testing scaling

# Multi-threading scaling benchmark for bytes.join().
#
# Targets the change in https://github.com/python/cpython/pull/148456
# (gh-148323): bytes.join() now releases the GIL when the operands are
# immutable, even when they reach bytes.join() through a buffer view
# (e.g. memoryview(bytes_object)).
#
# The GIL is only released when the joined result is large enough
# (>= 1 MiB). The benchmarks below are sized accordingly.
#
#
# Inspired by Tools/ftscalingbench/ftscalingbench.py.

import os
import queue
import sys
import threading
import time

WORK_SCALE = 200

JOIN_OUTPUT_BYTES = 16 << 20  # 16 MiB

# Few, large chunks: bytes.join() acquires a buffer per element on the
# GIL and only the final memcpy runs with the GIL dropped.
CHUNK_SIZE = 1 << 20  # 1 MiB per chunk

ALL_BENCHMARKS = {}

threads = []
in_queues = []
out_queues = []


def register_benchmark(func):
    ALL_BENCHMARKS[func.__name__] = func
    return func


# Build fixed inputs once so each benchmark iteration measures join(),
# not list/memoryview construction.
_CHUNK = b"x" * CHUNK_SIZE
_SEP = b""
_NUM_CHUNKS = JOIN_OUTPUT_BYTES // len(_CHUNK)

_BYTES_LIST = [_CHUNK] * _NUM_CHUNKS
_MEMVIEW_LIST = [memoryview(_CHUNK) for _ in range(_NUM_CHUNKS)]
_MIXED_LIST = [memoryview(_CHUNK) if i % 2 else _CHUNK
               for i in range(_NUM_CHUNKS)]


@register_benchmark
def join_bytes():
    # Pure bytes operands: already released the GIL before the PR.
    # Included as a baseline.
    sep = _SEP
    seq = _BYTES_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


@register_benchmark
def join_memoryview():
    # memoryview over exact-bytes objects. Before the PR the GIL was
    # held here; after the PR it is released.
    sep = _SEP
    seq = _MEMVIEW_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


@register_benchmark
def join_mixed():
    # Mix of bytes and memoryview(bytes). Exercises the same decision
    # path per element inside bytes.join().
    sep = _SEP
    seq = _MIXED_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


def bench_one_thread(func):
    t0 = time.perf_counter_ns()
    func()
    t1 = time.perf_counter_ns()
    return t1 - t0


def bench_parallel(func):
    t0 = time.perf_counter_ns()
    for inq in in_queues:
        inq.put(func)
    for outq in out_queues:
        outq.get()
    t1 = time.perf_counter_ns()
    return t1 - t0


def benchmark(func):
    delta_one_thread = bench_one_thread(func)
    delta_many_threads = bench_parallel(func)

    speedup = delta_one_thread * len(threads) / delta_many_threads
    if speedup >= 1:
        factor = speedup
        direction = "faster"
    else:
        factor = 1 / speedup
        direction = "slower"

    use_color = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
    color = reset_color = ""
    if use_color:
        if speedup <= 1.1:
            color = "\x1b[31m"  # red
        elif speedup < len(threads) / 2:
            color = "\x1b[33m"  # yellow
        reset_color = "\x1b[0m"

    one_ms = delta_one_thread / 1_000_000
    many_ms = delta_many_threads / 1_000_000
    print(f"{color}{func.__name__:<20} "
          f"1T {one_ms:7.1f} ms  "
          f"{len(threads)}T {many_ms:7.1f} ms  "
          f"{round(factor, 2):>5}x {direction}{reset_color}")


def determine_num_threads_and_affinity():
    if sys.platform != "linux":
        return [None] * os.cpu_count()

    import subprocess
    try:
        output = subprocess.check_output(
            ["lscpu", "-p=cpu,node,core,MAXMHZ"],
            text=True, env={"LC_NUMERIC": "C"})
    except (FileNotFoundError, subprocess.CalledProcessError):
        return [None] * os.cpu_count()

    table = []
    for line in output.splitlines():
        if line.startswith("#"):
            continue
        cpu, node, core, maxhz = line.split(",")
        if maxhz == "":
            maxhz = "0"
        table.append((int(cpu), int(node), int(core), float(maxhz)))

    cpus = []
    cores = set()
    max_mhz_all = max(row[3] for row in table)
    for cpu, node, core, maxmhz in table:
        if node == 0 and core not in cores and maxmhz == max_mhz_all:
            cpus.append(cpu)
            cores.add(core)
    return cpus


def thread_run(cpu, in_queue, out_queue):
    if cpu is not None and hasattr(os, "sched_setaffinity"):
        os.sched_setaffinity(0, (cpu,))

    while True:
        func = in_queue.get()
        if func is None:
            break
        func()
        out_queue.put(None)


def initialize_threads(opts):
    if opts.threads == -1:
        cpus = determine_num_threads_and_affinity()
    else:
        cpus = [None] * opts.threads  # don't set affinity

    print(f"Running benchmarks with {len(cpus)} threads")
    for cpu in cpus:
        inq = queue.Queue()
        outq = queue.Queue()
        in_queues.append(inq)
        out_queues.append(outq)
        t = threading.Thread(target=thread_run, args=(cpu, inq, outq),
                             daemon=True)
        threads.append(t)
        t.start()


def main(opts):
    global WORK_SCALE
    gil_enabled = (not hasattr(sys, "_is_gil_enabled")
                   or sys._is_gil_enabled())
    if gil_enabled:
        sys.stderr.write(
            "note: running with the GIL enabled; parallel scaling is "
            "expected to be near 1x\n")

    benchmark_names = opts.benchmarks
    if benchmark_names:
        for name in benchmark_names:
            if name not in ALL_BENCHMARKS:
                sys.stderr.write(f"Unknown benchmark: {name}\n")
                sys.exit(1)
    else:
        benchmark_names = ALL_BENCHMARKS.keys()

    WORK_SCALE = opts.scale

    if not opts.baseline_only:
        initialize_threads(opts)

    do_bench = not opts.baseline_only and not opts.parallel_only
    for name in benchmark_names:
        func = ALL_BENCHMARKS[name]
        if do_bench:
            benchmark(func)
            continue

        if opts.parallel_only:
            delta_ns = bench_parallel(func)
        else:
            delta_ns = bench_one_thread(func)

        time_ms = delta_ns / 1_000_000
        print(f"{func.__name__:<20} {time_ms:.1f} ms")


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--threads", type=int, default=-1,
                        help="number of threads to use")
    parser.add_argument("--scale", type=int, default=WORK_SCALE,
                        help=f"work scale factor (default={WORK_SCALE})")
    parser.add_argument("--baseline-only", default=False,
                        action="store_true",
                        help="only run the baseline benchmarks (single thread)")
    parser.add_argument("--parallel-only", default=False,
                        action="store_true",
                        help="only run the parallel benchmark (many threads)")
    parser.add_argument("benchmarks", nargs="*",
                        help="benchmarks to run")
    options = parser.parse_args()
    main(options)

picnixz · 2026-04-12T19:13:02Z

However, the gil is not released for arguments of type memoryview(bytes)

Oh. I thought it would have been. Hum, then when do we have a buffer for which the underlying obj field is a bytes? is it possible to construct one from Python? or maybe I should check that I get a copy of myself, that I'm an exact memoryview and I'm readonly? (not sure about the invariants here)

picnixz · 2026-04-12T19:15:35Z

And why a macrobenchmark

I wanted to konw whether the comparisons affected the general bytes.join() stuff but now that I read your message I shouldn't have asked for a macro benchmark probably :)

picnixz · 2026-04-12T19:17:23Z

Possible regressions during normal usage of bytes.join or the improvements for the no-GIL scenario?

(I wanted to test both)

pythongh-148323: release the GIL in bytes.join when operands are im…

b861e67

…mutable

bedevere-app bot added the awaiting core review label Apr 12, 2026

bedevere-app bot mentioned this pull request Apr 12, 2026

b"".join should release GIL in case of memoryview #148323

Open

picnixz requested a review from vstinner April 12, 2026 16:01

picnixz commented Apr 12, 2026

View reviewed changes

Objects/stringlib/join.h Outdated Show resolved Hide resolved

Update Objects/stringlib/join.h

9e43c18

sunmy2019 reviewed Apr 12, 2026

View reviewed changes

Objects/stringlib/join.h Outdated Show resolved Hide resolved

picnixz commented Apr 12, 2026

View reviewed changes

Objects/stringlib/join.h Outdated Show resolved Hide resolved

Update Objects/stringlib/join.h

eb5cb1c

sunmy2019 reviewed Apr 12, 2026

View reviewed changes

Uh oh!

Conversation

picnixz commented Apr 12, 2026 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Apr 12, 2026

Uh oh!

Uh oh!

eendebakpt commented Apr 12, 2026

Uh oh!

picnixz commented Apr 12, 2026

Uh oh!

sunmy2019 commented Apr 12, 2026

Uh oh!

Uh oh!

Uh oh!

sunmy2019 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

sunmy2019 Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

picnixz Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunmy2019 commented Apr 12, 2026

Uh oh!

eendebakpt commented Apr 12, 2026

Uh oh!

picnixz commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

picnixz commented Apr 12, 2026

Uh oh!

picnixz commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

picnixz commented Apr 12, 2026 •

edited by bedevere-app bot

Loading

picnixz Apr 12, 2026 •

edited

Loading

picnixz commented Apr 12, 2026 •

edited

Loading