Skip to content

gh-148323: release the GIL in bytes.join when operands are immutable#148456

Open
picnixz wants to merge 3 commits intopython:mainfrom
picnixz:perf/release-gil-in-join-148323
Open

gh-148323: release the GIL in bytes.join when operands are immutable#148456
picnixz wants to merge 3 commits intopython:mainfrom
picnixz:perf/release-gil-in-join-148323

Conversation

@picnixz
Copy link
Copy Markdown
Member

@picnixz picnixz commented Apr 12, 2026

@picnixz
Copy link
Copy Markdown
Member Author

picnixz commented Apr 12, 2026

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

@picnixz picnixz requested a review from vstinner April 12, 2026 16:01
@eendebakpt
Copy link
Copy Markdown
Contributor

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

I'll have a look at it today or tomorrow

@picnixz
Copy link
Copy Markdown
Member Author

picnixz commented Apr 12, 2026

Thanks! My gut feeling is that it won't decrease the performance but I'm more interested in knowing how much we gain for the NEWS entry.

@sunmy2019
Copy link
Copy Markdown
Member

Please give me some time to review its correctness.

Comment on lines +84 to +87
PyObject *bufobj = buffers[i].obj;
if (!bufobj || !PyBytes_CheckExact(bufobj)) {
drop_gil = 0;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyObject *bufobj = buffers[i].obj;
if (!bufobj || !PyBytes_CheckExact(bufobj)) {
drop_gil = 0;
}
if (drop_gil) {
PyObject *bufobj = buffers[i].obj;
if (!bufobj || !PyBytes_CheckExact(bufobj)) {
drop_gil = 0;
}
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor speedup

Copy link
Copy Markdown
Member Author

@picnixz picnixz Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it'll matter much honestly. I'll wait for benchmarks in this case. Most of the time will be spent in memcpy() and the rest.

@sunmy2019
Copy link
Copy Markdown
Member

The rest LGTM!

@eendebakpt
Copy link
Copy Markdown
Contributor

Ideally, I'd like to run macrobenchmarks with this but I don't know how to do it so I'll ask someone I think knows @eendebakpt

What exactly do you want to benchmark? Possible regressions during normal usage of bytes.join or the improvements for the no-GIL scenario? And why a macrobenchmark? This is is very specific about join.

I created a script to test the releasing of the GIL (included in the details below). It shows that bytes.join(...) indeed releases the gil when arguments are exact bytes. However, the gil is not released for arguments of type memoryview(bytes). The reason is PyObject_GetBuffer does not put a PyBytes object in buffers[i].obj but a copy of itself.

Script for testing scaling
# Multi-threading scaling benchmark for bytes.join().
#
# Targets the change in https://github.com/python/cpython/pull/148456
# (gh-148323): bytes.join() now releases the GIL when the operands are
# immutable, even when they reach bytes.join() through a buffer view
# (e.g. memoryview(bytes_object)).
#
# The GIL is only released when the joined result is large enough
# (>= 1 MiB). The benchmarks below are sized accordingly.
#
#
# Inspired by Tools/ftscalingbench/ftscalingbench.py.

import os
import queue
import sys
import threading
import time

WORK_SCALE = 200

JOIN_OUTPUT_BYTES = 16 << 20  # 16 MiB

# Few, large chunks: bytes.join() acquires a buffer per element on the
# GIL and only the final memcpy runs with the GIL dropped.
CHUNK_SIZE = 1 << 20  # 1 MiB per chunk

ALL_BENCHMARKS = {}

threads = []
in_queues = []
out_queues = []


def register_benchmark(func):
    ALL_BENCHMARKS[func.__name__] = func
    return func


# Build fixed inputs once so each benchmark iteration measures join(),
# not list/memoryview construction.
_CHUNK = b"x" * CHUNK_SIZE
_SEP = b""
_NUM_CHUNKS = JOIN_OUTPUT_BYTES // len(_CHUNK)

_BYTES_LIST = [_CHUNK] * _NUM_CHUNKS
_MEMVIEW_LIST = [memoryview(_CHUNK) for _ in range(_NUM_CHUNKS)]
_MIXED_LIST = [memoryview(_CHUNK) if i % 2 else _CHUNK
               for i in range(_NUM_CHUNKS)]


@register_benchmark
def join_bytes():
    # Pure bytes operands: already released the GIL before the PR.
    # Included as a baseline.
    sep = _SEP
    seq = _BYTES_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


@register_benchmark
def join_memoryview():
    # memoryview over exact-bytes objects. Before the PR the GIL was
    # held here; after the PR it is released.
    sep = _SEP
    seq = _MEMVIEW_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


@register_benchmark
def join_mixed():
    # Mix of bytes and memoryview(bytes). Exercises the same decision
    # path per element inside bytes.join().
    sep = _SEP
    seq = _MIXED_LIST
    for _ in range(WORK_SCALE):
        sep.join(seq)


def bench_one_thread(func):
    t0 = time.perf_counter_ns()
    func()
    t1 = time.perf_counter_ns()
    return t1 - t0


def bench_parallel(func):
    t0 = time.perf_counter_ns()
    for inq in in_queues:
        inq.put(func)
    for outq in out_queues:
        outq.get()
    t1 = time.perf_counter_ns()
    return t1 - t0


def benchmark(func):
    delta_one_thread = bench_one_thread(func)
    delta_many_threads = bench_parallel(func)

    speedup = delta_one_thread * len(threads) / delta_many_threads
    if speedup >= 1:
        factor = speedup
        direction = "faster"
    else:
        factor = 1 / speedup
        direction = "slower"

    use_color = hasattr(sys.stdout, 'isatty') and sys.stdout.isatty()
    color = reset_color = ""
    if use_color:
        if speedup <= 1.1:
            color = "\x1b[31m"  # red
        elif speedup < len(threads) / 2:
            color = "\x1b[33m"  # yellow
        reset_color = "\x1b[0m"

    one_ms = delta_one_thread / 1_000_000
    many_ms = delta_many_threads / 1_000_000
    print(f"{color}{func.__name__:<20} "
          f"1T {one_ms:7.1f} ms  "
          f"{len(threads)}T {many_ms:7.1f} ms  "
          f"{round(factor, 2):>5}x {direction}{reset_color}")


def determine_num_threads_and_affinity():
    if sys.platform != "linux":
        return [None] * os.cpu_count()

    import subprocess
    try:
        output = subprocess.check_output(
            ["lscpu", "-p=cpu,node,core,MAXMHZ"],
            text=True, env={"LC_NUMERIC": "C"})
    except (FileNotFoundError, subprocess.CalledProcessError):
        return [None] * os.cpu_count()

    table = []
    for line in output.splitlines():
        if line.startswith("#"):
            continue
        cpu, node, core, maxhz = line.split(",")
        if maxhz == "":
            maxhz = "0"
        table.append((int(cpu), int(node), int(core), float(maxhz)))

    cpus = []
    cores = set()
    max_mhz_all = max(row[3] for row in table)
    for cpu, node, core, maxmhz in table:
        if node == 0 and core not in cores and maxmhz == max_mhz_all:
            cpus.append(cpu)
            cores.add(core)
    return cpus


def thread_run(cpu, in_queue, out_queue):
    if cpu is not None and hasattr(os, "sched_setaffinity"):
        os.sched_setaffinity(0, (cpu,))

    while True:
        func = in_queue.get()
        if func is None:
            break
        func()
        out_queue.put(None)


def initialize_threads(opts):
    if opts.threads == -1:
        cpus = determine_num_threads_and_affinity()
    else:
        cpus = [None] * opts.threads  # don't set affinity

    print(f"Running benchmarks with {len(cpus)} threads")
    for cpu in cpus:
        inq = queue.Queue()
        outq = queue.Queue()
        in_queues.append(inq)
        out_queues.append(outq)
        t = threading.Thread(target=thread_run, args=(cpu, inq, outq),
                             daemon=True)
        threads.append(t)
        t.start()


def main(opts):
    global WORK_SCALE
    gil_enabled = (not hasattr(sys, "_is_gil_enabled")
                   or sys._is_gil_enabled())
    if gil_enabled:
        sys.stderr.write(
            "note: running with the GIL enabled; parallel scaling is "
            "expected to be near 1x\n")

    benchmark_names = opts.benchmarks
    if benchmark_names:
        for name in benchmark_names:
            if name not in ALL_BENCHMARKS:
                sys.stderr.write(f"Unknown benchmark: {name}\n")
                sys.exit(1)
    else:
        benchmark_names = ALL_BENCHMARKS.keys()

    WORK_SCALE = opts.scale

    if not opts.baseline_only:
        initialize_threads(opts)

    do_bench = not opts.baseline_only and not opts.parallel_only
    for name in benchmark_names:
        func = ALL_BENCHMARKS[name]
        if do_bench:
            benchmark(func)
            continue

        if opts.parallel_only:
            delta_ns = bench_parallel(func)
        else:
            delta_ns = bench_one_thread(func)

        time_ms = delta_ns / 1_000_000
        print(f"{func.__name__:<20} {time_ms:.1f} ms")


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("-t", "--threads", type=int, default=-1,
                        help="number of threads to use")
    parser.add_argument("--scale", type=int, default=WORK_SCALE,
                        help=f"work scale factor (default={WORK_SCALE})")
    parser.add_argument("--baseline-only", default=False,
                        action="store_true",
                        help="only run the baseline benchmarks (single thread)")
    parser.add_argument("--parallel-only", default=False,
                        action="store_true",
                        help="only run the parallel benchmark (many threads)")
    parser.add_argument("benchmarks", nargs="*",
                        help="benchmarks to run")
    options = parser.parse_args()
    main(options)

@picnixz
Copy link
Copy Markdown
Member Author

picnixz commented Apr 12, 2026

However, the gil is not released for arguments of type memoryview(bytes)

Oh. I thought it would have been. Hum, then when do we have a buffer for which the underlying obj field is a bytes? is it possible to construct one from Python? or maybe I should check that I get a copy of myself, that I'm an exact memoryview and I'm readonly? (not sure about the invariants here)

@picnixz
Copy link
Copy Markdown
Member Author

picnixz commented Apr 12, 2026

And why a macrobenchmark

I wanted to konw whether the comparisons affected the general bytes.join() stuff but now that I read your message I shouldn't have asked for a macro benchmark probably :)

@picnixz
Copy link
Copy Markdown
Member Author

picnixz commented Apr 12, 2026

Possible regressions during normal usage of bytes.join or the improvements for the no-GIL scenario?

(I wanted to test both)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants