Skip to content

Commit 17529d2

Browse files
committed
Add a reference counting tutorial.
1 parent 1692854 commit 17529d2

3 files changed

Lines changed: 200 additions & 7 deletions

File tree

Doc/extending/first-extension-module.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -475,7 +475,7 @@ So, we'll need to *encode* the data, and we'll use the UTF-8 encoding for it.
475475
and the C API has special support for it.)
476476

477477
The function to encode a Python string into a UTF-8 buffer is named
478-
:c:func:`PyUnicode_AsUTF8AndSize` [#why-pyunicodeasutf8]_.
478+
:c:func:`PyUnicode_AsUTF8` [#why-pyunicodeasutf8]_.
479479
Call it like this:
480480

481481
.. code-block:: c
@@ -484,13 +484,13 @@ Call it like this:
484484
static PyObject *
485485
spam_system(PyObject *self, PyObject *arg)
486486
{
487-
const char *command = PyUnicode_AsUTF8AndSize(arg, NULL);
487+
const char *command = PyUnicode_AsUTF8(arg, NULL);
488488
int status = 3;
489489
PyObject *result = PyLong_FromLong(status);
490490
return result;
491491
}
492492
493-
If :c:func:`PyUnicode_AsUTF8AndSize` is successful, *command* will point to the
493+
If :c:func:`PyUnicode_AsUTF8` is successful, *command* will point to the
494494
resulting C string -- a zero-terminated array of bytes [#embedded-nul]_.
495495
This buffer is managed by the *arg* object, which means we don't need to free
496496
it, but we must follow some rules:
@@ -500,14 +500,14 @@ it, but we must follow some rules:
500500
garbage-collected.
501501
* We must not modify it. This is why we use ``const``.
502502

503-
If :c:func:`PyUnicode_AsUTF8AndSize` was *not* successful, it returns a ``NULL``
503+
If :c:func:`PyUnicode_AsUTF8` was *not* successful, it returns a ``NULL``
504504
pointer.
505505
When calling *any* Python C API, we always need to handle such error cases.
506506
The way to do this in general is left for later chapters of this documentation.
507507
For now, be assured that we are already handling errors from
508508
:c:func:`PyLong_FromLong` correctly.
509509

510-
For the :c:func:`PyUnicode_AsUTF8AndSize` call, the correct way to handle
510+
For the :c:func:`PyUnicode_AsUTF8` call, the correct way to handle
511511
errors is returning ``NULL`` from ``spam_system``.
512512
Add an ``if`` block for this:
513513

@@ -518,7 +518,7 @@ Add an ``if`` block for this:
518518
static PyObject *
519519
spam_system(PyObject *self, PyObject *arg)
520520
{
521-
const char *command = PyUnicode_AsUTF8AndSize(arg);
521+
const char *command = PyUnicode_AsUTF8(arg);
522522
if (command == NULL) {
523523
return NULL;
524524
}
@@ -548,7 +548,7 @@ the ``char *`` buffer, and using its result instead of the ``3``:
548548
static PyObject *
549549
spam_system(PyObject *self, PyObject *arg)
550550
{
551-
const char *command = PyUnicode_AsUTF8AndSize(arg);
551+
const char *command = PyUnicode_AsUTF8(arg);
552552
if (command == NULL) {
553553
return NULL;
554554
}

Doc/extending/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,7 @@ as part of this version of CPython.
7777

7878

7979
#. :ref:`first-extension-module`
80+
#. :ref:`reference-counting-intro`
8081

8182

8283
Guides for intermediate topics
Lines changed: 192 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
.. highlight:: c
2+
3+
4+
.. _reference-counting-intro:
5+
6+
7+
*************************************
8+
An introduction to reference counting
9+
*************************************
10+
11+
What is reference counting?
12+
===========================
13+
14+
In CPython, objects are garbage collected through a scheme known as
15+
"reference counting". This means that all objects keeps count of the number
16+
of references to them.
17+
18+
For example, take the following code:
19+
20+
.. code-block:: python
21+
22+
a = object() # refcount: 1
23+
24+
In the above code, the ``object()`` has a single reference (``a``), so it has
25+
reference count of 1. If we add more references, the reference count will
26+
increase:
27+
28+
.. code-block:: python
29+
30+
a = object() # refcount: 1
31+
b = a # refcount: 2
32+
c = b # refcount: 3
33+
34+
35+
When a name is unbinded, the reference count is decremented. If the reference
36+
count of an object reaches zero, the object is immediately deallocated.
37+
38+
We can visualize this using the :meth:`~object.__del__` method:
39+
40+
.. code-block:: pycon
41+
42+
>>> class Test:
43+
... def __del__(self):
44+
... print("Deleting")
45+
>>> a = Test() # refcount: 1
46+
>>> del a # refcount: 0
47+
Deleting
48+
49+
50+
Object references in the C API
51+
==============================
52+
53+
In the C API, all objects are represented by a pointer to a :c:type:`PyObject`.
54+
This is known as a "reference".
55+
For our purposes, the ``PyObject`` structure contains two important pieces of
56+
information:
57+
58+
1. The object's type, accessible through :c:macro:`Py_TYPE`.
59+
2. The object's :term:`reference count`, accessible through :c:macro:`Py_REFCNT`.
60+
61+
When using the C API, we need to manage the reference count of an object on our
62+
own. Or, in other words, we need to tell Python where and when we are using an
63+
object. This is done through two macros:
64+
65+
1. :c:macro:`Py_INCREF`, which increments the object's reference count.
66+
2. :c:macro:`Py_DECREF`, which decrements the object's reference count.
67+
If the object's reference count becomes zero, the object's destructor is
68+
invoked.
69+
70+
To understand how this works in practice, let's go back to our ``system``
71+
function, taking note of ``PyObject *`` uses this time:
72+
73+
.. code-block:: c
74+
75+
:emphasize-lines: 1-2, 9
76+
77+
static PyObject *
78+
spam_system(PyObject *self, PyObject *arg)
79+
{
80+
const char *command = PyUnicode_AsUTF8(arg);
81+
if (command == NULL) {
82+
return NULL;
83+
}
84+
int status = system(command);
85+
PyObject *result = PyLong_FromLong(status);
86+
return result;
87+
}
88+
89+
Again, each ``PyObject *`` is a reference. There are two types of references
90+
in the C API:
91+
92+
1. :term:`Strong references <strong reference>`, in which you are responsible
93+
for calling :c:macro:`Py_DECREF` (or otherwise handing off the reference).
94+
At the end of a function, all strong references should have either been
95+
destroyed or handed off (such as by returning it).
96+
2. :term:`Borrowed references <borrowed reference>`, in which you are *not*
97+
responsible for destroying the reference.
98+
99+
In the ``spam_system`` function, ``self`` and ``arg`` are borrowed references
100+
(meaning we must not decrement their reference count), but ``result`` is a
101+
strong reference. ``result`` is returned, so the strong reference is given to
102+
the caller. This is also called "stealing" a reference (so, in the above
103+
example, the caller steals our strong reference to ``result``).
104+
105+
106+
Reference counting patterns
107+
===========================
108+
109+
In Python's C API, most functions will return a strong reference, and as such,
110+
you need to release those references when you are done with them. For example,
111+
let's say that we wanted to change our ``system`` function to only accept ASCII
112+
strings as an input. We would first call :c:func:`PyUnicode_AsASCIIString` to
113+
convert the string to a Python :class:`bytes` object, and then use
114+
:c:macro:`PyBytes_AS_STRING` to extract the internal ``const char *`` buffer
115+
from it.
116+
117+
To visualize:
118+
119+
.. code-block:: c
120+
121+
:emphasize-lines: 4-8, 10
122+
123+
static PyObject *
124+
spam_system(PyObject *self, PyObject *arg)
125+
{
126+
PyObject *bytes = PyUnicode_AsASCIIString(arg); // Strong reference
127+
if (bytes == NULL) {
128+
return NULL;
129+
}
130+
const char *command = PyBytes_AS_STRING(bytes);
131+
int status = system(command);
132+
Py_DECREF(bytes); // Release the strong reference
133+
PyObject *result = PyLong_FromLong(status);
134+
return result;
135+
}
136+
137+
Note that we have to call ``Py_DECREF(bytes)`` *after* we call ``system``.
138+
If we did it before, then the string returned by ``PyBytes_AS_STRING``
139+
might be freed and cause a crash upon trying to use it in ``system``.
140+
141+
142+
The pitfalls of reference counting
143+
==================================
144+
145+
As mentioned previously, *most* functions will return a strong reference, but not
146+
all of them! In the above example, if ``PyUnicode_AsASCIIString`` were to
147+
return a borrowed reference, then there would be a use-after-free somewhere
148+
down the call stack.
149+
150+
Unfortunately, there is no way to determine whether a reference is strong or
151+
borrowed just by looking at it. This can lead to many memory-safety bugs,
152+
and to make matters worse, debugging bugs of this nature is often very difficult.
153+
154+
For example, let's add a bug to ``spam_system`` where we release a borrowed
155+
reference:
156+
157+
.. code-block:: c
158+
159+
:emphasize-lines: 5
160+
161+
static PyObject *
162+
spam_system(PyObject *self, PyObject *arg)
163+
{
164+
const char *command = PyUnicode_AsUTF8(arg);
165+
Py_DECREF(arg); // refcount: 0!!!!
166+
if (command == NULL) {
167+
return NULL;
168+
}
169+
int status = system(command);
170+
PyObject *result = PyLong_FromLong(status);
171+
return result;
172+
}
173+
174+
175+
Running the above code will result in a crash, but *not* in the
176+
``spam_system`` function. In fact, ``spam_system`` won't even show up in the
177+
stack trace. The crash occurs after ``spam_system`` returns and the *caller*
178+
tries to release its reference to ``arg``, but since we stole the reference,
179+
``arg`` is now invalid. This can make it very difficult to track down where
180+
a reference counting error was made.
181+
182+
Another common error is forgetting to release a strong reference, in which case
183+
the object will leak its memory. This is known as a "reference leak".
184+
In this case, tools such as `Memray <https://bloomberg.github.io/memray/>`_
185+
are able to identify which objects are leaking, which does make debugging
186+
a little bit easier, but objects often hold references to many other objects,
187+
which will *also* leak, making it even harder to find the cause of the leak.
188+
189+
Because CPython does not track where reference counts are incremented and
190+
decremented, reference counting bugs are notoriously difficult to identify and
191+
fix. This is one of the reasons many developers choose to use other programming
192+
languages and tools when interfacing with Python's C API.

0 commit comments

Comments
 (0)