|
| 1 | +.. highlight:: c |
| 2 | + |
| 3 | + |
| 4 | +.. _reference-counting-intro: |
| 5 | + |
| 6 | + |
| 7 | +************************************* |
| 8 | +An introduction to reference counting |
| 9 | +************************************* |
| 10 | + |
| 11 | +What is reference counting? |
| 12 | +=========================== |
| 13 | + |
| 14 | +In CPython, objects are garbage collected through a scheme known as |
| 15 | +"reference counting". This means that all objects keeps count of the number |
| 16 | +of references to them. |
| 17 | + |
| 18 | +For example, take the following code: |
| 19 | + |
| 20 | +.. code-block:: python |
| 21 | +
|
| 22 | + a = object() # refcount: 1 |
| 23 | +
|
| 24 | +In the above code, the ``object()`` has a single reference (``a``), so it has |
| 25 | +reference count of 1. If we add more references, the reference count will |
| 26 | +increase: |
| 27 | + |
| 28 | +.. code-block:: python |
| 29 | +
|
| 30 | + a = object() # refcount: 1 |
| 31 | + b = a # refcount: 2 |
| 32 | + c = b # refcount: 3 |
| 33 | +
|
| 34 | +
|
| 35 | +When a name is unbinded, the reference count is decremented. If the reference |
| 36 | +count of an object reaches zero, the object is immediately deallocated. |
| 37 | + |
| 38 | +We can visualize this using the :meth:`~object.__del__` method: |
| 39 | + |
| 40 | +.. code-block:: pycon |
| 41 | +
|
| 42 | + >>> class Test: |
| 43 | + ... def __del__(self): |
| 44 | + ... print("Deleting") |
| 45 | + >>> a = Test() # refcount: 1 |
| 46 | + >>> del a # refcount: 0 |
| 47 | + Deleting |
| 48 | +
|
| 49 | +
|
| 50 | +Object references in the C API |
| 51 | +============================== |
| 52 | + |
| 53 | +In the C API, all objects are represented by a pointer to a :c:type:`PyObject`. |
| 54 | +This is known as a "reference". |
| 55 | +For our purposes, the ``PyObject`` structure contains two important pieces of |
| 56 | +information: |
| 57 | + |
| 58 | +1. The object's type, accessible through :c:macro:`Py_TYPE`. |
| 59 | +2. The object's :term:`reference count`, accessible through :c:macro:`Py_REFCNT`. |
| 60 | + |
| 61 | +When using the C API, we need to manage the reference count of an object on our |
| 62 | +own. Or, in other words, we need to tell Python where and when we are using an |
| 63 | +object. This is done through two macros: |
| 64 | + |
| 65 | +1. :c:macro:`Py_INCREF`, which increments the object's reference count. |
| 66 | +2. :c:macro:`Py_DECREF`, which decrements the object's reference count. |
| 67 | + If the object's reference count becomes zero, the object's destructor is |
| 68 | + invoked. |
| 69 | + |
| 70 | +To understand how this works in practice, let's go back to our ``system`` |
| 71 | +function, taking note of ``PyObject *`` uses this time: |
| 72 | + |
| 73 | +.. code-block:: c |
| 74 | +
|
| 75 | + :emphasize-lines: 1-2, 9 |
| 76 | +
|
| 77 | + static PyObject * |
| 78 | + spam_system(PyObject *self, PyObject *arg) |
| 79 | + { |
| 80 | + const char *command = PyUnicode_AsUTF8(arg); |
| 81 | + if (command == NULL) { |
| 82 | + return NULL; |
| 83 | + } |
| 84 | + int status = system(command); |
| 85 | + PyObject *result = PyLong_FromLong(status); |
| 86 | + return result; |
| 87 | + } |
| 88 | +
|
| 89 | +Again, each ``PyObject *`` is a reference. There are two types of references |
| 90 | +in the C API: |
| 91 | + |
| 92 | +1. :term:`Strong references <strong reference>`, in which you are responsible |
| 93 | + for calling :c:macro:`Py_DECREF` (or otherwise handing off the reference). |
| 94 | + At the end of a function, all strong references should have either been |
| 95 | + destroyed or handed off (such as by returning it). |
| 96 | +2. :term:`Borrowed references <borrowed reference>`, in which you are *not* |
| 97 | + responsible for destroying the reference. |
| 98 | + |
| 99 | +In the ``spam_system`` function, ``self`` and ``arg`` are borrowed references |
| 100 | +(meaning we must not decrement their reference count), but ``result`` is a |
| 101 | +strong reference. ``result`` is returned, so the strong reference is given to |
| 102 | +the caller. This is also called "stealing" a reference (so, in the above |
| 103 | +example, the caller steals our strong reference to ``result``). |
| 104 | + |
| 105 | + |
| 106 | +Reference counting patterns |
| 107 | +=========================== |
| 108 | + |
| 109 | +In Python's C API, most functions will return a strong reference, and as such, |
| 110 | +you need to release those references when you are done with them. For example, |
| 111 | +let's say that we wanted to change our ``system`` function to only accept ASCII |
| 112 | +strings as an input. We would first call :c:func:`PyUnicode_AsASCIIString` to |
| 113 | +convert the string to a Python :class:`bytes` object, and then use |
| 114 | +:c:macro:`PyBytes_AS_STRING` to extract the internal ``const char *`` buffer |
| 115 | +from it. |
| 116 | + |
| 117 | +To visualize: |
| 118 | + |
| 119 | +.. code-block:: c |
| 120 | +
|
| 121 | + :emphasize-lines: 4-8, 10 |
| 122 | +
|
| 123 | + static PyObject * |
| 124 | + spam_system(PyObject *self, PyObject *arg) |
| 125 | + { |
| 126 | + PyObject *bytes = PyUnicode_AsASCIIString(arg); // Strong reference |
| 127 | + if (bytes == NULL) { |
| 128 | + return NULL; |
| 129 | + } |
| 130 | + const char *command = PyBytes_AS_STRING(bytes); |
| 131 | + int status = system(command); |
| 132 | + Py_DECREF(bytes); // Release the strong reference |
| 133 | + PyObject *result = PyLong_FromLong(status); |
| 134 | + return result; |
| 135 | + } |
| 136 | +
|
| 137 | +Note that we have to call ``Py_DECREF(bytes)`` *after* we call ``system``. |
| 138 | +If we did it before, then the string returned by ``PyBytes_AS_STRING`` |
| 139 | +might be freed and cause a crash upon trying to use it in ``system``. |
| 140 | + |
| 141 | + |
| 142 | +The pitfalls of reference counting |
| 143 | +================================== |
| 144 | + |
| 145 | +As mentioned previously, *most* functions will return a strong reference, but not |
| 146 | +all of them! In the above example, if ``PyUnicode_AsASCIIString`` were to |
| 147 | +return a borrowed reference, then there would be a use-after-free somewhere |
| 148 | +down the call stack. |
| 149 | + |
| 150 | +Unfortunately, there is no way to determine whether a reference is strong or |
| 151 | +borrowed just by looking at it. This can lead to many memory-safety bugs, |
| 152 | +and to make matters worse, debugging bugs of this nature is often very difficult. |
| 153 | + |
| 154 | +For example, let's add a bug to ``spam_system`` where we release a borrowed |
| 155 | +reference: |
| 156 | + |
| 157 | +.. code-block:: c |
| 158 | +
|
| 159 | + :emphasize-lines: 5 |
| 160 | +
|
| 161 | + static PyObject * |
| 162 | + spam_system(PyObject *self, PyObject *arg) |
| 163 | + { |
| 164 | + const char *command = PyUnicode_AsUTF8(arg); |
| 165 | + Py_DECREF(arg); // refcount: 0!!!! |
| 166 | + if (command == NULL) { |
| 167 | + return NULL; |
| 168 | + } |
| 169 | + int status = system(command); |
| 170 | + PyObject *result = PyLong_FromLong(status); |
| 171 | + return result; |
| 172 | + } |
| 173 | +
|
| 174 | +
|
| 175 | +Running the above code will result in a crash, but *not* in the |
| 176 | +``spam_system`` function. In fact, ``spam_system`` won't even show up in the |
| 177 | +stack trace. The crash occurs after ``spam_system`` returns and the *caller* |
| 178 | +tries to release its reference to ``arg``, but since we stole the reference, |
| 179 | +``arg`` is now invalid. This can make it very difficult to track down where |
| 180 | +a reference counting error was made. |
| 181 | + |
| 182 | +Another common error is forgetting to release a strong reference, in which case |
| 183 | +the object will leak its memory. This is known as a "reference leak". |
| 184 | +In this case, tools such as `Memray <https://bloomberg.github.io/memray/>`_ |
| 185 | +are able to identify which objects are leaking, which does make debugging |
| 186 | +a little bit easier, but objects often hold references to many other objects, |
| 187 | +which will *also* leak, making it even harder to find the cause of the leak. |
| 188 | + |
| 189 | +Because CPython does not track where reference counts are incremented and |
| 190 | +decremented, reference counting bugs are notoriously difficult to identify and |
| 191 | +fix. This is one of the reasons many developers choose to use other programming |
| 192 | +languages and tools when interfacing with Python's C API. |
0 commit comments