microsoft · astafan8 · Jun 12, 2026
@@ -75,3 +75,33 @@ We note that the dataset currently exclusively supports storing data in an
 SQLite database. This is not an intrinsic limitation of the dataset and
 measurement layer. It is possible that at a future state support for writing
 to a different backend will be added.
+
+.. _sec:design_split_storage:
+
+Split Raw Data Storage
+======================
+
+As the main SQLite database grows with many datasets, browsing experiments and
+loading metadata can become slower due to the file size. To address this,
+QCoDeS supports an optional **split raw data storage** mode (see
+:ref:`sec:intro_split_raw_data` for user-facing details).
+
+From a design perspective, this feature adds a thin routing layer inside the
+``DataSet`` class without changing any public interfaces:
+
+- A ``_data_conn`` property transparently returns either the main database
+  connection or a per-dataset raw data connection, depending on the
+  configuration.
+- Write paths (``add_results``, ``_BackgroundWriter``) and read paths
+  (``get_parameter_data``, ``DataSetCacheWithDBBackend``, ``number_of_results``,
+  ``__len__``) all go through this single routing point.
+- The per-dataset SQLite file is a lightweight database containing only the
+  results table and numpy type adapters -- no QCoDeS metadata schema.
+- Subscriber triggers (used for real-time data callbacks) are created on the
+  data connection so that they fire regardless of which database holds the
+  results table.
+
+The implementation is contained in ``qcodes.dataset.raw_data_storage`` (helper
+functions) and a handful of additions to ``qcodes.dataset.data_set`` (routing
+logic). The ``Measurement`` context manager, ``DataSaver``, and all export
+functions work without modification.
@@ -75,3 +75,33 @@ For dataset operations, QCoDeS provides functions for:
 - **Exporting datasets**: :doc:`Exporting data to other file formats <../examples/DataSet/Exporting-data-to-other-file-formats>`
 - **Extracting runs between databases**: :doc:`Extracting runs from one DB file to another <../examples/DataSet/Extracting-runs-from-one-DB-file-to-another>` and :func:`qcodes.dataset.extract_runs_into_db`
 - **Bulk export and metadata-only databases**: :func:`qcodes.dataset.export_datasets_and_create_metadata_db` for creating lightweight metadata-only databases while exporting all data to NetCDF files
+
+.. _sec:intro_split_raw_data:
+
+Split Raw Data Storage
+======================
+
+By default, all measurement data (the results table rows) is stored in the same SQLite database alongside metadata such as experiments, runs, parameter layouts, and dependencies. Over time, the main database file can grow very large, which can slow down operations like browsing experiments and loading metadata.
+
+QCoDeS supports an optional **split raw data storage** mode in which the actual measurement data for each ``DataSet`` is written to an individual, per-dataset SQLite file while all metadata remains in the main database. Each per-dataset file is named after the dataset's GUID (e.g. ``<guid>.db``) and is stored in a configurable folder.
+
+This feature is controlled by two configuration options in ``qcodesrc.json``:
+
+- ``dataset.raw_data_to_separate_db`` (bool, default ``false``): enables or disables split storage.
+- ``dataset.raw_data_path`` (string, default ``"{db_location}"``): the folder where per-dataset files are created. The ``{db_location}`` placeholder is expanded to a folder derived from the main database path (e.g. ``~/experiments.db`` becomes ``~/experiments_db/``).
+
+When enabled:
+
+- The main database retains the full results table schema (column definitions) but no data rows are written to it, keeping it lightweight.
+- All ``INSERT`` and ``SELECT`` operations on results data are transparently routed to the per-dataset file.
+- The path to the per-dataset file is persisted in the run's metadata (``raw_data_db_path``), so ``load_by_id`` and related loading functions automatically reconnect to the correct file.
+- All public ``DataSet`` APIs (``get_parameter_data``, ``to_pandas_dataframe``, ``to_xarray_dataset``, ``cache``, ``export``, etc.) work identically whether split storage is enabled or not.
+
+Example runtime configuration::
+
+    import qcodes as qc
+
+    qc.config.dataset.raw_data_to_separate_db = True
+    qc.config.dataset.raw_data_path = "/data/raw_measurements/"
+
+For more details on database management, see the :doc:`Database notebook <../examples/DataSet/Database>`.
@@ -167,6 +167,71 @@
     "\n",
     "Moreover, we have also written an [example notebook](Extracting-runs-from-one-DB-file-to-another.ipynb) of transferring `DataSets` between database flies that may serve as a template for more complex data organization protocols."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Split Raw Data Storage\n",
+    "\n",
+    "As the main database grows with many datasets, browsing experiments and loading metadata can become slower. QCoDeS supports an optional **split raw data storage** mode that writes the raw measurement data for each dataset into its own individual SQLite file, while keeping all metadata (experiments, runs, parameters, dependencies) in the main database.\n",
+    "\n",
+    "This keeps the main database lightweight and makes it faster to work with, while still allowing all existing `DataSet` APIs to function identically.\n",
+    "\n",
+    "### Configuration\n",
+    "\n",
+    "Split raw data storage is controlled by two configuration options:\n",
+    "\n",
+    "- `dataset.raw_data_to_separate_db` (bool, default `False`): enables or disables split storage.\n",
+    "- `dataset.raw_data_path` (string, default `\"{db_location}\"`): the folder where per-dataset SQLite files are created. The `{db_location}` placeholder expands to a folder derived from the main database path (e.g. `~/experiments.db` becomes `~/experiments_db/`).\n",
+    "\n",
+    "You can enable it at runtime:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Enable split raw data storage\n",
+    "qc.config.dataset.raw_data_to_separate_db = True\n",
+    "\n",
+    "# Optionally set a custom path for per-dataset files\n",
+    "qc.config.dataset.raw_data_path = \"/data/raw_measurements/\"\n",
+    "\n",
+    "# Or use the default which derives from the main DB location:\n",
+    "# qc.config.dataset.raw_data_path = \"{db_location}\"\n",
+    "# e.g. ~/experiments.db -> ~/experiments_db/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Or permanently in your `qcodesrc.json`:\n",
+    "\n",
+    "```json\n",
+    "{\n",
+    "  \"dataset\": {\n",
+    "    \"raw_data_to_separate_db\": true,\n",
+    "    \"raw_data_path\": \"{db_location}\"\n",
+    "  }\n",
+    "}\n",
+    "```\n",
+    "\n",
+    "### How It Works\n",
+    "\n",
+    "When split storage is enabled:\n",
+    "\n",
+    "1. When a measurement starts (`mark_started()`), a per-dataset SQLite file named `<guid>.db` is created in the configured folder.\n",
+    "2. All measurement data (results table rows) is written to this per-dataset file instead of the main database.\n",
+    "3. The main database retains the results table schema (column definitions) but contains no data rows, keeping it small.\n",
+    "4. The path to the per-dataset file is saved in the run metadata, so `load_by_id()` and related functions automatically find and reconnect to the correct file.\n",
+    "5. All `DataSet` methods (`get_parameter_data`, `to_pandas_dataframe`, `to_xarray_dataset`, `cache`, `export`, etc.) work transparently with split storage.\n",
+    "\n",
+    "> **Note:** Datasets created with split storage enabled can always be loaded later, even if the configuration is changed back to the default, as long as the per-dataset files remain at their original paths."
+   ]
   }
  ],
  "metadata": {

@@ -79,7 +79,9 @@
         "export_chunked_export_of_large_files_enabled": false,
         "export_chunked_threshold": 1000,
         "in_memory_cache": true,
-        "load_from_exported_file": false
+        "load_from_exported_file": false,
+        "raw_data_to_separate_db": false,
+        "raw_data_path": "{db_location}"
     },
     "telemetry":
     {

@@ -382,6 +382,16 @@
                     "type": "boolean",
                     "default": true,
                     "description": "Should the data be cached in memory as it is measured. Useful to disable for large datasets to save on memory consumption."
+                },
+                "raw_data_to_separate_db": {
+                    "type": "boolean",
+                    "default": false,
+                    "description": "If true, raw measurement data (results tables) will be written to individual per-dataset SQLite files instead of the main database. Metadata remains in the main database."
+                },
+                "raw_data_path": {
+                    "type": "string",
+                    "default": "{db_location}",
+                    "description": "Path to the folder where per-dataset raw data SQLite files are stored. {db_location} is a directory in the same folder as the .db file with a matching name, e.g. for ~/experiments.db raw data files will be stored in ~/experiments_db/"
                 }
             },
             "description": "Settings related to the DataSet and Measurement Context manager",

@@ -66,6 +66,7 @@
     get_run_timestamp_from_run_id,
     get_runid_from_guid,
     get_sample_name_from_experiment_id,
+    get_shaped_parameter_data_for_one_paramtree,
     mark_run_complete,
     remove_trigger,
     run_exists,
@@ -99,6 +100,12 @@
     load_to_xarray_dataset_dict,
     xarray_to_h5netcdf_with_complex_numbers,
 )
+from .raw_data_storage import (
+    connect_to_raw_data_db,
+    create_raw_data_db,
+    get_raw_data_db_path,
+    is_raw_data_storage_enabled,
+)
 from .subscriber import _Subscriber
 
 if TYPE_CHECKING:
@@ -141,22 +148,40 @@ def __init__(self, queue: Queue[Any], conn: AtomicConnection):
 
     def run(self) -> None:
         self.conn = connect(self.path)
+        self._raw_data_conns: dict[str, AtomicConnection] = {}
 
         while self.keep_writing:
             item = self.queue.get()
             if item["keys"] == "stop":
                 self.keep_writing = False
                 self.conn.close()
+                for raw_conn in self._raw_data_conns.values():
+                    raw_conn.close()
             elif item["keys"] == "finalize":
                 _WRITERS[self.path].active_datasets.remove(item["values"])
             else:
-                self.write_results(item["keys"], item["values"], item["table_name"])
+                conn = self._get_conn_for_item(item)
+                self.write_results(
+                    conn, item["keys"], item["values"], item["table_name"]
+                )
             self.queue.task_done()
 
+    def _get_conn_for_item(self, item: dict[str, Any]) -> AtomicConnection:
+        raw_data_path = item.get("raw_data_path")
+        if raw_data_path is None:
+            return self.conn
+        if raw_data_path not in self._raw_data_conns:
+            self._raw_data_conns[raw_data_path] = connect_to_raw_data_db(raw_data_path)
+        return self._raw_data_conns[raw_data_path]
+
     def write_results(
-        self, keys: Sequence[str], values: Sequence[list[Any]], table_name: str
+        self,
+        conn: AtomicConnection,
+        keys: Sequence[str],
+        values: Sequence[list[Any]],
+        table_name: str,
     ) -> None:
-        insert_many_values(self.conn, table_name, keys, values)
+        insert_many_values(conn, table_name, keys, values)
 
     def shutdown(self) -> None:
         """
@@ -272,6 +297,7 @@ def __init__(
         self._cache: DataSetCacheWithDBBackend = DataSetCacheWithDBBackend(self)
         self._results: list[dict[str, VALUE]] = []
         self._in_memory_cache = in_memory_cache
+        self._raw_data_conn: AtomicConnection | None = None
 
         if run_id is not None:
             if not run_exists(self.conn, run_id):
@@ -290,6 +316,13 @@ def __init__(
             self._export_info = ExportInfo.from_str(
                 self.metadata.get("export_info", "")
             )
+            # If this dataset was saved with raw data in a separate db,
+            # re-open that connection for reads.
+            raw_db_path = self._metadata.get("raw_data_db_path")
+            if raw_db_path is not None and Path(raw_db_path).is_file():
+                self._raw_data_conn = connect_to_raw_data_db(
+                    raw_db_path, read_only=read_only
+                )
         else:
             # Actually perform all the side effects needed for the creation
             # of a new dataset. Note that a dataset is created (in the DB)
@@ -358,6 +391,17 @@ def prepare(
     def cache(self) -> DataSetCacheWithDBBackend:
         return self._cache
 
+    @property
+    def _data_conn(self) -> AtomicConnection:
+        """Connection to use for results-table data operations.
+
+        Returns the separate raw-data connection when split storage is
+        active, otherwise falls back to the main database connection.
+        """
+        if self._raw_data_conn is not None:
+            return self._raw_data_conn
+        return self.conn
+
     @property
     def run_id(self) -> int:
         return self._run_id
@@ -420,7 +464,7 @@ def snapshot_raw(self) -> str | None:
     @property
     def number_of_results(self) -> int:
         sql = f'SELECT COUNT(*) FROM "{self.table_name}"'
-        cursor = atomic_transaction(self.conn, sql)
+        cursor = atomic_transaction(self._data_conn, sql)
         return one(cursor, "COUNT(*)")
 
     @property
@@ -682,12 +726,34 @@ def _perform_start_actions(self, start_bg_writer: bool) -> None:
         Perform the actions that must take place once the run has been started
         """
         paramspecs = new_to_old(self._rundescriber.interdeps).paramspecs
+        raw_data_enabled = is_raw_data_storage_enabled()
 
         for spec in paramspecs:
             add_parameter(
-                spec, conn=self.conn, run_id=self.run_id, insert_into_results_table=True
+                spec,
+                conn=self.conn,
+                run_id=self.run_id,
+                insert_into_results_table=True,
             )
 
+        # When raw data split is enabled, create a per-dataset SQLite file
+        # for results data with the full results table.
+        if raw_data_enabled:
+            raw_db_path = get_raw_data_db_path(self.guid)
+            self._raw_data_conn = create_raw_data_db(
+                raw_db_path,
+                self.table_name,
+                self._rundescriber.interdeps.paramspecs,
+            )
+            # Persist the raw data path in metadata so we can find it when
+            # loading the dataset later.
+            raw_path_str = str(raw_db_path)
+            self._metadata["raw_data_db_path"] = raw_path_str
+            with atomic(self.conn) as aconn:
+                add_data_to_dynamic_columns(
+                    aconn, self.run_id, {"raw_data_db_path": raw_path_str}
+                )
+
         desc_str = serial.to_json_for_storage(self.description)
 
         update_run_description(self.conn, self.run_id, desc_str)
@@ -770,14 +836,18 @@ def add_results(self, results: Sequence[Mapping[str, VALUE]]) -> None:
         writer_status = self._writer_status
 
         if writer_status.write_in_background:
-            item = {
+            item: dict[str, Any] = {
                 "keys": list(expected_keys),
                 "values": values,
                 "table_name": self.table_name,
             }
+            if self._raw_data_conn is not None:
+                item["raw_data_path"] = self._raw_data_conn.path_to_dbfile
             writer_status.data_write_queue.put(item)
         else:
-            insert_many_values(self.conn, self.table_name, list(expected_keys), values)
+            insert_many_values(
+                self._data_conn, self.table_name, list(expected_keys), values
+            )
 
     def _raise_if_not_writable(self) -> None:
         if self.pristine:
@@ -869,6 +939,25 @@ def get_parameter_data(
 
         else:
             valid_param_names = self._validate_parameters(*params)
+
+        if self._raw_data_conn is not None:
+            # When raw data lives in a separate DB, we bypass
+            # get_parameter_data (which looks up the rundescriber
+            # from the main DB) and call the lower-level function
+            # directly with the rundescriber we already hold.
+            output: ParameterData = {}
+            for param_name in valid_param_names:
+                output[param_name] = get_shaped_parameter_data_for_one_paramtree(
+                    self._raw_data_conn,
+                    self.table_name,
+                    self._rundescriber,
+                    param_name,
+                    start,
+                    end,
+                    callback,
+                )
+            return output
+
         return get_parameter_data(
             self.conn, self.table_name, valid_param_names, start, end, callback
         )
@@ -1225,7 +1314,7 @@ def get_metadata(self, tag: str) -> VALUE | None:
         return get_data_by_tag_and_table_name(self.conn, tag, self.table_name)
 
     def __len__(self) -> int:
-        return length(self.conn, self.table_name)
+        return length(self._data_conn, self.table_name)
 
     def __repr__(self) -> str:
         out = []
@@ -1878,7 +1967,13 @@ def _get_datasetprotocol_from_guid(
     if _check_if_table_found(conn, result_table_name):
         d = DataSet(conn=conn, run_id=run_id)
     else:
-        d = DataSetInMem._load_from_db(conn=conn, guid=guid)
+        # The results table may be absent from the main DB when raw data
+        # is stored in a separate per-dataset SQLite file.
+        metadata = get_metadata_from_run_id(conn, run_id)
+        if metadata.get("raw_data_db_path") is not None:
+            d = DataSet(conn=conn, run_id=run_id)
+        else:
+            d = DataSetInMem._load_from_db(conn=conn, guid=guid)
 
     return d