Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/dataset/dataset_design.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,33 @@ We note that the dataset currently exclusively supports storing data in an
SQLite database. This is not an intrinsic limitation of the dataset and
measurement layer. It is possible that at a future state support for writing
to a different backend will be added.

.. _sec:design_split_storage:

Split Raw Data Storage
======================

As the main SQLite database grows with many datasets, browsing experiments and
loading metadata can become slower due to the file size. To address this,
QCoDeS supports an optional **split raw data storage** mode (see
:ref:`sec:intro_split_raw_data` for user-facing details).

From a design perspective, this feature adds a thin routing layer inside the
``DataSet`` class without changing any public interfaces:

- A ``_data_conn`` property transparently returns either the main database
connection or a per-dataset raw data connection, depending on the
configuration.
- Write paths (``add_results``, ``_BackgroundWriter``) and read paths
(``get_parameter_data``, ``DataSetCacheWithDBBackend``, ``number_of_results``,
``__len__``) all go through this single routing point.
- The per-dataset SQLite file is a lightweight database containing only the
results table and numpy type adapters -- no QCoDeS metadata schema.
- Subscriber triggers (used for real-time data callbacks) are created on the
data connection so that they fire regardless of which database holds the
results table.

The implementation is contained in ``qcodes.dataset.raw_data_storage`` (helper
functions) and a handful of additions to ``qcodes.dataset.data_set`` (routing
logic). The ``Measurement`` context manager, ``DataSaver``, and all export
functions work without modification.
30 changes: 30 additions & 0 deletions docs/dataset/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,3 +75,33 @@ For dataset operations, QCoDeS provides functions for:
- **Exporting datasets**: :doc:`Exporting data to other file formats <../examples/DataSet/Exporting-data-to-other-file-formats>`
- **Extracting runs between databases**: :doc:`Extracting runs from one DB file to another <../examples/DataSet/Extracting-runs-from-one-DB-file-to-another>` and :func:`qcodes.dataset.extract_runs_into_db`
- **Bulk export and metadata-only databases**: :func:`qcodes.dataset.export_datasets_and_create_metadata_db` for creating lightweight metadata-only databases while exporting all data to NetCDF files

.. _sec:intro_split_raw_data:

Split Raw Data Storage
======================

By default, all measurement data (the results table rows) is stored in the same SQLite database alongside metadata such as experiments, runs, parameter layouts, and dependencies. Over time, the main database file can grow very large, which can slow down operations like browsing experiments and loading metadata.

QCoDeS supports an optional **split raw data storage** mode in which the actual measurement data for each ``DataSet`` is written to an individual, per-dataset SQLite file while all metadata remains in the main database. Each per-dataset file is named after the dataset's GUID (e.g. ``<guid>.db``) and is stored in a configurable folder.

This feature is controlled by two configuration options in ``qcodesrc.json``:

- ``dataset.raw_data_to_separate_db`` (bool, default ``false``): enables or disables split storage.
- ``dataset.raw_data_path`` (string, default ``"{db_location}"``): the folder where per-dataset files are created. The ``{db_location}`` placeholder is expanded to a folder derived from the main database path (e.g. ``~/experiments.db`` becomes ``~/experiments_db/``).

When enabled:

- The main database retains the full results table schema (column definitions) but no data rows are written to it, keeping it lightweight.
- All ``INSERT`` and ``SELECT`` operations on results data are transparently routed to the per-dataset file.
- The path to the per-dataset file is persisted in the run's metadata (``raw_data_db_path``), so ``load_by_id`` and related loading functions automatically reconnect to the correct file.
- All public ``DataSet`` APIs (``get_parameter_data``, ``to_pandas_dataframe``, ``to_xarray_dataset``, ``cache``, ``export``, etc.) work identically whether split storage is enabled or not.

Example runtime configuration::

import qcodes as qc

qc.config.dataset.raw_data_to_separate_db = True
qc.config.dataset.raw_data_path = "/data/raw_measurements/"

For more details on database management, see the :doc:`Database notebook <../examples/DataSet/Database>`.
65 changes: 65 additions & 0 deletions docs/examples/DataSet/Database.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,71 @@
"\n",
"Moreover, we have also written an [example notebook](Extracting-runs-from-one-DB-file-to-another.ipynb) of transferring `DataSets` between database flies that may serve as a template for more complex data organization protocols."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split Raw Data Storage\n",
"\n",
"As the main database grows with many datasets, browsing experiments and loading metadata can become slower. QCoDeS supports an optional **split raw data storage** mode that writes the raw measurement data for each dataset into its own individual SQLite file, while keeping all metadata (experiments, runs, parameters, dependencies) in the main database.\n",
"\n",
"This keeps the main database lightweight and makes it faster to work with, while still allowing all existing `DataSet` APIs to function identically.\n",
"\n",
"### Configuration\n",
"\n",
"Split raw data storage is controlled by two configuration options:\n",
"\n",
"- `dataset.raw_data_to_separate_db` (bool, default `False`): enables or disables split storage.\n",
"- `dataset.raw_data_path` (string, default `\"{db_location}\"`): the folder where per-dataset SQLite files are created. The `{db_location}` placeholder expands to a folder derived from the main database path (e.g. `~/experiments.db` becomes `~/experiments_db/`).\n",
"\n",
"You can enable it at runtime:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Enable split raw data storage\n",
"qc.config.dataset.raw_data_to_separate_db = True\n",
"\n",
"# Optionally set a custom path for per-dataset files\n",
"qc.config.dataset.raw_data_path = \"/data/raw_measurements/\"\n",
"\n",
"# Or use the default which derives from the main DB location:\n",
"# qc.config.dataset.raw_data_path = \"{db_location}\"\n",
"# e.g. ~/experiments.db -> ~/experiments_db/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or permanently in your `qcodesrc.json`:\n",
"\n",
"```json\n",
"{\n",
" \"dataset\": {\n",
" \"raw_data_to_separate_db\": true,\n",
" \"raw_data_path\": \"{db_location}\"\n",
" }\n",
"}\n",
"```\n",
"\n",
"### How It Works\n",
"\n",
"When split storage is enabled:\n",
"\n",
"1. When a measurement starts (`mark_started()`), a per-dataset SQLite file named `<guid>.db` is created in the configured folder.\n",
"2. All measurement data (results table rows) is written to this per-dataset file instead of the main database.\n",
"3. The main database retains the results table schema (column definitions) but contains no data rows, keeping it small.\n",
"4. The path to the per-dataset file is saved in the run metadata, so `load_by_id()` and related functions automatically find and reconnect to the correct file.\n",
"5. All `DataSet` methods (`get_parameter_data`, `to_pandas_dataframe`, `to_xarray_dataset`, `cache`, `export`, etc.) work transparently with split storage.\n",
"\n",
"> **Note:** Datasets created with split storage enabled can always be loaded later, even if the configuration is changed back to the default, as long as the per-dataset files remain at their original paths."
]
}
],
"metadata": {
Expand Down
4 changes: 3 additions & 1 deletion src/qcodes/configuration/qcodesrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,9 @@
"export_chunked_export_of_large_files_enabled": false,
"export_chunked_threshold": 1000,
"in_memory_cache": true,
"load_from_exported_file": false
"load_from_exported_file": false,
"raw_data_to_separate_db": false,
"raw_data_path": "{db_location}"
},
"telemetry":
{
Expand Down
10 changes: 10 additions & 0 deletions src/qcodes/configuration/qcodesrc_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,16 @@
"type": "boolean",
"default": true,
"description": "Should the data be cached in memory as it is measured. Useful to disable for large datasets to save on memory consumption."
},
"raw_data_to_separate_db": {
"type": "boolean",
"default": false,
"description": "If true, raw measurement data (results tables) will be written to individual per-dataset SQLite files instead of the main database. Metadata remains in the main database."
},
"raw_data_path": {
"type": "string",
"default": "{db_location}",
"description": "Path to the folder where per-dataset raw data SQLite files are stored. {db_location} is a directory in the same folder as the .db file with a matching name, e.g. for ~/experiments.db raw data files will be stored in ~/experiments_db/"
}
},
"description": "Settings related to the DataSet and Measurement Context manager",
Expand Down
113 changes: 104 additions & 9 deletions src/qcodes/dataset/data_set.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@
get_run_timestamp_from_run_id,
get_runid_from_guid,
get_sample_name_from_experiment_id,
get_shaped_parameter_data_for_one_paramtree,
mark_run_complete,
remove_trigger,
run_exists,
Expand Down Expand Up @@ -99,6 +100,12 @@
load_to_xarray_dataset_dict,
xarray_to_h5netcdf_with_complex_numbers,
)
from .raw_data_storage import (
connect_to_raw_data_db,
create_raw_data_db,
get_raw_data_db_path,
is_raw_data_storage_enabled,
)
from .subscriber import _Subscriber

if TYPE_CHECKING:
Expand Down Expand Up @@ -141,22 +148,40 @@ def __init__(self, queue: Queue[Any], conn: AtomicConnection):

def run(self) -> None:
self.conn = connect(self.path)
self._raw_data_conns: dict[str, AtomicConnection] = {}

while self.keep_writing:
item = self.queue.get()
if item["keys"] == "stop":
self.keep_writing = False
self.conn.close()
for raw_conn in self._raw_data_conns.values():
raw_conn.close()
elif item["keys"] == "finalize":
_WRITERS[self.path].active_datasets.remove(item["values"])
else:
self.write_results(item["keys"], item["values"], item["table_name"])
conn = self._get_conn_for_item(item)
self.write_results(
conn, item["keys"], item["values"], item["table_name"]
)
self.queue.task_done()

def _get_conn_for_item(self, item: dict[str, Any]) -> AtomicConnection:
raw_data_path = item.get("raw_data_path")
if raw_data_path is None:
return self.conn
if raw_data_path not in self._raw_data_conns:
self._raw_data_conns[raw_data_path] = connect_to_raw_data_db(raw_data_path)
return self._raw_data_conns[raw_data_path]

def write_results(
self, keys: Sequence[str], values: Sequence[list[Any]], table_name: str
self,
conn: AtomicConnection,
keys: Sequence[str],
values: Sequence[list[Any]],
table_name: str,
) -> None:
insert_many_values(self.conn, table_name, keys, values)
insert_many_values(conn, table_name, keys, values)

def shutdown(self) -> None:
"""
Expand Down Expand Up @@ -272,6 +297,7 @@ def __init__(
self._cache: DataSetCacheWithDBBackend = DataSetCacheWithDBBackend(self)
self._results: list[dict[str, VALUE]] = []
self._in_memory_cache = in_memory_cache
self._raw_data_conn: AtomicConnection | None = None

if run_id is not None:
if not run_exists(self.conn, run_id):
Expand All @@ -290,6 +316,13 @@ def __init__(
self._export_info = ExportInfo.from_str(
self.metadata.get("export_info", "")
)
# If this dataset was saved with raw data in a separate db,
# re-open that connection for reads.
raw_db_path = self._metadata.get("raw_data_db_path")
if raw_db_path is not None and Path(raw_db_path).is_file():
self._raw_data_conn = connect_to_raw_data_db(
raw_db_path, read_only=read_only
)
else:
# Actually perform all the side effects needed for the creation
# of a new dataset. Note that a dataset is created (in the DB)
Expand Down Expand Up @@ -358,6 +391,17 @@ def prepare(
def cache(self) -> DataSetCacheWithDBBackend:
return self._cache

@property
def _data_conn(self) -> AtomicConnection:
"""Connection to use for results-table data operations.

Returns the separate raw-data connection when split storage is
active, otherwise falls back to the main database connection.
"""
if self._raw_data_conn is not None:
return self._raw_data_conn
return self.conn

@property
def run_id(self) -> int:
return self._run_id
Expand Down Expand Up @@ -420,7 +464,7 @@ def snapshot_raw(self) -> str | None:
@property
def number_of_results(self) -> int:
sql = f'SELECT COUNT(*) FROM "{self.table_name}"'
cursor = atomic_transaction(self.conn, sql)
cursor = atomic_transaction(self._data_conn, sql)
return one(cursor, "COUNT(*)")

@property
Expand Down Expand Up @@ -682,12 +726,34 @@ def _perform_start_actions(self, start_bg_writer: bool) -> None:
Perform the actions that must take place once the run has been started
"""
paramspecs = new_to_old(self._rundescriber.interdeps).paramspecs
raw_data_enabled = is_raw_data_storage_enabled()

for spec in paramspecs:
add_parameter(
spec, conn=self.conn, run_id=self.run_id, insert_into_results_table=True
spec,
conn=self.conn,
run_id=self.run_id,
insert_into_results_table=True,
)

# When raw data split is enabled, create a per-dataset SQLite file
# for results data with the full results table.
if raw_data_enabled:
raw_db_path = get_raw_data_db_path(self.guid)
self._raw_data_conn = create_raw_data_db(
raw_db_path,
self.table_name,
self._rundescriber.interdeps.paramspecs,
)
# Persist the raw data path in metadata so we can find it when
# loading the dataset later.
raw_path_str = str(raw_db_path)
self._metadata["raw_data_db_path"] = raw_path_str
with atomic(self.conn) as aconn:
add_data_to_dynamic_columns(
aconn, self.run_id, {"raw_data_db_path": raw_path_str}
)

desc_str = serial.to_json_for_storage(self.description)

update_run_description(self.conn, self.run_id, desc_str)
Expand Down Expand Up @@ -770,14 +836,18 @@ def add_results(self, results: Sequence[Mapping[str, VALUE]]) -> None:
writer_status = self._writer_status

if writer_status.write_in_background:
item = {
item: dict[str, Any] = {
"keys": list(expected_keys),
"values": values,
"table_name": self.table_name,
}
if self._raw_data_conn is not None:
item["raw_data_path"] = self._raw_data_conn.path_to_dbfile
writer_status.data_write_queue.put(item)
else:
insert_many_values(self.conn, self.table_name, list(expected_keys), values)
insert_many_values(
self._data_conn, self.table_name, list(expected_keys), values
)

def _raise_if_not_writable(self) -> None:
if self.pristine:
Expand Down Expand Up @@ -869,6 +939,25 @@ def get_parameter_data(

else:
valid_param_names = self._validate_parameters(*params)

if self._raw_data_conn is not None:
# When raw data lives in a separate DB, we bypass
# get_parameter_data (which looks up the rundescriber
# from the main DB) and call the lower-level function
# directly with the rundescriber we already hold.
output: ParameterData = {}
for param_name in valid_param_names:
output[param_name] = get_shaped_parameter_data_for_one_paramtree(
self._raw_data_conn,
self.table_name,
self._rundescriber,
param_name,
start,
end,
callback,
)
return output

return get_parameter_data(
self.conn, self.table_name, valid_param_names, start, end, callback
)
Expand Down Expand Up @@ -1225,7 +1314,7 @@ def get_metadata(self, tag: str) -> VALUE | None:
return get_data_by_tag_and_table_name(self.conn, tag, self.table_name)

def __len__(self) -> int:
return length(self.conn, self.table_name)
return length(self._data_conn, self.table_name)

def __repr__(self) -> str:
out = []
Expand Down Expand Up @@ -1878,7 +1967,13 @@ def _get_datasetprotocol_from_guid(
if _check_if_table_found(conn, result_table_name):
d = DataSet(conn=conn, run_id=run_id)
else:
d = DataSetInMem._load_from_db(conn=conn, guid=guid)
# The results table may be absent from the main DB when raw data
# is stored in a separate per-dataset SQLite file.
metadata = get_metadata_from_run_id(conn, run_id)
if metadata.get("raw_data_db_path") is not None:
d = DataSet(conn=conn, run_id=run_id)
else:
d = DataSetInMem._load_from_db(conn=conn, guid=guid)

return d

Expand Down
Loading
Loading