diff --git a/docs/dataset/dataset_design.rst b/docs/dataset/dataset_design.rst index f44795586083..53a223292a52 100644 --- a/docs/dataset/dataset_design.rst +++ b/docs/dataset/dataset_design.rst @@ -75,3 +75,33 @@ We note that the dataset currently exclusively supports storing data in an SQLite database. This is not an intrinsic limitation of the dataset and measurement layer. It is possible that at a future state support for writing to a different backend will be added. + +.. _sec:design_split_storage: + +Split Raw Data Storage +====================== + +As the main SQLite database grows with many datasets, browsing experiments and +loading metadata can become slower due to the file size. To address this, +QCoDeS supports an optional **split raw data storage** mode (see +:ref:`sec:intro_split_raw_data` for user-facing details). + +From a design perspective, this feature adds a thin routing layer inside the +``DataSet`` class without changing any public interfaces: + +- A ``_data_conn`` property transparently returns either the main database + connection or a per-dataset raw data connection, depending on the + configuration. +- Write paths (``add_results``, ``_BackgroundWriter``) and read paths + (``get_parameter_data``, ``DataSetCacheWithDBBackend``, ``number_of_results``, + ``__len__``) all go through this single routing point. +- The per-dataset SQLite file is a lightweight database containing only the + results table and numpy type adapters -- no QCoDeS metadata schema. +- Subscriber triggers (used for real-time data callbacks) are created on the + data connection so that they fire regardless of which database holds the + results table. + +The implementation is contained in ``qcodes.dataset.raw_data_storage`` (helper +functions) and a handful of additions to ``qcodes.dataset.data_set`` (routing +logic). The ``Measurement`` context manager, ``DataSaver``, and all export +functions work without modification. diff --git a/docs/dataset/introduction.rst b/docs/dataset/introduction.rst index b5de7a35e4a3..b0dda5c1eb12 100644 --- a/docs/dataset/introduction.rst +++ b/docs/dataset/introduction.rst @@ -75,3 +75,33 @@ For dataset operations, QCoDeS provides functions for: - **Exporting datasets**: :doc:`Exporting data to other file formats <../examples/DataSet/Exporting-data-to-other-file-formats>` - **Extracting runs between databases**: :doc:`Extracting runs from one DB file to another <../examples/DataSet/Extracting-runs-from-one-DB-file-to-another>` and :func:`qcodes.dataset.extract_runs_into_db` - **Bulk export and metadata-only databases**: :func:`qcodes.dataset.export_datasets_and_create_metadata_db` for creating lightweight metadata-only databases while exporting all data to NetCDF files + +.. _sec:intro_split_raw_data: + +Split Raw Data Storage +====================== + +By default, all measurement data (the results table rows) is stored in the same SQLite database alongside metadata such as experiments, runs, parameter layouts, and dependencies. Over time, the main database file can grow very large, which can slow down operations like browsing experiments and loading metadata. + +QCoDeS supports an optional **split raw data storage** mode in which the actual measurement data for each ``DataSet`` is written to an individual, per-dataset SQLite file while all metadata remains in the main database. Each per-dataset file is named after the dataset's GUID (e.g. ``.db``) and is stored in a configurable folder. + +This feature is controlled by two configuration options in ``qcodesrc.json``: + +- ``dataset.raw_data_to_separate_db`` (bool, default ``false``): enables or disables split storage. +- ``dataset.raw_data_path`` (string, default ``"{db_location}"``): the folder where per-dataset files are created. The ``{db_location}`` placeholder is expanded to a folder derived from the main database path (e.g. ``~/experiments.db`` becomes ``~/experiments_db/``). + +When enabled: + +- The main database retains the full results table schema (column definitions) but no data rows are written to it, keeping it lightweight. +- All ``INSERT`` and ``SELECT`` operations on results data are transparently routed to the per-dataset file. +- The path to the per-dataset file is persisted in the run's metadata (``raw_data_db_path``), so ``load_by_id`` and related loading functions automatically reconnect to the correct file. +- All public ``DataSet`` APIs (``get_parameter_data``, ``to_pandas_dataframe``, ``to_xarray_dataset``, ``cache``, ``export``, etc.) work identically whether split storage is enabled or not. + +Example runtime configuration:: + + import qcodes as qc + + qc.config.dataset.raw_data_to_separate_db = True + qc.config.dataset.raw_data_path = "/data/raw_measurements/" + +For more details on database management, see the :doc:`Database notebook <../examples/DataSet/Database>`. diff --git a/docs/examples/DataSet/Database.ipynb b/docs/examples/DataSet/Database.ipynb index 9aeaa6bd818d..a294ec187062 100644 --- a/docs/examples/DataSet/Database.ipynb +++ b/docs/examples/DataSet/Database.ipynb @@ -167,6 +167,71 @@ "\n", "Moreover, we have also written an [example notebook](Extracting-runs-from-one-DB-file-to-another.ipynb) of transferring `DataSets` between database flies that may serve as a template for more complex data organization protocols." ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Split Raw Data Storage\n", + "\n", + "As the main database grows with many datasets, browsing experiments and loading metadata can become slower. QCoDeS supports an optional **split raw data storage** mode that writes the raw measurement data for each dataset into its own individual SQLite file, while keeping all metadata (experiments, runs, parameters, dependencies) in the main database.\n", + "\n", + "This keeps the main database lightweight and makes it faster to work with, while still allowing all existing `DataSet` APIs to function identically.\n", + "\n", + "### Configuration\n", + "\n", + "Split raw data storage is controlled by two configuration options:\n", + "\n", + "- `dataset.raw_data_to_separate_db` (bool, default `False`): enables or disables split storage.\n", + "- `dataset.raw_data_path` (string, default `\"{db_location}\"`): the folder where per-dataset SQLite files are created. The `{db_location}` placeholder expands to a folder derived from the main database path (e.g. `~/experiments.db` becomes `~/experiments_db/`).\n", + "\n", + "You can enable it at runtime:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Enable split raw data storage\n", + "qc.config.dataset.raw_data_to_separate_db = True\n", + "\n", + "# Optionally set a custom path for per-dataset files\n", + "qc.config.dataset.raw_data_path = \"/data/raw_measurements/\"\n", + "\n", + "# Or use the default which derives from the main DB location:\n", + "# qc.config.dataset.raw_data_path = \"{db_location}\"\n", + "# e.g. ~/experiments.db -> ~/experiments_db/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or permanently in your `qcodesrc.json`:\n", + "\n", + "```json\n", + "{\n", + " \"dataset\": {\n", + " \"raw_data_to_separate_db\": true,\n", + " \"raw_data_path\": \"{db_location}\"\n", + " }\n", + "}\n", + "```\n", + "\n", + "### How It Works\n", + "\n", + "When split storage is enabled:\n", + "\n", + "1. When a measurement starts (`mark_started()`), a per-dataset SQLite file named `.db` is created in the configured folder.\n", + "2. All measurement data (results table rows) is written to this per-dataset file instead of the main database.\n", + "3. The main database retains the results table schema (column definitions) but contains no data rows, keeping it small.\n", + "4. The path to the per-dataset file is saved in the run metadata, so `load_by_id()` and related functions automatically find and reconnect to the correct file.\n", + "5. All `DataSet` methods (`get_parameter_data`, `to_pandas_dataframe`, `to_xarray_dataset`, `cache`, `export`, etc.) work transparently with split storage.\n", + "\n", + "> **Note:** Datasets created with split storage enabled can always be loaded later, even if the configuration is changed back to the default, as long as the per-dataset files remain at their original paths." + ] } ], "metadata": { diff --git a/src/qcodes/configuration/qcodesrc.json b/src/qcodes/configuration/qcodesrc.json index 583f36bf8326..977cc3571966 100644 --- a/src/qcodes/configuration/qcodesrc.json +++ b/src/qcodes/configuration/qcodesrc.json @@ -79,7 +79,9 @@ "export_chunked_export_of_large_files_enabled": false, "export_chunked_threshold": 1000, "in_memory_cache": true, - "load_from_exported_file": false + "load_from_exported_file": false, + "raw_data_to_separate_db": false, + "raw_data_path": "{db_location}" }, "telemetry": { diff --git a/src/qcodes/configuration/qcodesrc_schema.json b/src/qcodes/configuration/qcodesrc_schema.json index ddd4929be22c..e7e49931d4f5 100644 --- a/src/qcodes/configuration/qcodesrc_schema.json +++ b/src/qcodes/configuration/qcodesrc_schema.json @@ -382,6 +382,16 @@ "type": "boolean", "default": true, "description": "Should the data be cached in memory as it is measured. Useful to disable for large datasets to save on memory consumption." + }, + "raw_data_to_separate_db": { + "type": "boolean", + "default": false, + "description": "If true, raw measurement data (results tables) will be written to individual per-dataset SQLite files instead of the main database. Metadata remains in the main database." + }, + "raw_data_path": { + "type": "string", + "default": "{db_location}", + "description": "Path to the folder where per-dataset raw data SQLite files are stored. {db_location} is a directory in the same folder as the .db file with a matching name, e.g. for ~/experiments.db raw data files will be stored in ~/experiments_db/" } }, "description": "Settings related to the DataSet and Measurement Context manager", diff --git a/src/qcodes/dataset/data_set.py b/src/qcodes/dataset/data_set.py index f36ad008772f..5d0c7f17e65b 100644 --- a/src/qcodes/dataset/data_set.py +++ b/src/qcodes/dataset/data_set.py @@ -66,6 +66,7 @@ get_run_timestamp_from_run_id, get_runid_from_guid, get_sample_name_from_experiment_id, + get_shaped_parameter_data_for_one_paramtree, mark_run_complete, remove_trigger, run_exists, @@ -99,6 +100,12 @@ load_to_xarray_dataset_dict, xarray_to_h5netcdf_with_complex_numbers, ) +from .raw_data_storage import ( + connect_to_raw_data_db, + create_raw_data_db, + get_raw_data_db_path, + is_raw_data_storage_enabled, +) from .subscriber import _Subscriber if TYPE_CHECKING: @@ -141,22 +148,40 @@ def __init__(self, queue: Queue[Any], conn: AtomicConnection): def run(self) -> None: self.conn = connect(self.path) + self._raw_data_conns: dict[str, AtomicConnection] = {} while self.keep_writing: item = self.queue.get() if item["keys"] == "stop": self.keep_writing = False self.conn.close() + for raw_conn in self._raw_data_conns.values(): + raw_conn.close() elif item["keys"] == "finalize": _WRITERS[self.path].active_datasets.remove(item["values"]) else: - self.write_results(item["keys"], item["values"], item["table_name"]) + conn = self._get_conn_for_item(item) + self.write_results( + conn, item["keys"], item["values"], item["table_name"] + ) self.queue.task_done() + def _get_conn_for_item(self, item: dict[str, Any]) -> AtomicConnection: + raw_data_path = item.get("raw_data_path") + if raw_data_path is None: + return self.conn + if raw_data_path not in self._raw_data_conns: + self._raw_data_conns[raw_data_path] = connect_to_raw_data_db(raw_data_path) + return self._raw_data_conns[raw_data_path] + def write_results( - self, keys: Sequence[str], values: Sequence[list[Any]], table_name: str + self, + conn: AtomicConnection, + keys: Sequence[str], + values: Sequence[list[Any]], + table_name: str, ) -> None: - insert_many_values(self.conn, table_name, keys, values) + insert_many_values(conn, table_name, keys, values) def shutdown(self) -> None: """ @@ -272,6 +297,7 @@ def __init__( self._cache: DataSetCacheWithDBBackend = DataSetCacheWithDBBackend(self) self._results: list[dict[str, VALUE]] = [] self._in_memory_cache = in_memory_cache + self._raw_data_conn: AtomicConnection | None = None if run_id is not None: if not run_exists(self.conn, run_id): @@ -290,6 +316,13 @@ def __init__( self._export_info = ExportInfo.from_str( self.metadata.get("export_info", "") ) + # If this dataset was saved with raw data in a separate db, + # re-open that connection for reads. + raw_db_path = self._metadata.get("raw_data_db_path") + if raw_db_path is not None and Path(raw_db_path).is_file(): + self._raw_data_conn = connect_to_raw_data_db( + raw_db_path, read_only=read_only + ) else: # Actually perform all the side effects needed for the creation # of a new dataset. Note that a dataset is created (in the DB) @@ -358,6 +391,17 @@ def prepare( def cache(self) -> DataSetCacheWithDBBackend: return self._cache + @property + def _data_conn(self) -> AtomicConnection: + """Connection to use for results-table data operations. + + Returns the separate raw-data connection when split storage is + active, otherwise falls back to the main database connection. + """ + if self._raw_data_conn is not None: + return self._raw_data_conn + return self.conn + @property def run_id(self) -> int: return self._run_id @@ -420,7 +464,7 @@ def snapshot_raw(self) -> str | None: @property def number_of_results(self) -> int: sql = f'SELECT COUNT(*) FROM "{self.table_name}"' - cursor = atomic_transaction(self.conn, sql) + cursor = atomic_transaction(self._data_conn, sql) return one(cursor, "COUNT(*)") @property @@ -682,12 +726,34 @@ def _perform_start_actions(self, start_bg_writer: bool) -> None: Perform the actions that must take place once the run has been started """ paramspecs = new_to_old(self._rundescriber.interdeps).paramspecs + raw_data_enabled = is_raw_data_storage_enabled() for spec in paramspecs: add_parameter( - spec, conn=self.conn, run_id=self.run_id, insert_into_results_table=True + spec, + conn=self.conn, + run_id=self.run_id, + insert_into_results_table=True, ) + # When raw data split is enabled, create a per-dataset SQLite file + # for results data with the full results table. + if raw_data_enabled: + raw_db_path = get_raw_data_db_path(self.guid) + self._raw_data_conn = create_raw_data_db( + raw_db_path, + self.table_name, + self._rundescriber.interdeps.paramspecs, + ) + # Persist the raw data path in metadata so we can find it when + # loading the dataset later. + raw_path_str = str(raw_db_path) + self._metadata["raw_data_db_path"] = raw_path_str + with atomic(self.conn) as aconn: + add_data_to_dynamic_columns( + aconn, self.run_id, {"raw_data_db_path": raw_path_str} + ) + desc_str = serial.to_json_for_storage(self.description) update_run_description(self.conn, self.run_id, desc_str) @@ -770,14 +836,18 @@ def add_results(self, results: Sequence[Mapping[str, VALUE]]) -> None: writer_status = self._writer_status if writer_status.write_in_background: - item = { + item: dict[str, Any] = { "keys": list(expected_keys), "values": values, "table_name": self.table_name, } + if self._raw_data_conn is not None: + item["raw_data_path"] = self._raw_data_conn.path_to_dbfile writer_status.data_write_queue.put(item) else: - insert_many_values(self.conn, self.table_name, list(expected_keys), values) + insert_many_values( + self._data_conn, self.table_name, list(expected_keys), values + ) def _raise_if_not_writable(self) -> None: if self.pristine: @@ -869,6 +939,25 @@ def get_parameter_data( else: valid_param_names = self._validate_parameters(*params) + + if self._raw_data_conn is not None: + # When raw data lives in a separate DB, we bypass + # get_parameter_data (which looks up the rundescriber + # from the main DB) and call the lower-level function + # directly with the rundescriber we already hold. + output: ParameterData = {} + for param_name in valid_param_names: + output[param_name] = get_shaped_parameter_data_for_one_paramtree( + self._raw_data_conn, + self.table_name, + self._rundescriber, + param_name, + start, + end, + callback, + ) + return output + return get_parameter_data( self.conn, self.table_name, valid_param_names, start, end, callback ) @@ -1225,7 +1314,7 @@ def get_metadata(self, tag: str) -> VALUE | None: return get_data_by_tag_and_table_name(self.conn, tag, self.table_name) def __len__(self) -> int: - return length(self.conn, self.table_name) + return length(self._data_conn, self.table_name) def __repr__(self) -> str: out = [] @@ -1878,7 +1967,13 @@ def _get_datasetprotocol_from_guid( if _check_if_table_found(conn, result_table_name): d = DataSet(conn=conn, run_id=run_id) else: - d = DataSetInMem._load_from_db(conn=conn, guid=guid) + # The results table may be absent from the main DB when raw data + # is stored in a separate per-dataset SQLite file. + metadata = get_metadata_from_run_id(conn, run_id) + if metadata.get("raw_data_db_path") is not None: + d = DataSet(conn=conn, run_id=run_id) + else: + d = DataSetInMem._load_from_db(conn=conn, guid=guid) return d diff --git a/src/qcodes/dataset/data_set_cache.py b/src/qcodes/dataset/data_set_cache.py index d3d361d2ce0d..2338b4143684 100644 --- a/src/qcodes/dataset/data_set_cache.py +++ b/src/qcodes/dataset/data_set_cache.py @@ -553,12 +553,15 @@ def load_data_from_db(self) -> None: self._loaded_from_completed_ds = True if self._data == {}: self.prepare() + # Use the raw-data connection when the dataset stores results + # in a separate per-dataset SQLite file. + data_conn = self._dataset._data_conn ( self._write_status, self._read_status, self._data, ) = load_new_data_from_db_and_append( - self._dataset.conn, + data_conn, self._dataset.table_name, self.rundescriber, self._write_status, diff --git a/src/qcodes/dataset/raw_data_storage.py b/src/qcodes/dataset/raw_data_storage.py new file mode 100644 index 000000000000..c581a97c631f --- /dev/null +++ b/src/qcodes/dataset/raw_data_storage.py @@ -0,0 +1,174 @@ +""" +Module for managing per-dataset raw data SQLite files. + +When the ``dataset.raw_data_to_separate_db`` config option is enabled, +measurement data (results tables) are written to individual SQLite files +- one per dataset - instead of the main QCoDeS database file. All metadata +(runs, experiments, layouts, dependencies) remains in the main database. + +The per-dataset files are stored in the folder given by +``dataset.raw_data_path`` and are named ``.db``. +""" + +from __future__ import annotations + +import logging +import sqlite3 +from pathlib import Path +from typing import TYPE_CHECKING + +import numpy as np + +import qcodes +from qcodes.dataset.export_config import _expand_export_path +from qcodes.dataset.sqlite.connection import AtomicConnection +from qcodes.dataset.sqlite.database import ( + _adapt_array, + _adapt_complex, + _adapt_float, + _convert_array, + _convert_complex, + _convert_numeric, +) +from qcodes.utils.types import complex_types, numpy_floats, numpy_ints + +if TYPE_CHECKING: + from collections.abc import Sequence + + from qcodes.parameters import ParamSpecBase + +log = logging.getLogger(__name__) + +_RAW_DATA_CONFIG_SECTION = "dataset" +_RAW_DATA_ENABLED_KEY = "raw_data_to_separate_db" +_RAW_DATA_PATH_KEY = "raw_data_path" + + +def is_raw_data_storage_enabled() -> bool: + """Return True if per-dataset raw data storage is enabled in config.""" + return bool( + qcodes.config[_RAW_DATA_CONFIG_SECTION].get(_RAW_DATA_ENABLED_KEY, False) + ) + + +def get_raw_data_folder() -> Path: + """Return the resolved folder path for raw data SQLite files. + + The path template from config is expanded the same way as the + export path (``{db_location}`` is replaced with a folder derived + from the main database path). + """ + raw_path_template: str = qcodes.config[_RAW_DATA_CONFIG_SECTION].get( + _RAW_DATA_PATH_KEY, "{db_location}" + ) + return Path(_expand_export_path(raw_path_template)).expanduser().absolute() + + +def get_raw_data_db_path(guid: str, folder: Path | None = None) -> Path: + """Return the full path for a dataset's raw data SQLite file. + + Args: + guid: The GUID of the dataset. + folder: Override folder. If *None*, uses :func:`get_raw_data_folder`. + + """ + if folder is None: + folder = get_raw_data_folder() + return folder / f"{guid}.db" + + +def connect_to_raw_data_db( + path: str | Path, + *, + read_only: bool = False, +) -> AtomicConnection: + """Open (or create) a lightweight SQLite connection for raw data. + + Unlike the main QCoDeS :func:`~qcodes.dataset.sqlite.database.connect`, + this does **not** create the full metadata schema (experiments, runs, ...). + It only registers the numpy/sqlite type adapters that QCoDeS needs to + round-trip array and numeric data. + + Args: + path: Path to the raw-data SQLite file. + read_only: Open the database in read-only mode. + + Returns: + An :class:`AtomicConnection` to the raw-data database. + + """ + # Register adapters/converters (idempotent calls) + sqlite3.register_adapter(np.ndarray, _adapt_array) + sqlite3.register_converter("array", _convert_array) + for numpy_int in numpy_ints: + sqlite3.register_adapter(numpy_int, int) + sqlite3.register_converter("numeric", _convert_numeric) + for numpy_float in (float, *numpy_floats): + sqlite3.register_adapter(numpy_float, _adapt_float) + for complex_type in complex_types: + sqlite3.register_adapter(complex_type, _adapt_complex) # type: ignore[arg-type] + sqlite3.register_converter("complex", _convert_complex) + + uri = f"file:{path!s}" + if read_only: + uri += "?mode=ro" + + conn = sqlite3.connect( + uri, + detect_types=sqlite3.PARSE_DECLTYPES, + check_same_thread=True, + uri=True, + factory=AtomicConnection, + ) + return conn + + +def create_raw_data_db( + path: str | Path, + table_name: str, + paramspecs: Sequence[ParamSpecBase], +) -> AtomicConnection: + """Create a per-dataset raw-data SQLite file with a results table. + + The file is created if it does not exist. The parent directory is + created if needed. + + Args: + path: Full path for the new SQLite file. + table_name: Name of the results table to create (matches the name + in the main database). + paramspecs: Parameter specifications describing the columns. + + Returns: + An :class:`AtomicConnection` to the newly created database. + + """ + path = Path(path) + path.parent.mkdir(parents=True, exist_ok=True) + + conn = connect_to_raw_data_db(path) + + if paramspecs: + columns = ",".join(p.sql_repr() for p in paramspecs) + sql = f""" + CREATE TABLE IF NOT EXISTS "{table_name}" ( + id INTEGER PRIMARY KEY, + {columns} + ); + """ + else: + sql = f""" + CREATE TABLE IF NOT EXISTS "{table_name}" ( + id INTEGER PRIMARY KEY + ); + """ + + conn.execute(sql) + conn.commit() + + log.info( + "Created raw data database at %s with table %s", + path, + table_name, + ) + return conn diff --git a/src/qcodes/dataset/subscriber.py b/src/qcodes/dataset/subscriber.py index 6b540599396c..5229878d5190 100644 --- a/src/qcodes/dataset/subscriber.py +++ b/src/qcodes/dataset/subscriber.py @@ -64,7 +64,7 @@ def __init__( self.callback_id = f"callback{self._id}" self.trigger_id = f"sub{self._id}" - conn = dataSet.conn + conn = dataSet._data_conn conn.create_function(self.callback_id, -1, self._cache_data_to_queue) diff --git a/tests/dataset/test_raw_data_storage.py b/tests/dataset/test_raw_data_storage.py new file mode 100644 index 000000000000..96d0c0055faf --- /dev/null +++ b/tests/dataset/test_raw_data_storage.py @@ -0,0 +1,315 @@ +""" +Tests for per-dataset raw data SQLite storage. + +When ``dataset.raw_data_to_separate_db`` is enabled, measurement data +(results tables) are written to individual SQLite files while metadata +remains in the main database. +""" + +from __future__ import annotations + +import gc +import sqlite3 +from pathlib import Path +from typing import TYPE_CHECKING + +import numpy as np +import pytest + +import qcodes as qc +from qcodes.dataset import new_data_set, new_experiment +from qcodes.dataset.data_set import DataSet, load_by_id +from qcodes.dataset.descriptions.dependencies import InterDependencies_ +from qcodes.dataset.raw_data_storage import ( + connect_to_raw_data_db, + create_raw_data_db, + get_raw_data_db_path, + get_raw_data_folder, + is_raw_data_storage_enabled, +) +from qcodes.dataset.sqlite.database import initialise_database +from qcodes.parameters import ParamSpecBase + +if TYPE_CHECKING: + from collections.abc import Generator + + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +@pytest.fixture() +def _raw_data_db(tmp_path: Path) -> Generator[None, None, None]: + """Set up a temp DB with raw_data_to_separate_db enabled.""" + db_path = str(tmp_path / "test.db") + qc.config["core"]["db_location"] = db_path + qc.config["core"]["db_debug"] = False + qc.config["dataset"]["raw_data_to_separate_db"] = True + qc.config["dataset"]["raw_data_path"] = str(tmp_path / "raw_data") + initialise_database() + try: + yield + finally: + qc.config["dataset"]["raw_data_to_separate_db"] = False + qc.config["dataset"]["raw_data_path"] = "{db_location}" + gc.collect() + + +@pytest.fixture() +def _raw_data_experiment(_raw_data_db: None) -> Generator[None, None, None]: + """Create a test experiment inside the raw data DB.""" + e = new_experiment("test-experiment", sample_name="test-sample") + try: + yield + finally: + e.conn.close() + + +# --------------------------------------------------------------------------- +# Unit tests - raw_data_storage module +# --------------------------------------------------------------------------- + + +class TestRawDataStorageHelpers: + def test_is_raw_data_storage_enabled_default(self, tmp_path: Path) -> None: + assert not is_raw_data_storage_enabled() + + def test_is_raw_data_storage_enabled_on(self, _raw_data_db: None) -> None: + assert is_raw_data_storage_enabled() + + def test_get_raw_data_folder(self, _raw_data_db: None, tmp_path: Path) -> None: + folder = get_raw_data_folder() + assert folder == (tmp_path / "raw_data") + + def test_get_raw_data_db_path(self, tmp_path: Path) -> None: + folder = tmp_path / "raw_data" + path = get_raw_data_db_path("abc-123", folder) + assert path == folder / "abc-123.db" + + def test_create_raw_data_db(self, tmp_path: Path) -> None: + db_path = tmp_path / "raw" / "test.db" + params = [ + ParamSpecBase("x", "numeric"), + ParamSpecBase("y", "numeric"), + ] + conn = create_raw_data_db(db_path, "results_table", params) + try: + # Verify table exists and has the right columns + cursor = conn.execute("PRAGMA table_info('results_table')") + columns = {row[1] for row in cursor.fetchall()} + assert "id" in columns + assert "x" in columns + assert "y" in columns + finally: + conn.close() + + def test_connect_to_raw_data_db(self, tmp_path: Path) -> None: + db_path = tmp_path / "test.db" + conn = connect_to_raw_data_db(db_path) + try: + # Should be a valid connection with numpy adapters + conn.execute("CREATE TABLE t (id INTEGER PRIMARY KEY, val REAL)") + conn.execute("INSERT INTO t (val) VALUES (?)", (3.14,)) + conn.commit() + cursor = conn.execute("SELECT val FROM t") + assert cursor.fetchone()[0] == pytest.approx(3.14) + finally: + conn.close() + + def test_connect_to_raw_data_db_read_only(self, tmp_path: Path) -> None: + db_path = tmp_path / "test.db" + # Create first + conn = connect_to_raw_data_db(db_path) + conn.execute("CREATE TABLE t (id INTEGER PRIMARY KEY)") + conn.commit() + conn.close() + # Now open read-only + conn = connect_to_raw_data_db(db_path, read_only=True) + try: + with pytest.raises(sqlite3.OperationalError): + conn.execute("INSERT INTO t (id) VALUES (1)") + finally: + conn.close() + + +# --------------------------------------------------------------------------- +# Integration tests - DataSet with split raw data +# --------------------------------------------------------------------------- + + +@pytest.mark.usefixtures("_raw_data_experiment") +class TestDataSetWithSplitRawData: + def _make_dataset_with_data( + self, n_rows: int = 10 + ) -> tuple[DataSet, list[dict[str, float]]]: + """Create a dataset, add data, mark completed, return ds and data.""" + ds = new_data_set("test-split") + x = ParamSpecBase("x", "numeric") + y = ParamSpecBase("y", "numeric") + idps = InterDependencies_(dependencies={y: (x,)}) + ds.set_interdependencies(idps) + ds.mark_started() + + results = [{"x": float(i), "y": float(i**2)} for i in range(n_rows)] + ds.add_results(results) + ds.mark_completed() + return ds, results + + def test_raw_data_conn_is_set(self) -> None: + """When split is enabled, DataSet should have a raw data connection.""" + ds = new_data_set("test-split") + x = ParamSpecBase("x", "numeric") + y = ParamSpecBase("y", "numeric") + idps = InterDependencies_(dependencies={y: (x,)}) + ds.set_interdependencies(idps) + ds.mark_started() + assert ds._raw_data_conn is not None + ds.conn.close() + + def test_raw_data_file_created(self, tmp_path: Path) -> None: + """A per-dataset SQLite file should be created.""" + ds, _ = self._make_dataset_with_data() + raw_folder = tmp_path / "raw_data" + raw_file = raw_folder / f"{ds.guid}.db" + assert raw_file.is_file() + ds.conn.close() + + def test_raw_data_db_path_in_metadata(self) -> None: + """The path to the raw data file should be stored in metadata.""" + ds, _ = self._make_dataset_with_data() + assert "raw_data_db_path" in ds.metadata + assert Path(ds.metadata["raw_data_db_path"]).is_file() + ds.conn.close() + + def test_data_is_in_raw_db_not_main(self) -> None: + """Data should be in the raw data file, not the main DB.""" + ds, _ = self._make_dataset_with_data(n_rows=5) + table_name = ds.table_name + main_conn = ds.conn + + # Main DB should have the results table (schema) but no data rows + cursor = main_conn.execute( + "SELECT name FROM sqlite_master WHERE type='table' AND name=?", + (table_name,), + ) + assert cursor.fetchone() is not None + cursor = main_conn.execute(f'SELECT COUNT(*) FROM "{table_name}"') + assert cursor.fetchone()[0] == 0 + + # Raw data DB should have the actual data + raw_conn = ds._raw_data_conn + assert raw_conn is not None + cursor = raw_conn.execute(f'SELECT COUNT(*) FROM "{table_name}"') + raw_count = cursor.fetchone()[0] + assert raw_count == 5 + ds.conn.close() + + def test_number_of_results(self) -> None: + """number_of_results should read from the raw data file.""" + ds, _ = self._make_dataset_with_data(n_rows=7) + assert ds.number_of_results == 7 + ds.conn.close() + + def test_get_parameter_data(self) -> None: + """get_parameter_data should read from the raw data file.""" + ds, results = self._make_dataset_with_data(n_rows=5) + data = ds.get_parameter_data() + assert "y" in data + np.testing.assert_array_almost_equal( + data["y"]["x"], np.array([r["x"] for r in results]) + ) + np.testing.assert_array_almost_equal( + data["y"]["y"], np.array([r["y"] for r in results]) + ) + ds.conn.close() + + def test_cache_loads_from_raw_data(self) -> None: + """The cache should also read from the raw data file.""" + ds, results = self._make_dataset_with_data(n_rows=5) + cache = ds.cache + cache.load_data_from_db() + cache_data = cache.data() + assert "y" in cache_data + np.testing.assert_array_almost_equal( + cache_data["y"]["x"], np.array([r["x"] for r in results]) + ) + ds.conn.close() + + def test_load_by_id_with_split_data(self) -> None: + """Loading by ID should automatically use the raw data connection.""" + ds, results = self._make_dataset_with_data(n_rows=3) + run_id = ds.run_id + ds.conn.close() + + # Re-load from the database + loaded = load_by_id(run_id) + assert isinstance(loaded, DataSet) + assert loaded._raw_data_conn is not None + data = loaded.get_parameter_data() + np.testing.assert_array_almost_equal( + data["y"]["y"], np.array([r["y"] for r in results]) + ) + loaded.conn.close() + + def test_multiple_datasets_split(self) -> None: + """Multiple datasets should each get their own raw data file.""" + ds1, _ = self._make_dataset_with_data(n_rows=3) + ds2, _ = self._make_dataset_with_data(n_rows=5) + + assert ds1._raw_data_conn is not None + assert ds2._raw_data_conn is not None + assert ds1._raw_data_conn.path_to_dbfile != ds2._raw_data_conn.path_to_dbfile + assert ds1.number_of_results == 3 + assert ds2.number_of_results == 5 + ds1.conn.close() + ds2.conn.close() + + def test_metadata_remains_in_main_db(self) -> None: + """Run metadata should be in the main DB, not the raw data DB.""" + ds, _ = self._make_dataset_with_data() + main_conn = ds.conn + # runs table should be populated in main DB + cursor = main_conn.execute( + "SELECT name, is_completed FROM runs WHERE run_id=?", (ds.run_id,) + ) + row = cursor.fetchone() + assert row is not None + assert row[0] == "test-split" + assert row[1] == 1 # completed + ds.conn.close() + + +# --------------------------------------------------------------------------- +# Tests that the feature doesn't interfere when disabled +# --------------------------------------------------------------------------- + + +@pytest.mark.usefixtures("experiment") +class TestDataSetWithoutSplitRawData: + def test_raw_data_conn_is_none(self) -> None: + """When split is disabled, _raw_data_conn should be None.""" + ds = new_data_set("test-no-split") + x = ParamSpecBase("x", "numeric") + y = ParamSpecBase("y", "numeric") + idps = InterDependencies_(dependencies={y: (x,)}) + ds.set_interdependencies(idps) + ds.mark_started() + assert ds._raw_data_conn is None + ds.conn.close() + + def test_data_in_main_db(self) -> None: + """When split is disabled, data should be in the main DB as usual.""" + ds = new_data_set("test-no-split") + x = ParamSpecBase("x", "numeric") + y = ParamSpecBase("y", "numeric") + idps = InterDependencies_(dependencies={y: (x,)}) + ds.set_interdependencies(idps) + ds.mark_started() + ds.add_results([{"x": 1.0, "y": 2.0}]) + ds.mark_completed() + + cursor = ds.conn.execute(f'SELECT COUNT(*) FROM "{ds.table_name}"') + assert cursor.fetchone()[0] == 1 + assert ds.number_of_results == 1 + ds.conn.close()