Skip to content

Remove DataFiles table from TransformationDB #7752

@chaen

Description

@chaen

Looking into the performance of the TransformationSystem, and its DB in particular, the hotest spot is the DataFiles table.
The aim of this table is to deduplicate the LFN in the DB, so if multiple transformations are applied to the same file, the LFN is only stored once in this DataFiles, and the TransformationFiles just refers to it via foreign key.

When a lot of transformations are running, the DataFiles table can get big (currently 80M rows in LHCb). Queries we are running against it are of this type:

SELECT LFN,FileID FROM DataFiles WHERE LFN in  ('a', 'b', 'c')

They can take up to half an hour in our case.
Effectively, the DataFiles:

  • is inefficient at querying (which we do very often, even to insert new files)
  • subject to race condition (the code tries to protect it at various places, but still)

I propose to remove the DataFiles table, and add an indexed LFN column to the TransformationFiles table. It may make the DB slightly bigger in size, but the performance will be dramatically improved.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions