Make using DataSource faster by using lazy reading by jmcarcell · Pull Request #949 · AIDASoft/podio

jmcarcell · 2026-03-26T12:45:43Z

The problem of DataSource is that everything is being loaded all the time, unlike for traditional RDataFrames. I have added lazy reading to ROOTReader and RNTupleReader, which have to be different because the internals of the readers are different. For ROOTReader it is relatively trivial with a callback, we just read the collection that we need when it is used; for RNTupleReader we create a new ROOT RNTupleReader with a minimal model for each collection since the complete model is fixed at the beginning (the key difference is: for ROOTReader we have one branch for each collection and we read per-branch, for RNTupleReader we have the full model). Possible questions and comments:

If we want to support lazy reading more generally, for example for files for which we are only interested in a few collections. Then we would maybe need to have different modes for the readers (at least for the RNTuple one this has to be known before reading). Currently, the implementation of the ROOTReader assumes all reading for an event is done before moving on to the next one, which is something that would have to be changed.
In that case, maybe lazy reading could be the default if performance is not too different from what we have now.
If we don't, then the lazy reading has to be hidden and only allowed for DataSource, since it does not work in the general case (at least for ROOTReader, explained above).

Benchmarks later but I can read a few GB of TTree and RNTuple files in a close time to using RDataFrame directly on them (tested with single threading only, I think multithreading should bring a similar speedup since DataSource makes several independent readers).

BEGINRELEASENOTES

Notes about the podio Readers: to be completed once there is a final implementation
Following the documentation (https://root.cern/doc/v638/classROOT_1_1RDF_1_1RDataSource.html), make the DataSource look more like the two existing implementations (https://root.cern/doc/v638/classROOT_1_1RDF_1_1RCsvDS.html and https://root.cern/doc/v638/classROOT_1_1RDF_1_1RNTupleDS.html).
Add a test that uses the DataSource by creating a RDataFrame and writes it to a file, and check that the contents are the expected ones. Note that this test does not work in master (it seems resolving relations is not correct).
Remove the existing test that was using DataSource that was simply counting the number of events

ENDRELEASENOTES

tmadlener · 2026-03-26T13:18:24Z

When I initially designed the whole Frame infrastructure, I had the following idea for reading data lazily: Construct some form of LazyFrameData that effectively retains a reference to the reader such that it can read the buffers from there when they are requested from the Frame. For ROOT this would effectively mean that the reader does almost nothing and most of the logic for retrieving data from the file would go into this LazyFrameData.

From the looks of your version here callback goes pretty much in that direction, only that the logic still lives in the Reader and that there is a condition of not going to the next entry before all collections have been lazily read. To make this more generic one would have to probably add

a mutex (or another synchronization primitive) to lock the reader from the Frame
some bookkeeping information to know which entry in the file belongs to a Frame

Purely from not overloading the existing reader with too much functionality, I would be in favor of having the details of lazy or eager reading entirely hidden behind the existing interface of Reader, i.e. no addition of readFrameLazy, but rather have new lazy readers (and corresponding lazy frames) that provide the same interface.

As a side note some of the excessive data loading could be front-loaded to the users for a quick work around, because DataSource has the ability to only read a subset of all collections. @kjvbrt and me had a brief discussion about automating that on the python side, but I think we arrived at the conclusion that that would essentially require parsing python or doing a "dummy" run to collect data names. However, a user might know which collections they want and could provide that as a list.

jmcarcell · 2026-04-09T20:22:38Z

Making a LazyFrameData and new readers seems like the "lazy" option. Currently the callback can be inserted with few changes in the ROOTFrameData, that is the same except for the callback, and the runtime performance is simply an if check when not using it. The readers are the same except that they need the callback (and the RNTuple to create a map or list of "small" readers). Even the readLazy functions are almost a copy and paste of the existing readEntry functions. So making new readers duplicates all the code of the existing readers. Another way to go at this could be to pass a parameter "lazy" to the readers at construction time and then enable the lazy functionality and have two paths that share a lot of the functionality. If lazy -> init the lazy path (only different for the RNTupleReader) and read lazily instead of the regular way (different for both readers since they have to use the callback).

I think performance for the ROOTReader could be the same (note that nothing changes for initialization unlike the RNTupleReader that freezes the model) as the one that we have, since it is reading collections one by one. So it wouldn't be crazy if lazy reading was the default.

tmadlener · 2026-04-13T14:23:39Z

The thing I don't like about the current implementation of readLazy is that it couples the returned FrameData to the state of the reader. It's even explicitly mentioned in the docstring:

podio/include/podio/RNTupleReader.h

Lines 99 to 101 in cc2b70d

    
           /// The reader must not advance to the next entry while the returned FrameData 
        
           /// (and any Frame built from it) is still being accessed. This is guaranteed 
        
           /// in the DataSource context where each slot has its own reader.

If we want to make this generally useful, what I would like to guarantee is that the following is also possible and works as expected

auto reader = podio::makeReader("some-file.root"); // <-- Assume for now this is a lazy reader through magic

auto event1 = reader.readEvent(1);
auto event2 = reader.readNextEvent();
auto event3 = reader.readEvent(42);

auto coll1 = event1.get<CollType>("collName");
auto coll2 = event2.get<AnotherType>("name");
auto coll3 = event3.get<ThirdType>("third-name");

In the approach in this PR, at the moment this would not work and presumably break somewhere. Instead if there is a dedicated LazyFrameData (or probably one per backend, so two for ROOT) that just keeps track of the original reader that created it as well as which entry in that reader this can be made to work. In the end for the users it should not make any difference in usage whether they use lazy or eager reading and this should be drop-in replaceable. Otherwise, I don't think we have gained much by adding lazy reading.

tmadlener · 2026-04-13T20:23:42Z

+  // Build a minimal RNTupleModel from the full reader's descriptor
+  auto& fullReader = *m_readers[category][readerIndex];
+  const auto& desc = fullReader.GetDescriptor();
+
+  ROOT::RCreateFieldOptions fieldOpts;
+  fieldOpts.SetEmulateUnknownTypes(true);
+  fieldOpts.SetReturnInvalidOnError(true);
+
+  auto smallModel = ROOT::RNTupleModel::CreateBare();
+  const auto& topFieldDesc = desc.GetFieldDescriptor(desc.GetFieldZeroId());
+  for (const auto& fieldDesc : desc.GetFieldIterable(topFieldDesc)) {
+    const auto& fn = fieldDesc.GetFieldName();
+    if (std::ranges::find(neededFieldNames, fn) != neededFieldNames.end()) {
+      auto field = fieldDesc.CreateField(desc, fieldOpts);
+      if (field) {
+        smallModel->AddField(std::move(field));
+      }
+    }
+  }
+  smallModel->Freeze();


If we had a "global" (per reader) lazy flag this could move into initCategory, right? In that case one could simply create all readers up-front in initCategory or even up-front in openFiles like we do with the full reader and simply create one per possible collection?

jmcarcell · 2026-04-17T12:34:01Z

The thing I don't like about the current implementation of readLazy is that it couples the returned FrameData to the state of the reader. It's even explicitly mentioned in the docstring:

podio/include/podio/RNTupleReader.h

Lines 99 to 101 in cc2b70d

/// The reader must not advance to the next entry while the returned FrameData

/// (and any Frame built from it) is still being accessed. This is guaranteed

/// in the DataSource context where each slot has its own reader.

If we want to make this generally useful, what I would like to guarantee is that the following is also possible and works as expected
auto reader = podio::makeReader("some-file.root"); // <-- Assume for now this is a lazy reader through magic

auto event1 = reader.readEvent(1);
auto event2 = reader.readNextEvent();
auto event3 = reader.readEvent(42);

auto coll1 = event1.get<CollType>("collName");
auto coll2 = event2.get<AnotherType>("name");
auto coll3 = event3.get<ThirdType>("third-name");
In the approach in this PR, at the moment this would not work and presumably break somewhere. Instead if there is a dedicated LazyFrameData (or probably one per backend, so two for ROOT) that just keeps track of the original reader that created it as well as which entry in that reader this can be made to work. In the end for the users it should not make any difference in usage whether they use lazy or eager reading and this should be drop-in replaceable. Otherwise, I don't think we have gained much by adding lazy reading.

Yes, the implementation was not finished, I needed to have something working before knowing whether it would speed up with RDataframe or not, and the callback was a quick way and it can be changed relatively easy.

I would like to avoid creating more readers that are basically a copy of the existing. In addition, more readers and writers (or ways of reading and writing that are not fundamentally different) could be supported in the future, for example, using multithreading for reading and writing (for writing RNTuples, RNTupleParallelWriter). One could also fix the branches for the TTree reader at the beginning like RNTuple does for possibly faster performance and read all at the same time. If we want one reader or writer for each of these options, we can't have all of them, but it may be possible (and useful) to have available both lazy and multithreading at the same time, for example. I see all these as different paths and options that could be enabled, possibly independently of each other.

jmcarcell added 3 commits March 26, 2026 13:29

Add lazy reading to ROOTReader and RNTupleReader

002b2c9

Refactor DataSource

9683d7a

Add new and remove old tests for DataSource

6232698

jmcarcell force-pushed the datasource branch from cf2fb73 to 6232698 Compare March 26, 2026 12:49

Fix compiler warning with GCC

cc2b70d

tmadlener mentioned this pull request Apr 13, 2026

Draft: Lazy Frame reading for ROOT based I/O #953

Draft

tmadlener reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make using DataSource faster by using lazy reading#949

Make using DataSource faster by using lazy reading#949
jmcarcell wants to merge 4 commits intomasterfrom
datasource

jmcarcell commented Mar 26, 2026 •

edited

Loading

Uh oh!

tmadlener commented Mar 26, 2026

Uh oh!

jmcarcell commented Apr 9, 2026

Uh oh!

tmadlener commented Apr 13, 2026

Uh oh!

tmadlener Apr 13, 2026

Uh oh!

jmcarcell commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmcarcell commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmadlener commented Mar 26, 2026

Uh oh!

jmcarcell commented Apr 9, 2026

Uh oh!

tmadlener commented Apr 13, 2026

Uh oh!

tmadlener Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

jmcarcell commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jmcarcell commented Mar 26, 2026 •

edited

Loading