Make using DataSource faster by using lazy reading#949
Make using DataSource faster by using lazy reading#949
Conversation
|
When I initially designed the whole Frame infrastructure, I had the following idea for reading data lazily: Construct some form of From the looks of your version here callback goes pretty much in that direction, only that the logic still lives in the Reader and that there is a condition of not going to the next entry before all collections have been lazily read. To make this more generic one would have to probably add
Purely from not overloading the existing reader with too much functionality, I would be in favor of having the details of lazy or eager reading entirely hidden behind the existing interface of As a side note some of the excessive data loading could be front-loaded to the users for a quick work around, because |
|
Making a I think performance for the ROOTReader could be the same (note that nothing changes for initialization unlike the RNTupleReader that freezes the model) as the one that we have, since it is reading collections one by one. So it wouldn't be crazy if lazy reading was the default. |
|
The thing I don't like about the current implementation of podio/include/podio/RNTupleReader.h Lines 99 to 101 in cc2b70d If we want to make this generally useful, what I would like to guarantee is that the following is also possible and works as expected auto reader = podio::makeReader("some-file.root"); // <-- Assume for now this is a lazy reader through magic
auto event1 = reader.readEvent(1);
auto event2 = reader.readNextEvent();
auto event3 = reader.readEvent(42);
auto coll1 = event1.get<CollType>("collName");
auto coll2 = event2.get<AnotherType>("name");
auto coll3 = event3.get<ThirdType>("third-name");In the approach in this PR, at the moment this would not work and presumably break somewhere. Instead if there is a dedicated |
| // Build a minimal RNTupleModel from the full reader's descriptor | ||
| auto& fullReader = *m_readers[category][readerIndex]; | ||
| const auto& desc = fullReader.GetDescriptor(); | ||
|
|
||
| ROOT::RCreateFieldOptions fieldOpts; | ||
| fieldOpts.SetEmulateUnknownTypes(true); | ||
| fieldOpts.SetReturnInvalidOnError(true); | ||
|
|
||
| auto smallModel = ROOT::RNTupleModel::CreateBare(); | ||
| const auto& topFieldDesc = desc.GetFieldDescriptor(desc.GetFieldZeroId()); | ||
| for (const auto& fieldDesc : desc.GetFieldIterable(topFieldDesc)) { | ||
| const auto& fn = fieldDesc.GetFieldName(); | ||
| if (std::ranges::find(neededFieldNames, fn) != neededFieldNames.end()) { | ||
| auto field = fieldDesc.CreateField(desc, fieldOpts); | ||
| if (field) { | ||
| smallModel->AddField(std::move(field)); | ||
| } | ||
| } | ||
| } | ||
| smallModel->Freeze(); |
There was a problem hiding this comment.
If we had a "global" (per reader) lazy flag this could move into initCategory, right? In that case one could simply create all readers up-front in initCategory or even up-front in openFiles like we do with the full reader and simply create one per possible collection?
Yes, the implementation was not finished, I needed to have something working before knowing whether it would speed up with RDataframe or not, and the callback was a quick way and it can be changed relatively easy. I would like to avoid creating more readers that are basically a copy of the existing. In addition, more readers and writers (or ways of reading and writing that are not fundamentally different) could be supported in the future, for example, using multithreading for reading and writing (for writing RNTuples, RNTupleParallelWriter). One could also fix the branches for the TTree reader at the beginning like RNTuple does for possibly faster performance and read all at the same time. If we want one reader or writer for each of these options, we can't have all of them, but it may be possible (and useful) to have available both lazy and multithreading at the same time, for example. I see all these as different paths and options that could be enabled, possibly independently of each other. |
The problem of
DataSourceis that everything is being loaded all the time, unlike for traditionalRDataFrames. I have added lazy reading toROOTReaderandRNTupleReader, which have to be different because the internals of the readers are different. ForROOTReaderit is relatively trivial with a callback, we just read the collection that we need when it is used; forRNTupleReaderwe create a new ROOTRNTupleReaderwith a minimal model for each collection since the complete model is fixed at the beginning (the key difference is: forROOTReaderwe have one branch for each collection and we read per-branch, forRNTupleReaderwe have the full model). Possible questions and comments:ROOTReaderassumes all reading for an event is done before moving on to the next one, which is something that would have to be changed.DataSource, since it does not work in the general case (at least forROOTReader, explained above).Benchmarks later but I can read a few GB of TTree and RNTuple files in a close time to using RDataFrame directly on them (tested with single threading only, I think multithreading should bring a similar speedup since
DataSourcemakes several independent readers).BEGINRELEASENOTES
DataSourcelook more like the two existing implementations (https://root.cern/doc/v638/classROOT_1_1RDF_1_1RCsvDS.html and https://root.cern/doc/v638/classROOT_1_1RDF_1_1RNTupleDS.html).master(it seems resolving relations is not correct).ENDRELEASENOTES