Allow I/O of interpreted class if the underlying data format allows#22234
Allow I/O of interpreted class if the underlying data format allows#22234vepadulano wants to merge 1 commit into
Conversation
fa0c418 to
93748f1
Compare
Test Results 22 files 22 suites 3d 10h 33m 57s ⏱️ Results for commit fb44bbe. ♻️ This comment has been updated with latest results. |
93748f1 to
ed725fc
Compare
RDataFrame needs the type_info of input column types to create the column readers. This is searched first in a map of known simple types, then through TClass. Sometimes TClass might not know about a type_info even though it's available to the interpreter. This is the case for example of runtime-generated classes that are declared to the interpreter but do not have a dictionary, in which case `tclass->IsLoaded()` will return false and the TClass will not expose the type_info but the interpreter knows about it. This commit proposes to extend the search in RDataFrame to include also a call to the interpreter in case the RTTI cannot be found through TClass. This removes a long-standing safeguard introduced in RDataFrame that would prevent users from trying to Snapshot a column if the type had no dictionary. But both TTree and RNTuple support this to different degrees. By removing the safeguard and searching querying the interpreter for the RTTI RDataFrame delegates responsability to the underlying I/O technology. As a side effect, this also helps in the integration with awkward arrays, where a C++ type is generated per different awkward array layout at runtime in the Python application. The awkward data source is still responsible to communicate correctly to TTree or RNTuple what they should store to disk when a Snapshot is called.
ed725fc to
fb44bbe
Compare
Change the implementation of the generated RDataSource class to use the newer signature that returns one column reader. This, together with the changes at root-project/root#22234, allow to restore the behaviour that awkward relied upon to store arrays to disk via RDataFrame. Irrespective of these changes, the current in-memory layout of the awkward array is not fit for storage, leading to writing silently garbage data to disk. Fixes scikit-hep#3885
Change the implementation of the generated RDataSource class to use the newer signature that returns one column reader. This, together with the changes at root-project/root#22234, allow to restore the behaviour that awkward relied upon to store arrays to disk via RDataFrame. Irrespective of these changes, the current in-memory layout of the awkward RecordArray is not fit for storage, leading to writing silently garbage data to disk. Fixes scikit-hep#3885
| std::unique_ptr<TInterpreterValue> v = gInterpreter->MakeInterpreterValue(); | ||
| if (gInterpreter->Evaluate(("typeid(" + name + ')').c_str(), *v)) { | ||
| auto *typeIdAsVoidPtr = v->GetAsPointer(); | ||
| const std::type_info *ti = reinterpret_cast<const std::type_info *>(typeIdAsVoidPtr); | ||
| if (ti) | ||
| return *ti; | ||
| } |
There was a problem hiding this comment.
It fine to have this here for now but we might want to consider eventually moving it to TClass
| TEST_F(RDFSnapshotInterpretedClassRead, ReadTTree) | ||
| { | ||
| ROOT::TestSupport::CheckDiagsRAII diags; | ||
| diags.optionalDiag(kWarning, "TClass::Init", |
There was a problem hiding this comment.
What is the diagnostic optional? Isn't it important for the integrity of the test that we are missing the dictionary? I guess alternatively we can assert somewhere that the corresponding TClass is not loaded and/or has Interpreted state.
There was a problem hiding this comment.
To improve locality, I would move the tear down / cleanup to the read test.
| EXPECT_EQ(*s_charge, 1); | ||
| EXPECT_FLOAT_EQ(*s_pt, 30); | ||
| EXPECT_FLOAT_EQ(*s_eta, 3); | ||
| } No newline at end of file |
There was a problem hiding this comment.
Github points out a missing newline here.
| inline constexpr static auto fgRNTupleFile{"RDFSnapshotInterpretedClassWriteRNTuple.root"}; | ||
| inline constexpr static auto fgRNTupleName{"RDFSnapshotInterpretedClassWriteRNTuple"}; | ||
| inline constexpr static auto fgTTreeFile{"RDFSnapshotInterpretedClassWriteTTree.root"}; | ||
| inline constexpr static auto fgTTreeName{"RDFSnapshotInterpretedClassWriteTTree"}; |
There was a problem hiding this comment.
Consider adding
static void TearDownTestSuite() {
std::remove(...);
std::remove(...);
}This can be used instead of the cleanup fixture. Since it's static, gtest calls it only after all tests are done.
RDataFrame needs the type_info of input column types to create the column readers. This is searched first in a map of known simple types, then through TClass. Sometimes TClass might not know about a type_info even though it's available to the interpreter. This is the case for example of runtime-generated classes that are declared to the interpreter but do not have a dictionary, in which case
tclass->IsLoaded()will return false and the TClass will not expose the type_info but the interpreter knows about it.This commit proposes to extend the search in RDataFrame to include also a call to the interpreter in case the RTTI cannot be found through TClass.
This removes a long-standing safeguard introduced in RDataFrame that would prevent users from trying to Snapshot a column if the type had no dictionary. But both TTree and RNTuple support this to different degrees. By removing the safeguard and querying the interpreter for the RTTI RDataFrame delegates responsability to the underlying I/O technology.
As a side effect, this also helps in the integration with awkward arrays, where a C++ type is generated per different awkward array layout at runtime in the Python application. The awkward data source is still responsible to communicate correctly to TTree or RNTuple what they should store to disk when a Snapshot is called.
FYI @ianna
Fixes #22233