Skip to content

Implementation of fetch_pdb()#4943

Open
jauy123 wants to merge 127 commits intoMDAnalysis:developfrom
jauy123:downloads
Open

Implementation of fetch_pdb()#4943
jauy123 wants to merge 127 commits intoMDAnalysis:developfrom
jauy123:downloads

Conversation

@jauy123
Copy link
Contributor

@jauy123 jauy123 commented Mar 3, 2025

Fixes #4907

Changes made in this Pull Request:

This is a still work in progress, but here's a implementation of @BradyAJohnston 's code wrapped into classes. I still need to write tests and docs for the entire thing.

  • Added two classes: DownloaderBase and 'PDBDownloader' in order to implement downloading structure file from online sources such as the PDB databank.
  • Added requests as a dependency
  • mda.fetch_pdb() is implemented as a wrapper to commonly used option in 'PDBDownloader'

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--4943.org.readthedocs.build/en/4943/

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 3, 2025

I'm not sure where to put this code in the codebase, so I create a new folder for it right now. I'm open to it being moved somewhere

Some stuff which I like to still add (besides tests and docs):

  1. Verbose option for PdbDownloader.download() (I think tqdm was a dependency last time I saw?)
  2. Integration with Mdanalysis' logger
  3. Probably could wrap the cache logic into a separate function

@BradyAJohnston BradyAJohnston self-assigned this Mar 4, 2025
@BradyAJohnston
Copy link
Member

BradyAJohnston commented Mar 4, 2025

I think others will have to confirm, but likely we'll want to have requests be an optional dependency to reduce core library dependencies (as the fetching of structures won't be something that lot of users will be doing).

Additional it's not finalised yet but if the mmcif reader in #2367 gets finalised then the default download shouldn't be .pdb (but can remain for now).

@codecov
Copy link

codecov bot commented Mar 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.73%. Comparing base (24548e6) to head (20d8a84).

Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #4943   +/-   ##
========================================
  Coverage    92.73%   92.73%           
========================================
  Files          180      182    +2     
  Lines        22475    22499   +24     
  Branches      3190     3194    +4     
========================================
+ Hits         20842    20865   +23     
- Misses        1176     1177    +1     
  Partials       457      457           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 5, 2025

I'm ok with that. I can make the code raise an exception if requests is not in the environment.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 10, 2025

Assuming that requests will be optional dependency, how would I exactly specify in the build files? Right now, I'm just hard coding it in, so that the github CLI tests can build successfully and run.

@BradyAJohnston
Copy link
Member

You've added it to one of the optional dependency categories which is all that should be required. For the actual files where it is used you'll need to have something setup like the usage of biopython:

try:
import Bio.AlignIO
import Bio.Align
import Bio.Align.Applications
except ImportError:
HAS_BIOPYTHON = False
else:
HAS_BIOPYTHON = True

I'm not an expert on the pipelines so someone else would have to pitch in more on that.

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

Thanks for the comment!

@jauy123
Copy link
Contributor Author

jauy123 commented Mar 19, 2025

I happen to have another question!

Is it normal for some of the tests to not be consistent across each commit? From what I understand, each github CLI has to get and build each MDAnalysis from source, and this instance can potentially timeout from what I observe across each commit.

The macOS (of the latest commit) failed at 97% of test because it reached the max wall time of two hours.

Even then the latest Azure tests failed because of other tests in the source code which I didn't write (namely due to other tests)

From Azure_Tests Win-Python310-64bit-full (commit 651bf267076d2d7da6491608b1b5136915caf2e2)

FAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDReaderWithRealTrajectory::test_open_filestream - Issue #2884
XFAIL MDAnalysisTests/coordinates/test_h5md.py::TestH5MDWriterWithRealTrajectory::test_write_with_drivers[core] - occasional PermissionError on windows
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_frame_collect_all_same - reason: memoryreader allows independent coordinates
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice0] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice1] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/coordinates/test_memory.py::TestMemoryReader::test_timeseries_values[slice2] - reason: MemoryReader uses deprecated stop inclusive indexing, see Issue #3893
XFAIL MDAnalysisTests/core/test_topologyattrs.py::TestResids::test_set_atoms
XFAIL MDAnalysisTests/lib/test_util.py::test_which - util.which does not get right binary on Windows
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#N] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-N=[N+]=[N-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[O+]=C] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/converters/test_rdkit.py::TestRDKitFunctions::test_order_independant_issue_3339[C-[N+]#[C-]] - Not currently tackled by the RDKitConverter
XFAIL MDAnalysisTests/coordinates/test_dcd.py::TestDCDReader::test_frame_collect_all_same - reason: DCDReader allows independent coordinates.This behaviour is deprecated and will be changedin 3.

@orbeckst
Copy link
Member

In principle, tests should pass everywhere.

The Azure tests time out in the test

_________________________ Test_Fetch_Pdb.test_timeout _________________________

which looks like something that you added. I haven't looked at your code but it might simply be the case that some stuff needs to be written differently for windows.

@orbeckst
Copy link
Member

@jauy123 do you have time to pick up this PR again? Would be great to have the feature in 2.10!

@jauy123
Copy link
Contributor Author

jauy123 commented Jun 12, 2025

I have time again. I was busy starting the end of spring break with comps, classes, and you know what.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Primarily add docs and integrate into the docs under package/doc/sphinx/source/documentation_pages. You'll need a new fetch_modules.rst and also link that file from ../index.rst. Check the rendered docs to see that your docs show up correctly. Put yourself in the shoes of someone who wants to learn more about this new functionality.

Changed default file_format to cif.gz
Chaned return type to pathlib.Path

Updated test appropriately
removed true_basename()
However, I don't know where to put it? And I kinda finicking around with the sphinx code in order to get it actually show up

	modified:   package/MDAnalysis/fetch/__init__.py
	modified:   package/MDAnalysis/fetch/pdb.py
	new file:   package/doc/sphinx/source/documentation_pages/auxiliary/fetch_init.rst
	new file:   package/doc/sphinx/source/documentation_pages/auxiliary/fetch_pdb.rst
	modified:   package/doc/sphinx/source/documentation_pages/auxiliary_modules.rst
@jauy123
Copy link
Contributor Author

jauy123 commented Dec 23, 2025

Comments addressed and fixed:
@p-j-smith comment on making pdb_ids required for from_pdb
@p-j-smith comment on making pdb_ids return pathlib.Path instead of strings.
@marinegor comment on making the default file format download as cif.gz instead of pdb.gz -- However since there is no reader for cif yet, a Universe can't actually read cif.gz yet.
@orbeckst comment on removing from_pdb from top level import
@orbeckst various comments on documentation -- I also added various amount of documentation

One thing that I need to do (and something that I need input for from the core devs) is where to place mda.fetch in the docs/sphinx code? I kinda shoehorned in under auxilary just to visualized what I was doing, but I don't see an obvious place where to put this in the web documentation.

@@ -0,0 +1 @@
.. automodule:: MDAnalysis.fetch.__init__
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably need to remove __init__ and just ended it at fetch

@jauy123
Copy link
Contributor Author

jauy123 commented Jan 12, 2026

@orbeckst Can I get input on where to put the fetch module documentation on the website? I sort of just shoved under auxiliary for now but considering how there is an existing auxiliary module. It would probably be better suited to be placed elsewhere.

@orbeckst
Copy link
Member

To integrate a new top-level module:

  1. create new directory fetchers in package/doc/sphinx/source/documentation_pages
  2. create the module doc file fetchers/PDB.rst — it's content is just
.. automodule:: MDAnalysis.fetch.pdb`

(see others for examples). This will use the doc strings from the actual module. (see automodule)
4. create new top level file fetcher_modules.rst in the package/doc/sphinx/source/documentation_pages](https://github.com/MDAnalysis/mdanalysis/tree/develop/package/doc/sphinx/source/documentation_pages) directory
5. add an entry to the toctree in package/doc/sphinx/source/index.rst pointing to fetcher_modules.rst

@jauy123
Copy link
Contributor Author

jauy123 commented Jan 26, 2026

@orbeckst
Ok, I think I added enough documentation for the fetch module. This is was my first time work with sphinx and it took a while to actually get something to work.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost done to me. Just minor fixes.

@jauy123
Copy link
Contributor Author

jauy123 commented Jan 29, 2026

@orbeckst Ok, I added your requests.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing all the work and addressing the change requests.

This looks good to go from my perspective. @p-j-smith @BradyAJohnston can you also have a Quick Look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mda.fetch_pdb() to generate Universe from Protein Databank structures

7 participants