Skip to content

Fix duplicate nodes being removed in tree imports#7863

Open
alesan99 wants to merge 3 commits intov7.12.0-prereleasefrom
issue-7853-1
Open

Fix duplicate nodes being removed in tree imports#7863
alesan99 wants to merge 3 commits intov7.12.0-prereleasefrom
issue-7853-1

Conversation

@alesan99
Copy link
Copy Markdown
Contributor

@alesan99 alesan99 commented Mar 27, 2026

Fixes #7853

Tree imports de-duplicate nodes with the same name (ParentNode -> ChildNode1 and ParentNode -> ChildNode2 need to have the same).
However, it looks like some tree files intentionally have non-parent nodes with the same names. To allows those to be imported, I disabled de-duplication for leaf nodes.

Dev note: Also added some minor fixes to some mistakes I saw in the code

Checklist

  • Self-review the PR after opening it to make sure the changes look good and
    self-explanatory (or properly documented)
  • Add relevant issue to release milestone
  • Add pr to documentation list
  • Add automated tests
  • Add a reverse migration if a migration is present in the PR
  • Add migration function to
    def fix_schema_config(stdout: WriteToStdOut | None = None):

Testing instructions

  • Look at https://files.specifysoftware.org/taxonfiles/taxonfiles.json. Download one or more tree csv files (The download links are proceeded by "file": )
  • Go to the taxon tree view
  • Create a new tree and choose to import the same trees that you downloaded.
  • Run a Query on all taxon records in that tree. Look at the total count (Excluding the root node).
  • Open the taxon tree csv file for that same tree. Look at the total number of rows (Excluding the header).
  • The taxon tree in Specify and the csv file should have the same number of trees.
  • Repeat for all the trees you downloaded

@alesan99 alesan99 added this to the 7.12.0 milestone Mar 27, 2026
@alesan99 alesan99 requested review from a team March 27, 2026 18:55
@github-project-automation github-project-automation bot moved this to 📋Back Log in General Tester Board Mar 27, 2026
@alesan99
Copy link
Copy Markdown
Contributor Author

alesan99 commented Mar 27, 2026

I have personally checked the following trees so far:

  • Geography
  • Chronostratigraphy
  • Aves
  • Botany (Bryophyta)
  • Botany (Polypodiopsida)
  • Botany (Tracheophyta)
  • Fungi -- This one has extra records, not sure which ones yet

This tree looks like it gains extra records

  • Entomology

Though this is likely only because the tree file is formatted differently. It should still be the same number of records if reformatted, but I haven't verified it.

@alesan99 alesan99 linked an issue Mar 27, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Collaborator

@lexiclevenger lexiclevenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The taxon tree in Specify and the csv file should have the same number of trees.

Based on the query below, the counts for Ichthyology and Mammology match those in the CSV.

Image

The following do not match::

Fungi: CSV 159,229, Query, 159,232
Herpetology: CSV 24,730, Query 24,732
Invertebrate: CSV 79,276 Query 79,282

Database: lsumz_herps_2025_09_09

@github-project-automation github-project-automation bot moved this from 📋Back Log to Dev Attention Needed in General Tester Board Mar 27, 2026
@CarolineDenis CarolineDenis self-requested a review March 30, 2026 16:35
Copy link
Copy Markdown
Collaborator

@bhumikaguptaa bhumikaguptaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The taxon tree in Specify and the csv file should have the same number of trees.

--
There is discrepancy in some trees the number of records uploaded versus the number of records in the query.

Botany (bryophyta):
Uploaded: 14425
Queried: 14428

Mammals:
Uploaded: 13557
Queried: 13557

Minerals:
Uploaded: 6202
Queried: 6202

I am changing requests because I had mixed results and also the number of records uploaded were different from the .json file. (eg. minerals was 6189 rows but 6202 got uploaded).

@alesan99
Copy link
Copy Markdown
Contributor Author

It looks like nodes were being added to the incorrect parents all along (even on main) if there were multiple parents with the same name in the same rank.
Currently working on a fix

@kwhuber kwhuber added 1 - Bug Incorrect behavior of the product and removed 1 - Bug Incorrect behavior of the product labels Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Dev Attention Needed

Development

Successfully merging this pull request may close these issues.

Tree count in queries doesn't match source data

5 participants