Skip to content

feat!: UTA access class separation of concerns, postgres lib fixes#465

Open
jsstevenson wants to merge 12 commits intomainfrom
uta-refactor
Open

feat!: UTA access class separation of concerns, postgres lib fixes#465
jsstevenson wants to merge 12 commits intomainfrom
uta-refactor

Conversation

@jsstevenson
Copy link
Copy Markdown
Member

@jsstevenson jsstevenson commented Apr 2, 2026

Basically a lot of tech debt fixes

  • Migrate from asyncpg to psycopg. Honestly, asyncpg might be faster but we are generally using psycopg elsewhere so might as well stick with one thing.
  • (breaking) break the DB access class into 2 things: 1) a repository pattern-style lookup class that just lasts as long as a DB connection, and 2) a connection pool-holder class for easily launching repository instances.
    • This means that applications which want to have more direct control over the lifespan of the connection pool can do so directly, and separates the connection-management logistics from the actual query/data transformation logic.
    • Mark the "get AWS secrets" function as deprecated. In the back of my head, I felt like we were trying to move away from secrets manager for DB connection params, but maybe I made that up. Either way, it's not ideal to be writing code specific to our AWS deployment into a library, so I threw down a deprecation warning and maybe we can figure out something better in 10 years
    • Previously we were basically doing a connection pool status check every time we ran a query, and (re)creating the pool if it was gone. It's a little funky to be delaying pool setup/config until runtime like this. I included a child class of the new UtaDatabase class that basically retains this behavior so we can minimize needed changes in the short term (+ less work for me to update tests)
  • (breaking) set schema in the way you are supposed to per postgres docs: ie postgresql://uta_admin@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public rather than /uta/uta_20241220
    • the postgresql://stuff/(db_name)/(schema_name) thing is nonstandard (I think it was basically invented for the hgvs library) and not in keeping with more typical postgres access patterns. In general, there are two standards for describing postgres connections -- as a key-value string and as a URI. So, I think if we're gonna have users go with the URI way, we should adhere to the standard.
    • env var name stays the same, but the old value won't work any more (it'll give you the wrong search path) and the default value is new
    • edit: I thought about it and it would probably be easy to retain backwards compatibility with the old URL format and just declare it deprecated or something, might do that instead
  • Don't hardcode schema into queries anymore. Instead, set the search path when creating the connection. (Still assume uta_20241220 by default, but this is set at connection pool creation via the connection string)
  • Don't hard-code variables into queries anymore, but pass them as params using the psycopg %(argname)s syntax
  • Except for 1-2 exceptions that were too hard, make all queries static instead of dynamically constructing them
  • Allow the public UTA query execution function to accept params as an additional arg
  • Change one UTA DB function that returned a (result, failure_description) tuple where the result and the failure values were mutually exclusive into a function that returned the result value, and raised an exception in case of failure.
  • Refactor the genomic table initializer method, and also make it optional for the pool factory function to attempt creation of the genomic table (I think this addresses UTA database recreates schema on startup #430)

basically I think the ideal FastAPI usage for just UTA should look like

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator:
    uta_pool = await create_uta_connection_pool()
    app.state.uta_pool = uta_pool
    yield
    await uta_pool.close()

async def get_uta(request: Request) -> Generator(UtaRepository, None, None):
    async with request.app.state.uta_pool.connection() as conn:
        yield UtaRepository(conn)

@app.get("/check_exists")
async def check_exists(gene: str, uta: Annotated[UtaRepository, Depends(get_uta)]):
    return await uta.gene_exists(gene)

and within coolseqtool itself, or in other contexts where you might want to pass around the entire coolseqtool god class instance, you can do something like

uta_db: UtaDatabase  # assume this exists
async with uta_db.repository() as uta:
    print(await uta.gene_exists("BRAF"))

A few misc engineering notes I thought about while working on this

  • Yes, we want to be using connection pools vs plain connections in basically all contexts. The pool more efficiently recycles connections, so even in contexts where everything is sequential, there's still a speedup
  • We probably want to stay async. It would be a pain to move back. Maybe in the future we can move the rest of the library to async/dial down all the blocking file open calls.

@jsstevenson jsstevenson added the priority:medium Medium priority label Apr 2, 2026
@jsstevenson jsstevenson marked this pull request as ready for review April 2, 2026 01:14
@jsstevenson jsstevenson requested a review from a team as a code owner April 2, 2026 01:14
@korikuzma
Copy link
Copy Markdown
Member

Will review this tomorrow

Copy link
Copy Markdown
Member

@korikuzma korikuzma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, minor things

return f"postgresql://{username}{':' + password if password else ''}@{host}:{port}/{database}?options=-csearch_path%3D{schema},public"


DEFAULT_UTA_DB_URL = "postgresql://uta_admin@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying this out with docker install, needed to set to this. I think since we are expecting people to use our docker instructions, we should update

Suggested change
DEFAULT_UTA_DB_URL = "postgresql://uta_admin@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public"
DEFAULT_UTA_DB_URL = "postgresql://anonymous@localhost:5432/uta?options=-csearch_path%3Duta_20241220,public"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like usage was updated correctly, but forgot to update here

Comment on lines +999 to +1007
if initialize_genomic_table:
try:
async with pool.connection() as conn:
await UtaRepository(conn).create_genomic_table()
# catch all exceptions -- this is probably a critical error, it's good to
# close the pool first
except:
await pool.close()
raise
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we can actually remove this since Docker compose handles this. Or we can have the default set to False

Suggested change
if initialize_genomic_table:
try:
async with pool.connection() as conn:
await UtaRepository(conn).create_genomic_table()
# catch all exceptions -- this is probably a critical error, it's good to
# close the pool first
except:
await pool.close()
raise

resp = await uta_repo.get_alt_ac_start_or_end("NM_152263.3", 822, 892, None)
assert resp == tpm3_1_8_end_genomic

with pytest.raises(NoMatchingAlignmentError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency?

Suggested change
with pytest.raises(NoMatchingAlignmentError):
with pytest.raises(
NoMatchingAlignmentError,
match=re.escape(
"Unable to find a result where NM_152263.63 has transcript coordinates (tx_exon_start=822, tx_exon_end=892) between an exon's start and end coordinates on gene=None"
),
):

Comment on lines -172 to -175
async def test_data_from_result(test_db, tx_exon_aln_data, data_from_result):
"""Test that data_from_result works correctly."""
resp = test_db.data_from_result(tx_exon_aln_data)
assert resp == data_from_result
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added back? If not, remove the data_from_result test fixture

Co-authored-by: Kori Kuzma <korikuzma@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants