Generate seqr project data #593

EddieLF · 2023-10-30T23:48:43Z

Developed this script to help with testing some endpoints being worked on that compile stats from seqr projects.

It largely adapts test/data/generate_data.py, but instead of generating a single project using the pedigree file from the same folder, it generates a dozen projects, populates them with randomly generated pedigrees filled with randomly generated family and participant IDs, and relationships. Each project is then bulked out with random numbers of samples, sequencing groups, assays, and analyses.

It also randomly allocates a subset of the sequencing groups as "aligned" and creates completed CRAM analyses for these. A subset of the aligned sequencing groups are allocated as the "joint-called" sequencing groups, and an AnnotateDataset custom analysis + es-index analysis are created containing these sequencing groups.

milo-hyben · 2023-10-31T04:24:59Z

test/data/generate_seqr_project_data.py

+                )
+                samples.append(sample)
+
+                for stype in generate_sequencing_type():


The main function seems a bit too long.
I would suggest to wrap those 2 most inner loops in a function (for stype: for _ ). Will be easier to follow.
Maybe even the whole samples list creation can be refactored as a function.

Otherwise looking good to me.

Thanks for the feedback Milo. Agreed that the main function was too long, I've moved the samples list creation into its own function and added some better comments which hopefully make it easier to follow.

I'd maybe clean it up a bit more. Break down the main function so it just makes function calls. Each commented block should be its own function. It just makes things so much easier to interpret.

I suspect this will be used (and potentially modified) a lot, so it's worth setting is up well if people come back and want to make significant changes later on.

codecov-commenter · 2023-10-31T06:02:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd1ba37) 71.39% compared to head (1199cb5) 71.39%.
Report is 4 commits behind head on dev.

Additional details and impacted files

@@           Coverage Diff           @@
##              dev     #593   +/-   ##
=======================================
  Coverage   71.39%   71.39%           
=======================================
  Files         116      116           
  Lines        9283     9283           
=======================================
  Hits         6628     6628           
  Misses       2655     2655

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

violetbrina

Main function refactor

EddieLF · 2023-11-02T03:25:28Z

Thanks for the feedback @violetbrina.
I've refactored the main function so that the various steps are compartmentalized into their own functions.

Now across only 4 lines in main():

The families/participants/pedigrees are created and inserted
The samples/sequencing groups/assays are created and inserted
The CRAM analyses are inserted
The AnnotateDataset/ES-Index analyses are inserted

I also changed the analyses to initialise as an empty list prior to the project iteration, and then insert in chunks at the end.

violetbrina

Looks great to me!

EddieLF added 3 commits October 24, 2023 16:45

New script to create test datasets for seqr endpoint testing

c02d3f8

Change inserted fields

c20280d

Fix variable names and add typehints

97c89e4

EddieLF requested review from milo-hyben, violetbrina and illusional October 30, 2023 23:51

milo-hyben reviewed Oct 31, 2023

View reviewed changes

EddieLF added 2 commits October 31, 2023 16:55

Refactor main function into smaller functions

af5d3be

Rearrange variables for clarity

d504cc4

violetbrina requested changes Nov 1, 2023

View reviewed changes

Further atomize script by moving blocks out of main function

1199cb5

EddieLF requested a review from violetbrina November 2, 2023 03:25

violetbrina approved these changes Nov 2, 2023

View reviewed changes

EddieLF merged commit ff22bab into dev Nov 2, 2023
2 checks passed

EddieLF deleted the generate_seqr_project_data branch November 2, 2023 03:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate seqr project data #593

Generate seqr project data #593

EddieLF commented Oct 30, 2023

milo-hyben Oct 31, 2023

EddieLF Oct 31, 2023

violetbrina Nov 1, 2023

codecov-commenter commented Oct 31, 2023 •

edited

Loading

violetbrina left a comment

EddieLF commented Nov 2, 2023

violetbrina left a comment

Generate seqr project data #593

Generate seqr project data #593

Conversation

EddieLF commented Oct 30, 2023

milo-hyben Oct 31, 2023

Choose a reason for hiding this comment

EddieLF Oct 31, 2023

Choose a reason for hiding this comment

violetbrina Nov 1, 2023

Choose a reason for hiding this comment

codecov-commenter commented Oct 31, 2023 • edited Loading

Codecov Report

violetbrina left a comment

Choose a reason for hiding this comment

EddieLF commented Nov 2, 2023

violetbrina left a comment

Choose a reason for hiding this comment

codecov-commenter commented Oct 31, 2023 •

edited

Loading