Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate seqr project data #593

Merged
merged 6 commits into from
Nov 2, 2023
Merged

Generate seqr project data #593

merged 6 commits into from
Nov 2, 2023

Conversation

EddieLF
Copy link
Contributor

@EddieLF EddieLF commented Oct 30, 2023

Developed this script to help with testing some endpoints being worked on that compile stats from seqr projects.

It largely adapts test/data/generate_data.py, but instead of generating a single project using the pedigree file from the same folder, it generates a dozen projects, populates them with randomly generated pedigrees filled with randomly generated family and participant IDs, and relationships. Each project is then bulked out with random numbers of samples, sequencing groups, assays, and analyses.

It also randomly allocates a subset of the sequencing groups as "aligned" and creates completed CRAM analyses for these. A subset of the aligned sequencing groups are allocated as the "joint-called" sequencing groups, and an AnnotateDataset custom analysis + es-index analysis are created containing these sequencing groups.

)
samples.append(sample)

for stype in generate_sequencing_type():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main function seems a bit too long.
I would suggest to wrap those 2 most inner loops in a function (for stype: for _ ). Will be easier to follow.
Maybe even the whole samples list creation can be refactored as a function.

Otherwise looking good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback Milo. Agreed that the main function was too long, I've moved the samples list creation into its own function and added some better comments which hopefully make it easier to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd maybe clean it up a bit more. Break down the main function so it just makes function calls. Each commented block should be its own function. It just makes things so much easier to interpret.

I suspect this will be used (and potentially modified) a lot, so it's worth setting is up well if people come back and want to make significant changes later on.

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dd1ba37) 71.39% compared to head (1199cb5) 71.39%.
Report is 4 commits behind head on dev.

Additional details and impacted files
@@           Coverage Diff           @@
##              dev     #593   +/-   ##
=======================================
  Coverage   71.39%   71.39%           
=======================================
  Files         116      116           
  Lines        9283     9283           
=======================================
  Hits         6628     6628           
  Misses       2655     2655           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@violetbrina violetbrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main function refactor

@EddieLF
Copy link
Contributor Author

EddieLF commented Nov 2, 2023

Thanks for the feedback @violetbrina.
I've refactored the main function so that the various steps are compartmentalized into their own functions.

Now across only 4 lines in main():

  1. The families/participants/pedigrees are created and inserted
  2. The samples/sequencing groups/assays are created and inserted
  3. The CRAM analyses are inserted
  4. The AnnotateDataset/ES-Index analyses are inserted

I also changed the analyses to initialise as an empty list prior to the project iteration, and then insert in chunks at the end.

Copy link
Contributor

@violetbrina violetbrina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me!

@EddieLF EddieLF merged commit ff22bab into dev Nov 2, 2023
2 checks passed
@EddieLF EddieLF deleted the generate_seqr_project_data branch November 2, 2023 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants