Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T012 - Data acquisition from KLIFS #79

Merged
merged 25 commits into from
Dec 14, 2020
Merged

T012 - Data acquisition from KLIFS #79

merged 25 commits into from
Dec 14, 2020

Conversation

dominiquesydow
Copy link
Collaborator

@dominiquesydow dominiquesydow commented Dec 2, 2020

Details

Content review

  • Potential labels or categories (e.g. machine learning, small molecules, online APIs): kinases, online APIs, protein-ligand complex, EGFR, profiling data, off-targets
  • One line summary: Structures and bound ligands for a query kinase are fetched from KLIFS using the website's Swagger API.
  • The table of contents reflects the talktorial story-line; order of #, ##, ### headers is correct
  • URLs are linked with meaningful words, instead of pasting the URL directly or linking words like here.
  • I have spell-checked the notebook
  • Images have enough resolution to be rendered with quality, without being too heavy.
  • All figures have a description
  • Markdown cell content is still in-line with code cell output (whenever results are discussed)
  • I have checked that cell outputs are not incredibly long (this applies also to DataFrames)
  • Formatting looks correctly on the Sphinx render (bold, italics, figure placing)

Code review

  • Time it took to execute (approx.): 20 s
  • Variable and function names follow snake case rules (e.g. a_variable_name vs aVariableName)
  • Spacing follows PEP8 (run Black on the code cells if needed)
  • Code line are under 99 characters each (run black -l 99)
  • Comments are useful and well placed
  • There are no unpythonic idioms like for i in range(len(list)) (see slides)
  • All 3rd party dependencies are listed at the top of the notebook
  • I have marked all code cell with output referenced in markdown cells with the label # TODO: CI - added # NBVAL_CHECK_OUTPUT
  • I have identified potential candidates for a code refactor / useful functions: KLIFS queries are already part of opencadd
  • All import ... lines are at the top (practice part) cell, ordered by standard library / 3rd party packages / our own (teachopencadd.*)
  • I have update the relative paths to absolute paths: Not needed in this stand-alone online API talkorial
  • List here unfamiliar libraries you find in the imports and their intended use: All libraries known and checked!

TODOs

  • Update environment: Use latest opencadd version
  • Update opencadd.databases.klifs TODO after PR is merged!
  • Reread the full talktorial one more time to check again for typos
  • Prepare talktorial TOC
  • Replace wiki reference with EGFR and Gefitinib references
  • Add discussion

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dominiquesydow dominiquesydow self-assigned this Dec 2, 2020
@dominiquesydow dominiquesydow changed the base branch from master to t011-base December 2, 2020 16:48
@dominiquesydow dominiquesydow changed the title Refactor and extend KLIFS talktorial T012 T012 revamp Dec 5, 2020
@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:27Z
----------------------------------------------------------------

Maybe?

programmatic access to this database

dominiquesydow commented on 2020-12-07T15:41:11Z
----------------------------------------------------------------

Yes, better, thanks!

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:27Z
----------------------------------------------------------------

There are 518 protein kinases encoded in the human genome, which were clustered based on their sequence into eight main kinase groups (AGC, CAMK, CK1, CMGC, STE, TK, TKL and Other

This depends on your source... We have seen slightly different numbers at openkinome/kinodata. I guess you need to cite Manning 2002 for this specific number.

Protein kinases catalyze the phosphorylation of tyrosine, serine and theorine residues of themselves or other kinases using their bound ATP.

Histidine residues can also be targeted!


dominiquesydow commented on 2020-12-07T15:41:21Z
----------------------------------------------------------------

Good point, added both.

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:28Z
----------------------------------------------------------------

to programmatically access - from your computer -

Luckily, there is a solution - some websites provide you with a document that defines the REST API schema for you (OpenAPI specification).

Link: https://swagger.io/docs/specification/about/. Swagger opensourced the whole thing and now the schema is called OpenAPI :) Replace Swagger definitions for OpenAPI definitions in the rest of the cell/notebook too!


dominiquesydow commented on 2020-12-07T15:44:42Z
----------------------------------------------------------------

Thanks for clarifying!

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:29Z
----------------------------------------------------------------

Volkamber Volkamer lab

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:30Z
----------------------------------------------------------------

When you only want the first element of a comprehension you can use a generator to avoid iterating (and storing) the whole thing on memory. Like this:

kinase_klifs_id = next(kinase.kinase_ID for kinase in kinases if kinase.name==kinase_name)

Same applies to other instances of this notebook


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:31Z
----------------------------------------------------------------

I don't see the image here but it might be a reviewnb thing


dominiquesydow commented on 2020-12-07T15:47:18Z
----------------------------------------------------------------

I did not push the rendered images, yet, will do so with the version for the full review for David.

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:31Z
----------------------------------------------------------------

Truncate input list to avoid that warning:

show_ligands(ligands["ligand.smiles"].to_list()[:10])

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 7, 2020

View / edit / reply to this conversation on ReviewNB

jaimergp commented on 2020-12-07T15:05:32Z
----------------------------------------------------------------

I think multiple returns can be expressed like this in docstrings:

    Returns
    -------
    molcomplex : str
        Complex structural data
    protein : str
        protein structural data
    ligands : list of str
        list of ligand SMILES.

@jaimergp
Copy link
Contributor

jaimergp commented Dec 7, 2020

Awesome content @dominiquesydow! I left a few comments clarifying some terms and definitions, and a couple of unimportant technicalities.

Always a pleasure to review high quality PRs such as this one!

Copy link
Collaborator Author

Yes, better, thanks!


View entire conversation on ReviewNB

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:23Z
----------------------------------------------------------------

typo?: the following information about:


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:24Z
----------------------------------------------------------------

typo: (IFP) for the example


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:25Z
----------------------------------------------------------------

Show only the first 10 file rows.

below it looks like you dont use the first 10 rows, but rows 100-110


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:26Z
----------------------------------------------------------------

I dont think all ligands in ChEMBL have drug-like properties


dominiquesydow commented on 2020-12-11T08:14:11Z
----------------------------------------------------------------

Good point. I took it from the first sentence on their website:

https://www.ebi.ac.uk/chembl/

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

Rephrased this to "...ChEMBL, a manually curated compound database"

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:26Z
----------------------------------------------------------------

example ligand W19: 10

not sure what this means


dominiquesydow commented on 2020-12-11T08:17:51Z
----------------------------------------------------------------

This is the number of ChEMBL bioactivity values available at KLIFS ligand W19 (i.e. 10 values).

Rephrased from

Number of bioactivity values for example ligand W19: 10

to

Number of ChEMBL bioactivity values available at KLIFS for example ligand W19: 10

Is that better? Or am I missing the point you were making?

schallerdavid commented on 2020-12-11T20:33:11Z
----------------------------------------------------------------

Sorry my fault, I just did not read the code :). All clear from my side

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:27Z
----------------------------------------------------------------

the opencadd version in test_env.yaml does not contain a databases module


dominiquesydow commented on 2020-12-11T08:20:35Z
----------------------------------------------------------------

True! This is still one TODO in the PR description - will take care of this today.

[ ] Update environment: Use latest opencadd version

@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:27Z
----------------------------------------------------------------

typo: per residues position or per residues positions


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:28Z
----------------------------------------------------------------

typo: and keep show the top 10 structures.


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:29Z
----------------------------------------------------------------

typo: there are a lot of

typo: of Gefitinib is EFGFR

typo: showning high activity


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:30Z
----------------------------------------------------------------

typo: You can specify one or more kinases of your choice to narrow down ...


@review-notebook-app
Copy link

review-notebook-app bot commented Dec 8, 2020

View / edit / reply to this conversation on ReviewNB

schallerdavid commented on 2020-12-08T16:34:31Z
----------------------------------------------------------------

typo: are available for the kinase CDK2?


@schallerdavid
Copy link
Collaborator

Great talktorial @dominiquesydow !

I mainly found typos. Everything ran smoothly. Only installing the environment from the test_env.yaml resulted in installation of an older opencadd version without the databases module.

Copy link
Collaborator Author

Good point. I took it from the first sentence on their website:

https://www.ebi.ac.uk/chembl/

ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.

Rephrased this to "...ChEMBL, a manually curated compound database"


View entire conversation on ReviewNB

Copy link
Collaborator Author

This is the number of ChEMBL bioactivity values available at KLIFS ligand W19 (i.e. 10 values).

Rephrased from

Number of bioactivity values for example ligand W19: 10

to

Number of ChEMBL bioactivity values available at KLIFS for example ligand W19: 10

Is that better? Or am I missing the point you were making?


View entire conversation on ReviewNB

Copy link
Collaborator Author

True! This is still one TODO in the PR description - will take care of this today.

[ ] Update environment: Use latest opencadd version

View entire conversation on ReviewNB

Copy link
Collaborator Author

Thanks, I have rephrased the sentence to a broader meaning:

"EGFR was the first receptor for which a relationship between mutations, overexpression and cancer (tumor growth) has been shown. "


View entire conversation on ReviewNB

@dominiquesydow
Copy link
Collaborator Author

Thanks @schallerdavid! My apologies for the many typos!
Will now take care of adding the latest opencadd version to the yml files.

@dominiquesydow dominiquesydow mentioned this pull request Dec 11, 2020
27 tasks
@dominiquesydow dominiquesydow added the new talktorial New talktorial label Dec 11, 2020
@dominiquesydow dominiquesydow changed the title T012 revamp T012 - Data acquisition from KLIFS Dec 11, 2020
Copy link
Collaborator

Sorry my fault, I just did not read the code :). All clear from my side


View entire conversation on ReviewNB

@dominiquesydow dominiquesydow merged commit 2e99614 into t011-base Dec 14, 2020
@dominiquesydow dominiquesydow deleted the ds-012-revamp branch December 14, 2020 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new talktorial New talktorial
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants