Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download external data once and reuse them during run #27

Open
kltm opened this issue Mar 10, 2018 · 11 comments
Open

Download external data once and reuse them during run #27

kltm opened this issue Mar 10, 2018 · 11 comments

Comments

@kltm
Copy link
Member

kltm commented Mar 10, 2018

Download ontologies and "annotation" upstreams once and reuse them during run in all stages. This would be accomplished with some combo of catalogs and/or robot.

This serves two important purposes.

  • allows the pipeline to fail early -- will not fail several hours in, wasting less time and worry
  • have effective retries at the pipeline level -- as we are failing early, we can add retries to cover temporary network issues
  • allows us to package used ontologies to create actually reuseable/repeatable environments

Ideally, once the initial data grabs are done up front, the pipeline stops talking to the outside world until it starts publishing.

@kltm kltm changed the title Download ontologies once and reuse them during run Download external data once and reuse them during run Mar 13, 2018
@kltm
Copy link
Member Author

kltm commented Mar 13, 2018

We are also having problems with getting our hands on other artifacts. I've generalized this ticket so we can figure out a joint strategy to get all data in place upfront so we can ensure that we don't have failures after (in this case) about six hours.

E.g.:

02:42:34 wget --quiet --retry-connrefused --waitretry=10 -t 10 --no-check-certificate ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/CHICKEN/goa_chicken_isoform.gaf.gz -O target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz.tmp && mv target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz.tmp target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz && touch target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz
02:42:37 target/Makefile:302: recipe for target 'target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz' failed
02:42:37 make: *** [target/groups/goa_chicken_isoform/goa_chicken_isoform-src.gaf.gz] Error 4

also tagging @cmungall @dougli1sqrd

@kltm
Copy link
Member Author

kltm commented Feb 4, 2019

Noting that the ontology portion of this ticket would largely be covered by geneontology/go-ontology#16876 (@balhoff )

Noting to @dougli1sqrd that the GAF/GPAD upstream part of this could be covered by the following steps:

  • new Makefile target something like download
    • downloads necessary upstream file to local filesystem
    • make download could be wrapped in a retry and fail the pipeline early
  • run regular mega target
    • if the mega-Makefile validate/downloader already sees the file on the filesystem, it accepts it without attempting a download

This would help address recent issues we've had with our upstreams where even though a download fails, the pipeline may continue running for many hours with a different parallel job, both increasing the amount of time it takes to notice a problem and burying the error message somewhere in a bajillion log lines.

@kltm
Copy link
Member Author

kltm commented Jun 6, 2019

Further commentary on #27 : if every tool we realistically use (Noctua, AmiGO, etc.) loads a single realized ontology, if it had the right pedigree information in it, we could just reference those locally and not have to worry about catalogs.

@kltm
Copy link
Member Author

kltm commented Jun 6, 2019

@balhoff With the closure of ontodev/robot#6, we should be able to reference the merged ontologies that we produce withing the pipeline, taking out the guesswork. At this stage, what are the ones being produced?

In pipeline, mostly for AmiGO:

http://purl.obolibrary.org/obo/chebi.owl
http://purl.obolibrary.org/obo/cl/cl-basic.owl
http://purl.obolibrary.org/obo/eco/eco-basic.owl
http://purl.obolibrary.org/obo/ncbitaxon/subsets/taxslim.owl
http://purl.obolibrary.org/obo/pato.owl
http://purl.obolibrary.org/obo/po.owl
http://purl.obolibrary.org/obo/uberon/basic.owl
http://purl.obolibrary.org/obo/wbbt.owl
http://skyhook.berkeleybop.org/master/ontology/extensions/go-lego.owl
http://skyhook.berkeleybop.org/snapshot/ontology/extensions/go-gaf.owl
http://skyhook.berkeleybop.org/snapshot/ontology/extensions/go-modules-annotations.owl
http://skyhook.berkeleybop.org/snapshot/ontology/extensions/go-taxon-subsets.owl
http://skyhook.berkeleybop.org/snapshot/ontology/extensions/gorel.owl

Tangentially, in minerva:
http://purl.obolibrary.org/obo/go/extensions/go-lego.owl

The vast majority are for AmiGO/GOlr (basically everything not go-lego). Already, they are looking at the internal version where it is our product. Could we not have another product for AmiGO/GOlr that could bind the rest of these up?

@dougli1sqrd Where are you getting the ontology information for ontobio?

@dougli1sqrd
Copy link
Contributor

@kltm ontobio uses the go.json from the purl. It's downloaded by the go-site/pipeline makefile before ontobio runs.

@kltm
Copy link
Member Author

kltm commented Jun 28, 2019

@balhoff I wanted to follow up on #27 (comment) above in reference to:

GOLR_INPUT_ONTOLOGIES = [

Are there plans to have these as a merged ontology, or should we work with @dougli1sqrd to make sure these are all available locally by the time we hit this point in the pipeline?

@dougli1sqrd
Copy link
Contributor

GAF upstream sources are now being downloaded and used in the pipeline. Is there anything left in this ticket?

@kltm
Copy link
Member Author

kltm commented Nov 27, 2019

We still have ontologies all over the place. We might want to make another issue, but essentially we need to have an enforce catalogs (or similar) so that there is no leaking during a run. For example, currently, I believe there is a place that tags the public NEO load, meaning that it can be a month behind.

@kltm
Copy link
Member Author

kltm commented Dec 6, 2019

Talking to @dougli1sqrd, it turns out that the "mixin" process in ontobio will still grab the remote file (paint in this case, possibly causing errors is resource down, as experienced on 2019-12-05).
So, we're almost there, but still leaky.

That said, we probably don't want to keep going down the path of "tricking" ontobio by laying things out on the filesystem, but rather a more "catalog-like" system where the downloader can generate the mapping file for the run that is then consumed by ontobio.

@dougli1sqrd
Copy link
Contributor

Just a reminder to ourselves, this is occurring still: #27 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants