Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incomplete file listings #46

Open
1 task done
jbusecke opened this issue May 8, 2024 · 3 comments
Open
1 task done

Incomplete file listings #46

jbusecke opened this issue May 8, 2024 · 3 comments

Comments

@jbusecke
Copy link
Owner

jbusecke commented May 8, 2024

leap-stc/cmip6-leap-feedstock#116 (comment) describes a case where I get a nice list of files back, but they are not complete!
How do we detect this case before ingesting?

  • Is this an artifact of searching just one index node?
@jbusecke
Copy link
Owner Author

jbusecke commented May 8, 2024

Yup I just confirmed that the distributed search does not work properly 😡:

from pangeo_forge_esgf.client import ESGFClient
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
for search_node in ["https://esgf-node.llnl.gov",
            "https://esgf-data.dkrz.de",
            "https://esgf.nci.org.au",
            "https://esgf-node.ornl.gov",
            "https://esgf-node.ipsl.upmc.fr",
            "https://esg-dn1.nsc.liu.se",
            "https://esgf.ceda.ac.uk",]:
    client = ESGFClient(search_node, distributed=True)
    dataset_id = client.get_instance_id_input([iid])[iid]['id']
    details = client._search_files_from_dataset_ids([dataset_id])
    print(f"{search_node=} {[i['id'] for i in details]}")

Which means I will now have to query every index node separately and combine results, what a pain.

Ill stop here for now, but list the options I have going forwards

  1. Loop over multiple index nodes as part of my client class (seems slow and annoying)

  2. Use intake-esgf which already does this (would reduce my maintenance burden, but probably also be very slow; needs testing)

  3. Rewrite the client AGAIN to do all the requesting async (probably fastest, but also significant work)

  4. Seems useless, since I might as well go to 2. So I guess ill time 2. and then decide if it might be worth embarking on 3.

😩

@jbusecke
Copy link
Owner Author

jbusecke commented May 8, 2024

Example how to maybe use intake-esgf:

!pip install git+https://github.com/jbusecke/intake-esgf.git@http-links

import intake_esgf
from intake_esgf import ESGFCatalog
from intake_esgf.base import NoSearchResults
from pangeo_forge_esgf.utils import facets_from_iid

intake_esgf.conf.set(indices={
    "esgf-node.llnl.gov":True,
    "esg-dn1.nsc.liu.se":True,
    "esgf-data.dkrz.de":True,
    "esgf-node.ipsl.upmc.fr":True,
    "esgf-node.ornl.gov":True,
    "esgf.ceda.ac.uk":True,
    # "esgf.nci.org.au":True,
})
cat = ESGFCatalog()
def get_urls_from_intake_esgf(iid:str, cat:ESGFCatalog):
    print(iid)
    facets = facets_from_iid(iid)
    facets['version'] = facets['version'].replace('v','') # shouldn't be necessary once https://github.com/jbusecke/pangeo-forge-esgf/pull/41 is merged
    try:
        res = cat.search(**facets)
        return res.to_http_link_dict()
    except NoSearchResults:
        return None

iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
a = get_urls_from_intake_esgf(iid, cat)
[i['path'] for i in a]

@jbusecke
Copy link
Owner Author

jbusecke commented May 8, 2024

Ah here is a way to fail out these instances of incomplete filenames:

from pangeo_forge_esgf.client import ESGFClient
import json
iid = 'CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710'
client = ESGFClient()
d = client.get_instance_id_input([iid])
print(json.dumps(d, indent=4))

This produces

{ "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710": { "id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|esgf.nci.org.au", "version": "20190710", "access": [ "HTTPServer", "GridFTP", "OPENDAP", "Globus" ], "activity_drs": [ "CMIP" ], "activity_id": [ "CMIP" ], "cf_standard_name": [ "air_temperature" ], "citation_url": [ "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json" ], "data_node": "esgf.nci.org.au", "data_specs_version": [ "01.00.30" ], "dataset_id_template_": [ "%(mip_era)s.%(activity_drs)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s" ], "datetime_start": "1975-01-16T12:00:00Z", "datetime_stop": "2014-12-16T12:00:00Z", "directory_format_template_": [ "%(root)s/%(mip_era)s/%(activity_drs)s/%(institution_id)s/%(source_id)s/%(experiment_id)s/%(member_id)s/%(table_id)s/%(variable_id)s/%(grid_label)s/%(version)s" ], "east_degrees": 359.0625, "experiment_id": [ "historical" ], "experiment_title": [ "all-forcing simulation of the recent past" ], "frequency": [ "mon" ], "further_info_url": [ "https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-ESM1-2-HR.historical.none.r1i1p1f1" ], "geo": [ "ENVELOPE(-180.0, -0.9375, 89.284225, -89.284225)", "ENVELOPE(0.0, 180.0, 89.284225, -89.284225)" ], "geo_units": [ "degrees_east" ], "grid": [ "gn" ], "grid_label": [ "gn" ], "index_node": "esgf.nci.org.au", "instance_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710", "institution_id": [ "MPI-M" ], "latest": true, "master_id": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn", "member_id": [ "r1i1p1f1" ], "mip_era": [ "CMIP6" ], "model_cohort": [ "Registered" ], "nominal_resolution": [ "100 km" ], "north_degrees": 89.284225, "number_of_aggregations": 1, "number_of_files": 8, "pid": [ "hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c" ], "product": [ "model-output" ], "project": [ "CMIP6" ], "realm": [ "atmos" ], "replica": true, "size": 56793078, "source_id": [ "MPI-ESM1-2-HR" ], "source_type": [ "AOGCM" ], "south_degrees": -89.284225, "sub_experiment_id": [ "none" ], "table_id": [ "Amon" ], "title": "CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn", "type": "Dataset", "url": [ "http://esgf.nci.org.au/thredds/catalog/esgcet/CMIP6/CMIP/MPI-M/MPI-ESM1-2-HR/historical/r1i1p1f1/Amon/tas/gn/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.xml#CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710|application/xml+thredds|THREDDS" ], "variable": [ "tas" ], "variable_id": [ "tas" ], "variable_long_name": [ "Near-Surface Air Temperature" ], "variable_units": [ "K" ], "variant_label": [ "r1i1p1f1" ], "west_degrees": 0.0, "xlink": [ "http://cera-www.dkrz.de/WDCC/meta/CMIP6/CMIP6.CMIP.MPI-M.MPI-ESM1-2-HR.historical.r1i1p1f1.Amon.tas.gn.v20190710.json|Citation|citation", "http://hdl.handle.net/hdl:21.14100/e7de3c1e-2c48-3470-ba5e-f97a62a1878c|PID|pid" ], "_version_": 1689449850470400000, "retracted": false, "_timestamp": "2021-01-20T23:22:11.250Z", "score": 1.0 } }

My idea is to use
"datetime_start": "1975-01-16T12:00:00Z",
"datetime_stop": "2014-12-16T12:00:00Z",

inject them as dataset attributes, and then run a check against the actual dataset time data to confirm that the dataset covers this (or at least close to this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant