-
Notifications
You must be signed in to change notification settings - Fork 4
Add Euclid HATS tutorial #108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>
Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>
Co-authored-by: Jaladh Singhal <jaladhsinghal@gmail.com> Co-authored-by: Brigitta Sipőcz <b.sipocz@gmail.com>
@@ -23,6 +23,9 @@ fsspec | |||
sep>=1.4 | |||
h5py | |||
requests | |||
hats>=0.5.2 | |||
lsdb>=0.5.2 | |||
pyerfa>=2.0.1.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only if the notebook actually ends up using lsdb
(currently, it does not). I'll prune the dependencies once that is known.
FYI: The new HATS requirements adds a very tight limit on the version tolerance for our dependencies, so I will need to rethink how we do CI here. It points beyond this PR, so I will just push the CI workarounds, but will revise CI approaches for notebooks that are relying on latest and greatest features and libraries. |
This now nicely triggered the need to update our CI approaches as |
Thanks @bsipocz. Big picture, yes, we will need to require hats>=0.5.2 and lsdb>=0.5.2 in order to use lsdb with IRSA's HATS products. This notebook doesn't actually use lsdb right now, but it might before it's finished. If you'd prefer, I can just remove those dependencies for now and deal with adding them back later if/when needed. |
Applied some feedback from @vandesai1. Remaining is:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me! Pyarrow syntax throughout the notebook looks easy to follow thanks to your comments and narrative.
Side note: From the HATS structure aspect, your notebook clearly demonstrates how to access multiple columns without needing to perform joins and how to load only slice of it in memory (using pyarrow). But spatial filtering (where hpgeom and/or lsdb comes into picture) isn't understandably demonstrated due to the size of this notebook.
Related to this, I see your note to remove section 3.5 - perhaps this can become a second notebook which can also do cross-matching of Euclid q1 with some other catalogs - maybe ZTF? This relates to discussion we had at https://github.com/IPAC-SW/ipac-sp-notebooks/pull/104 (and outline of such a tutorial). Let me know if you have a clear science use case in mind that can be pursued here.
pp_kwargs = dict(label=PHYSPARAM_GAL_Z + " (filtered)", color=tbl_colors["PHYSPARAM"], linestyle=":") | ||
ax.hist(pp_df[PHYSPARAM_GAL_Z], **pp_kwargs, **hist_kwargs) | ||
# Impose our final cuts. | ||
pp_kwargs.update(label=PHYSPARAM_GAL_Z + " (quality)", linestyle="-") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Maybe it's just me but] I find the "(filtered)" label quite confusing in contrast with "(quality)" because the latter is also a filtered set. I had to go back to previous cells to realize that "(filtered)" means "partial filter for quality" and "(quality)" means "final filter to eliminate further problematic sources". Maybe it can be named "(original filter)" and "(quality)" or some better naming that I can't think of (and AI can?!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I struggled over those labels as well 😆. I'll give them some more thought.
- 'Norder' : (hats column) HEALPix order at which the data is partitioned. | ||
- 'Npix' : (hats column) HEALPix pixel index at order Norder. | ||
- 'Dir' : (hats column) Integer equal to 10_000 * floor[Npix / 10_000]. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] Here you can emphasize on the last 3 partitioning columns by doing dataset.partitioning.schema
in a cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks.
s3_filesystem = pyarrow.fs.S3FileSystem() | ||
schema = pyarrow.parquet.read_schema(euclid_parquet_schema_path, filesystem=s3_filesystem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have already read the dataset, why not use simpler syntax:
s3_filesystem = pyarrow.fs.S3FileSystem() | |
schema = pyarrow.parquet.read_schema(euclid_parquet_schema_path, filesystem=s3_filesystem) | |
schema = dataset.schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema in dataset.schema
does not include the column metadata (units and descriptions) but the one I'm loading here does. I'll add some text mentioning that.
Fyi for anyone interested, the reason is that including that metadata in the places that would make it show up in dataset.schema
would result in a _metadata file (which is used to load the dataset) that is much bigger, and that makes it noticeably harder to work with. And the reason it's so much bigger is that the full schema gets repeated multiple times per data file (once per row group) in the _metadata file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I wasn't aware of it. Some text for this will be helpful!
Thanks @jaladh-singhal!
Yes, that's also my thought at the moment. Need to finish getting Euclid and ZTF HATS datasets released publicly, then I'll have a little more time to think through that notebook. |
Ready for review, but should not be merged before the Euclid HATS catalog is released publicly.
Adds a notebook introducing the Euclid Q1 HATS product.
The dataset is currently in a testing bucket that is available from Fornax and IPAC networks only.The dataset has been released publicly. nasa-fornax/fornax-demo-notebooks#416.