Skip to content

Add Euclid HATS tutorial #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
Draft

Conversation

troyraen
Copy link
Contributor

@troyraen troyraen commented Jun 5, 2025

Ready for review, but should not be merged before the Euclid HATS catalog is released publicly.

Adds a notebook introducing the Euclid Q1 HATS product. The dataset is currently in a testing bucket that is available from Fornax and IPAC networks only. The dataset has been released publicly. nasa-fornax/fornax-demo-notebooks#416.

@@ -23,6 +23,9 @@ fsspec
sep>=1.4
h5py
requests
hats>=0.5.2
lsdb>=0.5.2
pyerfa>=2.0.1.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if the notebook actually ends up using lsdb (currently, it does not). I'll prune the dependencies once that is known.

@bsipocz
Copy link
Member

bsipocz commented Jun 11, 2025

FYI: The new HATS requirements adds a very tight limit on the version tolerance for our dependencies, so I will need to rethink how we do CI here. It points beyond this PR, so I will just push the CI workarounds, but will revise CI approaches for notebooks that are relying on latest and greatest features and libraries.

@bsipocz
Copy link
Member

bsipocz commented Jun 11, 2025

This now nicely triggered the need to update our CI approaches as hats introduces a very tight version requirements that would be nice not to enforce for all the other notebooks.
But this all points beyond any reasonable scope in here, and I will address it separately.

@troyraen
Copy link
Contributor Author

Thanks @bsipocz. Big picture, yes, we will need to require hats>=0.5.2 and lsdb>=0.5.2 in order to use lsdb with IRSA's HATS products. This notebook doesn't actually use lsdb right now, but it might before it's finished. If you'd prefer, I can just remove those dependencies for now and deal with adding them back later if/when needed.

@troyraen
Copy link
Contributor Author

Applied some feedback from @vandesai1. Remaining is:

  • I sort of feel like the Appendix should be part of the notebook and earlier. I think a lot of astronomers don't know what "Schema" means.
  • Should have a bit more description of how it was joined, and if there are any limitations if they exist.
  • Learning goals:
    • compare to learning goals in "traditional" SPE notebook to look for balance/intentionally.
    • should include understand why we want to use this format: pros/cons versus other methods.
    • I see three goals for this material: (a) show the format/tools for the format; (b) make users trust that this new data product; (c) show users some of the ins and outs of Euclid catalog data. Tiffany and Anahita are going to provide additional reviews with these goals in mind.

Copy link
Member

@jaladh-singhal jaladh-singhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good to me! Pyarrow syntax throughout the notebook looks easy to follow thanks to your comments and narrative.

Side note: From the HATS structure aspect, your notebook clearly demonstrates how to access multiple columns without needing to perform joins and how to load only slice of it in memory (using pyarrow). But spatial filtering (where hpgeom and/or lsdb comes into picture) isn't understandably demonstrated due to the size of this notebook.

Related to this, I see your note to remove section 3.5 - perhaps this can become a second notebook which can also do cross-matching of Euclid q1 with some other catalogs - maybe ZTF? This relates to discussion we had at https://github.com/IPAC-SW/ipac-sp-notebooks/pull/104 (and outline of such a tutorial). Let me know if you have a clear science use case in mind that can be pursued here.

Comment on lines +439 to +442
pp_kwargs = dict(label=PHYSPARAM_GAL_Z + " (filtered)", color=tbl_colors["PHYSPARAM"], linestyle=":")
ax.hist(pp_df[PHYSPARAM_GAL_Z], **pp_kwargs, **hist_kwargs)
# Impose our final cuts.
pp_kwargs.update(label=PHYSPARAM_GAL_Z + " (quality)", linestyle="-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Maybe it's just me but] I find the "(filtered)" label quite confusing in contrast with "(quality)" because the latter is also a filtered set. I had to go back to previous cells to realize that "(filtered)" means "partial filter for quality" and "(quality)" means "final filter to eliminate further problematic sources". Maybe it can be named "(original filter)" and "(quality)" or some better naming that I can't think of (and AI can?!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I struggled over those labels as well 😆. I'll give them some more thought.

- 'Norder' : (hats column) HEALPix order at which the data is partitioned.
- 'Npix' : (hats column) HEALPix pixel index at order Norder.
- 'Dir' : (hats column) Integer equal to 10_000 * floor[Npix / 10_000].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] Here you can emphasize on the last 3 partitioning columns by doing dataset.partitioning.schema in a cell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks.

Comment on lines 1053 to 1054
s3_filesystem = pyarrow.fs.S3FileSystem()
schema = pyarrow.parquet.read_schema(euclid_parquet_schema_path, filesystem=s3_filesystem)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have already read the dataset, why not use simpler syntax:

Suggested change
s3_filesystem = pyarrow.fs.S3FileSystem()
schema = pyarrow.parquet.read_schema(euclid_parquet_schema_path, filesystem=s3_filesystem)
schema = dataset.schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema in dataset.schema does not include the column metadata (units and descriptions) but the one I'm loading here does. I'll add some text mentioning that.

Fyi for anyone interested, the reason is that including that metadata in the places that would make it show up in dataset.schema would result in a _metadata file (which is used to load the dataset) that is much bigger, and that makes it noticeably harder to work with. And the reason it's so much bigger is that the full schema gets repeated multiple times per data file (once per row group) in the _metadata file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wasn't aware of it. Some text for this will be helpful!

@troyraen
Copy link
Contributor Author

Thanks @jaladh-singhal!

But spatial filtering (where hpgeom and/or lsdb comes into picture) isn't understandably demonstrated. ... perhaps this can become a second notebook which can also do cross-matching of Euclid q1 with some other catalogs - maybe ZTF?

Yes, that's also my thought at the moment. Need to finish getting Euclid and ZTF HATS datasets released publicly, then I'll have a little more time to think through that notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content Content related issues/PRs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants