Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: separate path and filename providers #25

Closed
wants to merge 2 commits into from
Closed

Conversation

jokasimr
Copy link
Contributor

@jokasimr jokasimr commented Feb 2, 2024

The old version did not allow loading local files into the workflow, everything had to go through pooch.
With this change the user can directly specify a local file path or the name of a file that is in the pooch data repository.

@jokasimr jokasimr requested a review from nvaytet February 2, 2024 10:07
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea isn't bad. But I think I would stick with Filename as the type for the actual loader and be more explicit about the argument to pooch. E.g. PoochFilename or BuiltinFilename.

Also, looking ahead, we will need something similar for SciCat. But there, a simple name is not enough. We need at least a (DatasetID, SciCatFilename) tuple. Not sure if this should affect the current PR, but it might inform the way forward.

Copy link
Contributor Author

@jokasimr jokasimr Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the names. Good point about scicat but I don't think that affects the current PR since (I imagine) it would just be a question about adding a SciCatFilePath(id: SciCatFileIdentifier[RunType]) -> Filepath[RunType] provider that downloads the data somewhere and returns the path where it was stored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but that can be difficult. You would need a way to map the RunType to a dataset id. And you need to know which file of the dataset you want.
But this is a discussion for another day

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SimonHeybrock had a different way of doing this for the Zoom data. It's about using different providers depending on whether your files are local or not.
You have to select which provider to use in the notebook.
See for example this local file provider compared to the remote file provider.

We should probably all agree on the same approach and use it everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this essentially the same approach but with different names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is using different types, and the other approach is to use different providers but the same types?

Copy link
Contributor Author

@jokasimr jokasimr Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but that can be difficult. You would need a way to map the RunType to a dataset id. And you need to know which file of the dataset you want.
But this is a discussion for another day

Couldn't that mapping be provided by the user just like filenames are provided now? Like this:

pipeline[ScicatFileIdentifier[SampleRun]] = (dataset_id, 'sample_run.nxs')

# ... then we'd have a provider like this somewhere:
def download_scicat_data(identifier: ScicatFileIdentifier[RunType]) -> Filename[RunType]:
    ''' Downloads the dataset and puts it somewhere on disk, returns the path'''

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to amend my original comment. In ESSsans, we use Filename for some basic, user provided filename and FilePath as the actual path on disk. So pooch converts a Filename into a FilePath.

A SciCat provider would convert something like your ScicatFileIdentifier to a FilePath.

This is also listed here scipp/essreduce#7
But this issue for now lists current usage, not a guideline yet. So if you don't like this naming, please comment!

@@ -29,6 +29,7 @@
conversions.providers,
resolution.providers,
beamline.providers,
data.providers,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this? If we insert this provider by default, then every time the user want to provide the file name in a different way, they first have to remove the pooch provider.

Copy link
Contributor Author

@jokasimr jokasimr Feb 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameters override providers in Sciline so that is not a problem. See example below:

import sciline as sl

def test(f: float) -> int:
    return int(f)

sl.Pipeline([test], params={int: 1})

Similarly when replacing providers:

pl = sl.Pipeline([test], params={float: 1.0})
pl[int] = 2
pl.compute(int)

The above does not raise AmbiguousProviderError.
Or did I misunderstand you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jl-wynen, this should be removed. The providers we use for the docs are not useful for normal users. If they forget to insert data (or a data provider) we want an error, not something that silently loads some unknown data and processes this. This could result in someone wasting hours debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to exclude providers we only use in docs from the default provider list. I'll remove it.

" SampleRotation[Reference]: sc.scalar(0.8389, unit='deg'),\n",
" Filename[Reference]: \"reference.nxs\",\n",
" PoochFilename[Reference]: \"reference.nxs\",\n",
"}"
]
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a short cell that explains how to load local file by setting Filename[Run] in the params here, maybe...?

Copy link
Member

@SimonHeybrock SimonHeybrock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at how I solved a similar problem scipp/esssans#50. Not sure that is perfect, but the idea is that it would allow for adding things such as SciCat, which could convert a filename (or run ID) into a file path.

@jokasimr
Copy link
Contributor Author

jokasimr commented Feb 5, 2024

Please have a look at how I solved a similar problem scipp/esssans#50. Not sure that is perfect, but the idea is that it would allow for adding things such as SciCat, which could convert a filename (or run ID) into a file path.

@jl-wynen raised a similar concern earlier about future scicat file loading. I don't understand why the solution in this PR would hinder adding a SciCatDataset provider later, and I've described above how I would propose to solve that, but maybe there's something I'm missing, can you explain the problem?

@SimonHeybrock
Copy link
Member

Please have a look at how I solved a similar problem scipp/esssans#50. Not sure that is perfect, but the idea is that it would allow for adding things such as SciCat, which could convert a filename (or run ID) into a file path.

@jl-wynen raised a similar concern earlier about future scicat file loading. I don't understand why the solution in this PR would hinder adding a SciCatDataset provider later, and I've described above how I would propose to solve that, but maybe there's something I'm missing, can you explain the problem?

The problem is that we have to avoid using a slightly different solution in every project, i.e., we should coordinate.

@jokasimr
Copy link
Contributor Author

jokasimr commented Feb 6, 2024

The problem is that we have to avoid using a slightly different solution in every project, i.e., we should coordinate.

Absolutely agree.

I was looking through the PR you mentioned for Zoom and I think the solutions are similar but with some differences.
Roughly the types in the Zoom PR corresponds to the types in this PR like this:

# Zoom , # Reflectometry
Filename <-> PoochFilename
FilePath   <-> Filename
DataFolder <-> Does not exist
FilenameType <-> Does not exist

Is there a specific reason to split the file path into folder and name? In my experience that is a source of errors, for example in case you have files with the same name in different folders it is easy to accidentally select the file in the wrong folder.

I also think it seems generally more complicated to add four types instead of two. The way I see it we need N+1 types, 1 for the file path from where we are actually going to read the file, and N more for the N different file sources (Pooch, SciCat, etc).

This is why I'm reluctant to change this to mirror the zoom pr immediately, but if you still think that's the best way to do it then I'll do that, it is very good if file loading is identical everywhere.

@@ -23,11 +26,14 @@ def _make_pooch():
_pooch = _make_pooch()


def get_path(name: str) -> str:
def getpath(name: PoochFilename[Run]) -> FilePath[Run]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ESSsans, these functions have this signature:

def get_path(filename: FilenameType) -> FilePath[FilenameType]:

with

FilenameType = TypeVar('FilenameType', bound=str)
class FilePath(sciline.Scope[FilenameType, str], str): ...

I don't know why this is parametrised. @nvaytet Can you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because FilePath can be used for any file, not just Run files, but also mask files, or the file containing the direct beam function. Does that answer your question?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks!

So @jokasimr, can you change this to match the implementation in ESSsans except for the FilenameType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change it, but I'm not sure what you mean by matching the implementation in ESSsans except for FilenameType?

Should FilePath not be parameterized by anything or should it be parameterized by only the RunType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just about the names. Call the function get_path and the argument Filename. And keep the way it is parametrised.

@jokasimr jokasimr force-pushed the fix-local-files branch 2 times, most recently from ea8ddc1 to fd86bf6 Compare March 11, 2024 09:15
@jokasimr jokasimr enabled auto-merge March 11, 2024 09:17
@jokasimr jokasimr requested a review from jl-wynen March 25, 2024 15:10
The old version did not allow loading local files into the workflow,
everything had to go through pooch.
With this change the user can directly specify a local file path or
the name of a file that is in the pooch data repository.
@jokasimr
Copy link
Contributor Author

Closing because this issue was addressed in #40

@jokasimr jokasimr closed this Mar 25, 2024
auto-merge was automatically disabled March 25, 2024 15:18

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants