-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: separate path and filename providers #25
Conversation
src/essreflectometry/types.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea isn't bad. But I think I would stick with Filename
as the type for the actual loader and be more explicit about the argument to pooch. E.g. PoochFilename
or BuiltinFilename
.
Also, looking ahead, we will need something similar for SciCat. But there, a simple name is not enough. We need at least a (DatasetID, SciCatFilename)
tuple. Not sure if this should affect the current PR, but it might inform the way forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the names. Good point about scicat but I don't think that affects the current PR since (I imagine) it would just be a question about adding a SciCatFilePath(id: SciCatFileIdentifier[RunType]) -> Filepath[RunType]
provider that downloads the data somewhere and returns the path where it was stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but that can be difficult. You would need a way to map the RunType
to a dataset id. And you need to know which file of the dataset you want.
But this is a discussion for another day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SimonHeybrock had a different way of doing this for the Zoom data. It's about using different providers depending on whether your files are local or not.
You have to select which provider to use in the notebook.
See for example this local file provider compared to the remote file provider.
We should probably all agree on the same approach and use it everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this essentially the same approach but with different names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this is using different types, and the other approach is to use different providers but the same types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but that can be difficult. You would need a way to map the RunType to a dataset id. And you need to know which file of the dataset you want.
But this is a discussion for another day
Couldn't that mapping be provided by the user just like filenames are provided now? Like this:
pipeline[ScicatFileIdentifier[SampleRun]] = (dataset_id, 'sample_run.nxs')
# ... then we'd have a provider like this somewhere:
def download_scicat_data(identifier: ScicatFileIdentifier[RunType]) -> Filename[RunType]:
''' Downloads the dataset and puts it somewhere on disk, returns the path'''
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to amend my original comment. In ESSsans, we use Filename for some basic, user provided filename and FilePath as the actual path on disk. So pooch converts a Filename into a FilePath.
A SciCat provider would convert something like your ScicatFileIdentifier to a FilePath.
This is also listed here scipp/essreduce#7
But this issue for now lists current usage, not a guideline yet. So if you don't like this naming, please comment!
@@ -29,6 +29,7 @@ | |||
conversions.providers, | |||
resolution.providers, | |||
beamline.providers, | |||
data.providers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this? If we insert this provider by default, then every time the user want to provide the file name in a different way, they first have to remove the pooch provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parameters override providers in Sciline so that is not a problem. See example below:
import sciline as sl
def test(f: float) -> int:
return int(f)
sl.Pipeline([test], params={int: 1})
Similarly when replacing providers:
pl = sl.Pipeline([test], params={float: 1.0})
pl[int] = 2
pl.compute(int)
The above does not raise AmbiguousProviderError
.
Or did I misunderstand you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jl-wynen, this should be removed. The providers we use for the docs are not useful for normal users. If they forget to insert data (or a data provider) we want an error, not something that silently loads some unknown data and processes this. This could result in someone wasting hours debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense to exclude providers we only use in docs from the default provider list. I'll remove it.
docs/examples/amor.ipynb
Outdated
" SampleRotation[Reference]: sc.scalar(0.8389, unit='deg'),\n", | ||
" Filename[Reference]: \"reference.nxs\",\n", | ||
" PoochFilename[Reference]: \"reference.nxs\",\n", | ||
"}" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also have a short cell that explains how to load local file by setting Filename[Run]
in the params here, maybe...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at how I solved a similar problem scipp/esssans#50. Not sure that is perfect, but the idea is that it would allow for adding things such as SciCat, which could convert a filename (or run ID) into a file path.
@jl-wynen raised a similar concern earlier about future scicat file loading. I don't understand why the solution in this PR would hinder adding a SciCatDataset provider later, and I've described above how I would propose to solve that, but maybe there's something I'm missing, can you explain the problem? |
The problem is that we have to avoid using a slightly different solution in every project, i.e., we should coordinate. |
Absolutely agree. I was looking through the PR you mentioned for Zoom and I think the solutions are similar but with some differences. # Zoom , # Reflectometry
Filename <-> PoochFilename
FilePath <-> Filename
DataFolder <-> Does not exist
FilenameType <-> Does not exist Is there a specific reason to split the file path into folder and name? In my experience that is a source of errors, for example in case you have files with the same name in different folders it is easy to accidentally select the file in the wrong folder. I also think it seems generally more complicated to add four types instead of two. The way I see it we need N+1 types, 1 for the file path from where we are actually going to read the file, and N more for the N different file sources (Pooch, SciCat, etc). This is why I'm reluctant to change this to mirror the zoom pr immediately, but if you still think that's the best way to do it then I'll do that, it is very good if file loading is identical everywhere. |
8fbe33a
to
1b52c89
Compare
src/essreflectometry/amor/data.py
Outdated
@@ -23,11 +26,14 @@ def _make_pooch(): | |||
_pooch = _make_pooch() | |||
|
|||
|
|||
def get_path(name: str) -> str: | |||
def getpath(name: PoochFilename[Run]) -> FilePath[Run]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In ESSsans, these functions have this signature:
def get_path(filename: FilenameType) -> FilePath[FilenameType]:
with
FilenameType = TypeVar('FilenameType', bound=str)
class FilePath(sciline.Scope[FilenameType, str], str): ...
I don't know why this is parametrised. @nvaytet Can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because FilePath
can be used for any file, not just Run
files, but also mask files, or the file containing the direct beam function. Does that answer your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks!
So @jokasimr, can you change this to match the implementation in ESSsans except for the FilenameType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change it, but I'm not sure what you mean by matching the implementation in ESSsans except for FilenameType
?
Should FilePath
not be parameterized by anything or should it be parameterized by only the RunType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just about the names. Call the function get_path
and the argument Filename
. And keep the way it is parametrised.
ea8ddc1
to
fd86bf6
Compare
The old version did not allow loading local files into the workflow, everything had to go through pooch. With this change the user can directly specify a local file path or the name of a file that is in the pooch data repository.
219b7a5
to
4032b0b
Compare
Closing because this issue was addressed in #40 |
Pull request was closed
The old version did not allow loading local files into the workflow, everything had to go through pooch.
With this change the user can directly specify a local file path or the name of a file that is in the pooch data repository.