-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make a n3fit.io provider #1081
Make a n3fit.io provider #1081
Conversation
to do:
|
WIP but also please take a look and tell me if it's horrible. |
I should say this links to #1080 because I would vote to move the new module in this PR to that sub dir |
Thanks! I'll have a look at it tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The positivity is trivial if and only if you remember that both integrability and positivity go through the same place and you set all the correct flags (which for now is set True if "INTEG" in dataset.name
). For the rest it is just a simpler dataset.
n3fit/src/n3fit/performfit.py
Outdated
@@ -225,7 +197,7 @@ def performfit( | |||
# Note: In the basic scenario we are only running for one replica and thus this loop is only | |||
# run once and all_exp_infos is a list of just than one element | |||
stopwatch.register_times("data_loaded") | |||
for replica_number, exp_info, nnseed in zip(replica, all_exp_infos, nnseeds): | |||
for replica_number, exp_info, nnseed in zip(replicas, all_exp_infos, replicas_nnseed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess at this point (maybe for the future) it would make sense to have one single object which contains all_exp_infos and also the replica number and the corresponding seed?
So that one can do exps_replicas_fitting_data_dict.iterate_replicas()
or exps_replicas_fitting_data_dict.iterate_experiments()
.
But for now it surely makes sense to keep it with the "old structure"
from NNPDF import RandomGenerator | ||
from reportengine import collect | ||
|
||
from n3fit.io.reader import common_data_reader_experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this common_data_reader_experiment
would be made obsolete by the python commondata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly. I guess a large amount of n3fit
relies on the dictionaries which are returned by common_data_reader_*
functions.
Obviously the easiest thing would make some kind of "wrapper" which took the python commondata and then filled the same dictionary, so the replacement might look quite similar. It's been a while since I saw how deep these dictionaries get passed.
Btw, feel free to "break" any part of the reader you dislike. It was basically written during a time when In summary, if at any point you wonder "should I use this clever vp function instead of this convoluted loop of loops" the answer is probably "yes". |
The positivity datasets were easy enough to "actionify", the integrability less so because of the weird combo of parsing and production rule and the chance that they might not exist. In this case I just moved the part in performfit to a different location. Since fitting gets expanded as a namespace I was able to remove it as input and take the actual parameters. The obvious downside is performfit take loads of parameters now. I think the upside is it's a bit less obfuscated what should be present in fitting. I still need to update the docstrings in the actions. Probably also the names they're a bit disgusting right now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, as you say a fit bot will be nice to see everything works as it should
I will finish this... later |
This is ready for review, I moved the reader to validphys now. I also added an action to returns the pseudodata. From what I can tell the output should be pretty much identical to Quite keen for somebody to take a look at this because I'd like to dump the pseudodata from the closure tests for something Luigi and I are looking into, equally I don't want to rush this through because it touches some pretty fundamental parts of running a fit (hopefully without changing anything though) If the tests run I will run the fit bot. |
Btw as well as making several things easier such as #1059 and allowing us to slot in the python replica generation it also is one of the steps to make seperate packages for n3fit and validphys |
n3fit/src/n3fit/performfit.py
Outdated
load=None, | ||
sum_rules=True, | ||
tensorboard=None, | ||
save=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have the parameters, since there are so many, a bit more organized, like load
/save
being together and the same for tensorboard/save_weights_each/debug
.
Also kfold_parameters
should have =None
looking at the docstr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I think the ordering I was aiming for was order used. But I agree that it would probably be easier to read if they were grouped by context. With regards to your other comment would you prefer the check-only arguments are first or last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also
kfold_parameters
should have=None
looking at the docstr.
This doesn't need a default because the default gets filled in by the production rule in the Config
class. I think the docstring is consistent with its actual values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way it also goes with the other hyperopt options.
Regarding the checks, I don't mind, maybe a comment is enough. It might confuse people at some point in the future (I guess pylint for instance will tell you to remove them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that better?
|
||
# place tables in last replica path | ||
self.table_folder = path / TAB_FOLDER | ||
self.table_folder.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be moved to if file_content["fitting"].get("savepseudodata"):
? Or do we want the tables
folder to always exist even when it will be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it here because it modifies the environment attributes, I felt like having empty tables folders wasn't a disaster but I think saying I want that would be a step too far :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative would be to just set the table path to be exactly the replica path and the table would get put with all the other files. I don't think that would be so bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done my alternative suggestion, I personally quite like it:
$ n3fit test.yml 1
...
$ ls test/nnfit/replica_1
chi2exps.log test.exportgrid test.params test.time
datacuts_theory_fitting_pseudodata_table.csv test.fitinfo test.preproc version.info
tboard test.json test.sumrules weights.h5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it to the tables folder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the name isn't ideal but also I don't mind that
I think it would be good too have a check for the moment and if more than one replica is requested then pseudodata can't be saved and the fit won't run. |
That would be one solution. Perhaps the easiest. I was thinking that if one loaded and concatenated all the tables from all the replicas then it wouldn't matter if somebody did the above. Then the function would yield the columns of that super table in an order which corresponded to the final PDF replica. |
https://github.com/NNPDF/nnpdf/pull/1081/checks?check_run_id=2014283823#step:6:43 Why is Tensorflow 1.14 being installed with the conda package for the bot? Travis tests have passed so the conda recipe should be ok... |
and why is it coming from |
I thought I had virtually copied the original seed behaviour when I refactored this part. Have I messed it up potentially? Are there any serious ramifications? |
You maybe did, I'm not suggesting you caused the issue I'm more concerned that you've been using There's no real ramifications apart from if somebody was relying on reproducing the pseudodata used in n3fit. *whereas running
would give you some different replicas |
Ah my apologies everyone! Fortunately I ended up not really needing that function anyway. Sorry for missing this at the time |
No need to apoligise. So the test that I can't easily get to pass is |
… behaviour has changed slightly but now is reproducible independently of number of replicas being fitted at once
Are there any further comments on this? Or can we merge? |
Another bump. Appreciate everyone is busy but as mentioned on Friday I would like this for the closure fits and as somebody else mentioned I would ideally have those fits done in advance of my pdf4lhc talk (clearly having them by the practice this week is unfeasible) |
Sure thing mate, this is high on my priority now, will have it reviewed by end of the day |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be nice if you could save the training and validation splits as is done in the current C++ code but I appreciate you want this merged as a matter of urgency
Thanks for taking a look, I'll make appropriate changes tomorrow morning |
Hi @siranipour I think I addressed all of the points I agreed with - thanks for the review. I also added an action which dumps the training masks with an index. This action is basically only callable from the API right now, but in the future people are welcome to save this with the data in the same way the pseudodata is saved. I've expanded the documentation on the relevant actions and updated the action in |
Docs and examples from the new commits look great, cheers Mikey! |
Don't think I'll be looking at this in detail but just to echo that it indeed looks very good. |
Yes, lgtm |
duplicate the parts of n3fit.io.reader which get called by fit, but transform into validphys style providers.
I haven't looked at the integrability or positivity sets, although they looked kind of trivial in comparison from a scan (famous last words).
The idea here was to change to using providers in n3fit but still driven by libnnpdf so hopefully nothing changes in a fit. I don't think this breaks hyperopt or conflicts with any PRs out there. Obviously a run fit bot will test this fully.
The next step would be to replace the calling of libnnpdf with providers from the various destroy c++ PRs. I would probably move the location of this new module somewhere else and call it something better as well - this was more of a proof of principle. Change the module name and moving it just requires updating the providers list in n3fit_exec so very easy to do without breaking anything.
I added some inefficieny with the mask functions but the upside is we expose the masks for each dataset. This action is easy to modify to remove the dependency on
theoryid
takinggroup_name
anddataset_inputs_loaded_cd_with_cuts
instead. I don't think these functions are anywhere near the bottleneck though so I can't be bothered to think of a better way to do this.