-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trvl mask layers #1661
Trvl mask layers #1661
Conversation
# Conflicts: # n3fit/src/n3fit/model_trainer.py
Intermediate merge master into trvl mask layers
Intermediate merge from master
I am facing problems with datasets with single datapoint, some replicas wll mask these out, other won't. That is a poor fit for the design I followed, which assumes the training/validation masks can be represented as block-wise boolean arrays with the same number of 'True' values per row, such that the output tensor is not strided |
This was recently changed in #1636. In principle I would be happy with a gpu implementation that treats datasets with a single point in the old way (so including it in the tr set). The difference in the result is negligible and as such we could still use the gpu for most purposes, only for a very final fit before a release might it be better to have the new treatment of single point datasets. @scarlehoff has many gpus available, so perhaps he has an opinion on this. |
I'd be happy with the solution of just ignoring those datasets for the time being. |
Also fine for me. Point is (to address @goord's concerns), this is not a showstopper for the parallel fits. |
Yes that would be the best solution, as things become very complicated if we can't assume equal number of masked data points across replicas. I will try to include (or exclude?) all single-point datasets in the fit if parallel replicas is set and same_trvl_split is unset... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I like this solution. It is more or less what I had in mind I think.
RE the one-point datasets, I think we can eventually find a way around it but let's ignore them for the moment.
Did you run any parallel fits with this code? Did they work? Is the memory footprint greatly impacted?
masked_output_layers.append(mask_layer(output_layer)) | ||
|
||
# Finally concatenate all observables (so that experiments are one single entity) | ||
ret = op.concatenate(masked_output_layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the axis removed? (I guess the default is exactly the right axis, but I'd like to have it explicit, it makes debugging easier)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it was a bit explicit, but since all tensor shapes have to be as explicit as possible for tensorflow to do the correct thing, I will re-insert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than for tensorflow is for the person reading the code in this case, since it is hard to keep track of which axis is what :P
tr_mask_layers = [] | ||
vl_mask_layers = [] | ||
offset = 0 | ||
apply_masks = spec_dict.get("data_transformation_tr") is None and mask_array is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried. If mask_array
is not None
but there is a data_transformation_tr
then the masks will not be applied. If this is necessary then it should fail at the beginning.
We usually do that by adding a check before the fit starts. In this case it should check whether the run options are a parallel fit and data_transformation and if so validphys will raise an exception telling the user which options are inconsistent.
(for the time being you can put just a raise Exception here to stop it and create the check at the end)
n3fit/src/n3fit/model_gen.py
Outdated
trmask = mask_array[:, offset:offset + dataset.ndata] if apply_masks else None | ||
tr_mask_layers.append(Mask(trmask, axis=1, c=1) if apply_masks else None) | ||
vl_mask_layers.append(Mask(~trmask, axis=1, c=1) if apply_masks else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trmask = mask_array[:, offset:offset + dataset.ndata] if apply_masks else None | |
tr_mask_layers.append(Mask(trmask, axis=1, c=1) if apply_masks else None) | |
vl_mask_layers.append(Mask(~trmask, axis=1, c=1) if apply_masks else None) | |
if apply_masks: | |
trmask = mask_array[:, offset:offset + dataset.ndata] | |
tr_mask_layers.append(Mask(bool_mask=trmask, axis=1)) | |
vl_mask_layers.append(Mask(bool_mask=~trmask, axis=1)) | |
else: | |
trmask = None | |
tr_mask_layers.append(None) | |
vl_mask_layers.append(None) |
I don't like the idea of having a list of None
, I think with the check above you will always be in a consistent state and you might be able to check whether to apply the mask somewhere else (so you don't need the None
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in rev. bc56876
model_observables.append(obs_layer) | ||
|
||
# shift offset for new mask array | ||
offset = offset + dataset.ndata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a way to have a list of arrays from the onset (instead of having an offset that we move?) such that to each dataset in the list correspond an array.
@@ -37,6 +40,7 @@ class ObservableWrapper: | |||
|
|||
name: str | |||
observables: list | |||
trvl_mask_layers: list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can have this be optional = None, and if it is None then it doesn't get applied.
Because (I think, maybe I'm wrong!) that we should never be in a situation in which some of the masks exist and some are None
, that way you can avoid the list of None
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in rev. bc56876. I did not insert the default value yet though
n3fit/src/n3fit/model_trainer.py
Outdated
|
||
import numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import numpy |
Hi @scarlehoff I haven't replied to your suggestions because I propose to first get the validation in order. After testing with the basic_parallel test case with only a single dataset, I believe I still have a bug somewhere in the implementation because my chi2 values differ significantly from the master sequential (or parallel) runs.
These are means over 500 replicas. |
I have solved a few problems with the implementation: the first - obvious - one was that the masked truth values weren't correctly propagating to the loss function. Another issue was that a replica-specific inverse covariance was not taken into account. Latest commits result in a basic runcard fit that is bitwise identical to the sequential run for small no. epochs. More validation results will follow. |
Regarding memory usage: due to the different masks per replica, the
Although the FK table loading is in its own The easiest fix is to wrap the |
Yes. I think that should work since all replicas are using the same fktables now (i.e., it should work if the |
|
||
@functools.lru_cache | ||
def load_cached_fk_tables(fk, cuts): | ||
return fk.load_with_cuts(cuts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to wrap directly fk.load_with_cuts
?
(in practical terms it should be the same)
Given an fktable and a set of cuts, I see no reason why we would we want two different objects so that probably is a positive change in other parts of the code as well*
*hopefully
(of course, for the purposes of testing and benchmarking maybe it is better to start here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that would be the more elegant solution, but needs thorough testing because this indeed impacts the code in more places, there will be shared objects where there were copies before. I can have a look.
# Conflicts: # n3fit/src/n3fit/model_gen.py # n3fit/src/n3fit/model_trainer.py # validphys2/src/validphys/n3fit_data.py # validphys2/src/validphys/n3fit_data_utils.py
Getting latest master changes
New validation and performance tests are underway |
A 100-replica fit comparison with the recent fits by @APJansen is done here: https://vp.nnpdf.science/Hz7Gwu95TzCUCH4oYoQOQA== |
Intermediate merge of Aron's stuff
Can this be closed in favor of #1788 ? |
Let me echo @RoyStegeman's question
|
closed in favor of #1788 |
Implementation of training-validation split per parallel replica via a set of masking layers. This means there is only one (set of) observables, but still training-validation observable wrappers which mask out the correct data.
Currently, I still need to do: