-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closure test L1 consistency in random noise generation #1695
Conversation
At some point I disconnected from the discussion last Wednesday, but could you explain why it has to be done this way? Would be useful if that could also be added to the as a comment in the code or to the docs, whichever is more appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scarlehoff, @andreab1997, @giovannidecrescenzo In the code meeting of the 15.03.2023 it was discussed that the needed modifications should be:
covmat = dataset_inputs_covmat_from_systematics(
commondata_wc,
level0_commondata_wc,
dataset_input_list,
use_weights_in_covmat=False,
norm_threshold=None,
_list_of_central_values=None,
_only_additive=sep_mult,
)
level1_data = make_replica(
level0_commondata_wc, filterseed, covmat, sep_mult=sep_mul, genrep=True
)
and adding sep_mult
as an arg of make_level1_data.
However, I do not think that this would be correct since when _only_additive=True
only the ADD
types (see the cd.additive_errors method) in the systype file are selected that is only a subset of the actual points.
It is in fact also clear by running a vp-setupfit that the modifications wrt to the master in the central L1 values would be huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising. The only difference should be whether the multiplicative error is added with the covmat or after and we said in the past that it didn't make a (big) difference. @andreab1997 maybe we got the combination of true
/false
wrong?
afair in the fit we have the true-true option for the sampling.
I think the point is that Experimental Central values are not known within a Closure test. |
That I find unconvincing without further explanation. In a closure test we aim to reproduce a known underlying law, thus as L0 data we use predictions from an input PDF. Then we do the fits based on some loss function and see if we can recover the input pdf. Fine so far. Then, we have a choice regarding the loss function, either we keep it as in the real case (thus multiplicative exp uncs determined using the exp central value) or we determine the uncs going into the loss function using the input pdf. I would like to keep the closure test as close to the real scenario as possible, so in principle I prefer to use the same covmat as in the real scenario. If there is some reason why using the experimental covmat is wrong/inconsistent, we'll have to accept that we need to deviate from the real covmat and use the input pdf as you propose. My point is that right now I don't see where we go wrong if we use the usual exp covmat. |
I see your point, however, I think that the real (experimental) covariance matrix is never really used when fitting in a closure test. |
@RoyStegeman this is only for the creation of the data and should not affect the fit. This change should make the level 1 data closer to what is done in the fit when we generate a replica. |
Good point, somehow I forgot about that for a sec. Nevertheless, I'm not so interested in revisiting this discussion since you've talked about it for a long time and seemed to have come to some sort of agreement which is why I asked for the reasoning behind the choice to be documented (which might be useful since you apparently already remember something different from Mark after two days). |
I believe you two were just talking about different things (or were thinking about different things). In any case, the difference between both methods should be negligible. Having the separate multiplicative flag in the runcard will help checking whether the difference is truly negligible. |
@comane: re the real covmat I agree with you, but, of course, the additive part of it is exactly that.
@RoyStegeman about the concept itself I'd agree with @comane: the exp central values should never enter a closure test, not even in the covmat.
But the usual fit prescription is:
But in a closure test step 1. is replaced by the fake central data, and the experimental input should be limited to everything but central values, consistently. Otherwise, we would be leaking information about the exp central values, to which the closure test should be completely blind (even if just through the covmat). However, I believe that this is the only argument in favor of not using the actual exp covmat, but it really depends on how you decide to factorize experimental input, i.e. what you consider to be independent (in a statistical sense). |
I don't agree that the closure test should be completely blind to the exp central values purely for the sake of it. I do agree that the point of the closure test is to have a perfectly controlled/self-consistent scenario. If the closure test needs to be completely blind to the exp central values to achieve this, then that implies that the closure test should be blind to the exp central values as well, but this needs to be explained (which so far has not been done to my satisfaction, but if that's just me then please write down the argument in the docs). If you say we should use the L1 data for the mult uncs. e.g. because that is consistent with the L1 central data we want to fit while using the exp central values for the mult uncs is not, or because we need the same covmat for sampling and fitting, that would be a perfectly sensible reason to me. Though also those reasons would require explaining why they are more important than sampling the fluctuations from the same covmat as in a real fit. |
It's not for the sake of it, it is because exp central values are a specific instance of possible central values, and if you keep correlating to them in all your tests you can't prove that your methodology is good enough independently, because it might be still good only for some feature of that specific instance. So, let's say like this: if we had the choice of picking a PDF that reproduce exactly the known central values at L0, should we use that one? Or should we pick a random one anyhow?
I don't see exactly how to spell out the consistency. It is a rather vague requirement.
That's the point, that is the same one I mentioned above: I believe you want a test that is as independent as possible from the actual fit, to prove your methodology to be good enough in a large enough space. |
@RoyStegeman , @scarlehoff , @alecandido , A consistent Closure Test should be defined as:
The above Closure test is consistent since both eta as eps_k are sampled from N(0,C), i.e. using the same covariance matrix. What is actually done in the code (master branch) right now:
where CL1 is a covariance matrix whose 'additive part' is exactly the same as Cexp (and therefore consistent) the 'multiplicative part', however, has been generated from L1 central values. What is done in this PR:
where CL0 is a covariance matrix whose 'additive part' is exactly the same as Cexp (and therefore consistent) the 'multiplicative part', however, has been generated from L0 central values. |
Hi @RoyStegeman , thank you for your comments. |
In my first comment I said "whichever is more appropriate", though since the argument contains some subtleties that cannot be summarized in one or two sentences I would suggest adding it to the actual documentation https://docs.nnpdf.science/. These docs are built from the It may still be useful to add "what" the function does as a small comment in the code, but to me the docs seem like a more appropriate place to answer "why" it's done the way it is. There's no strict rules though so just do whatever you think is best, people will have to review this before it gets merged anyway so it could even be changed later. |
Just to mention it: in principle even the docstring of |
report for Rbv computed on a multiclosure test https://vp.nnpdf.science/CCq46px1QAygO0h3HlPULQ==/#example-report Note: Rbv is not evaluated on test data but on the same data used in training |
@RoyStegeman , @Zaharid , @scarlehoff https://vp.nnpdf.science/7sXVqI6BRgyFFRFzDRazuA==/ the training and test datasets are the same as the one used in NNPDF4.0 CT, see tab 6.2 of NNPDF4.0 paper. |
Am I reading this correctly? So this is consistent with the closure tests of MW that would mean we’ve had bugs in either closure tests and replica generation for the last two years… (well, we know this is the case because we have solved many, what I mean is that the difference with MW tests were these bugs) |
Thanks for these results. What changed between these last two reports? Only the in-sample and out-of sample dataset or also changes in the code? |
Hi @RoyStegeman, the only thing that changed is the in-sample and out-of sample dataset. Apparently the Rbv strongly depends on the test (out of sample) datasets, the ones I used for this fits are the same that have been used in the NNPDF4.0 paper. |
Thanks. Having a quick look at the input PDFs, am I correct that the settings resulting in Rbv=0.8 did not have any out-of-sample datasets? Instead, it looks like it's a fit to the full NNPDF4.0 dataset and Rbv is calculated to this full dataset. Do you know if this is what Samuele has been doing? Just to make sure - for the report with Rbv=1.0, did you use the exact same settings as MW? |
Yes, you are correct.
No, I think that is not what Samuele was doing. Samuele had some out-of-sample datasets, however (@giovannidecrescenzo correct me if I am wrong), these were not the same as the ones used by MW.
Same fakepdf and same datasets (which are the NNPDF3.1 datasets), some of the other parameters like smallx and largex initialisation are slightly different. This is the MW fit I am referring to: 210326-mw-001 |
This I can confirm, Stefano told us last week: Samuele had to pick a different choice of in/out-of-sample datasets, because the main project was to introduce the inconsistencies, and Michael's split was not perfectly suitable for the purpose. However, we mentioned the option of using the folds of K-folding (and possibly drawing folds at random), in order to split more evenly, and to exploit future improvements in folds creations. |
Hi @scarlehoff thank you for the review! |
Yes, indeed. Let's hope the data rebuilt in macOS is bit-by-bit the same as linux otherwise the test will fail just the same. |
Hi @scarlehoff , not sure why this test is not passing. It looks like an assert error of erf_tr in n3fit/tests/test_fit.py |
…om L0 central values
905742f
to
7ce3d51
Compare
@giovannidecrescenzo @andreab1997 just a heads up that I'm merging this, so if you were using the latest master for the closure tests then there will be modifications to the numerical values you obtain. |
The idea of this PR is that currently (in the master) the MULT uncertainties part of the Covmat used to compute L1 noise in a closure test are computed from the experimental central values (which do not exist within a CT).
In this branch the MULT unc. are computed from the L0 central values, i.e., the theory predictions.