Closure test L1 consistency in random noise generation #1695

comane · 2023-03-17T18:13:41Z

The idea of this PR is that currently (in the master) the MULT uncertainties part of the Covmat used to compute L1 noise in a closure test are computed from the experimental central values (which do not exist within a CT).
In this branch the MULT unc. are computed from the L0 central values, i.e., the theory predictions.

RoyStegeman · 2023-03-17T18:18:56Z

At some point I disconnected from the discussion last Wednesday, but could you explain why it has to be done this way? Would be useful if that could also be added to the as a comment in the code or to the docs, whichever is more appropriate.

comane · 2023-03-17T18:24:59Z

validphys2/src/validphys/pseudodata.py

@scarlehoff, @andreab1997, @giovannidecrescenzo In the code meeting of the 15.03.2023 it was discussed that the needed modifications should be:

covmat = dataset_inputs_covmat_from_systematics(
commondata_wc,
level0_commondata_wc,
dataset_input_list,
use_weights_in_covmat=False,
norm_threshold=None,
_list_of_central_values=None,
_only_additive=sep_mult,
)

level1_data = make_replica(
level0_commondata_wc, filterseed, covmat, sep_mult=sep_mul, genrep=True
)

and adding sep_mult as an arg of make_level1_data.
However, I do not think that this would be correct since when _only_additive=True only the ADD types (see the cd.additive_errors method) in the systype file are selected that is only a subset of the actual points.
It is in fact also clear by running a vp-setupfit that the modifications wrt to the master in the central L1 values would be huge.

That's surprising. The only difference should be whether the multiplicative error is added with the covmat or after and we said in the past that it didn't make a (big) difference. @andreab1997 maybe we got the combination of true/false wrong?

afair in the fit we have the true-true option for the sampling.

comane · 2023-03-17T18:28:45Z

At some point I disconnected from the discussion last Wednesday, but could you explain why it has to be done this way?

I think the point is that Experimental Central values are not known within a Closure test.
Thus, I think, that the multiplicative errors should be generated from the Level 0 central values in order to be more consistent

RoyStegeman · 2023-03-17T18:40:01Z

That I find unconvincing without further explanation. In a closure test we aim to reproduce a known underlying law, thus as L0 data we use predictions from an input PDF. Then we do the fits based on some loss function and see if we can recover the input pdf. Fine so far.

Then, we have a choice regarding the loss function, either we keep it as in the real case (thus multiplicative exp uncs determined using the exp central value) or we determine the uncs going into the loss function using the input pdf. I would like to keep the closure test as close to the real scenario as possible, so in principle I prefer to use the same covmat as in the real scenario.

If there is some reason why using the experimental covmat is wrong/inconsistent, we'll have to accept that we need to deviate from the real covmat and use the input pdf as you propose. My point is that right now I don't see where we go wrong if we use the usual exp covmat.

comane · 2023-03-17T18:52:11Z

Then, we have a choice regarding the loss function, either we keep it as in the real case (thus multiplicative exp uncs determined using the exp central value) or we determine the uncs going into the loss function using the input pdf. I would like to keep the closure test as close to the real scenario as possible, so in principle I prefer to use the same covmat as in the real scenario.

If there is some reason why using the experimental covmat is wrong/inconsistent, we'll have to accept that we need to deviate from the real covmat and use the input pdf as you propose. My point is that right now I don't see where we go wrong if we use the usual exp covmat.

I see your point, however, I think that the real (experimental) covariance matrix is never really used when fitting in a closure test.
This is because, correct me if I am wrong, the covmat used when fitting, i.e. the one used to construct the chi2, as well as the one used to add noise to Level 1 data for each data replica, is generated from the Level1 data that have been written in the runcard_name/filter folder (after vp-setupfit) and there, in particular, the central values won't be the experimental ones but the Level 0 ones augmented by some noise. So the question is from which covariance matrix do we sample the noise that we add to Level 0 data in order to get Level 1 data.
In any case, I think that the difference is really small.

scarlehoff · 2023-03-17T20:56:05Z

@RoyStegeman this is only for the creation of the data and should not affect the fit. This change should make the level 1 data closer to what is done in the fit when we generate a replica.

RoyStegeman · 2023-03-17T22:19:49Z

This change should make the level 1 data closer to what is done in the fit when we generate a replica.

Good point, somehow I forgot about that for a sec. Nevertheless, I'm not so interested in revisiting this discussion since you've talked about it for a long time and seemed to have come to some sort of agreement which is why I asked for the reasoning behind the choice to be documented (which might be useful since you apparently already remember something different from Mark after two days).

scarlehoff · 2023-03-17T23:01:32Z

(which might be useful since you apparently already remember something different from Mark after two days)

I believe you two were just talking about different things (or were thinking about different things).

In any case, the difference between both methods should be negligible. Having the separate multiplicative flag in the runcard will help checking whether the difference is truly negligible.

alecandido · 2023-03-20T16:35:38Z

I see your point, however, I think that the real (experimental) covariance matrix is never really used when fitting in a closure test.

@comane: re the real covmat I agree with you, but, of course, the additive part of it is exactly that.

Then, we have a choice regarding the loss function, either we keep it as in the real case (thus multiplicative exp uncs determined using the exp central value) or we determine the uncs going into the loss function using the input pdf. I would like to keep the closure test as close to the real scenario as possible, so in principle I prefer to use the same covmat as in the real scenario.

@RoyStegeman about the concept itself I'd agree with @comane: the exp central values should never enter a closure test, not even in the covmat.
The idea is that you want to test the methodology in a controlled scenario, i.e.:

I make up data for the central values of which I know the underlying law (the PDF I used for L0 generation)
and I try to reconstruct it as if I were doing the usual fit

But the usual fit prescription is:

take the central value
take additive and multiplicative uncertainties as given by experimentalist (and this we decided not to fake, apart from the closure tests with inconsistent data project)
construct the covmat with central values and exp uncertanties
initialize
start training, with the loss compute using the covmat

But in a closure test step 1. is replaced by the fake central data, and the experimental input should be limited to everything but central values, consistently. Otherwise, we would be leaking information about the exp central values, to which the closure test should be completely blind (even if just through the covmat).

However, I believe that this is the only argument in favor of not using the actual exp covmat, but it really depends on how you decide to factorize experimental input, i.e. what you consider to be independent (in a statistical sense).
If the two independent inputs were the central values and the covmat, that you would've been right, but in that case I'd say the experimentalist would quote those values. If instead the independent inputs are central values and add+mult unc, then you should replace the central values at this level with the closure test ones.

RoyStegeman · 2023-03-20T17:07:31Z

I don't agree that the closure test should be completely blind to the exp central values purely for the sake of it. I do agree that the point of the closure test is to have a perfectly controlled/self-consistent scenario. If the closure test needs to be completely blind to the exp central values to achieve this, then that implies that the closure test should be blind to the exp central values as well, but this needs to be explained (which so far has not been done to my satisfaction, but if that's just me then please write down the argument in the docs).

If you say we should use the L1 data for the mult uncs. e.g. because that is consistent with the L1 central data we want to fit while using the exp central values for the mult uncs is not, or because we need the same covmat for sampling and fitting, that would be a perfectly sensible reason to me. Though also those reasons would require explaining why they are more important than sampling the fluctuations from the same covmat as in a real fit.

alecandido · 2023-03-20T17:53:48Z

I don't agree that the closure test should be completely blind to the exp central values purely for the sake of it.

It's not for the sake of it, it is because exp central values are a specific instance of possible central values, and if you keep correlating to them in all your tests you can't prove that your methodology is good enough independently, because it might be still good only for some feature of that specific instance.
In a sense, it depends on the goal we are attributing to the closure tests.

So, let's say like this: if we had the choice of picking a PDF that reproduce exactly the known central values at L0, should we use that one? Or should we pick a random one anyhow?

If you say we should use the L1 data for the mult uncs. e.g. because that is consistent with the L1 central data we want to fit while using the exp central values for the mult uncs is not, or because we need the same covmat for sampling and fitting, that would be a perfectly sensible reason to me.

I don't see exactly how to spell out the consistency. It is a rather vague requirement.
About the second example (we need the same for sampling and fitting), well, that in a sense we'd need, but at L2: the way the NNPDF methodology is spelled out, it only allows having the same covmat for sampling and loss (up to t0, of course).
At L1 generation there is no comparison with the actual fit.

Though also those reasons would require explaining why they are more important than sampling the fluctuations from the same covmat as in a real fit.

That's the point, that is the same one I mentioned above: I believe you want a test that is as independent as possible from the actual fit, to prove your methodology to be good enough in a large enough space.
But maybe I'm misinterpreting what people claimed about closure tests (even though I heard many times during the hopscotch saga that not overfitting to be proven by closure tests).

comane · 2023-03-21T12:12:20Z

@RoyStegeman , @scarlehoff , @alecandido ,
To recap, the idea of the modification introduced by this PR:

A consistent Closure Test should be defined as:

Generate L0 data: L0 = G[f] (f is the underlying law, G ~ FK table)
Generate L1 data: L1 = L0 + eta, eta ~ N(0,C)
Generate L2 data: L2_k = L1 + eps_k, eps_k ~ N(0,C)

The above Closure test is consistent since both eta as eps_k are sampled from N(0,C), i.e. using the same covariance matrix.

What is actually done in the code (master branch) right now:

Generate L0 data: L0 = G[f]
Generate L1 data: L1 = L0 + eta, eta ~ N(0,Cexp)
Generate L2 data: L2_k = L1 + eps_k, eps_k ~ N(0,CL1)

where CL1 is a covariance matrix whose 'additive part' is exactly the same as Cexp (and therefore consistent) the 'multiplicative part', however, has been generated from L1 central values.
Therefore, in general, Cexp != CL1.

What is done in this PR:

Generate L0 data: L0 = G[f]
Generate L1 data: L1 = L0 + eta, eta ~ N(0,CL0)
Generate L2 data: L2_k = L1 + eps_k, eps_k ~ N(0,CL1)

where CL0 is a covariance matrix whose 'additive part' is exactly the same as Cexp (and therefore consistent) the 'multiplicative part', however, has been generated from L0 central values.
Note that in this case one still has CL0 != CL1, however, we can expect the 'multiplicative entries' of CL0 to more closely resemble those of CL1 (since those of Cexp could deviate by a significant amount from those of CL1 if there is no good data - theory agreement)

comane · 2023-03-21T12:15:45Z

please write down the argument in the docs

Hi @RoyStegeman , thank you for your comments.
By writing in the docs, do you mean to add something to https://docs.nnpdf.science
or simply to comment the make_level1_data() function that has been modified in this PR?

RoyStegeman · 2023-03-21T12:29:46Z

In my first comment I said "whichever is more appropriate", though since the argument contains some subtleties that cannot be summarized in one or two sentences I would suggest adding it to the actual documentation https://docs.nnpdf.science/. These docs are built from the docs folder in this repo.

It may still be useful to add "what" the function does as a small comment in the code, but to me the docs seem like a more appropriate place to answer "why" it's done the way it is. There's no strict rules though so just do whatever you think is best, people will have to review this before it gets merged anyway so it could even be changed later.

alecandido · 2023-03-21T12:30:12Z

By writing in the docs, do you mean to add something to docs.nnpdf.science
or simply to comment the make_level1_data() function that has been modified in this PR?

Just to mention it: in principle even the docstring of make_level1_data() is part of the docs on https://docs.nnpdf.science. Even if it is not reachable by the TOC (but it is discoverable searching).
https://docs.nnpdf.science/modules/validphys/validphys.html?highlight=all_preds#validphys.pseudodata.make_level1_data

comane · 2023-04-19T16:13:40Z

report for Rbv computed on a multiclosure test

https://vp.nnpdf.science/CCq46px1QAygO0h3HlPULQ==/#example-report

Note: Rbv is not evaluated on test data but on the same data used in training

comane · 2023-04-24T16:10:51Z

@RoyStegeman , @Zaharid , @scarlehoff
Multiclosure report for Rbv

https://vp.nnpdf.science/7sXVqI6BRgyFFRFzDRazuA==/

the training and test datasets are the same as the one used in NNPDF4.0 CT, see tab 6.2 of NNPDF4.0 paper.
Results seem to be consistent with the ones of M.W.

scarlehoff · 2023-04-25T11:06:12Z

Am I reading this correctly? So this is consistent with the closure tests of MW that would mean we’ve had bugs in either closure tests and replica generation for the last two years… (well, we know this is the case because we have solved many, what I mean is that the difference with MW tests were these bugs)

RoyStegeman · 2023-04-25T12:20:39Z

Thanks for these results. What changed between these last two reports? Only the in-sample and out-of sample dataset or also changes in the code?

comane · 2023-04-25T19:23:56Z

Only the in-sample and out-of sample dataset or also changes in the code?

Hi @RoyStegeman, the only thing that changed is the in-sample and out-of sample dataset.

Apparently the Rbv strongly depends on the test (out of sample) datasets, the ones I used for this fits are the same that have been used in the NNPDF4.0 paper.
From the report below (which is the same but with table per process) it is clear that the JETS and DIJETS processes have a strong impact on Rbv.

https://vp.nnpdf.science/7sXVqI6BRgyFFRFzDRazuA==/

RoyStegeman · 2023-04-25T19:49:02Z

Thanks. Having a quick look at the input PDFs, am I correct that the settings resulting in Rbv=0.8 did not have any out-of-sample datasets? Instead, it looks like it's a fit to the full NNPDF4.0 dataset and Rbv is calculated to this full dataset. Do you know if this is what Samuele has been doing?

Just to make sure - for the report with Rbv=1.0, did you use the exact same settings as MW?

comane · 2023-04-25T20:11:51Z

am I correct that the settings resulting in Rbv=0.8 did not have any out-of-sample datasets? Instead, it looks like it's a fit to the full NNPDF4.0 dataset and Rbv is calculated to this full dataset.

Yes, you are correct.

Do you know if this is what Samuele has been doing?

No, I think that is not what Samuele was doing. Samuele had some out-of-sample datasets, however (@giovannidecrescenzo correct me if I am wrong), these were not the same as the ones used by MW.

Just to make sure - for the report with Rbv=1.0, did you use the exact same settings as MW?

Same fakepdf and same datasets (which are the NNPDF3.1 datasets), some of the other parameters like smallx and largex initialisation are slightly different.

This is the MW fit I am referring to: 210326-mw-001

alecandido · 2023-04-25T21:08:06Z

No, I think that is not what Samuele was doing. Samuele had some out-of-sample datasets, however (@giovannidecrescenzo correct me if I am wrong), these where not the same as the ones used by MW.

This I can confirm, Stefano told us last week: Samuele had to pick a different choice of in/out-of-sample datasets, because the main project was to introduce the inconsistencies, and Michael's split was not perfectly suitable for the purpose.

However, we mentioned the option of using the folds of K-folding (and possibly drawing folds at random), in order to split more evenly, and to exploit future improvements in folds creations.

validphys2/src/validphys/filters.py

validphys2/src/validphys/pseudodata.py

comane · 2023-05-11T18:52:57Z

Hi @scarlehoff thank you for the review!
Just wondering about this line @pytest.mark.linux in test/test_rebuilddata.py. I had to uncomment it in order to rebuild the regressions/test_filter_rebuild_closure_data.csv file. So I guess that the test was passing for macOS because it was skipped and hence not actually being tested.

scarlehoff · 2023-05-11T18:54:23Z

So I guess that the test was passing for macOS because it was not actually being tested.

Yes, indeed. Let's hope the data rebuilt in macOS is bit-by-bit the same as linux otherwise the test will fail just the same.

comane · 2023-05-15T07:20:26Z

Hi @scarlehoff , not sure why this test is not passing. It looks like an assert error of erf_tr in n3fit/tests/test_fit.py
However, the test passes if I comment the @pytest.mark.linux line and run it on my Mac.
Also, I would expect it to be independent of the above modifications since quickcard.yaml does not have a fakepdf.

…om L0 central values

…g code meeting

scarlehoff · 2023-05-17T11:21:37Z

@giovannidecrescenzo @andreab1997

just a heads up that I'm merging this, so if you were using the latest master for the closure tests then there will be modifications to the numerical values you obtain.

comane requested review from Zaharid and scarlehoff March 17, 2023 18:13

comane commented Mar 17, 2023

View reviewed changes

comane requested a review from RoyStegeman March 17, 2023 18:25

scarlehoff reviewed May 10, 2023

View reviewed changes

comane added 6 commits May 15, 2023 08:27

MULT uncertainties, for covmat used to generate L1 noise, computed fr…

d100e36

…om L0 central values

added comments to make_level1_data

b244c21

added sep_mult to make_level1_data for consistency as discussed durin…

56ae2bc

…g code meeting

changed formatting

9f898ef

deleted csv file used for vp-setupfit closure test

3785311

added rebuilt csv file

7ce3d51

comane force-pushed the using_L0_cv_for_MULT_generation_in_CT branch from 905742f to 7ce3d51 Compare May 15, 2023 07:28

comane requested a review from scarlehoff May 15, 2023 11:21

scarlehoff approved these changes May 17, 2023

View reviewed changes

scarlehoff merged commit 0584835 into master May 17, 2023

scarlehoff deleted the using_L0_cv_for_MULT_generation_in_CT branch May 17, 2023 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closure test L1 consistency in random noise generation #1695

Closure test L1 consistency in random noise generation #1695

comane commented Mar 17, 2023 •

edited

Loading

RoyStegeman commented Mar 17, 2023

comane Mar 17, 2023 •

edited

Loading

scarlehoff Mar 17, 2023

comane commented Mar 17, 2023

RoyStegeman commented Mar 17, 2023

comane commented Mar 17, 2023 •

edited

Loading

scarlehoff commented Mar 17, 2023

RoyStegeman commented Mar 17, 2023

scarlehoff commented Mar 17, 2023 •

edited

Loading

alecandido commented Mar 20, 2023

RoyStegeman commented Mar 20, 2023

alecandido commented Mar 20, 2023

comane commented Mar 21, 2023

comane commented Mar 21, 2023

RoyStegeman commented Mar 21, 2023 •

edited

Loading

alecandido commented Mar 21, 2023

comane commented Apr 19, 2023

comane commented Apr 24, 2023

scarlehoff commented Apr 25, 2023

RoyStegeman commented Apr 25, 2023

comane commented Apr 25, 2023

RoyStegeman commented Apr 25, 2023

comane commented Apr 25, 2023 •

edited

Loading

alecandido commented Apr 25, 2023

comane commented May 11, 2023 •

edited

Loading

scarlehoff commented May 11, 2023

comane commented May 15, 2023

scarlehoff commented May 17, 2023

Closure test L1 consistency in random noise generation #1695

Closure test L1 consistency in random noise generation #1695

Conversation

comane commented Mar 17, 2023 • edited Loading

RoyStegeman commented Mar 17, 2023

comane Mar 17, 2023 • edited Loading

Choose a reason for hiding this comment

scarlehoff Mar 17, 2023

Choose a reason for hiding this comment

comane commented Mar 17, 2023

RoyStegeman commented Mar 17, 2023

comane commented Mar 17, 2023 • edited Loading

scarlehoff commented Mar 17, 2023

RoyStegeman commented Mar 17, 2023

scarlehoff commented Mar 17, 2023 • edited Loading

alecandido commented Mar 20, 2023

RoyStegeman commented Mar 20, 2023

alecandido commented Mar 20, 2023

comane commented Mar 21, 2023

comane commented Mar 21, 2023

RoyStegeman commented Mar 21, 2023 • edited Loading

alecandido commented Mar 21, 2023

comane commented Apr 19, 2023

comane commented Apr 24, 2023

scarlehoff commented Apr 25, 2023

RoyStegeman commented Apr 25, 2023

comane commented Apr 25, 2023

RoyStegeman commented Apr 25, 2023

comane commented Apr 25, 2023 • edited Loading

alecandido commented Apr 25, 2023

comane commented May 11, 2023 • edited Loading

scarlehoff commented May 11, 2023

comane commented May 15, 2023

scarlehoff commented May 17, 2023

comane commented Mar 17, 2023 •

edited

Loading

comane Mar 17, 2023 •

edited

Loading

comane commented Mar 17, 2023 •

edited

Loading

scarlehoff commented Mar 17, 2023 •

edited

Loading

RoyStegeman commented Mar 21, 2023 •

edited

Loading

comane commented Apr 25, 2023 •

edited

Loading

comane commented May 11, 2023 •

edited

Loading