Multi Replica PDF #1782

APJansen · 2023-07-24T12:53:07Z

Question

This will be some work, so before continuing past this I'd like to confirm that you agree that once finished this will be a beneficial change.

Idea

The idea of this PR is to refactor the tensorflow model from taking a list of single-replica pdfs into taking a single multiple replica pdf, a single pdf whose output has an extra axis representing the replica. This is much faster on the GPU, see tests below.

The main ingredient to make this possible is a MultiDense layer, (see here) which is essentially just a dense layer where the weights have one extra dimension, with size the number of replicas. For the first layer, which takes x's as input, this is exactly it. For deeper layers, the input already has a replica axis, and so the right index of the input has to be multiplied by the corresponding axis of the weights.

Development Strategy

To integrate this into the code, many small changes are necessary.
To make it as simple as possible to review and test, I aim to make small, independent changes that ideally are beneficial, or at least not detrimental, on their own. Wherever it's sensible I'll first create a unit test that covers the changes I want to make, and make sure it still passes after, and wherever possible I'll try to have the outputs be identical up to numerical errors. I'll put all of these on their own branch and with their own PR (maybe I should create a special label for those PRs?).

Once those small changes are merged, the actual implementation should be easily managable to review.

This PR itself for now is a placeholder, where I just added the commit so that I can create a draft PR and so you can check out the MultiDense layer.

I expect that as a final result you'll still want single replica pdf. I will add code that, once all computations are done, just splits the multi replica pdf into single ones, so the saving and any interaction with validphys will remain unchanged.

Performance

Timing

These are the timing tests I did on a 1/4 node on Snellius, with one GPU. I'm reporting the average seconds per epoch that is printed in debug mode.

runcard	replicas	multi_replica_pdf_test	trvl-mask-layers	master
Basic	200	0.12	1.2	2.3
NNPDF40_nnlo_as_01180_1000	200	out of memory	out of memory	-
NNPDF40_nnlo_as_01180_1000	100	0.76	1.12	-

Memory

Memory also appears to be significantly reduced.
I checked the peak cpu memory usage using libmemprofile, on the basic runcard with 200 replicas, and found 3.5Gb versus 16.5 for the trvl-mask-layers branch.

Status

I have a test branch where this is working up to the end of the model training, which is what I used to obtain the timings above.

branch	finished	tested	merged	comments
refactor_xintegrator	X	unit	X
refactor_msr	X	unit	X
refactor_preprocessing	X	unit	X
refactor_rotations	X	unit	X
refactor_stopping	X	unit	X
multi-dense-logistics				currently working on this
multi_replica_pdf-test				This is my test branch, which has the 4 above, and trvl-mask-layers, merged into it and has the code that will eventually go into this PR

scarlehoff · 2023-07-24T13:22:15Z

I'm guessing you already talked with @goord and are aware of some of the issues he found in #1661, most notably the ambiguity on how to treat the training/validation split for dataset with one single point, when you merged it with your PRs.

I would also ask that you finish the tests with the small PR you've been doing to ensure that the changes are incrementally merged (and so they are not broken by other changes that might be done in parallel to the code). (I see this PR is a draft so maybe this was already your plan).

Now, the answer to your questions:

I'd like to confirm that you agree that once finished this will be a beneficial change.

I expect that as a final result you'll still want single replica pdf. I will add code that, once all computations are done, just splits the multi replica pdf into single ones, so the saving and any interaction with validphys will remain unchanged.

Yes to both. But note that this is not due to interactions with validphys (which we could modify at will) but rather because each replica is independent of all the others. I.e., training (+data, trvlsplit, stopping, lagrange multipliers, etc) should be independent.

This is in practice the main point, as long as every replica at the end is independent of all the others I'd say there is freedom on how to get there.

Edit: in other words, if the little interaction with vp at the end of the fit (to compute the arclength, and little more) is an issue we can easily fix that as long as the interpolation grids at the end are correct.

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

APJansen · 2023-07-27T11:47:21Z

Yes to both. But note that this is not due to interactions with validphys (which we could modify at will) but rather because each replica is independent of all the others. I.e., training (+data, trvlsplit, stopping, lagrange multipliers, etc) should be independent.
Well also as a practical consideration, doing everything per replica is very baked in, different folders to save to etc, it would require a lot of changes.

And I imagine that there are users that want to be able to evaluate a single PDF without having to evaluate all replicas. I actually know nothing about what happens with these PDFs once trained, can you say something about that or link to something?

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

Yes this whole branch only makes sense to merge after the trvl-mask-layers branch is merged. The hyperopt penalties can trivially be parallelized across replicas, the photon I'm not sure, if not there just needs to be an interface extracting single replicas from the joined model.

RoyStegeman · 2023-07-27T12:54:30Z

I actually know nothing about what happens with these PDFs once trained, can you say something about that or link to something?

In particle physics we collide protons, i.e. bound states of quarks and gluons, with another proton or lepton. However, in perturbative QCD we can only calculate Feynman diagrams with individual incoming quarks/gluons, not with incoming hadrons. However, to connect the pQCD calculation to what can be measured in experiments, each Feynman diagram essentially needs to be weighted by the probability of finding the corresponding incoming states inside the proton in order to make a connection between the proton-lepton collision and the quark/gluon-electron Feynman diagram. These weights are what the PDFs provide, if you will. There are different "factorization" arguments for different processes that provide the theoretical underpinning of this factorizing of the quark from the proton (there are no formal proofs for all processes though).

For a general introduction to QCD/collider physics any set of lecture notes on the topic will do. For a more specific discussion of what NNPDF does you could have a look here: https://arxiv.org/pdf/2008.12305.pdf (see equation 1 in these notes for the factorization equation I explained above). Perhaps you don't want to read the entire thing, but up to section 2.2 might be useful.

niclaurenti · 2023-07-27T13:18:48Z

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

Dear @APJansen and @scarlehoff ,
just to let you know that at the moment the QED fit can handle the multireplica fits with the limitation that all the photons will be computed sequentially. It means that for a 100 replicas fit, to compute all the photons it will take roughly 30 min * 100 = forever.
This is because when I implemented it we where not using the parallel fits so I didn't bother to parallelize the photons computation.
Obviously, if it is needed I'm happy to help to speed up that part of the code

APJansen · 2023-07-28T08:55:44Z

@RoyStegeman Thanks, this I knew, sorry for not being clear. (My background is in theoretical physics as well, though mostly on black hole physics, but I did take master courses on QFT and particle physics so I know the basics)
I meant on a practical level, does this code have users outside of the collaboration, or do people only use the model outputs that you provide for example?

RoyStegeman · 2023-07-28T13:32:19Z

Ah I see, I am aware of your background (without some basic knowledge of QFT/SM I don't think my explanation would be very helpful anyway) but indeed understood that you were asking about some collider physics notes, my bad!

The code is public but not really used outside our collaboration. Some parts of the codes that produce the FKtables (these are in different repositories) are being used by others, and we hope to convince more people to use our codes, but doing theory predictions serves a more general purpose than a PDF fit using the NNPDF methodology as implemented in this repo.

Besides allowing people to check/reproduce our work by making it open source some of the tools in validphys do serve a more general purpose in analysis of results compared to the n3fit fitting code and have been used by others as well, though to be honest that's the only example I can think of. There was some interested from the CMS collaboration so a few months ago we did a workshop for them in which we explained how to install and run the code, but I haven't heard anything about that since.

The bottom line is thus: it's open source because we invite people to check our work and we hope it can be of use to some others as well, though in practice it will of course mainly be us who use the code and others just use the PDF grids we produce with it.

APJansen · 2023-08-11T10:12:17Z

@scarlehoff Can you comment on this?

Looking at this again, I'm trying to rewrite everything in terms of pdf_model which is a single model consisting of a stack of pdf_models before actually merging the pdfs inside. Wondering what to do here.

Also, later in the same class, replicas are set to non-trainable, but this only takes effect when the model is recompiled, which as far as I can see is not happening here. (And this won't be possible any longer I think once all replicas are a single model)

scarlehoff · 2023-08-11T10:50:47Z

Hi @APJansen, feel free to modify the strategy as you wish.
The important part is that the replicas are 100% independent from each other.
This I achieved back in the day with the trainable=false and maybe that either triggered a recompilation or it was recompiled manually. Or maybe I was just storing the epoch number and rewinding back to the right epoch at the end. I don't remember (I stopped the development of the multigpu once JR got the grant for doing it in Amsterdam so I cannot even promise I tested it...).
In any case, this was easy in principle since the model was a concatenation of models so you could treat them independently and the only unsolved thing at the time was the trvl split.

In any case, for your situation one possible strategy might be this (talking without actually having put my hands in the code to test what the problems/issues might be):

Ensure there's no crosstalking between the nodes corresponding to different replicas
Record the state of the part of the network that should've stopped by itself (say, weights 10 to 20)
Set the chi2 of that part of the network to a constant 0 (so that it doesn't affect the loss landscape anymore)

And then continue training until all replicas have triggered the stopping condition. At that point you readd all weights back since you got them at their best and then you have a network in which each replica has been trained independently.

APJansen · 2023-08-11T11:06:10Z

Yes that was my plan, except that I hadn't thought of step 3, thanks!
I'll look into it!

APJansen · 2023-08-14T12:57:24Z

So I've thought about it, and actually it shouldn't matter for the coupling between the replicas whether individual replicas are set to trainable=False or not, and whether they contribute to the total loss or not. The total loss is just a sum over individual replica losses, which is linear. The weights in replica i will only be affected by the gradients of that total loss w.r.t. those weights, which only receive contributions from their own component.

I've also verified that commenting out this line doesn't change anything. To test I added a log message when the function is being called, to make sure with the runcard I'm using the stopping conditions are being met. I tested with 5 replicas, and results are identical.
With the comment above, that is expected whether the line does anything or not, but I'm 99% sure that it doesn't do anything. I don't see any difference in timing, but more importantly I don't see it being recompiled anywhere in the code, which is required for this to take effect.

So to conclude, step 3 is not an issue, and the setting of one replica to non trainable won't be possible after this refactor, but it wasn't being done in the first place, and the speedup from the refactor should outweigh that of a proper implementation of setting individual replicas to non-trainable.

APJansen · 2024-03-04T09:23:41Z

Closing this, all of this has been done.

Add MultiDense layer

ef0da11

APJansen added Refactoring n3fit Issues and PRs related to n3fit labels Jul 24, 2023

APJansen requested review from scarlehoff, Radonirinaunimi, niclaurenti and RoyStegeman July 24, 2023 12:53

APJansen mentioned this pull request Aug 15, 2023

Refactor stopping #1792

Merged

3 tasks

APJansen mentioned this pull request Aug 30, 2023

Realising a factor 20-30 speedup on GPU #1803

Closed

APJansen mentioned this pull request Oct 16, 2023

Multi dense logistics #1818

Merged

RoyStegeman added the escience label Nov 29, 2023

APJansen mentioned this pull request Dec 4, 2023

Multi Replica PDF #1880

Closed

APJansen closed this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi Replica PDF #1782

Multi Replica PDF #1782

APJansen commented Jul 24, 2023 •

edited

Loading

scarlehoff commented Jul 24, 2023 •

edited

Loading

APJansen commented Jul 27, 2023

RoyStegeman commented Jul 27, 2023 •

edited

Loading

niclaurenti commented Jul 27, 2023

APJansen commented Jul 28, 2023

RoyStegeman commented Jul 28, 2023 •

edited

Loading

APJansen commented Aug 11, 2023

scarlehoff commented Aug 11, 2023

APJansen commented Aug 11, 2023

APJansen commented Aug 14, 2023

APJansen commented Mar 4, 2024

Multi Replica PDF #1782

Multi Replica PDF #1782

Conversation

APJansen commented Jul 24, 2023 • edited Loading

Question

Idea

Development Strategy

Performance

Timing

Memory

Status

scarlehoff commented Jul 24, 2023 • edited Loading

APJansen commented Jul 27, 2023

RoyStegeman commented Jul 27, 2023 • edited Loading

niclaurenti commented Jul 27, 2023

APJansen commented Jul 28, 2023

RoyStegeman commented Jul 28, 2023 • edited Loading

APJansen commented Aug 11, 2023

scarlehoff commented Aug 11, 2023

APJansen commented Aug 11, 2023

APJansen commented Aug 14, 2023

APJansen commented Mar 4, 2024

APJansen commented Jul 24, 2023 •

edited

Loading

scarlehoff commented Jul 24, 2023 •

edited

Loading

RoyStegeman commented Jul 27, 2023 •

edited

Loading

RoyStegeman commented Jul 28, 2023 •

edited

Loading