Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi Replica PDF #1782

Closed
wants to merge 1 commit into from
Closed

Multi Replica PDF #1782

wants to merge 1 commit into from

Conversation

APJansen
Copy link
Collaborator

@APJansen APJansen commented Jul 24, 2023

Question

This will be some work, so before continuing past this I'd like to confirm that you agree that once finished this will be a beneficial change.

Idea

The idea of this PR is to refactor the tensorflow model from taking a list of single-replica pdfs into taking a single multiple replica pdf, a single pdf whose output has an extra axis representing the replica. This is much faster on the GPU, see tests below.

The main ingredient to make this possible is a MultiDense layer, (see here) which is essentially just a dense layer where the weights have one extra dimension, with size the number of replicas. For the first layer, which takes x's as input, this is exactly it. For deeper layers, the input already has a replica axis, and so the right index of the input has to be multiplied by the corresponding axis of the weights.

Development Strategy

To integrate this into the code, many small changes are necessary.
To make it as simple as possible to review and test, I aim to make small, independent changes that ideally are beneficial, or at least not detrimental, on their own. Wherever it's sensible I'll first create a unit test that covers the changes I want to make, and make sure it still passes after, and wherever possible I'll try to have the outputs be identical up to numerical errors. I'll put all of these on their own branch and with their own PR (maybe I should create a special label for those PRs?).

Once those small changes are merged, the actual implementation should be easily managable to review.

This PR itself for now is a placeholder, where I just added the commit so that I can create a draft PR and so you can check out the MultiDense layer.

I expect that as a final result you'll still want single replica pdf. I will add code that, once all computations are done, just splits the multi replica pdf into single ones, so the saving and any interaction with validphys will remain unchanged.

Performance

Timing

These are the timing tests I did on a 1/4 node on Snellius, with one GPU. I'm reporting the average seconds per epoch that is printed in debug mode.

runcard replicas multi_replica_pdf_test trvl-mask-layers master
Basic 200 0.12 1.2 2.3
NNPDF40_nnlo_as_01180_1000 200 out of memory out of memory -
NNPDF40_nnlo_as_01180_1000 100 0.76 1.12 -

Memory

Memory also appears to be significantly reduced.
I checked the peak cpu memory usage using libmemprofile, on the basic runcard with 200 replicas, and found 3.5Gb versus 16.5 for the trvl-mask-layers branch.

Status

I have a test branch where this is working up to the end of the model training, which is what I used to obtain the timings above.

branch finished tested merged comments
refactor_xintegrator X unit X
refactor_msr X unit X
refactor_preprocessing X unit X
refactor_rotations X unit X
refactor_stopping X unit X
multi-dense-logistics currently working on this
multi_replica_pdf-test This is my test branch, which has the 4 above, and trvl-mask-layers, merged into it and has the code that will eventually go into this PR

@APJansen APJansen added Refactoring n3fit Issues and PRs related to n3fit labels Jul 24, 2023
@scarlehoff
Copy link
Member

scarlehoff commented Jul 24, 2023

I'm guessing you already talked with @goord and are aware of some of the issues he found in #1661, most notably the ambiguity on how to treat the training/validation split for dataset with one single point, when you merged it with your PRs.

I would also ask that you finish the tests with the small PR you've been doing to ensure that the changes are incrementally merged (and so they are not broken by other changes that might be done in parallel to the code). (I see this PR is a draft so maybe this was already your plan).

Now, the answer to your questions:

I'd like to confirm that you agree that once finished this will be a beneficial change.

I expect that as a final result you'll still want single replica pdf. I will add code that, once all computations are done, just splits the multi replica pdf into single ones, so the saving and any interaction with validphys will remain unchanged.

Yes to both. But note that this is not due to interactions with validphys (which we could modify at will) but rather because each replica is independent of all the others. I.e., training (+data, trvlsplit, stopping, lagrange multipliers, etc) should be independent.

This is in practice the main point, as long as every replica at the end is independent of all the others I'd say there is freedom on how to get there.

Edit: in other words, if the little interaction with vp at the end of the fit (to compute the arclength, and little more) is an issue we can easily fix that as long as the interpolation grids at the end are correct.

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

@APJansen
Copy link
Collaborator Author

Yes to both. But note that this is not due to interactions with validphys (which we could modify at will) but rather because each replica is independent of all the others. I.e., training (+data, trvlsplit, stopping, lagrange multipliers, etc) should be independent.
Well also as a practical consideration, doing everything per replica is very baked in, different folders to save to etc, it would require a lot of changes.

And I imagine that there are users that want to be able to evaluate a single PDF without having to evaluate all replicas. I actually know nothing about what happens with these PDFs once trained, can you say something about that or link to something?

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

Yes this whole branch only makes sense to merge after the trvl-mask-layers branch is merged. The hyperopt penalties can trivially be parallelized across replicas, the photon I'm not sure, if not there just needs to be an interface extracting single replicas from the joined model.

@RoyStegeman
Copy link
Member

RoyStegeman commented Jul 27, 2023

I actually know nothing about what happens with these PDFs once trained, can you say something about that or link to something?

In particle physics we collide protons, i.e. bound states of quarks and gluons, with another proton or lepton. However, in perturbative QCD we can only calculate Feynman diagrams with individual incoming quarks/gluons, not with incoming hadrons. However, to connect the pQCD calculation to what can be measured in experiments, each Feynman diagram essentially needs to be weighted by the probability of finding the corresponding incoming states inside the proton in order to make a connection between the proton-lepton collision and the quark/gluon-electron Feynman diagram. These weights are what the PDFs provide, if you will. There are different "factorization" arguments for different processes that provide the theoretical underpinning of this factorizing of the quark from the proton (there are no formal proofs for all processes though).

For a general introduction to QCD/collider physics any set of lecture notes on the topic will do. For a more specific discussion of what NNPDF does you could have a look here: https://arxiv.org/pdf/2008.12305.pdf (see equation 1 in these notes for the factorization equation I explained above). Perhaps you don't want to read the entire thing, but up to section 2.2 might be useful.

@niclaurenti
Copy link
Contributor

Regarding the photon or the hyperopt penalties (again, I guess this was your plan already, but writing it here to make sure we are all on the same page), I'd suggest leaving that for after the standard multireplica fit in GPU is well tested and merged. The photon might not even be suitable for GPU paralellization since a non negligible amount of time is spent calculating the photon with fiatlux so the best thing would be to make sure that the QED fit is not broken when running it in the "normal 1-replica way".

Dear @APJansen and @scarlehoff ,
just to let you know that at the moment the QED fit can handle the multireplica fits with the limitation that all the photons will be computed sequentially. It means that for a 100 replicas fit, to compute all the photons it will take roughly 30 min * 100 = forever.
This is because when I implemented it we where not using the parallel fits so I didn't bother to parallelize the photons computation.
Obviously, if it is needed I'm happy to help to speed up that part of the code

@APJansen
Copy link
Collaborator Author

@RoyStegeman Thanks, this I knew, sorry for not being clear. (My background is in theoretical physics as well, though mostly on black hole physics, but I did take master courses on QFT and particle physics so I know the basics)
I meant on a practical level, does this code have users outside of the collaboration, or do people only use the model outputs that you provide for example?

@RoyStegeman
Copy link
Member

RoyStegeman commented Jul 28, 2023

Ah I see, I am aware of your background (without some basic knowledge of QFT/SM I don't think my explanation would be very helpful anyway) but indeed understood that you were asking about some collider physics notes, my bad!

The code is public but not really used outside our collaboration. Some parts of the codes that produce the FKtables (these are in different repositories) are being used by others, and we hope to convince more people to use our codes, but doing theory predictions serves a more general purpose than a PDF fit using the NNPDF methodology as implemented in this repo.

Besides allowing people to check/reproduce our work by making it open source some of the tools in validphys do serve a more general purpose in analysis of results compared to the n3fit fitting code and have been used by others as well, though to be honest that's the only example I can think of. There was some interested from the CMS collaboration so a few months ago we did a workshop for them in which we explained how to install and run the code, but I haven't heard anything about that since.

The bottom line is thus: it's open source because we invite people to check our work and we hope it can be of use to some others as well, though in practice it will of course mainly be us who use the code and others just use the PDF grids we produce with it.

@APJansen
Copy link
Collaborator Author

@scarlehoff Can you comment on this?

Looking at this again, I'm trying to rewrite everything in terms of pdf_model which is a single model consisting of a stack of pdf_models before actually merging the pdfs inside. Wondering what to do here.

Also, later in the same class, replicas are set to non-trainable, but this only takes effect when the model is recompiled, which as far as I can see is not happening here. (And this won't be possible any longer I think once all replicas are a single model)

@scarlehoff
Copy link
Member

Hi @APJansen, feel free to modify the strategy as you wish.
The important part is that the replicas are 100% independent from each other.
This I achieved back in the day with the trainable=false and maybe that either triggered a recompilation or it was recompiled manually. Or maybe I was just storing the epoch number and rewinding back to the right epoch at the end. I don't remember (I stopped the development of the multigpu once JR got the grant for doing it in Amsterdam so I cannot even promise I tested it...).
In any case, this was easy in principle since the model was a concatenation of models so you could treat them independently and the only unsolved thing at the time was the trvl split.

In any case, for your situation one possible strategy might be this (talking without actually having put my hands in the code to test what the problems/issues might be):

  1. Ensure there's no crosstalking between the nodes corresponding to different replicas
  2. Record the state of the part of the network that should've stopped by itself (say, weights 10 to 20)
  3. Set the chi2 of that part of the network to a constant 0 (so that it doesn't affect the loss landscape anymore)

And then continue training until all replicas have triggered the stopping condition. At that point you readd all weights back since you got them at their best and then you have a network in which each replica has been trained independently.

@APJansen
Copy link
Collaborator Author

Yes that was my plan, except that I hadn't thought of step 3, thanks!
I'll look into it!

@APJansen
Copy link
Collaborator Author

So I've thought about it, and actually it shouldn't matter for the coupling between the replicas whether individual replicas are set to trainable=False or not, and whether they contribute to the total loss or not. The total loss is just a sum over individual replica losses, which is linear. The weights in replica i will only be affected by the gradients of that total loss w.r.t. those weights, which only receive contributions from their own component.

I've also verified that commenting out this line doesn't change anything. To test I added a log message when the function is being called, to make sure with the runcard I'm using the stopping conditions are being met. I tested with 5 replicas, and results are identical.
With the comment above, that is expected whether the line does anything or not, but I'm 99% sure that it doesn't do anything. I don't see any difference in timing, but more importantly I don't see it being recompiled anywhere in the code, which is required for this to take effect.

So to conclude, step 3 is not an issue, and the setting of one replica to non trainable won't be possible after this refactor, but it wasn't being done in the first place, and the speedup from the refactor should outweigh that of a proper implementation of setting individual replicas to non-trainable.

@APJansen
Copy link
Collaborator Author

APJansen commented Mar 4, 2024

Closing this, all of this has been done.

@APJansen APJansen closed this Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
escience n3fit Issues and PRs related to n3fit Refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants