Trvl mask layers #1661

goord · 2023-01-13T10:24:53Z

Implementation of training-validation split per parallel replica via a set of masking layers. This means there is only one (set of) observables, but still training-validation observable wrappers which mask out the correct data.

Currently, I still need to do:

Run the realistic nnpdf4 use case
Compare results of a basic run
Compare results of a realistic run
Monitor memory use, investigate scaling with no. parallel replicas
Test on GPU's

# Conflicts: # n3fit/src/n3fit/model_trainer.py

Intermediate merge master into trvl mask layers

Intermediate merge from master

goord · 2023-01-13T12:11:14Z

I am facing problems with datasets with single datapoint, some replicas wll mask these out, other won't. That is a poor fit for the design I followed, which assumes the training/validation masks can be represented as block-wise boolean arrays with the same number of 'True' values per row, such that the output tensor is not strided

RoyStegeman · 2023-01-13T13:29:05Z

This was recently changed in #1636.

In principle I would be happy with a gpu implementation that treats datasets with a single point in the old way (so including it in the tr set). The difference in the result is negligible and as such we could still use the gpu for most purposes, only for a very final fit before a release might it be better to have the new treatment of single point datasets. @scarlehoff has many gpus available, so perhaps he has an opinion on this.

scarlehoff · 2023-01-13T14:09:02Z

I'd be happy with the solution of just ignoring those datasets for the time being.

RoyStegeman · 2023-01-13T14:16:49Z

Also fine for me. Point is (to address @goord's concerns), this is not a showstopper for the parallel fits.

goord · 2023-01-13T15:08:47Z

Yes that would be the best solution, as things become very complicated if we can't assume equal number of masked data points across replicas. I will try to include (or exclude?) all single-point datasets in the fit if parallel replicas is set and same_trvl_split is unset...

scarlehoff

I think I like this solution. It is more or less what I had in mind I think.

RE the one-point datasets, I think we can eventually find a way around it but let's ignore them for the moment.

Did you run any parallel fits with this code? Did they work? Is the memory footprint greatly impacted?

n3fit/src/n3fit/layers/mask.py

n3fit/src/n3fit/model_gen.py

scarlehoff · 2023-01-17T07:00:51Z

n3fit/src/n3fit/model_gen.py

+                masked_output_layers.append(mask_layer(output_layer))
+
+        # Finally concatenate all observables (so that experiments are one single entity)
+        ret = op.concatenate(masked_output_layers)


Why is the axis removed? (I guess the default is exactly the right axis, but I'd like to have it explicit, it makes debugging easier)

I thought it was a bit explicit, but since all tensor shapes have to be as explicit as possible for tensorflow to do the correct thing, I will re-insert it.

More than for tensorflow is for the person reading the code in this case, since it is hard to keep track of which axis is what :P

scarlehoff · 2023-01-17T07:06:52Z

n3fit/src/n3fit/model_gen.py

+    tr_mask_layers = []
+    vl_mask_layers = []
+    offset = 0
+    apply_masks = spec_dict.get("data_transformation_tr") is None and mask_array is not None


I'm a bit worried. If mask_array is not None but there is a data_transformation_tr then the masks will not be applied. If this is necessary then it should fail at the beginning.

We usually do that by adding a check before the fit starts. In this case it should check whether the run options are a parallel fit and data_transformation and if so validphys will raise an exception telling the user which options are inconsistent.

(for the time being you can put just a raise Exception here to stop it and create the check at the end)

scarlehoff · 2023-01-17T07:11:26Z

n3fit/src/n3fit/model_gen.py

+        trmask = mask_array[:, offset:offset + dataset.ndata] if apply_masks else None
+        tr_mask_layers.append(Mask(trmask, axis=1, c=1) if apply_masks else None)
+        vl_mask_layers.append(Mask(~trmask, axis=1, c=1) if apply_masks else None)


Suggested change

trmask = mask_array[:, offset:offset + dataset.ndata] if apply_masks else None

tr_mask_layers.append(Mask(trmask, axis=1, c=1) if apply_masks else None)

vl_mask_layers.append(Mask(~trmask, axis=1, c=1) if apply_masks else None)

if apply_masks:

trmask = mask_array[:, offset:offset + dataset.ndata]

tr_mask_layers.append(Mask(bool_mask=trmask, axis=1))

vl_mask_layers.append(Mask(bool_mask=~trmask, axis=1))

else:

trmask = None

tr_mask_layers.append(None)

vl_mask_layers.append(None)

I don't like the idea of having a list of None, I think with the check above you will always be in a consistent state and you might be able to check whether to apply the mask somewhere else (so you don't need the None).

Fixed in rev. bc56876

scarlehoff · 2023-01-17T07:13:42Z

n3fit/src/n3fit/model_gen.py

+        model_observables.append(obs_layer)
+
+        # shift offset for new mask array
+        offset = offset + dataset.ndata


Would there be a way to have a list of arrays from the onset (instead of having an offset that we move?) such that to each dataset in the list correspond an array.

scarlehoff · 2023-01-17T07:15:24Z

n3fit/src/n3fit/model_gen.py

@@ -37,6 +40,7 @@ class ObservableWrapper:

    name: str
    observables: list
+    trvl_mask_layers: list


Maybe we can have this be optional = None, and if it is None then it doesn't get applied.

Because (I think, maybe I'm wrong!) that we should never be in a situation in which some of the masks exist and some are None, that way you can avoid the list of None.

Fixed in rev. bc56876. I did not insert the default value yet though

scarlehoff · 2023-01-17T07:18:47Z

n3fit/src/n3fit/model_trainer.py

+
+import numpy


Suggested change

import numpy

goord · 2023-02-06T15:12:43Z

Hi @scarlehoff I haven't replied to your suggestions because I propose to first get the validation in order. After testing with the basic_parallel test case with only a single dataset, I believe I still have a bug somewhere in the implementation because my chi2 values differ significantly from the master sequential (or parallel) runs.

variable	master-sequential	master-parallel	trvl mask-layers parallel
chi2	1.02	1.01	20.0
erf_tr	1.61	2.0	3.77
erf_vl	1.81	1.46	55.4

These are means over 500 replicas.

…ce matrix

…-layers

goord · 2023-02-15T21:42:20Z

I have solved a few problems with the implementation: the first - obvious - one was that the masked truth values weren't correctly propagating to the loss function. Another issue was that a replica-specific inverse covariance was not taken into account.

Latest commits result in a basic runcard fit that is bitwise identical to the sequential run for small no. epochs. More validation results will follow.

goord · 2023-03-01T15:17:27Z

Memory use is now ok, slightly lower than the master branch:

goord · 2023-05-10T10:30:42Z

Regarding memory usage: due to the different masks per replica, the lru_cache for fittable_datasets_masked in n3fit_data.py is not triggered anymore when same_trvl_per_replica is set to false. This causes the following loop in n3fit_data_utils.py to be executed for each replica:

for dspec, mask in zip_longest(datasets, tr_masks):
        # Load all fktables with the appropiate cuts
        fktables = [fk.load_with_cuts(dspec.cuts) for fk in dspec.fkspecs]
        # And now put them in a FittableDataSet object which
        loaded_obs.append(FittableDataSet(dspec.name, fktables, dspec.op, dspec.frac, mask))

Although the FK table loading is in its own lru_cache and not re-executed, the with_cuts is not and will trigger a copy of the FK-tables for each replica, although the cuts are identical and independent of the mask, and inflates the memory footprint.

The easiest fix is to wrap the fk.load_with_cuts(dspec.cuts) in a dedicated function with a lru_cache attribute. I wouldn't call this an elegant solution though, any thoughts @scarlehoff ?

scarlehoff · 2023-05-10T10:41:49Z

Yes. I think that should work since all replicas are using the same fktables now (i.e., it should work if the datasets are the same for all replicas and the only thing that changes is tr_masks but I think this is indeed the case).

scarlehoff · 2023-05-10T11:25:17Z

validphys2/src/validphys/n3fit_data_utils.py

+
+@functools.lru_cache
+def load_cached_fk_tables(fk, cuts):
+    return fk.load_with_cuts(cuts)


Wouldn't it be better to wrap directly fk.load_with_cuts?

https://github.com/NNPDF/nnpdf/blob/f02b49a8e6eb785af6a56dc3195133abff571046/validphys2/src/validphys/core.py#LL442C4-L442C4

(in practical terms it should be the same)

Given an fktable and a set of cuts, I see no reason why we would we want two different objects so that probably is a positive change in other parts of the code as well*

*hopefully

(of course, for the purposes of testing and benchmarking maybe it is better to start here)

Yes that would be the more elegant solution, but needs thorough testing because this indeed impacts the code in more places, there will be shared objects where there were copies before. I can have a look.

# Conflicts: # n3fit/src/n3fit/model_gen.py # n3fit/src/n3fit/model_trainer.py # validphys2/src/validphys/n3fit_data.py # validphys2/src/validphys/n3fit_data_utils.py

Getting latest master changes

goord · 2023-06-19T12:38:22Z

New validation and performance tests are underway

goord · 2023-07-13T12:49:03Z

A 100-replica fit comparison with the recent fits by @APJansen is done here: https://vp.nnpdf.science/Hz7Gwu95TzCUCH4oYoQOQA==

Intermediate merge of Aron's stuff

…ask-layers

RoyStegeman · 2023-08-30T09:31:40Z

Can this be closed in favor of #1788 ?

scarlehoff · 2023-11-13T09:14:52Z

Let me echo @RoyStegeman's question

Can this be closed in favor of #1788 ?

goord · 2023-11-13T10:45:16Z

closed in favor of #1788

goord and others added 5 commits December 9, 2022 15:05

Initial work to accommodate different tr-vl split per parallel replica.

d8e8e1c

# Conflicts: # n3fit/src/n3fit/model_trainer.py

Merge pull request #1 from LHC-NLeSC/master

0184708

Intermediate merge master into trvl mask layers

Work on tr-vl masking layers, TF crashes with shape mismatch in gradient

685af8a

tr-vl masking layers working, validation needed

94916ad

Merge pull request #2 from NNPDF/master

e9cf4bf

Intermediate merge from master

scarlehoff reviewed Jan 17, 2023

View reviewed changes

scarlehoff mentioned this pull request Jan 20, 2023

Tr-vl-split replicas #1618

Closed

goord added 3 commits February 15, 2023 22:15

Fixed issues with replica-specific training data and inverse covarian…

a43bd5d

…ce matrix

Merge remote-tracking branch 'origin/trvl-mask-layers' into trvl-mask…

0584f56

…-layers

Added input shape to build method args again

f5e9a45

Fixed broken sequential replica fits

2ca63f0

goord and others added 6 commits March 13, 2023 09:31

Merge branch 'NNPDF:master' into trvl-mask-layers

20c8c9a

Add single-point datasets to training set for all parallel replicas

eb819af

Give better names to mask layers

7269c1d

Merge branch 'NNPDF:master' into trvl-mask-layers

a5f5534

Merge branch 'NNPDF:master' into trvl-mask-layers

194c34d

Fix mask layer kernel shape.

303024e

Radonirinaunimi mentioned this pull request May 4, 2023

Runcards using theoryid 400? #1727

Closed

Either apply masks to all observables or not

bc56876

Wrapped load_with_cuts in lru cache function

267d973

scarlehoff reviewed May 10, 2023

View reviewed changes

goord and others added 9 commits May 10, 2023 14:35

Better way of wrapping load_with_cuts in lru cache function

7505f97

Fixed bug when running single replica in parallel

653f580

Merge branch 'master' into trvl-mask-layers

7e2dc71

Merge branch 'master' into trvl-mask-layers

84f56cd

# Conflicts: # n3fit/src/n3fit/model_gen.py # n3fit/src/n3fit/model_trainer.py # validphys2/src/validphys/n3fit_data.py # validphys2/src/validphys/n3fit_data_utils.py

Fixed incorrect merge in model_trainer.py

0409cf3

Attempt to fix tests with default argument in validphys

623a2d7

Fix 1-D mask test

7edad32

Reverted accidental commits

79240ad

Merge pull request NNPDF#1759 from NNPDF/master

57c2111

Getting latest master changes

goord marked this pull request as ready for review June 19, 2023 12:37

APJansen mentioned this pull request Jul 5, 2023

Refactoring model creation code #1734

Merged

Merge pull request NNPDF#1775 from NNPDF/master

49404d1

Intermediate merge of Aron's stuff

Radonirinaunimi mentioned this pull request Jul 17, 2023

Intermediate merge of Aron's stuff #1775

Merged

APJansen mentioned this pull request Jul 18, 2023

Refactor rotations #1780

Merged

scarlehoff mentioned this pull request Jul 24, 2023

Multi Replica PDF #1782

Closed

goord added 2 commits July 24, 2023 22:23

various caches speeding up multii-replica init

1aab479

Merge branch 'trvl-mask-layers' of github.com:NNPDF/nnpdf into trvl-m…

495ed3e

…ask-layers

goord mentioned this pull request Aug 9, 2023

Parallel replicas with varying tr-vl masks #1788

Merged

RoyStegeman mentioned this pull request Aug 30, 2023

Refactor stopping #1792

Merged

3 tasks

RoyStegeman mentioned this pull request Nov 1, 2023

Restart hyperopt #1824

Merged

goord closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trvl mask layers #1661

Trvl mask layers #1661

goord commented Jan 13, 2023 •

edited

Loading

goord commented Jan 13, 2023

RoyStegeman commented Jan 13, 2023

scarlehoff commented Jan 13, 2023

RoyStegeman commented Jan 13, 2023

goord commented Jan 13, 2023

scarlehoff left a comment

scarlehoff Jan 17, 2023

goord May 3, 2023

scarlehoff May 3, 2023

scarlehoff Jan 17, 2023

scarlehoff Jan 17, 2023

goord May 8, 2023

scarlehoff Jan 17, 2023

scarlehoff Jan 17, 2023

goord May 8, 2023

scarlehoff Jan 17, 2023

goord commented Feb 6, 2023 •

edited

Loading

goord commented Feb 15, 2023

goord commented Mar 1, 2023 •

edited

Loading

goord commented May 10, 2023

scarlehoff commented May 10, 2023

scarlehoff May 10, 2023 •

edited

Loading

goord May 10, 2023

goord commented Jun 19, 2023

goord commented Jul 13, 2023

RoyStegeman commented Aug 30, 2023

scarlehoff commented Nov 13, 2023

goord commented Nov 13, 2023 •

edited

Loading

-        trmask = mask_array[:, offset:offset + dataset.ndata] if apply_masks else None
-        tr_mask_layers.append(Mask(trmask, axis=1, c=1) if apply_masks else None)
-        vl_mask_layers.append(Mask(~trmask, axis=1, c=1) if apply_masks else None)
+        if apply_masks:
+            trmask = mask_array[:, offset:offset + dataset.ndata]
+            tr_mask_layers.append(Mask(bool_mask=trmask, axis=1))
+            vl_mask_layers.append(Mask(bool_mask=~trmask, axis=1))
+        else:
+            trmask = None
+            tr_mask_layers.append(None)
+            vl_mask_layers.append(None)

Trvl mask layers #1661

Trvl mask layers #1661

Conversation

goord commented Jan 13, 2023 • edited Loading

goord commented Jan 13, 2023

RoyStegeman commented Jan 13, 2023

scarlehoff commented Jan 13, 2023

RoyStegeman commented Jan 13, 2023

goord commented Jan 13, 2023

scarlehoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goord commented Feb 6, 2023 • edited Loading

goord commented Feb 15, 2023

goord commented Mar 1, 2023 • edited Loading

goord commented May 10, 2023

scarlehoff commented May 10, 2023

scarlehoff May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goord commented Jun 19, 2023

goord commented Jul 13, 2023

RoyStegeman commented Aug 30, 2023

scarlehoff commented Nov 13, 2023

goord commented Nov 13, 2023 • edited Loading

goord commented Jan 13, 2023 •

edited

Loading

goord commented Feb 6, 2023 •

edited

Loading

goord commented Mar 1, 2023 •

edited

Loading

scarlehoff May 10, 2023 •

edited

Loading

goord commented Nov 13, 2023 •

edited

Loading