Large refactoring of analysis logic #9

eguiraud · 2023-06-17T01:44:54Z

What and why

The goal of the refactoring was to make the actual analysis logic stand out more, making the code easier to follow. Please let me know if you agree this new version is more readable :)

The main changes are:

class TtbarAnalysis has been removed and logic has been split between a few free functions
a few helper functions and types have been added to utils.py, so that analysis.py contains almost only the actual analysis logic
plotting logic has been moved to plotting.py
helper.cpp has been renamed to helpers.cpp

As analysis.py has been completely rewritten this PR is probably better reviewed by checking out the branch and looking at the code directly.

Possible further improvements

ROOT.RDF.RunGraphs should accept RResultMaps just like it accepts RResultPtrs, so we don't need to filter RResultPtrs out (which is annoying to do in a way that is uniform for RDF and distRDF). Proposed fix at [DF] Make it possible to convert RResultMap to RResultHandle root#13193 .
ROOT.RDF.[RunGraphs,VariationsFor] (and ideally also SaveGraph) should just work:tm: when the DistRDF equivalents of the RDF types are passed to them

Quick preview

The main function now looks like this:

def main() -> None:
    program_start = time()
    args = parse_args()

    # Do not add histograms to TDirectories automatically: we'll do it ourselves as needed.
    ROOT.TH1.AddDirectory(False)

    if args.verbose:
        # Set higher RDF verbosity for the rest of the program.
        # To only change the verbosity in a given scope, use ROOT.Experimental.RLogScopedVerbosity.
        ROOT.Detail.RDF.RDFLogChannel.SetVerbosity(ROOT.Experimental.ELogLevel.kInfo)

    if args.scheduler == "mt":
        # Setup for local, multi-thread RDataFrame
        ROOT.EnableImplicitMT(args.ncores)
        print(f"Number of threads: {ROOT.GetThreadPoolSize()}")
        client = None
        load_cpp()
        run_graphs = ROOT.RDF.RunGraphs
    else:
        # Setup for distributed RDataFrame
        client = create_dask_client(args.scheduler, args.ncores, args.hosts)
        ROOT.RDF.Experimental.Distributed.initialize(load_cpp)
        run_graphs = ROOT.RDF.Experimental.Distributed.RunGraphs

    # Book RDataFrame results
    inputs: list[AGCInput] = retrieve_inputs(
        args.n_max_files_per_sample, args.remote_data_prefix, args.data_cache
    )
    results: list[AGCResult] = []
    for input in inputs:
        df = make_rdf(input.paths, client, args.npartitions)
        results += book_histos(df, input.process, input.variation, input.nevents)
    print(f"Building the computation graphs took {time() - program_start:.2f} seconds")

    # Run the event loops for all processes and variations here
    run_graphs_start = time()
    handles = [r.histo for r in results]
    n_computation_graphs_run = run_graphs(handles)
    assert n_computation_graphs_run == 9
    print(f"Executing the computation graphs took {time() - run_graphs_start:.2f} seconds")
    if client is not None:
        client.close()

    results = postprocess_results(results)
    save_plots(results)
    save_histos([r.histo for r in results], output_fname=args.output)
    print(f"Result histograms saved in file {args.output}")

eguiraud · 2023-06-28T20:08:33Z

I think I will convert those tasks listed in the PR description to separate issues and declare this PR ready to merge: the code as it is now should work for RDF also in distributed mode.

There is an issue with current master that @vepadulano is looking into but it seems to be a real distRDF regression rather than a problem with this implementation.

The goal of the refactoring was to make the actual analysis logic stand out more, making the code easier to follow. The main changes are: - class TtbarAnalysis has been removed: its logic has been split between a few free functions - a few helper functions and types have been added to utils.py, so that analysis.py contains almost only the actual analysis logic - plotting logic has been moved to plotting.py The new implementation is expected to produce exactly the same plots and output histograms as before the refactoring.

This removes the dependency of plotting.py on the jitted ROOT.Slice function. Slicing and then rebinning is not strictly equivalent to rebinning and then reducing the axis range so this change introduces minor changes in the aspect of the btag.png plot. However the reference implementation does not specify how the histograms should be plotted so I think it is ok: the contents of the histograms saved in output are not modified.

This post-processing step is only required for the fitting part of the analysis, which we do not implement yet. v1.1 of the Coffea reference implementation also saves the full histograms, differently from v1.0.

Avoids canvases flashing on screen before we save them to file.

vepadulano

I tried to run it on my laptop with 6.28/04, both MT and dask local work!

This is a great move, makes the whole workflow much clearer and easier to read, thanks a lot!

I have added a few small comments/questions

analyses/cms-open-data-ttbar/utils.py

vepadulano · 2023-07-07T09:29:53Z

analyses/cms-open-data-ttbar/utils.py

-            with tqdm(unit='B', unit_scale=True, unit_divisor=1024, miniters=1, desc=out_path.name) as t:
+            with tqdm(
+                unit="B", unit_scale=True, unit_divisor=1024, miniters=1, desc=out_path.name
+            ) as t:
                urlretrieve(url, out_path.absolute(), reporthook=_tqdm_urlretrieve_hook(t))


This means we can't cache files when the input paths use an xrootd prefix. It's not for this PR, I'll open a github issue

analyses/cms-open-data-ttbar/analysis.py

vepadulano · 2023-07-07T09:39:46Z

analyses/cms-open-data-ttbar/analysis.py

+
+    # Run the event loops for all processes and variations here
+    run_graphs_start = time()
+    run_graphs([r.nominal_histo for r in results])


Very nice given the current limitations!

analyses/cms-open-data-ttbar/utils.py

It does not slice/rebin anymore.

eguiraud requested a review from vepadulano June 17, 2023 01:44

eguiraud self-assigned this Jun 17, 2023

eguiraud force-pushed the refactoring branch 5 times, most recently from eb2b625 to 23723ce Compare June 26, 2023 22:51

eguiraud force-pushed the refactoring branch 2 times, most recently from 21bbe4a to 0910abd Compare July 6, 2023 19:17

eguiraud added 4 commits July 6, 2023 14:15

Write out histograms without slicing and rebinning

ef7b7fc

This post-processing step is only required for the fitting part of the analysis, which we do not implement yet. v1.1 of the Coffea reference implementation also saves the full histograms, differently from v1.0.

Disable interactive graphics

ab0fe4a

Avoids canvases flashing on screen before we save them to file.

eguiraud force-pushed the refactoring branch from 0910abd to ab0fe4a Compare July 6, 2023 20:16

eguiraud mentioned this pull request Jul 6, 2023

Add more CI checks #12

Merged

vepadulano approved these changes Jul 7, 2023

View reviewed changes

eguiraud added 2 commits July 7, 2023 17:02

[NFC] Add comment about distribution of C++ files

4da8d88

[NFC] Update postprocess_results docs

2c512c6

It does not slice/rebin anymore.

eguiraud merged commit e4d021a into root-project:main Jul 11, 2023
1 check passed

eguiraud deleted the refactoring branch July 11, 2023 16:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large refactoring of analysis logic #9

Large refactoring of analysis logic #9

eguiraud commented Jun 17, 2023 •

edited

Loading

eguiraud commented Jun 28, 2023

vepadulano left a comment

vepadulano Jul 7, 2023

vepadulano Jul 7, 2023

Large refactoring of analysis logic #9

Large refactoring of analysis logic #9

Conversation

eguiraud commented Jun 17, 2023 • edited Loading

What and why

Possible further improvements

Quick preview

eguiraud commented Jun 28, 2023

vepadulano left a comment

Choose a reason for hiding this comment

vepadulano Jul 7, 2023

Choose a reason for hiding this comment

vepadulano Jul 7, 2023

Choose a reason for hiding this comment

eguiraud commented Jun 17, 2023 •

edited

Loading