Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large refactoring of analysis logic #9

Merged
merged 6 commits into from
Jul 11, 2023

Conversation

eguiraud
Copy link
Member

@eguiraud eguiraud commented Jun 17, 2023

What and why

The goal of the refactoring was to make the actual analysis logic stand out more, making the code easier to follow. Please let me know if you agree this new version is more readable :)

The main changes are:

  • class TtbarAnalysis has been removed and logic has been split between a few free functions
  • a few helper functions and types have been added to utils.py, so that analysis.py contains almost only the actual analysis logic
  • plotting logic has been moved to plotting.py
  • helper.cpp has been renamed to helpers.cpp

As analysis.py has been completely rewritten this PR is probably better reviewed by checking out the branch and looking at the code directly.

Possible further improvements

  • ROOT.RDF.RunGraphs should accept RResultMaps just like it accepts RResultPtrs, so we don't need to filter RResultPtrs out (which is annoying to do in a way that is uniform for RDF and distRDF). Proposed fix at [DF] Make it possible to convert RResultMap to RResultHandle root#13193 .
  • ROOT.RDF.[RunGraphs,VariationsFor] (and ideally also SaveGraph) should just work:tm: when the DistRDF equivalents of the RDF types are passed to them

Quick preview

The main function now looks like this:

def main() -> None:
    program_start = time()
    args = parse_args()

    # Do not add histograms to TDirectories automatically: we'll do it ourselves as needed.
    ROOT.TH1.AddDirectory(False)

    if args.verbose:
        # Set higher RDF verbosity for the rest of the program.
        # To only change the verbosity in a given scope, use ROOT.Experimental.RLogScopedVerbosity.
        ROOT.Detail.RDF.RDFLogChannel.SetVerbosity(ROOT.Experimental.ELogLevel.kInfo)

    if args.scheduler == "mt":
        # Setup for local, multi-thread RDataFrame
        ROOT.EnableImplicitMT(args.ncores)
        print(f"Number of threads: {ROOT.GetThreadPoolSize()}")
        client = None
        load_cpp()
        run_graphs = ROOT.RDF.RunGraphs
    else:
        # Setup for distributed RDataFrame
        client = create_dask_client(args.scheduler, args.ncores, args.hosts)
        ROOT.RDF.Experimental.Distributed.initialize(load_cpp)
        run_graphs = ROOT.RDF.Experimental.Distributed.RunGraphs

    # Book RDataFrame results
    inputs: list[AGCInput] = retrieve_inputs(
        args.n_max_files_per_sample, args.remote_data_prefix, args.data_cache
    )
    results: list[AGCResult] = []
    for input in inputs:
        df = make_rdf(input.paths, client, args.npartitions)
        results += book_histos(df, input.process, input.variation, input.nevents)
    print(f"Building the computation graphs took {time() - program_start:.2f} seconds")

    # Run the event loops for all processes and variations here
    run_graphs_start = time()
    handles = [r.histo for r in results]
    n_computation_graphs_run = run_graphs(handles)
    assert n_computation_graphs_run == 9
    print(f"Executing the computation graphs took {time() - run_graphs_start:.2f} seconds")
    if client is not None:
        client.close()

    results = postprocess_results(results)
    save_plots(results)
    save_histos([r.histo for r in results], output_fname=args.output)
    print(f"Result histograms saved in file {args.output}")

@eguiraud eguiraud requested a review from vepadulano June 17, 2023 01:44
@eguiraud eguiraud self-assigned this Jun 17, 2023
@eguiraud eguiraud force-pushed the refactoring branch 5 times, most recently from eb2b625 to 23723ce Compare June 26, 2023 22:51
@eguiraud
Copy link
Member Author

I think I will convert those tasks listed in the PR description to separate issues and declare this PR ready to merge: the code as it is now should work for RDF also in distributed mode.

There is an issue with current master that @vepadulano is looking into but it seems to be a real distRDF regression rather than a problem with this implementation.

@eguiraud eguiraud force-pushed the refactoring branch 2 times, most recently from 21bbe4a to 0910abd Compare July 6, 2023 19:17
The goal of the refactoring was to make the actual analysis logic stand
out more, making the code easier to follow.

The main changes are:

- class TtbarAnalysis has been removed:
  its logic has been split between a few free functions
- a few helper functions and types have been added to utils.py, so that
  analysis.py contains almost only the actual analysis logic
- plotting logic has been moved to plotting.py

The new implementation is expected to produce exactly the same
plots and output histograms as before the refactoring.
This removes the dependency of plotting.py on the
jitted ROOT.Slice function.

Slicing and then rebinning is not strictly
equivalent to rebinning and then reducing the axis
range so this change introduces minor changes in
the aspect of the btag.png plot.
However the reference implementation does not
specify how the histograms should be plotted so I
think it is ok: the contents of the histograms
saved in output are not modified.
This post-processing step is only required for the
fitting part of the analysis, which we do not
implement yet.

v1.1 of the Coffea reference implementation also
saves the full histograms, differently from v1.0.
Avoids canvases flashing on screen before we save them to file.
Copy link
Member

@vepadulano vepadulano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to run it on my laptop with 6.28/04, both MT and dask local work!

This is a great move, makes the whole workflow much clearer and easier to read, thanks a lot!

I have added a few small comments/questions

analyses/cms-open-data-ttbar/utils.py Show resolved Hide resolved
with tqdm(unit='B', unit_scale=True, unit_divisor=1024, miniters=1, desc=out_path.name) as t:
with tqdm(
unit="B", unit_scale=True, unit_divisor=1024, miniters=1, desc=out_path.name
) as t:
urlretrieve(url, out_path.absolute(), reporthook=_tqdm_urlretrieve_hook(t))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we can't cache files when the input paths use an xrootd prefix. It's not for this PR, I'll open a github issue

analyses/cms-open-data-ttbar/analysis.py Show resolved Hide resolved
analyses/cms-open-data-ttbar/analysis.py Show resolved Hide resolved
analyses/cms-open-data-ttbar/analysis.py Show resolved Hide resolved
analyses/cms-open-data-ttbar/analysis.py Show resolved Hide resolved

# Run the event loops for all processes and variations here
run_graphs_start = time()
run_graphs([r.nominal_histo for r in results])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice given the current limitations!

analyses/cms-open-data-ttbar/utils.py Outdated Show resolved Hide resolved
@eguiraud eguiraud merged commit e4d021a into root-project:main Jul 11, 2023
1 check passed
@eguiraud eguiraud deleted the refactoring branch July 11, 2023 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants