Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvement to subgraph selection #4155

Merged
merged 3 commits into from
Oct 29, 2021
Merged

Conversation

leahwicz
Copy link
Contributor

@leahwicz leahwicz commented Oct 28, 2021

resolves #4135

Description

This performance improvement changes the way we select our graph subsets. The new algorithm replaces the use of a transitive closure (NP-Hard) and implements a combined transitive reduction + subtraction. I haven't done the math to determine complexity, but it should be at least P (or even lower, my complexity chops are a bit rusty).

In a practical setting it's reduced the time to run for the performance project from ~14.77s to ~11.17 seconds (~ 25% faster) when running on my macbook. I believe the improvement should scale with the size of the projects. The larger the graph, the larger the gain.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change

@cla-bot cla-bot bot added the cla:yes label Oct 28, 2021
@leahwicz leahwicz marked this pull request as draft October 28, 2021 14:54
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic. On my machine, using the sample performance project, it reduces the duration of the get_subset_graph method from 5.5s to 0.5-1s. I'll take a 5-10x speedup any day of the week.

Because the work to add edges happens each time we remove an unselected node from the graph, this approach means that get_subset_graph is faster when selecting all nodes (~0.5s locally) than when selecting only one node (~1s) in a large project. That makes sense. I'm only bringing it up because we should, when necessary, optimize for speedups when few nodes are selected (development) rather than many (production). In this case, we get to have it faster all across the board :)

I haven't managed to break this, all the tests seem to be passing, it seems like it's doing the right thing.

@iknox-fa iknox-fa changed the title Perf improvement to graph processing Perf improvement to subgraph selection Oct 28, 2021
Copy link
Contributor

@kwigley kwigley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me

Comment on lines +92 to +93
source_nodes = [x for x, _ in new_graph.in_edges(node)]
target_nodes = [x for _, x in new_graph.out_edges(node)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for these to be sets?

Suggested change
source_nodes = [x for x, _ in new_graph.in_edges(node)]
target_nodes = [x for _, x in new_graph.out_edges(node)]
source_nodes = {x for x, _ in new_graph.in_edges(node)}
target_nodes = {x for _, x in new_graph.out_edges(node)}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, lists are a little faster in this scenario (iteration as opposed to set membership)

Comment on lines 95 to 98
new_edges = product(source_nodes, target_nodes)
new_edges = [
(source, target) for source, target in new_edges if source != target
] # removes cyclic refs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Love being able to read this instead of looking up what nx.algorithms.transitive_closure does every time as well.

@kwigley
Copy link
Contributor

kwigley commented Oct 29, 2021

I'm wondering how this compares to networkx.algorithms.dag.transitive_closure_dag which implements a faster transitive closure alg given an acyclical graph. Using ^^ would have the same performance impact regardless of how many nodes are included. I think it is a tradeoff at the end of the day where you take the performance hit (as @jtcohen6 mentioned). Do you have a sense of how this compares to this PR?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 29, 2021

@kwigley Nice find!! I just tried out on my local machine (using the same sample perf project, 2k models + 6k tests), and it looks like that one-line change (transitive_closure_dag instead of transitive_closure) cuts this down from 5.5s to 0.6s!

That seems like an uncontroversial improvement. There's no way it can be slower, right? And it preserves consistent behavior no matter how many nodes are being selected.

@iknox-fa
Copy link
Contributor

iknox-fa commented Oct 29, 2021

@kwigley and @jtcohen6 I made some benchmarks using the performance project!
Each solution was tested 10x and averaged.

transative_closure = ~4.02s (+/- 0.07s)
transative_closure_dag = ~0.55s (+/- 0.02s)
this PR = ~ 0.11s (+/- 0.001s)

Note that this is testing the entire get_subset_graph method. This is because with this PR we don't need to do the extra step of removing nodes as it's built into the rest of the algo.

I like the approach they take in transative_closure_dag* and I could potentially combine it with graph node removal and make it even moar faster(!), but I'm thinking we've put enough hours into this at this point and would recommend going with this PR's solution.

* they use a topo sort + graph distance calculation instead of a graph iteration + edge detection.

@iknox-fa iknox-fa marked this pull request as ready for review October 29, 2021 19:02
@iknox-fa iknox-fa merged commit dd7af47 into main Oct 29, 2021
@iknox-fa iknox-fa deleted the leahwicz/perfImprovement branch October 29, 2021 21:06
jtcohen6 pushed a commit that referenced this pull request Nov 2, 2021
Perf improvement to get_subset_graph
Co-authored-by: Ian Knox <ian.knox@fishtownanalytics.com>
leahwicz added a commit that referenced this pull request Nov 2, 2021
* Perf improvement to subgraph selection (#4155)

Perf improvement to get_subset_graph
Co-authored-by: Ian Knox <ian.knox@fishtownanalytics.com>

* Add extra graph edges for `build` only (#4143)

* Resolve extra graph edges for build only

* Fix flake8

* Change test to reflect functional change

* Rename method + args. Add changelog entry

* Any subset, strict or not (#4160)

* Fix test backport

Co-authored-by: leahwicz <60146280+leahwicz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Huge slowdown of dbt initialization for projects with many tests since v0.21
4 participants