Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(backends): clean up resources produced by memtable #10055

Merged
merged 15 commits into from
Sep 10, 2024

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 7, 2024

Description of changes

PR on top of #10053 to clean up temporary tables produced by ibis.memtable.

The approach is a generalization of the one taken in #10042.

The main caveat here, which I don't think is a blocker is that the following
code will start to fail:

table = ibis.memtable({"a": [1, 2, 3]}, name="foo")
register_memtable(table)

table = None  # very likely to cause `table` op to be GC'd, and thus delete the underlying resource
run_query_against("foo")  # fails because `foo` has been dropped

Issues closed

Closes #10044.

@@ -382,31 +384,27 @@ def create_table(
else:
temp_name = name

table = sg.table(temp_name, catalog=database, quoted=quoted)
target = sge.Schema(this=table, expressions=column_defs)
table_expr = sg.table(temp_name, catalog=database, quoted=quoted)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed these variable names because table = <not the memtable> causes it to get GC'd. I gave another example of this in the PR description.

@cpcloud cpcloud marked this pull request as draft September 7, 2024 14:26
@cpcloud
Copy link
Member Author

cpcloud commented Sep 7, 2024

Still need to add at least one cross-backend test to verify the behavior.

@cpcloud cpcloud force-pushed the cleanup-temp-resources branch 9 times, most recently from fb642a4 to c6c0677 Compare September 8, 2024 13:25
@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase performance Issues related to ibis's performance labels Sep 8, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Sep 9, 2024

I'll split out the faster memtable checks into a separate PR once #10053 is merged.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 9, 2024

Going to split this out into in-memory versus not-in-memory backends since the former are more important to clean up more aggressively.

@cpcloud cpcloud force-pushed the cleanup-temp-resources branch 5 times, most recently from 009261a to 837e3d2 Compare September 9, 2024 19:58
@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

After hours of trying to understand what is happening with MySQL I was able to get a reproducer:

I can only reproduce this with a Python 3.12 environment.

pytest --randomly-seed=159558151 -m mysql -x --pdb ibis/backends/tests/test_generic.py -vv

Still not sure why executing SQL in a finalizer causes the socket on the connection to get set to None, so I'm going to disable memtable cleanup for MySQL for now. It's fairly low risk in that memtables are created with TEMPORARY, so they'll get clean up when the session ends.

We won't be able to prevent build up of disk space when constructing large memtables in a loop, but we won't keep all the storage around when the session ends.

@cpcloud cpcloud marked this pull request as ready for review September 10, 2024 12:08
@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

I know I said I would break out the in-memory backends into their own PR, but that would require no-op-ing the non-in-memory backends and then changing that code again. That doesn't seem worth the trouble, given the hopefully manageable small number of changes here.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A could questions/suggestions around error handling, but otherwise LGTM.

ibis/backends/exasol/__init__.py Outdated Show resolved Hide resolved
ibis/backends/__init__.py Outdated Show resolved Hide resolved
ibis/backends/duckdb/__init__.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_client.py Outdated Show resolved Hide resolved
ibis/backends/tests/test_client.py Outdated Show resolved Hide resolved
@cpcloud cpcloud requested a review from jcrist September 10, 2024 15:09
Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presuming tests pass, LGTM.

@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

Running the clouds locally now.

@cpcloud cpcloud added this to the 9.5 milestone Sep 10, 2024
@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

BigQuery is good:

❯ pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
...........sssssssssssssssssssss.......x.....x.........x............x.............x....x........x.x..xx.....s..x..x..x...x......x...x.s...ssssssssssssssssssssssssssssssssssssssssssssssssssss [  8%]
sssssssssssssssssssssssssssssssssssssssssssssss.........s......x...xx.x.xx........x.....s........................x........x......x...........s................................................ [ 17%]
.x......x.....................x......x.........................x........................................................xx.....x....x.X............x.............xx..x.......x.....xx....x...x [ 26%]
.....xx.......x.x..x.....X.x...x.x....x.....x....x.x...........x......x.xx.x.xXxx...x.x.xxXxx....Xx....x...x........x.........x...x......x..................................x......x.........x [ 35%]
..........x.x...x..xxxxx.xx.x..xx.x.x.........x...x..x.........x...x..........x....x.......x..............x......x....x.............x....x.x............x.............x............x.......... [ 44%]
....xx.........x....x...x.x..x.............xx...x.xx.x.x..xx...xxxx..xxxx...x.....xxxx.xx.....x......xxx.....x.x.........x....x..xx..x....x........x.x...x..............x.......x.x....xx.x... [ 53%]
........x.x..x............x........xx..xx...x..x.....x......x.x.x...x....x......xx...x.......x...x....x.x...x..x..............x..x.......................x........x...x...x.....x.........x... [ 62%]
...x..x......x.......x........x..........x..x..............x.x...x..........x.....x....x..x................xx....xx.....x.....................x.................s.........s................... [ 71%]
.....................................................x.s.xxxx.xxx.x.xx.xxx.x..xx...xxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxx.xxxxxxxxxxxxxxxxxx.xxxxxx................... [ 79%]
.............................x.............................x.....................x.......x........x...............x.x...x.....................x............................................... [ 88%]
..............x..x...................................................xxxxxxxx.xx...........................x...x.......................x..............................................x....... [ 97%]
.................................................                                                                                                                                              [100%]
1663 passed, 128 skipped, 343 xfailed, 5 xpassed in 344.90s (0:05:44)

@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

Snowflake is good:

❯ pytest -m snowflake -n 8 --dist loadgroup --snapshot-update -q && pytest -m bigquery -n auto --dist loadgroup --snapshot-update -q
bringing up nodes...
.......x.....................s..................x..............x....x.............................x..................x...................x...x........x...x..................x................ [  9%]
............x.....x.......x........x..x.x...............x....................x....................................xx....................x...x..................x...x.........................x [ 19%]
.x.........x.....x.......x..x.........x..................x............................x..........x........x....x.....x......................x..........x.................x............x....... [ 29%]
..x.........x.....x..xx....x...x.......x..............xx..s.................x...............xxx.xx...............x......x..x.......x.x.x.......x......................s........x.............. [ 39%]
.......................s.......s........x.....xx.........xx...x..............x............xx.x......x..................x....xx......x............x.................x....x............x..x..... [ 49%]
x.x.....xx.x......x...xx..xx...x..x....................x..................x...............................................................x..............x.................................xx. [ 59%]
.........................x..xx..xxxx...x.x...x.xx.x.........xx..xx.....x.xxxx.x.xxxxxxxxx.x..xxxxx.x....xxxxx...x...xx.xx..xxx....xxxx..........xxx...x.....x.xx..x......x.................... [ 69%]
..........x..........x.s...x...x.........x..x..xxxxxxxxxxxx.x....................................................x........................x.......x........................................... [ 79%]
.....................x..x...........x........x.....x..x......................x...x........x.............xx...x...x.........x...............x.x.........x..........x....x...xxxxxxxx.x...x....x [ 89%]
.x.xx..........s...s...s.s.................................................................................................................................................................... [ 99%]
..........

@cpcloud cpcloud enabled auto-merge (squash) September 10, 2024 16:03
@cpcloud cpcloud merged commit 019cae5 into ibis-project:main Sep 10, 2024
81 checks passed
@cpcloud cpcloud deleted the cleanup-temp-resources branch September 10, 2024 16:29
@jcrist
Copy link
Member

jcrist commented Sep 10, 2024

Looks like the new test might be a bit flaky on bigquery: https://github.com/ibis-project/ibis/actions/runs/10798011642/job/29950561913

@cpcloud
Copy link
Member Author

cpcloud commented Sep 10, 2024

Blah, okay, looking into it.

cpcloud added a commit that referenced this pull request Sep 18, 2024
Replaces pymysql with mysqlclient, mostly out of frustration with bizarre GC behavior discovered during #10055. I think this is probably a breaking change due to some changes in how types are inferred for JSON, INET and UUID types.

BREAKING CHANGE: Ibis now uses the `MySQLdb` driver. You may need to install MySQL client libraries to **build** the extension.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues related to ibis's performance refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: ibis should release resources when an ibis.memtable op is GC'd
2 participants