-
Notifications
You must be signed in to change notification settings - Fork 699
Ensure that groupKey objects are treated as their groupTarget for the purposes of joins and combines #6139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for nextflow-docs-staging canceled.
|
static private safeStr(key) { | ||
key instanceof GString ? key.toString() : key | ||
if (key instanceof GString) | ||
return key.toString() | ||
if (key instanceof GroupKey) | ||
return key.getGroupTarget() | ||
return key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds very reasonable, it may be worth adding a unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Will fix that today.
result.keys = [entry] | ||
result.addKey(entry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not strictly required, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the [entry]
is added directly to the result.keys
List, it doesn't pass through the safeStr
method, and thus would miss the new instanceOf GroupKey
check.
If we wanted to be strict about it, we could make the keys List private to make sure addKey
is the only way to add a KeyPair, but I'm not sure it's worth the extra complexity. KeyPair is not used broadly in the codebase.
Looking at this a bit more closely this morning, I see that this change does "unwrap" the groupKey if the channel passes through the The options are:
I think option 3 (the PR) is the cleanest without introducing major breaking changes. The only "breaking" change this introduces is if a workflow is relying on a groupKey being passed through a join/combine with the matching unwrapped value. Such a workflow is already in a fragile timing-dependent state anyway, so I think it's ok draw attention to this by introducing this "breaking" change. |
I need AI to summarise this 😆 |
In my view this could be managed, unwrapping temporary the key for the requirement comparison/indexing operation, but continuing to emit as wrapped GroupKey. This essential because this was designed to traverse a chain of operators |
If so, we may have to consider the pathological case where two groupKeys are joined, but the groupKey contains different size values 😬 |
How that would happen in practice? |
Now I hope you see why I would rather deal with this in operators v2 |
A fair concern. This is very unlikely, and I can't imagine anybody doing it intentionally. Maybe we can just issue a warning in this very rare circumstance.
Yes, Ben :D I would still like to find a neat solution for existing users. If you work on v2, I'll give some thought to an interim solution here. Nextflow is a path to reproducibility, so it would be great to help users avoid timing-dependent workflow bugs that break that reproducibility guarantee. |
Signed-off-by: Rob Syme <rob.syme@gmail.com>
- Store original keys in KeyPair and use them for output emission in join/combine - Always prefer GroupKey over plain keys when both are present for the same match - Fixes timing-dependent bugs when joining or combining channels with mixed GroupKey/plain keys - Updates buffer logic to store KeyPair objects, ensuring metadata like group size is retained - Updates remainder and mismatch handling to work with new buffer structure - Adds and passes comprehensive tests for GroupKey preservation and commutativity Fixes #4104 Signed-off-by: Rob Syme <rob.syme@gmail.com>
5fe466e
to
d105842
Compare
Ok, I've fixed it to ensure that the groupKey is preserved if combined with the target. I'll accept that preserving the key comes at the cost of some additional complexity 😬 |
Signed-off-by: Rob Syme <rob.syme@gmail.com>
6384cc5
to
fc58bec
Compare
b4b321e
to
069653d
Compare
When using operators like join or combine, if one side of the join uses a plain value (e.g., a string) and the other side uses a GroupKey (wrapping the same string), the join may fail to match them—even though they represent the same logical key.
This is because, although GroupKey implements
equals
andhashCode
to delegate to its target, there are subtle issues in Groovy/Java equality and hash semantics, especially when comparing a GroupKey to a plain value from the other side. This can lead to unpredictable or inconsistent join results, as described in #4104.This asymmetry can cause joins to miss matches, depending on which side emits the GroupKey. The ordering of items on left or right can depend on the order of items in channels, leading to difficult-to-debug timing issues. This issue affects the join and combine operators which will do these sorts of equality checks.
Both JoinOp and CombineOp use the
makeKey
utility (which creates a KeyPair) to extract keys from tuples for matching. By ensuring that all keys are "unwrapped" from GroupKey to their target value, we guarantee that:A key
'BATCH1'
and a keygroupKey('BATCH1')
are treated as the same for the purposes of joining or combining.This removes the asymmetry and inconsistency in key matching, regardless of which side emits a GroupKey or a plain value.