Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for mandatory message key recovery mechanism #1701

Closed
ell1e opened this issue Jan 1, 2024 · 11 comments
Closed

Suggestion for mandatory message key recovery mechanism #1701

ell1e opened this issue Jan 1, 2024 · 11 comments
Labels
improvement An idea/future MSC for the spec

Comments

@ell1e
Copy link

ell1e commented Jan 1, 2024

Rationale and note that I'm incompetent:
I'm not qualified to make the following suggestions. But I've seen people abandon matrix due to this, and for years the approach for the following seems to have been to just not make that scenario happen rather than to implement a robust recovery in case it does. It seems to me like not making it happen might now be futile, since it hasn't worked out for years. Therefore, as a completely uninformed unqualified person, I'm suggesting this mandatory recovery mechanism idea, since this whole situation is just difficult for the users and the current trajectory doesn't seem to be quite working out. Even if Element addressed this, other clients would likely to be affected, or wouldn't they? If Element can't fully avoid this in years, how could other clients expected to be? I don't understand it. The mechanism I'm proposing is based on this nheko bug report.

Problem: I've just had a target user (Element) get "unable to decrypt message" even though my sending client (Nheko) is still online and they're not using any sort of new device. The message is also from hours ago, so there is no real reason in the interest of plausible deniability why in such a short time frame, my sending client shouldn't allow a receiving client to re-request the necessary keys if the same trust level is still present, which it should be with no new device or no device fingerprint change or anything having had happened.

Steps to reproduce above underlying problem:

  1. Use Nheko to send a text message to an Element user
  2. Encounter some condition where sending client fails to send necessary keys to all target devices or they somehow don't arrive with all of them (this seems fairly easy on both Nheko and Element as a sender, as soon as the target has multiple clients and some of them are offline. don't ask me how since I don't know the protocol on that level, but it happens all the time, even without Nheko involvement)
  3. Receiving person comes online on one of their devices that didn't get the decryption key and then tries to read the message

What happened after above problem steps?

Receiver just gets "unable to decrypt message". Neither does it ever automatically recover (which it should), nor is there a button or anything to allow the receiving user to actively try a recovery mechanism of any sort.

Suggestion for the matrix specs to address this:

Any receiving client in above scenario should be required at least as a default behavior (if not actively opted out by user) to ask a sender client via some e2ee tunneled direct communication protocol to receive the keys for the message again.

Any sending client being asked in such a way for key retransmittion should check that 1. same trust level applies (this is either not a new unknown device, or per-user rather than per-device trust is enabled and the device has the necessary signature and all for that), that 2. the message isn't too old (a reasonable time window would probably be around 24h) so plausible deniability is still reasonably adhered to, that 3. the message wasn't deleted in the meantime or something, and then once these checks all end up successful, and then 4. should resend the necessary decryption keys to the target user's device, or make new keys and resend the message, or whatever is needed.

Sorry if this is a dumb suggestion, but all other mechanisms of not making this necessary, including the hydrated device of Element (that other clients seem to not even have dared to try to implement anyway, making it not available to everyone in the ecosystem), seem to have failed to realistically get a handle on this.

@ell1e
Copy link
Author

ell1e commented Jan 1, 2024

Also, my apologies, I think I already filed an element bug report for this too. But I couldn't find an actual specs bug, so I'm assuming it didn't make it here yet. if it did and this is a duplicate, please close it.

@richvdh
Copy link
Member

richvdh commented Jan 2, 2024

We used to have something similar to this, in the form of m.room_key_request messages. It was disabled primarily due to security problems (GHSA-6263-x97c-c4gg); however, it also had large performance problems which meant they actually ended up making the "UTD" problem worse (element-hq/element-web#26524) and, more generally, our experience of papering over delivery bugs by implementing workarounds is not good: it makes the underlying bug an order-of-magnitude harder to find and fix.

If you have reason to believe that room key messages are being sent but not delivered, that is a bug that should be investigated and fixed.

We have a bunch of ideas for making crypto more reliable generally, but a blanket "I didn't get the keys" message is unlikely to be one of them, I'm afraid.

@richvdh richvdh closed this as completed Jan 2, 2024
@ell1e
Copy link
Author

ell1e commented Jan 14, 2024

If you have reason to believe that room key messages are being sent but not delivered, that is a bug that should be investigated and fixed.

I mean this seems to be exactly what is happening.

I have reported these bugs for years now with nothing ever happening to fully remove them, there seem new ones to pop up all the time. This also includes in non-Element clients.

a blanket "I didn't get the keys" message is unlikely to be one of them

If that's what works? It seems like everything else just doesn't, so I think you're moving in the wrong direction.

For performance reasons, you could for example limit it to direct messages. It seems like these lost key bugs happen less in larger rooms anyway, for whatever reason, while simultaneously having the biggest impact in 1on1 rooms.

As a conclusion, I think you should reopen this until someone has a better idea, which for about 3-4 years nobody seems to have.

@ell1e
Copy link
Author

ell1e commented Jan 14, 2024

I also checked the security issue you linked:

"Semi-trusted Impersonation" – matrix-js-sdk (and derived SDKs) accepted keys forwarded by other users, even if your client didn't request them. As a result, a malicious server admin could fake an encrypted message to look as if it was sent from a given user on that server.

That doesn't seem to be a relevant issue with my suggestion. A client would simply need to only accept keys when actually requested from that party, and then of course apply the usual checks as well. That generally messages can in some corner cases be injected that are clearly not marked as trusted seems like an unavoidable protocol issue anyway, I would think, or am I wrong? My apologies if this superficial look at it is wrong, I'm certainly not the right person to ask, but maybe this can prompt a discussion by people with a more proper understanding.

@richvdh
Copy link
Member

richvdh commented Jan 14, 2024

I mean this seems to be exactly what is happening.

I'd like to see a pair of rageshakes that demonstrate it, please. I haven't seen such behaviour since we fixed matrix-org/synapse#15335.

@ell1e
Copy link
Author

ell1e commented Jan 14, 2024

I recently had it happen with nheko and element (the latter failing to get the key, and then also failing to re-request it which would have been addressed by this ticket here), and it seems to happen to me every 6 months or so. Since I started using matrix, which is many years. Also, you might think it's just me, but usually when I bring it up in the matrix dev and support rooms, somebody mentions they've also had it happen recently.

@richvdh
Copy link
Member

richvdh commented Jan 14, 2024

As a conclusion, I think you should reopen this until someone has a better idea, which for about 3-4 years nobody seems to have.

Well, that's not entirely fair. We have made some improvements over the last few years. But it's also fair to say that unable-to-decrypt errors still happen too often. Now that our work on Element-Web R is reaching a conclusion, we're planning to devote some time trying to flush out the remaining causes of these errors.

If you're not aware of it, please take a look at element-hq/element-meta#245. It includes a list of known causes of "unable to decrypt" errors; a number of the linked issues also have ideas for fixing them.

@richvdh
Copy link
Member

richvdh commented Jan 14, 2024

I recently had it happen with nheko and element (the latter failing to get the key, and then also failing to re-request it which would have been addressed by this ticket here)

How do you know the failure mode you describe here happened? Without logs from both sides, it's somewhat impossible to be sure, and impossible to know if the solution you describe would have fixed it. There's a lot more going on here than "messages getting lost".

@ell1e
Copy link
Author

ell1e commented Jan 14, 2024

Well because it was exactly the situation I suggested in the initial comment, as it always has been when this happens. 1on1 chat room, key somehow isn't sent to all the devices of the target of the message and instead only some due to whatever bug (in this case maybe on nheko's sender side), message is usually a few minutes or at most a few hours old when target user comes online and sees usual "Unable to decrypt". So in that case if element on the receiver side had rerequested the keys from nheko, like this ticket suggests to make all clients do, it would have been fine. Instead the situation was basically forever-stuck until at some point it magically worked again but with all the affected older messages lost.

I still think that given how often and how varied this happens, and since it seems to involve other clients bugging out as well, that this mechanism should be considered. It just doesn't really look like it from the outside that it's disappearing any time soon, even if element fixed all instances in element's code.

@richvdh
Copy link
Member

richvdh commented Jan 15, 2024

There are several (known) root causes which could cause the symptoms you describe, and I don't particularly want to get into speculating about which it was without the logs. And I don't agree that you can assert that a specific solution will solve the problem unless you understand what the issue was.

The problems with the solution you suggest are (a) it can easily lead to an increase in traffic which then leads to a decrease in reliability; (b) if you simply allow any device to say "hey I didn't get the keys, send again please" then that makes it much easier for an attacker to pose as a recipient device and hence obtain the keys to the message. (Obviously you can apply more intelligence, but then we get into details and it once again it becomes important to understand exactly which failure mode we are trying to solve).

So: (a) your proposal will not necessarily solve the problem you are having; (b) your proposal has downsides and at the very least needs refining.

In short: I'm not going to discuss this any further here.

@ell1e
Copy link
Author

ell1e commented Jan 15, 2024

I see. I'm not sure if there's any point to me responding further then.

I hope it's fair to admit both that I'm, like specified multiple times, not (sorry, typo) fully qualified to judge anyway. But also that at least I don't easily see neither how it wouldn't solve it for the situation given, nor the downsides part. (Since it seems like mostly the same problem as for verifying any other message trust and should therefore already have pre-existing mechanisms around to use for that.) Nevertheless, if there's no interest in discussing this further then everyone reading up to here can simply assume I'm wrong and leave it at that. Which is possible, that I'm wrong I mean, of course.

Maybe I should also add though that this unreliability seems concerning to me, and I don't seem to be the only one. I guess that's easy to say from the sidelines, however, I accept that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement An idea/future MSC for the spec
Projects
None yet
Development

No branches or pull requests

2 participants