Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Messages not being sent over federation #7099

Closed
ErrorProne opened this issue Mar 18, 2020 · 21 comments · Fixed by #7376
Closed

Messages not being sent over federation #7099

ErrorProne opened this issue Mar 18, 2020 · 21 comments · Fixed by #7376
Assignees
Labels
z-bug (Deprecated Label)

Comments

@ErrorProne
Copy link

ErrorProne commented Mar 18, 2020

Description

I'll try to do this as exact as I'm able to in reverse.

After upgrading our matrix server running on ubuntu 18.04 to 1.11.1+bionic1 we did not receive any messages from servers running on older instances (only in 1:1 chats, groups have been working fine).

We are explicitly talking with a another company running a matrix instance and a freelancer and his own setup.

Both problems were solved after upgrading their versions to the matrix synapse 1.11.1 version. But since at least one party was running 1.11.0+bionic1 before, it looks like the bug has been introduced in the patch release.

All messages and presence updates going from 1.11.1+bionic1 to the other sides went through without problems. But nothing the other way around.

One major pain point was that we discovered this by getting suspicious that some people did not respond in time and thus tried other communication channels. So the users where totally unaware of this.

A last note: As mentioned bringing all instances to 1.11.1 (or the latest docker image) solved it, no message went missing! As soon as the other parties wrote a new message all the missing ones got redelivered.

Edit: As far as I can tell it did not matter on which side the 1:1 chatroom has been initially created.

Steps to reproduce

(I did not re-test this, but this should result in the problematic setup)

  • Connect two matrix servers running 1.11.0+bionic1
  • Create two 1:1 chats, one initialized by each side
  • Upgrade on instance to 1.11.1+bionic1

You should now only be able to send 1:1 messages from 1.11.1+bionic1 to 1.11.0+bionic1 but not the other way around (At least using any riot the user won't get any feedback about this. I'll write an issue for that project later).

I could not find any suspicious log entries on our 1.11.1 server. But I've requested logs from the other side, maybe we can find something there.

Version information

  • apt package installed: matrix-synapse-py3 1.11.1+bionic1 amd64
  • Not exactly sure about the other sides. One had matrix-synapse-py3 1.11.0+bionic1 apt
  • The other party had the docker container setup using matrixdotorg/synapse (version before upgrade not known)
@ErrorProne
Copy link
Author

ErrorProne commented Mar 19, 2020

Update: This continues, sometimes messaging in 1:1 rooms between two server works, sometimes it fails.
It looks like old 1:1 rooms could be bricked. We just tried to create new direct chats and that seems to work for now

Addon:
We have now checked logs on both server but could not find any error messages regarding the broken rooms

@richvdh
Copy link
Member

richvdh commented Mar 19, 2020

It sounds like the upgrade from 1.11.0 to 1.11.1 is irrelevant. Possibly it just seemed related because things coincidentally broke at the same time.

I'm wondering if this is somehow related to #7065: there was a bug in synapse 1.10 which could cause some incorrect database updates. There is a fix for this (#7070) in the forthcoming 1.12.0 release.

@ErrorProne
Copy link
Author

ErrorProne commented Mar 25, 2020

@richvdh thanks for the info. After all parties had updated their instances to 1.12.0 it looks like it works for everyone again =)
The problems still occurs on some 1:1 chats (only via federation). However if a 1:1 chat as been created after the problem first hit it seems to be not influenced by the bug.

This all fells like a some form of data corruption bug for 1:1 federation chats which at least can be worked around be starting new 1:1 chats

@richvdh
Copy link
Member

richvdh commented Mar 26, 2020

ok if you want to investigate further, please send a message in one of the affected rooms, and then share the logs from both the sending server and a server which didn't receive the message.

@ErrorProne
Copy link
Author

By chance we got a broken room again, we'll try to filter the logs on both servers accordingly. There is a lot going on in there. I'll keep you posted

@ErrorProne
Copy link
Author

ErrorProne commented Apr 14, 2020

Sorry for the delay! We have some logs now, also we've seen that:

  • Not all 1:1 chats are broken at once. We had cases where one chat would work and another one won't.
  • Restarts seem to resolve the problem for a period of time until it somehow breaks again.

Here are logs from the admins of server1 and server2.
user1@server1 -> user2@server2: Works
user2@server2 -> user1@server1: Does not work

All systems should be latest greatest (latest stable riot web and latest stable synapse version)

We've tried to get the correct point in time as good as we can and sanitize the logs

Server1 sending:

> 2020-04-14 07:51:23,694 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7824- TX [server2] {1586268922514} Sending transaction [1586268922514], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:23,695 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922514; timeout 60.000000s
> 2020-04-14 07:51:23,706 - synapse.access.http.8008 - 302 - INFO - GET-181487- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.129sec/0.003sec (0.007sec, 0.001sec) (0.000sec/0.000sec/0) 759B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10579_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:23,707 - synapse.access.http.8008 - 302 - INFO - PUT-181490- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.002sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:23,749 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:23,750 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7824- {PUT-O-10788} [server2] Completed: 200 OK
> 2020-04-14 07:51:23,750 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7824- TX [server2] {1586268922514} got 200 response
> 2020-04-14 07:51:24,811 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7825- TX [server2] {1586268922515} Sending transaction [1586268922515], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:24,813 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922515; timeout 60.000000s
> 2020-04-14 07:51:24,828 - synapse.access.http.8008 - 302 - INFO - GET-181491- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.056sec/0.007sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 733B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10580_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,829 - synapse.access.http.8008 - 302 - INFO - PUT-181493- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.007sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,834 - synapse.metrics - 464 - INFO - - Collecting gc 1
> 2020-04-14 07:51:24,870 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:24,872 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7825- {PUT-O-10789} [server2] Completed: 200 OK
> 2020-04-14 07:51:24,873 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7825- TX [server2] {1586268922515} got 200 response
> 2020-04-14 07:51:24,903 - synapse.access.http.8008 - 302 - INFO - PUT-181494- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.064sec/0.001sec (0.009sec, 0.002sec) (0.003sec/0.008sec/4) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/send/m.room.message/m1586850684575.73 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:24,934 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-7826- TX [server2] {1586268922516} Sending transaction [1586268922516], (PDUs: 1, EDUs: 0)
> 2020-04-14 07:51:24,937 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268922516; timeout 60.000000s
> 2020-04-14 07:51:25,045 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Got response headers: 200 OK
> 2020-04-14 07:51:25,047 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-7826- {PUT-O-10790} [server2] Completed: 200 OK
> 2020-04-14 07:51:25,047 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-7826- TX [server2] {1586268922516} got 200 response
> 2020-04-14 07:51:25,139 - synapse.access.http.8008 - 302 - INFO - GET-181496- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.016sec/0.001sec (0.010sec, 0.000sec) (0.002sec/0.002sec/1) 1337B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s316693_1032142_10581_84647_137398_58_247_4197_57 HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]
> 2020-04-14 07:51:25,269 - synapse.access.http.8008 - 302 - INFO - POST-181497- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.018sec/0.001sec (0.002sec, 0.001sec) (0.001sec/0.004sec/2) 6008B 200 "POST /_matrix/client/r0/keys/query HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.163 Chrome/80.0.3987.163 Safari/537.36" [0 dbevts]

Server 2 Receiving:

> 2020-04-14 07:51:23,719 - synapse.federation.transport.server - 406 - INFO - PUT-77621 - Received txn 1586268922514 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:23,726 - synapse.state - 497 - INFO - PUT-77621 - Resolving state for !nmOxkCqCxiHFcJqVrX:server1 with 2 groups
> 2020-04-14 07:51:23,727 - synapse.state - 520 - INFO - PUT-77621 - Resolving conflicted state for '!nmOxkCqCxiHFcJqVrX:server1'
> 2020-04-14 07:51:23,735 - synapse.handlers.typing - 237 - INFO - PUT-77621 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': True, 'user_id': '@user1:server1'}
> 2020-04-14 07:51:23,740 - synapse.access.http.8008 - 302 - INFO - PUT-77621 - 194.95.62.117 - 8008 - {server1} Processed request: 0.025sec/0.000sec (0.017sec, 0.000sec) (0.002sec/0.007sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1586268922514 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-14 07:51:24,838 - synapse.federation.transport.server - 406 - INFO - PUT-77622 - Received txn 1586268922515 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:51:24,846 - synapse.handlers.typing - 237 - INFO - PUT-77622 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': False, 'user_id': '@user1:server1'}
> 2020-04-14 07:51:24,859 - synapse.access.http.8008 - 302 - INFO - PUT-77622 - 194.95.62.117 - 8008 - {server1} Processed request: 0.023sec/0.001sec (0.013sec, 0.000sec) (0.002sec/0.005sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1586268922515 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-14 07:51:24,948 - synapse.federation.transport.server - 406 - INFO - PUT-77623 - Received txn 1586268922516 from server1. (PDUs: 1, EDUs: 0)
> 2020-04-14 07:51:24,957 - synapse.handlers.federation - 194 - INFO - PUT-77623-$RA-DUABeG2Ob6pmFoAs-2Uf-XAxjSnnnpOqBAMHhbuY - handling received PDU: <FrozenEventV3 event_id='$RA-DUABeG2Ob6pmFoAs-2Uf-XAxjSnnnpOqBAMHhbuY', type='m.room.message', state_key='None'>
> 2020-04-14 07:51:25,034 - synapse.access.http.8008 - 302 - INFO - PUT-77623 - 194.95.62.117 - 8008 - {server1} Processed request: 0.087sec/0.002sec (0.017sec, 0.000sec) (0.007sec/0.019sec/7) 62B 200 "PUT /_matrix/federation/v1/send/1586268922516 HTTP/1.0" "Synapse/1.12.3" [1 dbevts]
> 2020-04-14 07:51:25,037 - synapse.push.httppusher - 195 - INFO - httppush.process-181 - Processing 1 unprocessed push actions for @user2:server2/im.vector.app.android/xxxx starting at stream_ordering 7820

Server 2 sending:

> 2020-04-14 07:53:42,024 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-42873 - TX [server1] {1585132221531} Sending transaction [1585132221531], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:42,026 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Sending request: PUT matrix://server1/_matrix/federation/v1/send/1585132221531; timeout 60.000000s
> 2020-04-14 07:53:42,029 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-42873 - Connecting to chat3.server1:443
> 2020-04-14 07:53:42,041 - synapse.access.http.8008 - 302 - INFO - GET-77751 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 2.598sec/0.005sec (0.012sec, 0.000sec) (0.000sec/0.000sec/0) 738B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1945_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,042 - synapse.access.http.8008 - 302 - INFO - PUT-77753 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.027sec/0.002sec (0.009sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,081 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77754 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1946_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:42,159 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Got response headers: 200 OK
> 2020-04-14 07:53:42,161 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-42873 - {PUT-O-25694} [server1] Completed: 200 OK
> 2020-04-14 07:53:42,161 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-42873 - TX [server1] {1585132221531} got 200 response
> 2020-04-14 07:53:43,791 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-42874 - TX [server1] {1585132221532} Sending transaction [1585132221532], (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:43,793 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Sending request: PUT matrix://server1/_matrix/federation/v1/send/1585132221532; timeout 60.000000s
> 2020-04-14 07:53:43,806 - synapse.access.http.8008 - 302 - INFO - GET-77755 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 1.689sec/0.004sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 713B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1946_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,807 - synapse.access.http.8008 - 302 - INFO - PUT-77756 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.021sec/0.002sec (0.006sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,809 - synapse.metrics - 464 - INFO -  - Collecting gc 1
> 2020-04-14 07:53:43,814 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77757 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1586850823763.0 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,839 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Got response headers: 200 OK
> 2020-04-14 07:53:43,842 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-42874 - {PUT-O-25695} [server1] Completed: 200 OK
> 2020-04-14 07:53:43,843 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-42874 - TX [server1] {1585132221532} got 200 response
> 2020-04-14 07:53:43,857 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77759 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,892 - synapse.access.http.8008 - 302 - INFO - PUT-77758 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.041sec/0.006sec (0.014sec, 0.001sec) (0.002sec/0.004sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1586850823763.0 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,917 - synapse.access.http.8008 - 302 - INFO - GET-77760 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.017sec/0.001sec (0.006sec, 0.000sec) (0.003sec/0.004sec/2) 942B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9681_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,988 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77762 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:43,994 - synapse.access.http.8008 - 302 - INFO - POST-77761 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.017sec/0.000sec (0.002sec, 0.004sec) (0.002sec/0.004sec/3) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:44,039 - synapse.access.http.8008 - 302 - INFO - GET-77763 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.018sec/0.001sec (0.009sec, 0.001sec) (0.004sec/0.002sec/2) 2007B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_199_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-14 07:53:44,077 - synapse.access.http.8008 - 302 - INFO - OPTIONS-77764 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s9682_261342_1947_10059_200_5_3_6751_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]

Server 1 receiving:

> 2020-04-14 07:53:42,143 - synapse.federation.transport.server - 406 - INFO - PUT-181595- Received txn 1585132221531 from server2. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:42,156 - synapse.access.http.8008 - 302 - INFO - PUT-181595- 10.0.11.10 - 8008 - {server2} Processed request: 0.024sec/0.001sec (0.002sec, 0.003sec) (0.002sec/0.005sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" "Synapse/1.12.0" [0 dbevts]
> 2020-04-14 07:53:42,221 - synapse.rest.media.v1.preview_url_resource - 404 - INFO - expire_url_cache_data-58190- Running url preview cache expiry
> 2020-04-14 07:53:42,228 - synapse.rest.media.v1.preview_url_resource - 483 - INFO - expire_url_cache_data-58190- Deleted 0 media from url cache
> 2020-04-14 07:53:43,822 - synapse.federation.transport.server - 406 - INFO - PUT-181596- Received txn 1585132221532 from server2. (PDUs: 0, EDUs: 1)
> 2020-04-14 07:53:43,837 - synapse.access.http.8008 - 302 - INFO - PUT-181596- 10.0.11.10 - 8008 - {server2} Processed request: 0.017sec/0.001sec (0.008sec, 0.004sec) (0.002sec/0.008sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1585132221532 HTTP/1.0" "Synapse/1.12.0" [0 dbevts]

@richvdh
Copy link
Member

richvdh commented Apr 20, 2020

looks like server2 isn't sending the events to server1 for some reason.

can you enable DEBUG logging for synapse.federation.sender and then sharing the logs from the sending server.

it's also worth grepping the logs for ERROR and CRITICAL lines to see if anything looks odd.

@richvdh
Copy link
Member

richvdh commented Apr 20, 2020

... "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" 

this is odd. have you got a reverse-proxy which is converting HTTP/1.1 requests to HTTP/1.0?

@ErrorProne
Copy link
Author

... "PUT /_matrix/federation/v1/send/1585132221531 HTTP/1.0" 

this is odd. have you got a reverse-proxy which is converting HTTP/1.1 requests to HTTP/1.0?

That is nginx default for proxy connections (https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_http_version)
So yeah, there are actually two nginx proxies in front. One in the DMZ and one on the VM itself to separate the slack-bridge from the matrix requests.

The admin of server2 changed the configuration as per your recommendation. As we found out before messages started flowing again after the restart, but we'll try until it breaks again.

@ErrorProne
Copy link
Author

ErrorProne commented Apr 29, 2020

Today it finally broke again.

Here are the logs with debug enabled for synapse.federation.sender on server2.

First I (user1) have sent a direct message to server 2, and than user2 on server 2 instantly tried to respond, which did not come through (also no read or typing notifications). If we would restart server2 now, it would work again and also old messages would be re-sent with the next new new message user 2 writes in the direct channel.

Server1 log (Can send, but does not receive)

> 2020-04-29 07:32:58,666 - synapse.access.http.8008 - 302 - INFO - GET-870110- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.905sec/0.003sec (0.008sec, 0.001sec) (0.000sec/0.000sec/0) 759B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41282_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,668 - synapse.access.http.8008 - 302 - INFO - PUT-870111- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.013sec/0.002sec (0.005sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,671 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30313- TX [server2] {1586268953496} Sending transaction [1586268953496], (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:58,672 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953496; timeout 60.000000s
> 2020-04-29 07:32:58,691 - synapse.http.federation.matrix_federation_agent - 242 - INFO - federation_transaction_transmission_loop-30313- Connecting to server2:8448
> 2020-04-29 07:32:58,724 - synapse.metrics - 464 - INFO - - Collecting gc 1
> 2020-04-29 07:32:58,794 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Got response headers: 200 OK
> 2020-04-29 07:32:58,796 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30313- {PUT-O-42488} [server2] Completed: 200 OK
> 2020-04-29 07:32:58,796 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30313- TX [server2] {1586268953496} got 200 response
> 2020-04-29 07:32:59,962 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30314- TX [server2] {1586268953497} Sending transaction [1586268953497], (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:59,963 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953497; timeout 60.000000s
> 2020-04-29 07:32:59,974 - synapse.access.http.8008 - 302 - INFO - GET-870112- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.256sec/0.003sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 733B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41283_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:59,975 - synapse.access.http.8008 - 302 - INFO - PUT-870113- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.015sec/0.002sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/typing/%40user1%3Aserver1 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,017 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Got response headers: 200 OK
> 2020-04-29 07:33:00,019 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30314- {PUT-O-42489} [server2] Completed: 200 OK
> 2020-04-29 07:33:00,019 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30314- TX [server2] {1586268953497} got 200 response
> 2020-04-29 07:33:00,092 - synapse.access.http.8008 - 302 - INFO - PUT-870115- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.066sec/0.001sec (0.001sec, 0.003sec) (0.002sec/0.006sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/send/m.room.message/m1588145579925.49 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,109 - synapse.access.http.8008 - 302 - INFO - GET-870114- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.093sec/0.004sec (0.008sec, 0.001sec) (0.003sec/0.008sec/1) 1336B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340843_1120144_41284_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,123 - synapse.federation.sender.transaction_manager - 113 - INFO - federation_transaction_transmission_loop-30315- TX [server2] {1586268953498} Sending transaction [1586268953498], (PDUs: 1, EDUs: 0)
> 2020-04-29 07:33:00,125 - synapse.http.matrixfederationclient - 408 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Sending request: PUT matrix://server2/_matrix/federation/v1/send/1586268953498; timeout 60.000000s
> 2020-04-29 07:33:00,228 - synapse.access.http.8008 - 302 - INFO - POST-870116- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.014sec/0.001sec (0.000sec, 0.001sec) (0.001sec/0.009sec/2) 6008B 200 "POST /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,247 - synapse.http.matrixfederationclient - 442 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Got response headers: 200 OK
> 2020-04-29 07:33:00,248 - synapse.http.matrixfederationclient - 164 - INFO - federation_transaction_transmission_loop-30315- {PUT-O-42490} [server2] Completed: 200 OK
> 2020-04-29 07:33:00,249 - synapse.federation.sender.transaction_manager - 146 - INFO - federation_transaction_transmission_loop-30315- TX [server2] {1586268953498} got 200 response
> 2020-04-29 07:33:00,480 - synapse.access.http.8008 - 302 - INFO - POST-870118- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.018sec/0.004sec (0.008sec, 0.001sec) (0.002sec/0.008sec/3) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX%3Aserver1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,493 - synapse.access.http.8008 - 302 - INFO - GET-870117- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.273sec/0.001sec (0.010sec, 0.001sec) (0.003sec/0.004sec/2) 781B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120144_41284_99343_162034_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,950 - synapse.federation.transport.server - 406 - INFO - PUT-870120- Received txn 1587630318384 from matrix.gambio-server.net. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:33:01,979 - synapse.access.http.8008 - 302 - INFO - GET-870119- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.433sec/0.015sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 498B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120144_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,987 - synapse.access.http.8008 - 302 - INFO - PUT-870120- 10.0.11.10 - 8008 - {matrix.gambio-server.net} Processed request: 0.042sec/0.000sec (0.009sec, 0.002sec) (0.004sec/0.011sec/4) 12B 200 "PUT /_matrix/federation/v1/send/1587630318384 HTTP/1.1" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:02,210 - synapse.federation.transport.server - 406 - INFO - PUT-870125- Received txn 1587630318385 from matrix.gambio-server.net. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:33:02,238 - synapse.rest.media.v1.preview_url_resource - 404 - INFO - expire_url_cache_data-187666- Running url preview cache expiry
> 2020-04-29 07:33:02,249 - synapse.access.http.8008 - 302 - INFO - GET-870122- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 0.188sec/0.024sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 517B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120145_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:02,254 - synapse.access.http.8008 - 302 - INFO - PUT-870125- 10.0.11.10 - 8008 - {matrix.gambio-server.net} Processed request: 0.046sec/0.000sec (0.006sec, 0.001sec) (0.003sec/0.014sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1587630318385 HTTP/1.1" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:02,260 - synapse.rest.media.v1.preview_url_resource - 483 - INFO - expire_url_cache_data-187666- Deleted 0 media from url cache
> 2020-04-29 07:33:03,620 - synapse.access.http.8008 - 302 - INFO - GET-870127- 10.0.11.10 - 8008 - {@user1:server1} Processed request: 1.317sec/0.005sec (0.009sec, 0.000sec) (0.000sec/0.000sec/0) 828B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s340844_1120146_41284_99343_162035_64_247_4474_57 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/81.0.4044.122 Chrome/81.0.4044.122 Safari/537.36" [0 dbevts]

Server2 log (Did receive all, but no messages are flowing back)

> 2020-04-29 07:32:48,958 - synapse.access.http.8008 - 302 - INFO - GET-39300 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 10.123sec/0.003sec (0.012sec, 0.000sec) (0.000sec/0.000sec/0) 711B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_721_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:48,959 - synapse.access.http.8008 - 302 - INFO - PUT-39302 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.018sec/0.002sec (0.011sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/typing/@user2:server2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:49,003 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39303 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_722_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,774 - synapse.federation.transport.server - 406 - INFO - PUT-39305 - Received txn 1586268953496 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:58,778 - synapse.handlers.typing - 237 - INFO - PUT-39305 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': True, 'user_id': '@user1:server1'}
> 2020-04-29 07:32:58,785 - synapse.access.http.8008 - 302 - INFO - GET-39304 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 9.746sec/0.003sec (0.008sec, 0.000sec) (0.000sec/0.000sec/0) 737B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_722_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:58,787 - synapse.access.http.8008 - 302 - INFO - PUT-39305 - 194.95.62.117 - 8008 - {server1} Processed request: 0.020sec/0.000sec (0.010sec, 0.000sec) (0.004sec/0.004sec/3) 12B 200 "PUT /_matrix/federation/v1/send/1586268953496 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:32:58,829 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39306 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_723_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:32:59,984 - synapse.federation.transport.server - 406 - INFO - PUT-39308 - Received txn 1586268953497 from server1. (PDUs: 0, EDUs: 1)
> 2020-04-29 07:32:59,993 - synapse.handlers.typing - 237 - INFO - PUT-39308 - Got typing update from @user1:server1: {'room_id': '!nmOxkCqCxiHFcJqVrX:server1', 'typing': False, 'user_id': '@user1:server1'}
> 2020-04-29 07:33:00,005 - synapse.access.http.8008 - 302 - INFO - GET-39307 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 1.144sec/0.006sec (0.007sec, 0.000sec) (0.000sec/0.000sec/0) 711B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_723_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,008 - synapse.access.http.8008 - 302 - INFO - PUT-39308 - 194.95.62.117 - 8008 - {server1} Processed request: 0.027sec/0.000sec (0.013sec, 0.000sec) (0.002sec/0.006sec/2) 12B 200 "PUT /_matrix/federation/v1/send/1586268953497 HTTP/1.0" "Synapse/1.12.3" [0 dbevts]
> 2020-04-29 07:33:00,051 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39309 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,137 - synapse.federation.transport.server - 406 - INFO - PUT-39311 - Received txn 1586268953498 from server1. (PDUs: 1, EDUs: 0)
> 2020-04-29 07:33:00,147 - synapse.handlers.federation - 194 - INFO - PUT-39311-$ZfnYHy5QusO2J6fFhyVqJhLmBpMpURpJ5URTETEMqw0 - handling received PDU: <FrozenEventV3 event_id='$ZfnYHy5QusO2J6fFhyVqJhLmBpMpURpJ5URTETEMqw0', type='m.room.message', state_key='None'>
> 2020-04-29 07:33:00,232 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-315 - Handling 10162 -> 10163
> 2020-04-29 07:33:00,240 - synapse.access.http.8008 - 302 - INFO - PUT-39311 - 194.95.62.117 - 8008 - {server1} Processed request: 0.097sec/0.008sec (0.019sec, 0.001sec) (0.006sec/0.013sec/6) 62B 200 "PUT /_matrix/federation/v1/send/1586268953498 HTTP/1.0" "Synapse/1.12.3" [1 dbevts]
> 2020-04-29 07:33:00,244 - synapse.push.httppusher - 195 - INFO - httppush.process-46 - Processing 1 unprocessed push actions for @user2:server2/im.vector.app.android/cu04JX1a3so:APA91bGko9o762To6cI6Sg31_S8kmo2qXbE54uIG3Cq4GXIMXfzx8TIeSCpHka6T6_8fIcgM6qb_c07pe9EpWrPqwtErBEBe1ok9z3QEEiJ4uDmZX2q9uo4FA6lRzOhB9rbg15t3OIpj starting at stream_ordering 10160
> 2020-04-29 07:33:00,246 - synapse.http.client - 281 - INFO - httppush.process-46 - Sending request POST https://matrix.org/_matrix/push/v1/notify
> 2020-04-29 07:33:00,251 - synapse.access.http.8008 - 302 - INFO - GET-39310 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.161sec/0.007sec (0.011sec, 0.000sec) (0.006sec/0.013sec/2) 897B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10162_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,255 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-315 - Handling 10163 -> 10163
> 2020-04-29 07:33:00,324 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39312 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10163_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,397 - synapse.http.client - 318 - INFO - httppush.process-46 - Received response to POST https://matrix.org/_matrix/push/v1/notify: 200
> 2020-04-29 07:33:00,767 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39314 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1588145580713.2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,816 - synapse.metrics - 464 - INFO -  - Collecting gc 1
> 2020-04-29 07:33:00,868 - synapse.access.http.8008 - 302 - INFO - PUT-39315 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.065sec/0.005sec (0.013sec, 0.000sec) (0.002sec/0.005sec/3) 60B 200 "PUT /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/send/m.room.message/m1588145580713.2 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,884 - synapse.access.http.8008 - 302 - INFO - GET-39313 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.524sec/0.006sec (0.011sec, 0.001sec) (0.003sec/0.004sec/1) 1378B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10163_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,887 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10163 -> 10164
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 204 - DEBUG - process_event_queue_for_federation-316 - Sending <FrozenEventV3 event_id='$FlSDMyWVO-iVXSLfi4aAok4AUHi58zpFv_xiWczhFvc', type='m.room.message', state_key='None'> to {'server2'}
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 265 - DEBUG - process_event_queue_for_federation-316 - Sending to: set()
> 2020-04-29 07:33:00,899 - synapse.access.http.8008 - 302 - INFO - GET-39316 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.030sec/0.000sec (0.010sec, 0.001sec) (0.003sec/0.005sec/4) 7458B 200 "GET /_matrix/client/r0/sync?filter=0&set_presence=offline&timeout=0&since=s10160_290014_716_11423_73_5_3_7771_10 HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:00,904 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10164 -> 10164
> 2020-04-29 07:33:00,952 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39317 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.002sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,954 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39318 - 31.17.249.107 - 8008 - {None} Processed request: 0.000sec/0.003sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:00,996 - synapse.access.http.8008 - 302 - INFO - POST-39319 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.015sec/0.001sec (0.005sec, 0.000sec) (0.002sec/0.003sec/2) 3344B 200 "POST /_matrix/client/r0/keys/query HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,251 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39321 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,336 - synapse.http.client - 281 - INFO - http_pusher.on_new_receipts-59 - Sending request POST https://matrix.org/_matrix/push/v1/notify
> 2020-04-29 07:33:01,338 - synapse.access.http.8008 - 302 - INFO - GET-39320 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.344sec/0.003sec (0.006sec, 0.000sec) (0.004sec/0.009sec/3) 808B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,341 - synapse.access.http.8008 - 302 - INFO - POST-39322 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.050sec/0.000sec (0.011sec, 0.001sec) (0.006sec/0.022sec/6) 2B 200 "POST /_matrix/client/r0/rooms/!nmOxkCqCxiHFcJqVrX:server1/read_markers HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,393 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39323 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,453 - synapse.access.http.8008 - 302 - INFO - GET-39324 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.019sec/0.000sec (0.014sec, 0.001sec) (0.001sec/0.002sec/2) 2565B 200 "GET /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_79_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:01,499 - synapse.access.http.8008 - 302 - INFO - OPTIONS-39325 - 31.17.249.107 - 8008 - {None} Processed request: 0.001sec/0.001sec (0.001sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "OPTIONS /_matrix/client/r0/sync?filter=0&timeout=30000&since=s10164_290056_724_11424_80_5_3_7771_10 HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.141 Safari/537.36" [0 dbevts]
> 2020-04-29 07:33:02,005 - synapse.http.client - 318 - INFO - http_pusher.on_new_receipts-59 - Received response to POST https://matrix.org/_matrix/push/v1/notify: 200
> 2020-04-29 07:33:02,098 - synapse.access.http.8008 - 302 - INFO - GET-39327 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.013sec/0.000sec (0.007sec, 0.000sec) (0.001sec/0.002sec/2) 2711B 200 "GET /_matrix/client/r0/sync?filter=0&set_presence=offline&timeout=0&since=s10164_290056_724_11423_79_5_3_7771_10 HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:02,415 - synapse.access.http.8008 - 302 - INFO - POST-39329 - 31.17.249.107 - 8008 - {@user2:server2} Processed request: 0.005sec/0.000sec (0.004sec, 0.000sec) (0.001sec/0.002sec/2) 3344B 200 "POST /_matrix/client/unstable/keys/query HTTP/1.1" "Riot.im/0.9.10 (Linux; U; Android 8.0.0; XT1635-02 Build/OPNS27.76-12-22-; Flavour GooglePlay; MatrixAndroidSDK 0.9.33)" [0 dbevts]
> 2020-04-29 07:33:15,795 - synapse.handlers.presence - 256 - INFO - persist_presence_changes-12502 - Persisting 1 unpersisted presence updates

@richvdh richvdh changed the title Messages did not arrive after an upgrade to 1.11.1+bionic1 Messages not being sent over federation Apr 29, 2020
@richvdh
Copy link
Member

richvdh commented Apr 29, 2020

> 2020-04-29 07:33:00,887 - synapse.federation.sender - 164 - DEBUG - process_event_queue_for_federation-316 - Handling 10163 -> 10164
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 204 - DEBUG - process_event_queue_for_federation-316 - Sending <FrozenEventV3 event_id='$FlSDMyWVO-iVXSLfi4aAok4AUHi58zpFv_xiWczhFvc', type='m.room.message', state_key='None'> to {'server2'}
> 2020-04-29 07:33:00,889 - synapse.federation.sender - 265 - DEBUG - process_event_queue_for_federation-316 - Sending to: set()

For some reason, server2 has decided that server1 is no longer in the room :/.

Figuring out why is going to take some deep digging. Is the problem limited to a small number of rooms? Perhaps just creating new rooms is a viable workaround.

Otherwise, set up a manhole and prepare for the long haul...

@ErrorProne
Copy link
Author

ErrorProne commented Apr 29, 2020

Figuring out why is going to take some deep digging. Is the problem limited to a small number of rooms? Perhaps just creating new rooms is a viable workaround.

The pattern we've seen so far is that only direct chats are impacted by this.

Workarounds right now are restarting server2 (or server3, there is another one behaving exactly the same but with >50 people on it, so the logs are quite exhausting) or re-creating the room.

Given that there is another server acting the same way, could the problem be with server1?
I'm just not sure why restarting server2/3 than fixes the problem for the restarted server.

If we can provide any insights with more debug output etc. we are happy to help =), sadly it always takes a while until a room brakes again.

Side note which comes into my mind right now: Some direct chats with server2 and server3 did break in the same timeframe. However, most of the direct chats remained working.

@richvdh
Copy link
Member

richvdh commented Apr 29, 2020

do you have a manhole configured on each of the affected servers? if not, set one up, so that we can do some digging next time it happens. Docs are in the docs directory.

@ErrorProne
Copy link
Author

We'll set one up at server1 and server2. Could you provide us with the commands and information we should get through the manhole as soon as a room breaks again?

@richvdh
Copy link
Member

richvdh commented Apr 29, 2020

suggest once it happens you dm me at @richvdh:sw1v.org and I'll work through it with you

@ErrorProne
Copy link
Author

Sounds like a plan. We'll contact you. Thank you for all the effort =)

@richvdh
Copy link
Member

richvdh commented Apr 30, 2020

After investigation: it appears that this is due to custom status events corrupting the get_joined_users cache in synapse. The original bug was introduced by #2229, but has gone unnoticed until custom statuses were used.

@richvdh richvdh added z-bug (Deprecated Label) and removed info-needed labels Apr 30, 2020
richvdh added a commit that referenced this issue Apr 30, 2020
Fix a bug where the `get_joined_users` cache could be corrupted by custom
status events (or other state events with a state_key matching the user ID).

Fixes: #7099.
@richvdh
Copy link
Member

richvdh commented Apr 30, 2020

(many thanks to @ErrorProne and team for their patience in tracking this one down)

@richvdh
Copy link
Member

richvdh commented May 12, 2020

I'm labelling this as a release-blocker because it's becoming more of an issue now that custom statuses are a thing, and the fix is trivial (#7376)

@ErrorProne
Copy link
Author

I'm labelling this as a release-blocker because it's becoming more of an issue now that custom statuses are a thing, and the fix is trivial (#7376)

Great to hear, thank you again =)

@richvdh richvdh linked a pull request May 12, 2020 that will close this issue
richvdh added a commit that referenced this issue May 14, 2020
Fix a bug where the `get_joined_users` cache could be corrupted by custom
status events (or other state events with a state_key matching the user ID).

The bug was introduced by #2229, but has largely gone unnoticed since then.

Fixes #7099, #7373.
@richvdh richvdh closed this as completed May 14, 2020
phil-flex pushed a commit to phil-flex/synapse that referenced this issue Jun 16, 2020
)

Fix a bug where the `get_joined_users` cache could be corrupted by custom
status events (or other state events with a state_key matching the user ID).

The bug was introduced by matrix-org#2229, but has largely gone unnoticed since then.

Fixes matrix-org#7099, matrix-org#7373.
@TitanSpirit
Copy link

I had exactly the same problem, however I fixed it via changing my config on my apache reverse proxy.
This helped me solving direct message problem. https://github.com/matrix-org/synapse/blob/develop/docs/reverse_proxy.md
Add first line with AllowEncodedSlashes, add "nocanon" on the end of ProxyPass and add SecRuleEngine off. It fixed problem for me in my case.


    <IfModule security2_module>
           SecRuleEngine off
    </IfModule>

    AllowEncodedSlashes NoDecode
    ProxyPass /_matrix http://127.0.0.1:8008/_matrix nocanon
    ProxyPassReverse /_matrix http://127.0.0.1:8008/_matrix

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants