Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Cluster Nodes Unable to Contact Webseed in TLS-Enabled Cluster Setup #964

Open
abhishek-das-gupta opened this issue Aug 6, 2024 · 26 comments

Comments

@abhishek-das-gupta
Copy link

abhishek-das-gupta commented Aug 6, 2024

Overview

Adding new hosts within a cluster with TLS enabled is problematic due to a prerequisite that new nodes should have a 14 GB file distributed using the BitTorrent client running on these hosts. This torrent process is stuck indefinitely.

Architecture

Cluster Architecture

Within our cluster, we have a master node and worker nodes that report the cluster's state to the master. The master generates the .torrent file, which is a trackerless torrent file. The master somewhat acts as a tracker, providing each peer with information about other peers to communicate with during torrenting.

Torrent Architecture

Torrent Process During Fresh Cluster Install

This is the process followed during a fresh cluster setup:

  • The master node first downloads (HTTP fetch) the parcel from a remote server, then acts as a web seed.
  • Each peer gets other peers' information (host IP, port) using the heartbeat response from the master node. Each peer then calls the AddPeers() API from the torrent client. This AddPeers() API call happens after every heartbeat response from the master to the worker peer.
  • Torrenting starts between the peers (master + workers).
  • There is fallback logic that if the parcel download doesn't complete via BitTorrent in a certain time, the fallback mechanism is to do an HTTP download from the web seed.
Torrent Process During New Host Addition in Existing Cluster

This is the general flow of how new hosts are added in an existing cluster:

  • A set of new hosts getting added to the cluster install the Anacrolix/torrent binary and the libtorrent binary. By default, the Anacrolix/torrent client process runs.
  • These new peers/hosts/nodes contact the web seed (master node) present in the existing cluster (TLS enabled or not) to download the 14 GB file.
  • Simultaneously, these new peers start distributing pieces of this 14 GB file among each other.

Scenarios with New Host(s) Addition

Without TLS Enabled on the Existing Cluster

  • Whether using Anacrolix/Torrent or the libtorrent enabled on the new hosts being added, the 14 GB file gets distributed quickly, and these new hosts are added to the cluster.
  • If Anacrolix/Torrent is used, during torrenting of the 14 GB parcel, these peers have web seed information in their statuses:
webseeds:
- CLOSED: http://ccycloud-1.b-135-no-tls.root.comops.site:7180/cmf/parcel/download/CDH-7.2.18-1.cdh7.2.18.p0.51297892-el8.parcel
  last unhandled error: never
  bep40-prio: e97fd7f2
  last msg: never, connected: never, last helpful: 147.05s ago, itime: 2m41.004987105s, etime: 13.875250838s
  1669/1669 completed, 0 pieces touched, good chunks: 40889/40889:0 reqq: 0+0/(84/128):0/1024, flags: i:WS:, dr: 47132.0 KiB/s
  requested pieces:

With TLS Enabled in the Existing Cluster

Case #1: Libtorrent Client Process Runs on the New Hosts

The 14 GB file gets distributed within a few minutes.

Case #2: Anacrolix/Torrent Process Runs on the New Hosts

The 14 GB file distribution gets stuck on these new nodes because none of the new peers can contact the web seed (master node) present in the existing cluster. In the web seed section of full-status, it is empty:

webseeds:  <--- no web seed
2 peer conns:
- 10.140.93.137:51680-10.140.40.8:7191
  peer id: "-GT0003-\xb3.\x9epQ\xd6LG\x03\xad\xce8"
  extensions: 0000000000100005 (ltep, fast, dht)
  ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
  pex: 2 conns, 0 unsent events
  bep40-prio: e8a31f71
  last msg: 26.36s ago, connected: 86.37s ago, last helpful: never, itime: 0s, etime: 0s
  0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :M,e,v1:, dr: 0.0 KiB/s
  requested pieces:
- 10.140.93.137:7191-10.140.24.8:43468
  peer id: "-GT0003-\xfc\x93{w:\x94~\x8f\x13\u0671\x1b"
  extensions: 0000000000100005 (ltep, fast, dht)
  ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
  pex: 2 conns, 0 unsent events
  bep40-prio: d766eef0
  last msg: 86.29s ago, connected: 86.29s ago, last helpful: never, itime: 0s, etime: 0s
  0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :I,e,v1:, dr: 0.0 KiB/s
  requested pieces:

Hi @anacrolix, Can you please provide pointers on why this API: http://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download> is not reachable from peer to the web seed present?

@anacrolix
Copy link
Owner

At a guess the webseed URL doesn't conform to the BEP for a multi file torrent. Make sure it's a single file torrent if you're going to specify a URL to a single file. You could also put a panic in where it's closing the webseed to find out it's reasoning.

@abhishek-das-gupta
Copy link
Author

At a guess the webseed URL doesn't conform to the BEP for a multi file torrent. Make sure it's a single file torrent if you're going to specify a URL to a single file

It is a single file of size 14GB. To be more accurate it is "gzip compressed data" of 14GB.

You could also put a panic in where it's closing the webseed to find out it's reasoning.
Can you please provide more info on this. Where should I add more logs?

@anacrolix
Copy link
Owner

Could you provide the metainfo here?

I'll get back to you on the close thing tomorrow.

@abhishek-das-gupta
Copy link
Author

abhishek-das-gupta commented Aug 6, 2024

Here is the metainfo (.torrent)

Torrent name: <some-parcel>.parcel
Announced at: Seems to be trackerless
Created on..: Mon Aug 05 12:06:49 UTC 2024
Created by..: cm-server
Pieces......: 1669 piece(s) (8388608 byte(s)/piece)
Total size..: 13,997,539,212 byte(s)

@anacrolix
Copy link
Owner

Feel free to email it to me. Specifically I want to check the structure of the internal fields as that affects how webseeding works.

@abhishek-das-gupta
Copy link
Author

Thanks! Sending you. One thing though:

There is fallback logic that if the parcel download doesn't complete via BitTorrent in a certain time, the fallback mechanism is to do an HTTP download from the web seed.

Once this timeout occurs, then t.AddWebSeed() API gets called with the url as http://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>

@anacrolix
Copy link
Owner

The info checks out (it is a single file, but the URL should also be fine).

I need to find out why the webseed peer is being closed. There should only be two ways: It's banned, or the torrent closes.

There should be copious logging calling out why, or you can put a panic here:

ws.peer.logger.Levelf(log.Debug, "closing")
.

@anacrolix
Copy link
Owner

There is a sort of integration test in a semi-formed state that could help with this once we have a better reason.

@abhishek-das-gupta
Copy link
Author

abhishek-das-gupta commented Aug 8, 2024

I need to determine why the webseed peer is being closed.

I'm not sure if you're looking at the correct case I mentioned. Apologies for any confusion. Here is the issue more clearly explained (copied from post above):

Case: Anacrolix/Torrent Process Runs on the New Hosts with existing cluster having TLS

The 14 GB file distribution gets stuck on these new nodes because none of the new peers can contact the web seed (master node) present in the existing cluster. In the web seed section of full-status, it is empty:

webseeds:  <--- no web seed
2 peer conns:
- 10.140.93.137:51680-10.140.40.8:7191
  peer id: "-GT0003-\xb3.\x9epQ\xd6LG\x03\xad\xce8"
  extensions: 0000000000100005 (ltep, fast, dht)
  ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
  pex: 2 conns, 0 unsent events
  bep40-prio: e8a31f71
  last msg: 26.36s ago, connected: 86.37s ago, last helpful: never, itime: 0s, etime: 0s
  0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :M,e,v1:, dr: 0.0 KiB/s
  requested pieces:
- 10.140.93.137:7191-10.140.24.8:43468
  peer id: "-GT0003-\xfc\x93{w:\x94~\x8f\x13\u0671\x1b"
  extensions: 0000000000100005 (ltep, fast, dht)
  ltep extensions: map[ut_holepunch:2 ut_metadata:1 ut_pex:3]
  pex: 2 conns, 0 unsent events
  bep40-prio: d766eef0
  last msg: 86.29s ago, connected: 86.29s ago, last helpful: never, itime: 0s, etime: 0s
  0/1669 completed, 0 pieces touched, good chunks: 0/0:0 reqq: 0+0/(1/1024):0/1024, flags: :I,e,v1:, dr: 0.0 KiB/s
  requested pieces:

I believe you might be looking at the wrong case. The CLOSED status shown below indicates that torrenting through Anacrolix completed successfully. I captured this full-status after the torrent process finished.

If Anacrolix/Torrent is used, during torrenting of the 14 GB parcel, these peers have web seed information in their statuses:
webseeds:

My main issue is why when TLS is enabled webseed section remains empty. In the master node(webseed)'s logs, I do not see any of these new peers contacting it.

@anacrolix
Copy link
Owner

I don't quite follow. If they're not able to contact the webseed, there should be errors generated telling you why.

@abhishek-das-gupta
Copy link
Author

abhishek-das-gupta commented Aug 11, 2024

hi @anacrolix,
Unfortunately, I'm not seeing any obvious logs from either the Anacrolix/torrent library (client) or the master node's logs (acting as the webseed server) that indicate a download request from the client, such as:

2024-08-09 05:59:03,201 INFO ParcelController: Parcel download request: <some-parcel> from: <web-seed-client>

For the webseed client, I've already configured it to skip server certificate verification during the torrent client setup using the cfg.WebTransport configuration:

config.WebTransport = &http.Transport{
    TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}

This configuration works when I download the torrent file directly from the master node using a similar API (https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>.torrent). Here’s how the download of the .torrent file is handled from the master node:

url = https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>.torrent
client := http.DefaultClient
if se.configs.AllowInsecureCerts {
    client = &http.Client{
        Transport: &http.Transport{
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        },
    }
}
resp, err := client.Get(url)

This request results in a 200 response, successfully downloading the .torrent file.

However, when I provide a similar API/URL while adding the webseed (https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>) with cfg.WebTransport set, the download gets stuck. Note that I plan to implement TLS during BitTorrent in the future, but this isn't currently on the roadmap.

Could you please help me understand why the torrent file downloads successfully from the master node, but the webseed client fails to download from the master node when using the similar API and both having the InsecureSkipVerify: true set ? Is there another configuration I need to pass? Or is this a bug?

@anacrolix
Copy link
Owner

There's no reason TLS shouldn't work, I've had it work before with webseeding in production scenarios. I think if there's a bug it's that you're not seeing helpful log messages. I don't have much time to allocate to this at the moment but the webseed code isn't lengthy and some tracing through to find where things are going wrong might be worthwhile.

@anacrolix
Copy link
Owner

I'm not sure WebTransport is the correct config item, unfortunately there are quite a few of them due to slight variations in how http is consumed in BitTorrent that I haven't been able to merge. However as above you should be seeing a reason for it not working so just fixing that isn't productive for the project at least.

@gatisahu
Copy link

gatisahu commented Aug 16, 2024

I am also using tls config through WebTransport, it is able connect and send request ,but after some time I am seeing below error and getting status as

Status :
webseeds:

  • CLOSED: https://######/parcel/download/some.parcel
    last unhandled error: never
    bep40-prio: fa2f2406
    last msg: never, connected: never, last helpful: 103.80s ago, itime: 3m20.45815674s, etime: 1m36.647589654s
    1645/1645 completed, 1 pieces touched, good chunks: 655/655:0 reqq: 0+0/(1/128):0/1024, flags: i:WS:, dr: 105.1 KiB/s
    requested pieces:

Error :

banning webseed peer for "https://######/parcel/download/some.parcel" for being sole dirtier of piece 6 after failed piece check  [ github.com/anacrolix/torrent   torrent.go:2458 ]

@anacrolix
Copy link
Owner

Okay, as above being banned would make sense. Is it possible your http server does not implement range requests or is serving incorrect or incomplete data?

@gatisahu
Copy link

Yes we have added response.addHeader("Accept-Ranges", "bytes");

One more thing I have observed is when we add webseed peer and call download then it starts downloading . If we put 2/3 min gap and add webseed it did not start .I have put a torrent.AddWebSeedsOpt to trace in AddWebSeeds, I see torrent is not sending request to server .

@anacrolix
Copy link
Owner

anacrolix commented Aug 18, 2024

Great. It's very likely missing a "tickle" for webseed peers if reader priorities have already been set. I should be able to statically verify that.

@gatisahu
Copy link

I am also seeing error
error running handshook conn: main read loop: decoding message: reading message length: EOF
I think webseed may not used below config
config.WebTransport = &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}

@abhishek-das-gupta
Copy link
Author

abhishek-das-gupta commented Aug 19, 2024

Hi @anacrolix, which configuration should be set to true to enable "Local Service Discovery"? I want to ensure cross-rack communication is possible.

@anacrolix
Copy link
Owner

I've not implemented this yet. #248.

@abhishek-das-gupta
Copy link
Author

Thanks for the update, @anacrolix. I have another question based on this.

In my public cloud environment, nodes are spread across different AZs/racks within a VPC network. The security group for this VPC network allows "All traffic" (all protocols, all ports) from all sources (0.0.0.0/0), which should mean that the torrent port 7191 (in my case) is open for communication across racks in the VPC network. However, when I attempted to start a connection between two nodes located in different racks, the connection was reset or closed every time.

[root@e2e-56716943-456-dl-gateway0 user]# telnet 10.80.221.8 7191
Trying 10.80.221.8...
Connected to 10.80.221.8.
Escape character is '^]'.
Connection closed by foreign host.

[root@e2e-56716943-456-dl-gateway0 user]# nc 10.80.221.8 7191
Ncat: Connection reset by peer.

The code snippet above shows that when trying to connect to the destination IP via the torrent port, the connection gets reset. Could this be because there is no LSD (Local Service Discovery) implementation within the library, which uses multicast advertisements to enable nodes to discover peers that may be able to help them with their downloads?

@anacrolix
Copy link
Owner

I've pushed fixes to master that should improve webseed performance, and fix the stall that occurs if you add webseeds after adding the torrent (and some delay).

@anacrolix
Copy link
Owner

I am also seeing error error running handshook conn: main read loop: decoding message: reading message length: EOF I think webseed may not used below config config.WebTransport = &http.Transport{ TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, }

I've checked this, you are setting it in the correct place.

@anacrolix
Copy link
Owner

Thanks for the update, @anacrolix. I have another question based on this.

In my public cloud environment, nodes are spread across different AZs/racks within a VPC network. The security group for this VPC network allows "All traffic" (all protocols, all ports) from all sources (0.0.0.0/0), which should mean that the torrent port 7191 (in my case) is open for communication across racks in the VPC network. However, when I attempted to start a connection between two nodes located in different racks, the connection was reset or closed every time.

[root@e2e-56716943-456-dl-gateway0 user]# telnet 10.80.221.8 7191
Trying 10.80.221.8...
Connected to 10.80.221.8.
Escape character is '^]'.
Connection closed by foreign host.

[root@e2e-56716943-456-dl-gateway0 user]# nc 10.80.221.8 7191
Ncat: Connection reset by peer.

The code snippet above shows that when trying to connect to the destination IP via the torrent port, the connection gets reset. Could this be because there is no LSD (Local Service Discovery) implementation within the library, which uses multicast advertisements to enable nodes to discover peers that may be able to help them with their downloads?

This may be due to automatic blocking of internal IPs in the client. It won't be anything to do with the lack of LSD.

@anacrolix
Copy link
Owner

hi @anacrolix, Unfortunately, I'm not seeing any obvious logs from either the Anacrolix/torrent library (client) or the master node's logs (acting as the webseed server) that indicate a download request from the client, such as:

Can you try running master with GO_LOG=webseed=all?

For the webseed client, I've already configured it to skip server certificate verification during the torrent client setup using the cfg.WebTransport configuration:

config.WebTransport = &http.Transport{
    TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
}

This configuration works when I download the torrent file directly from the master node using a similar API (https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>.torrent). Here’s how the download of the .torrent file is handled from the master node:

url = https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>.torrent
client := http.DefaultClient
if se.configs.AllowInsecureCerts {
    client = &http.Client{
        Transport: &http.Transport{
            TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        },
    }
}
resp, err := client.Get(url)

This request results in a 200 response, successfully downloading the .torrent file.

However, when I provide a similar API/URL while adding the webseed (https://<master-node>:<TLS-port>/cmf/parcel/download/<file-to-download>) with cfg.WebTransport set, the download gets stuck. Note that I plan to implement TLS during BitTorrent in the future, but this isn't currently on the roadmap.

Could you please help me understand why the torrent file downloads successfully from the master node, but the webseed client fails to download from the master node when using the similar API and both having the InsecureSkipVerify: true set ? Is there another configuration I need to pass? Or is this a bug?

Maybe take a look at

torrent/client.go

Lines 220 to 228 in f471182

if cl.httpClient.Transport == nil {
cl.httpClient.Transport = &http.Transport{
Proxy: cfg.HTTPProxy,
DialContext: cfg.HTTPDialContext,
// I think this value was observed from some webseeds. It seems reasonable to extend it
// to other uses of HTTP from the client.
MaxConnsPerHost: 10,
}
}
. I still don't see why you wouldn't get errors though, although the recent fixes may explain that. Let me know how the logging and fixes mentioned at the top of this comment go.

@anacrolix
Copy link
Owner

Is there any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants