fix: port collision on large amount of tests #257

ChaoticTempest · 2023-01-05T21:43:04Z

Port collision would rarely happen, but happens pretty frequently on a machine with not as much resources to run a lot of nodes. This fix remedies this problem by adding a couple lockfiles to the ports until the server actually runs and fully acquires the port.

Not sure if this the best way of achieving this, so feel free to suggest anything different.

DavidM-D · 2023-01-09T12:36:20Z

This seems a bit hacky. Why not just do something like bind to port zero and then the OS will assign you an unused port which you can find using local_addr?

miraclx · 2023-01-09T13:16:16Z

workspaces/src/network/server.rs

+/// Acquire an unused port and lock it for the duration until the sandbox server has
+/// been started.


Was going to suggest the same thing as @DavidM-D, but it seems like this is the key difference. A potential race condition between two nodes seeing that a port is “free” at the same time, then proceeding to simultaneously attempt binding to it. And since there's no global node executor that controls the issuing of ports to nodes, there's no way to gate access to ports if workspaces instances can't communicate with themselves, which is probably why the common denominator used here is the filesystem.

ChaoticTempest · 2023-01-09T18:04:20Z

@DavidM-D exactly what @miraclx said. And also this is due to the NEAR node requiring a RPC/NET port parameter which we can't do bind with directly ourselves

miraclx · 2023-01-09T18:32:02Z

workspaces/src/network/server.rs

+        let lockfile = File::create(lockpath).map_err(|err| {
+            ErrorKind::Io.full(format!("failed to create lockfile for port {}", port), err)
+        })?;
+        if lockfile.try_lock_exclusive().is_ok() {


Curious what create does to a file that currently exists and is locked. The normal behaviour is to truncate existing files. I wonder if it errors at this point and forcing an failed return? Or would it truncate the file, (invalidating the previous lock *unlikely), causing this new lock attempt to fail and then trying a different port or would it skip truncation, yet, return a handle to the file for which we can attempt securing a lock.

I assume you've probably tested this.

So lock files are purely advisory about a resource being taken up. They don't actually lock the file from being written to, so truncation wouldn't error out

mrLSD · 2023-01-10T12:09:52Z

@ChaoticTempest is it also fixes #253? One of the solutions was timeout between requests and threads reducing. It was too slow for the test running.

mrLSD · 2023-01-10T12:21:17Z

but happens pretty frequently on a machine with not as much resources to run a lot of nodes.

@ChaoticTempest As our research has shown, this statement is incorrect. This is typical for instances with a large amount of resources - namely the number of CPU cores. Which results in a large number of tests running at the same time. And since the ports are selected randomly, and the randomizer entropy is not of high quality, this leads to the fact that the probability of choosing an already open port increases dramatically.

More correct:
on a machine with a large number of cpu cores which leads to running a lot of nodes ...

ChaoticTempest · 2023-01-11T00:29:21Z

is it also fixes #253? One of the solutions was timeout between requests and threads reducing. It was too slow for the test running.

This one doesn't fix that issue particularly. That still needs to resolved in near/nearcore#8328

@ChaoticTempest As our research has shown, this statement is incorrect. This is typical for instances with a large amount of resources - namely the number of CPU cores. Which results in a large number of tests running at the same time. And since the ports are selected randomly, and the randomizer entropy is not of high quality, this leads to the fact that the probability of choosing an already open port increases dramatically.

ahh, I was generalizing for the whole problem including the issue I mentioned above. But this one just fixes the fact that the RNG chooses very similar ports. The other issue should try to alleviate the patching state issues so that you can run this on as many threads as possible without hitting that issue

Fix port collision

41ae66f

ChaoticTempest requested review from itegulov, DavidM-D and miraclx January 5, 2023 21:43

miraclx reviewed Jan 9, 2023

View reviewed changes

miraclx approved these changes Jan 9, 2023

View reviewed changes

ChaoticTempest merged commit 7e6f995 into main Jan 10, 2023

ChaoticTempest deleted the fix/port-collision branch January 10, 2023 00:37

This was referenced Jan 10, 2023

feat: use tempdir crate instead of provided by os #259

Closed

Sandbox - No such file or directory #255

Closed

frol mentioned this pull request Oct 4, 2023

chore: release v0.8.0 #319

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: port collision on large amount of tests #257

fix: port collision on large amount of tests #257

ChaoticTempest commented Jan 5, 2023

DavidM-D commented Jan 9, 2023

miraclx Jan 9, 2023

ChaoticTempest commented Jan 9, 2023

miraclx Jan 9, 2023 •

edited

Loading

ChaoticTempest Jan 9, 2023

miraclx Jan 9, 2023

mrLSD commented Jan 10, 2023

mrLSD commented Jan 10, 2023 •

edited

Loading

ChaoticTempest commented Jan 11, 2023

		/// Acquire an unused port and lock it for the duration until the sandbox server has
		/// been started.

fix: port collision on large amount of tests #257

fix: port collision on large amount of tests #257

Conversation

ChaoticTempest commented Jan 5, 2023

DavidM-D commented Jan 9, 2023

miraclx Jan 9, 2023

Choose a reason for hiding this comment

ChaoticTempest commented Jan 9, 2023

miraclx Jan 9, 2023 • edited Loading

Choose a reason for hiding this comment

ChaoticTempest Jan 9, 2023

Choose a reason for hiding this comment

miraclx Jan 9, 2023

Choose a reason for hiding this comment

mrLSD commented Jan 10, 2023

mrLSD commented Jan 10, 2023 • edited Loading

ChaoticTempest commented Jan 11, 2023

miraclx Jan 9, 2023 •

edited

Loading

mrLSD commented Jan 10, 2023 •

edited

Loading