Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: port collision on large amount of tests #257

Merged
merged 1 commit into from
Jan 10, 2023

Conversation

ChaoticTempest
Copy link
Member

Port collision would rarely happen, but happens pretty frequently on a machine with not as much resources to run a lot of nodes. This fix remedies this problem by adding a couple lockfiles to the ports until the server actually runs and fully acquires the port.

Not sure if this the best way of achieving this, so feel free to suggest anything different.

@DavidM-D
Copy link
Contributor

DavidM-D commented Jan 9, 2023

This seems a bit hacky. Why not just do something like bind to port zero and then the OS will assign you an unused port which you can find using local_addr?

Comment on lines +14 to +15
/// Acquire an unused port and lock it for the duration until the sandbox server has
/// been started.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was going to suggest the same thing as @DavidM-D, but it seems like this is the key difference. A potential race condition between two nodes seeing that a port is “free” at the same time, then proceeding to simultaneously attempt binding to it. And since there's no global node executor that controls the issuing of ports to nodes, there's no way to gate access to ports if workspaces instances can't communicate with themselves, which is probably why the common denominator used here is the filesystem.

@ChaoticTempest
Copy link
Member Author

@DavidM-D exactly what @miraclx said. And also this is due to the NEAR node requiring a RPC/NET port parameter which we can't do bind with directly ourselves

Comment on lines +21 to +24
let lockfile = File::create(lockpath).map_err(|err| {
ErrorKind::Io.full(format!("failed to create lockfile for port {}", port), err)
})?;
if lockfile.try_lock_exclusive().is_ok() {
Copy link
Contributor

@miraclx miraclx Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what create does to a file that currently exists and is locked. The normal behaviour is to truncate existing files. I wonder if it errors at this point and forcing an failed return? Or would it truncate the file, (invalidating the previous lock *unlikely), causing this new lock attempt to fail and then trying a different port or would it skip truncation, yet, return a handle to the file for which we can attempt securing a lock.

I assume you've probably tested this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So lock files are purely advisory about a resource being taken up. They don't actually lock the file from being written to, so truncation wouldn't error out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha!

@mrLSD
Copy link

mrLSD commented Jan 10, 2023

@ChaoticTempest is it also fixes #253? One of the solutions was timeout between requests and threads reducing. It was too slow for the test running.

@mrLSD
Copy link

mrLSD commented Jan 10, 2023

but happens pretty frequently on a machine with not as much resources to run a lot of nodes.

@ChaoticTempest As our research has shown, this statement is incorrect. This is typical for instances with a large amount of resources - namely the number of CPU cores. Which results in a large number of tests running at the same time. And since the ports are selected randomly, and the randomizer entropy is not of high quality, this leads to the fact that the probability of choosing an already open port increases dramatically.

More correct:
on a machine with a large number of cpu cores which leads to running a lot of nodes ...

@ChaoticTempest
Copy link
Member Author

is it also fixes #253? One of the solutions was timeout between requests and threads reducing. It was too slow for the test running.

This one doesn't fix that issue particularly. That still needs to resolved in near/nearcore#8328

@ChaoticTempest As our research has shown, this statement is incorrect. This is typical for instances with a large amount of resources - namely the number of CPU cores. Which results in a large number of tests running at the same time. And since the ports are selected randomly, and the randomizer entropy is not of high quality, this leads to the fact that the probability of choosing an already open port increases dramatically.

ahh, I was generalizing for the whole problem including the issue I mentioned above. But this one just fixes the fact that the RNG chooses very similar ports. The other issue should try to alleviate the patching state issues so that you can run this on as many threads as possible without hitting that issue

@frol frol mentioned this pull request Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants