Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[patch] restore forbidding duplicate imports #1453

Merged
merged 3 commits into from
Jun 20, 2024

Conversation

samwaseda
Copy link
Member

This PR partially solves this problem, as it forbids to import a job whose name already exists in the project, but I make it a draft because it does not clarify why the problem is occurring in the first place.

Base automatically changed from fix_archiving_problem to main May 29, 2024 08:33
@samwaseda
Copy link
Member Author

So I guess for now this is the most pragmatic solution. I still do not perfectly understand what's happening, but it sounds reasonable to me to forbid importing jobs with an existing job name. I put the error in the importing function and not deeper in base, because I thought maybe it is not fundamentally forbidden that there are jobs with the same name in pyiron? Maybe @jan-janssen can make a comment? One way or other, I think this is more like a temporary fix, and we need an overhaul in this PR. What do you think?

@liamhuber
Copy link
Member

because I thought maybe it is not fundamentally forbidden that there are jobs with the same name in pyiron?

I actually wasn't sure what the behaviour would be here, so I checked. The executive summary is:

  • Project doesn't care at all about names until something has been run
  • Project won't run a job whose name is already a run job in its table
  • Creating a new job with an existing job name loads the existing job
  • The job column of the project table is not unique, but the combination of project and job is

Demo:

from pyiron_atomistics import Project
pr = Project("test")
pr.remove_jobs(silently=True)
j1 = pr.create.job.Lammps("lmp")
j1.structure = pr.create.structure.bulk("Al")

j2 = pr.create.job.Lammps("lmp")
j2.structure = pr.create.structure.bulk("Cu")

print(j1.structure.get_chemical_formula(), j2.structure.get_chemical_formula())
# Al Cu
print(j1.name, j2.name)
# lmp lmp
print(pr.job_table().job)
# Series([], Name: job, dtype: object)

j1.run()
# Runs as expected

j2.run()
# Prints a warning and doesn't run `job exists already and therefore was not created!`

j3 = pr.create.job.Lammps("lmp")
# Silently (really, I set `warnings.simplefilter("always")`) reloads the job

print(j1.structure.get_chemical_formula(), j2.structure.get_chemical_formula(), j3.structure.get_chemical_formula())
# Al Cu Al
print(pr.job_table().job)
# 0    lmp; Name: job, dtype: object

Only the project+job column gives a unique name, as it's fine to create and run a job in a subproject:

sub_pr = pr.create_group("sub_pr")
k1 = sub_pr.create.job.Lammps("lmp")
k1.structure = sub_pr.create.structure.bulk("Ni")
k1.run()
print(pr.job_table().job)
# 0    lmp; 1    lmp; Name: job, dtype: object

IMO there's nothing faulty with any of this, although it is annoying to find out "late" that your job name is non-unique and won't run -- would be nicer to learn that at creation time. This is impossible in the current setup though, as project is not truly aware of a job until after it has run.

I've attacked this general topic in pyiron_workflow using a Semantic class, where the most relevant bit is probably SemanticParent.add_child. This fundamentally differs as it's all happening at the object level in-memory, rather than referring to the database as (I think) Project is doing for these checks. I'm not suggesting we need to pull anything from there in here, it's just an FYI for attacking the topic from another angle.

Copy link
Member

@liamhuber liamhuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am completely behind the idea of having a check here, but I don't think this does what you want and it needs tests to prove it's doing the desired thing.

pyiron_base/project/archiving/import_archive.py Outdated Show resolved Hide resolved
tests/archiving/test_import.py Outdated Show resolved Hide resolved
@samwaseda
Copy link
Member Author

because I thought maybe it is not fundamentally forbidden that there are jobs with the same name in pyiron?

I actually wasn't sure what the behaviour would be here, so I checked. The executive summary is:

  • Project doesn't care at all about names until something has been run
  • Project won't run a job whose name is already a run job in its table
  • Creating a new job with an existing job name loads the existing job
  • The job column of the project table is not unique, but the combination of project and job is

Yep so far that was more or less what I had expected. What I don't know is whether it can happen that there are jobs with the same name that have run. I super vaguely remember that this can happen when there are two child jobs which share the same name. In that case it's the same project and the same job name but with different parents, but as I said I don't clearly remember it.

@liamhuber
Copy link
Member

I super vaguely remember that this can happen when there are two child jobs which share the same name. In that case it's the same project and the same job name but with different parents, but as I said I don't clearly remember it.

Oh yeah, I wasn't thinking about parent jobs. I don't intend to dig in and make an example for that though 😝

Copy link
Contributor

@pmrv pmrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @liamhuber's diagnosis. Probably you want to join job and project columns in the given df first, before making this check.

@jan-janssen jan-janssen marked this pull request as draft June 12, 2024 05:41
@samwaseda samwaseda marked this pull request as ready for review June 19, 2024 20:14
@samwaseda
Copy link
Member Author

I guess now I found the workaround and hopefully the tests will pass this time...

@samwaseda
Copy link
Member Author

In the end, it boiled down to pr.remove(enable=True), so I removed all the other lines that I had added. Since there'll be an overhaul in this PR, I don't think the error that I had added will be particularly helpful.

@samwaseda samwaseda changed the title restore forbidding duplicate imports [patch] restore forbidding duplicate imports Jun 20, 2024
Copy link
Member

@jan-janssen jan-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @samwaseda Can you merge this?

@samwaseda
Copy link
Member Author

It's blocked because @liamhuber's approval is missing

@jan-janssen jan-janssen dismissed liamhuber’s stale review June 20, 2024 12:06

The request was addressed.

@jan-janssen jan-janssen merged commit 5003ad0 into main Jun 20, 2024
24 checks passed
@jan-janssen jan-janssen deleted the forbid_importing_duplicate_jobs branch June 20, 2024 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants