Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow .airflowignore not handling soft link properly. #23532

Closed
1 of 2 tasks
NickYadance opened this issue May 6, 2022 · 10 comments · Fixed by #23535
Closed
1 of 2 tasks

Airflow .airflowignore not handling soft link properly. #23532

NickYadance opened this issue May 6, 2022 · 10 comments · Fixed by #23535
Assignees
Labels
area:core kind:bug This is a clearly a bug
Milestone

Comments

@NickYadance
Copy link
Contributor

NickYadance commented May 6, 2022

Apache Airflow version

2.3.0 (latest released)

What happened

Soft link and folder under same root folder will be handled as the same relative path. Say i have dags folder which looks like this:

-dags:
  -- .airflowignore
  -- folder
  -- soft-links-to-folder -> folder

and .airflowignore:

folder/

both folder and soft-links-to-folder will be ignored.

What you think should happen instead

Only the folder should be ignored. This is the expected behavior in airflow 2.2.4, before i upgraded. The root cause is that both _RegexpIgnoreRule and _GlobIgnoreRule is calling relative_to method to get search path.

How to reproduce

check @tirkarthi comment for the test case.

Operating System

ubuntu

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@NickYadance NickYadance added area:core kind:bug This is a clearly a bug labels May 6, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 6, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@NickYadance
Copy link
Contributor Author

The background is that i use kubernetes/git-sync to sync git into the dag folder which creates the soft link.

@tirkarthi
Copy link
Contributor

Here is a sample test case of the report. I guess this is more with the fact resolve is called before match that will resolve symlink to the original folder here. As per report "soft-links-to-folder" will resolve to "folder" and get ignored

test_path: Path = path.resolve()

Test case passes before changes in #22051 . cc : @ianbuss

# tests/plugins/test_plugin_ignore.py
def test_symlink_not_ignored(self):
    shutil.rmtree(self.test_dir)
    self.test_dir = tempfile.mkdtemp(prefix="onotole")
    source = os.path.join(self.test_dir, "folder")
    target = os.path.join(self.test_dir, "symlink")
    py_file = os.path.join(source, "hello_world.py")
    ignore_file = os.path.join(self.test_dir, ".airflowignore")
    os.mkdir(source)
    os.symlink(source, target)

    with open(ignore_file, 'w') as f:
        f.write("folder")

    with open(py_file, 'w') as f:
        f.write("print('hello world')")


    ignore_list_file = ".airflowignore"
    found = []

    for path in find_path_from_directory(self.test_dir, ignore_list_file):
        found.append(path)

    assert os.path.join(self.test_dir, "symlink", "hello_world.py") in found

@NickYadance
Copy link
Contributor Author

Here is a sample test case of the report. I guess this is more with the fact resolve is called before match that will resolve symlink to the original folder here. As per report "soft-links-to-folder" will resolve to "folder" and get ignored

test_path: Path = path.resolve()

Test case passes before changes in #22051 . cc : @ianbuss

# tests/plugins/test_plugin_ignore.py
def test_symlink_not_ignored(self):
    shutil.rmtree(self.test_dir)
    self.test_dir = tempfile.mkdtemp(prefix="onotole")
    source = os.path.join(self.test_dir, "folder")
    target = os.path.join(self.test_dir, "symlink")
    py_file = os.path.join(source, "hello_world.py")
    ignore_file = os.path.join(self.test_dir, ".airflowignore")
    os.mkdir(source)
    os.symlink(source, target)

    with open(ignore_file, 'w') as f:
        f.write("folder")

    with open(py_file, 'w') as f:
        f.write("print('hello world')")


    ignore_list_file = ".airflowignore"
    found = []

    for path in find_path_from_directory(self.test_dir, ignore_list_file):
        found.append(path)

    assert os.path.join(self.test_dir, "symlink", "hello_world.py") in found

Correct.

@ianbuss
Copy link
Contributor

ianbuss commented May 6, 2022

Yep, you're right @tirkarthi this looks like something we need to handle better and expand the unit tests to cover expected cases. @NickYadance can you give a sample directory outline (with file, dirs and symlinks) when git-sync is in the picture?

@ianbuss
Copy link
Contributor

ianbuss commented May 6, 2022

Hopefully this can be relatively easily fixed with a change to absolute() rather than resolve() on the path but we should set up a realistic test case for the git-sync use case as well.

@NickYadance
Copy link
Contributor Author

NickYadance commented May 6, 2022

@ianbuss sure , check this:
image

Can't find a simple way to work around this. Maybe try to delete soft link manually after git-sync is done.

@ianbuss
Copy link
Contributor

ianbuss commented May 6, 2022

OK, and you have a .airflowignore that has repo.git/ or similar @NickYadance?

I guess there's a wider question here about whether Airflow should be following symlinks at all as it walks the directory tree. I think in some circumstances it makes sense to do that, where it refers to targets out of the dag directory tree, like for example:

dags/
  team1 -> /mnt/share/team1
  team2 -> /git/team2

Longer term, I'm less sure of the utility of following "local" symlinks to targets within the same directory structure as it will result in the same files being found more than once and you have to craft the ignore files to ignore either the symlinked path or the real path. We could update the logic to ignore symlinks that are "in tree". Curious what you think @ashb , @kaxil , @potiuk ?

But for now we should try to restore the behaviour to pre-2.3.0 as it is a regression.

@potiuk potiuk added this to the Airflow 2.3.1 milestone May 6, 2022
@potiuk
Copy link
Member

potiuk commented May 6, 2022

Agree soft-links in tree are bad idea that will lead to duplication which is never good in this case. But soft-links are also dangerous for another reason - bacause of potential recursion problem which are difficult to diagnose (unless we protect against them). But also they are useful to "cut the cake" differently so to speak (i.e. have different hierarchy in DAGs and in your original repo where you keep them.

I'd say we should restore the symlink behaviour, allow the symlinks but protect against in-tree and recursion.

@ianbuss
Copy link
Contributor

ianbuss commented May 6, 2022

Have prepped a simple initial PR which should hopefully restore the original behaviour (and includes the test case provided by @tirkarthi - thanks!) but would be good to get some additional eyes on it. If we want to make larger changes to the symlink handling that should perhaps be a future PR with further thought? Depends on the timeline of 2.3.1 I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants