Skip to content

[Prototyping] Using rclone lsjson for all searches #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
df17a5a
lots of changes, sort out.
JoeZiminski Oct 6, 2023
3275487
still working, needs a bit more tidying up.
JoeZiminski Oct 9, 2023
accd30d
Continue working.
JoeZiminski Oct 9, 2023
dfd4f82
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 24, 2023
7f0326e
Tidy ups.
JoeZiminski Nov 10, 2023
f12683c
Fix linting.
JoeZiminski Nov 10, 2023
e03f72d
Tidy ups and documentation.
JoeZiminski Nov 10, 2023
5ff22c3
Add documentation.
JoeZiminski Nov 10, 2023
790326c
Try different Dockerfile for linux.
JoeZiminski Nov 23, 2023
f098159
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 23, 2023
f276ba9
Fix issue after rebase.
JoeZiminski Apr 10, 2024
93615b8
Remove breakpoint.
JoeZiminski Apr 10, 2024
e454325
Don't use singularity, docker or nothing.
JoeZiminski Apr 10, 2024
8162051
Rework ssh setup.
JoeZiminski Apr 10, 2024
1c04749
Fix connection refused issue.
JoeZiminski Apr 11, 2024
b296ca6
Use run not POpen for actions.
JoeZiminski Apr 11, 2024
43976de
try in detachted state.
JoeZiminski Apr 11, 2024
e6eaff5
Add error messages on docker setup.
JoeZiminski Apr 18, 2024
2e7d318
run linting.
JoeZiminski Apr 18, 2024
9dc2734
update connection failed error message.
JoeZiminski Apr 18, 2024
4306129
Test version working locally on linux but only can connect with sudo.
JoeZiminski Apr 19, 2024
b4bf5f6
Try freeing up and using the free port.
JoeZiminski Apr 19, 2024
6ff5c7c
Test sudo service only on linux.
JoeZiminski Apr 19, 2024
029da65
Add sudo to docker setup commands.
JoeZiminski Apr 19, 2024
941c1e0
test with auto add policy.
JoeZiminski Apr 19, 2024
ab8e9bc
Really restrict to the connect call.
JoeZiminski Apr 19, 2024
a0e083c
restrict to tests of interest temporarily.
JoeZiminski Apr 19, 2024
6b51d28
Fix to port 3306 in tests and for paramiko.
JoeZiminski Apr 19, 2024
84a7762
Extend port to rclone.
JoeZiminski Apr 19, 2024
d5e2b73
Use environment variable to set port.
JoeZiminski Apr 19, 2024
8d11516
Add all OS back.
JoeZiminski Apr 19, 2024
8e29a65
try remove tag for windows.
JoeZiminski Apr 19, 2024
f465747
Update docker commands for windows.
JoeZiminski Apr 20, 2024
324a796
Fix nonsense docker build command.
JoeZiminski Apr 20, 2024
597a0b2
Only run when docker running and on ubuntu on runners.
JoeZiminski Apr 22, 2024
b529809
Try build and run docker only once per session.
JoeZiminski Apr 22, 2024
14579ba
Teardown image at end of ssh tests, factor out ssh tests.
JoeZiminski Apr 22, 2024
db78aab
SPlit ssh tests.
JoeZiminski Apr 22, 2024
9c6c064
Add sudo to the docker teardown commands for Linux.
JoeZiminski Apr 22, 2024
5466e4c
try class scope of setup ssh container.
JoeZiminski Apr 22, 2024
e2aca7b
Try move ssh fixture to classes.
JoeZiminski Apr 22, 2024
e157bb4
Try a different command to shutdown on linux.
JoeZiminski Apr 22, 2024
905a093
Tidy ups and some docs.
JoeZiminski Apr 22, 2024
ae192b3
Extend to macOS.
JoeZiminski Apr 22, 2024
01b020e
Finish tidying up docstrings.
JoeZiminski Apr 22, 2024
5af3add
Change ssh test image name and fix docstring.
JoeZiminski Apr 22, 2024
679f3b8
Small tidy ups.
JoeZiminski Apr 22, 2024
c29e250
Small fixes after rebase.
JoeZiminski Jun 20, 2025
6d417d1
More fix.es
JoeZiminski Jun 20, 2025
b7f9db8
Fix CI.
JoeZiminski Jun 20, 2025
8712bff
Skip tests on macOS.
JoeZiminski Jun 21, 2025
33e20de
Refactor transfer tests.
JoeZiminski Jun 21, 2025
de756a2
Continue refactoring.
JoeZiminski Jun 21, 2025
e457d3a
Fix tests again.
JoeZiminski Jun 21, 2025
19bcacd
Update CI script.
JoeZiminski Jun 21, 2025
76b5e80
Update CI.
JoeZiminski Jun 21, 2025
ee03d3a
Playing around with using rclone for all file searches.
JoeZiminski Jun 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 12 additions & 3 deletions .github/workflows/code_test_and_deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
# macos-14 is M1, macos-13 is intel. Run on earliest and
# latest python versions. All python versions are tested in
# the weekly cron job.
os: [windows-latest, ubuntu-latest, macos-14, macos-13]
os: [ ubuntu-latest, windows-latest, macos-14, macos-13]
# Test all Python versions for cron job, and only first/last for other triggers
python-version: ${{ fromJson(github.event_name == 'schedule' && '["3.9", "3.10", "3.11", "3.12"]' || '["3.9", "3.12"]') }}

Expand All @@ -57,8 +57,17 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install .[dev]
- name: Test
run: pytest
# run SSH tests only on Linux because Windows and macOS
# are already run within a virtual container and so cannot
# run Linux containers because nested containerisation is disabled.
- name: Test SSH (Linux only)
if: runner.os == 'Linux'
run: |
sudo service mysql stop # free up port 3306 for ssh tests
pytest tests/tests_transfers/ssh
- name: All Other Tests
run: |
pytest --ignore tests/tests_transfers/ssh

build_sdist_wheels:
name: Build source distribution
Expand Down
11 changes: 11 additions & 0 deletions datashuttle/configs/canonical_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

from __future__ import annotations

import os
from typing import (
TYPE_CHECKING,
Dict,
Expand Down Expand Up @@ -58,6 +59,16 @@ def keys_str_on_file_but_path_in_class() -> list[str]:
]


def get_default_ssh_port() -> int:
"""
Get the default port used for SSH connections.
"""
if "DS_SSH_PORT" in os.environ:
return int(os.environ["DS_SSH_PORT"])
else:
return 22


# -----------------------------------------------------------------------------
# Check Configs
# -----------------------------------------------------------------------------
Expand Down
3 changes: 3 additions & 0 deletions datashuttle/configs/config_class.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,9 @@ def get_rclone_config_name(

return f"central_{self.project_name}_{connection_method}"

def get_rclone_config_name_local(self):
return f"local_{self.project_name}_local_filesystem"

def make_rclone_transfer_options(
self, overwrite_existing_files: OverwriteExistingFiles, dry_run: bool
) -> Dict:
Expand Down
1 change: 0 additions & 1 deletion datashuttle/utils/data_transfer.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,6 @@ def build_a_list_of_all_files_and_folders_to_transfer(self) -> List[str]:
self.update_list_with_non_ses_sub_level_folders(
extra_folder_names, extra_filenames, sub
)

continue

# Datatype (sub and ses level) --------------------------------
Expand Down
133 changes: 118 additions & 15 deletions datashuttle/utils/folders.py
Original file line number Diff line number Diff line change
Expand Up @@ -515,25 +515,66 @@ def search_for_folders(
verbose : If `True`, when a search folder cannot be found, a message
will be printed with the missing path.
"""
if local_or_central == "central" and cfg["connection_method"] == "ssh":
all_folder_names, all_filenames = ssh.search_ssh_central_for_folders(
search_path,
search_prefix,
cfg,
verbose,
return_full_path,
if local_or_central == "local":
all_folder_names, all_filenames = search_gdrive_or_aws_for_folders(
search_path, search_prefix, None, return_full_path
)
else:
if not search_path.exists():
if verbose:
utils.log_and_message(
f"No file found at {search_path.as_posix()}"
)
return [], []

all_folder_names, all_filenames = search_filesystem_path_for_folders(
all_folder_names_, all_filenames_ = search_filesystem_path_for_folders(
search_path / search_prefix, return_full_path
)

assert all_folder_names == all_folder_names_
assert all_filenames == all_filenames_

else:

if cfg["connection_method"] == "ssh":
all_folder_names, all_filenames = (
ssh.search_ssh_central_for_folders(
search_path,
search_prefix,
cfg,
verbose,
return_full_path,
)
)

all_folder_names_, all_filenames_ = (
search_gdrive_or_aws_for_folders(
search_path,
search_prefix,
cfg.get_rclone_config_name("ssh"),
return_full_path,
)
)
assert sorted(all_folder_names) == all_folder_names_
assert all_filenames == all_filenames_

else:
if not search_path.exists():
if verbose:
utils.log_and_message(
f"No file found at {search_path.as_posix()}"
)
return [], []

all_folder_names, all_filenames = search_gdrive_or_aws_for_folders(
search_path,
search_prefix,
cfg.get_rclone_config_name("local_filesystem"),
return_full_path,
)

all_folder_names_, all_filenames_ = (
search_filesystem_path_for_folders(
search_path / search_prefix, return_full_path
)
)

assert all_folder_names == all_folder_names_
assert all_filenames == all_filenames_

return all_folder_names, all_filenames


Expand Down Expand Up @@ -565,3 +606,65 @@ def search_filesystem_path_for_folders(
)

return all_folder_names, all_filenames


def search_gdrive_or_aws_for_folders(
search_path: Path,
search_prefix: str,
rclone_config_name: str | None,
return_full_path: bool = False,
) -> Tuple[List[Any], List[Any]]:
"""
Searches for files and folders in central path using `rclone lsjson` command.
This command lists all the files and folders in the central path in a json format.
The json contains file/folder info about each file/folder like name, type, etc.
"""
import fnmatch
import json

from datashuttle.utils import rclone

if rclone_config_name:
config_prefix = f"{rclone_config_name}:"
else:
config_prefix = ""

output = rclone.call_rclone(
f'lsjson {config_prefix}"{search_path.as_posix()}"',
pipe_std=True,
)

all_folder_names: List[str] = []
all_filenames: List[str] = []

if output.returncode != 0:
utils.log_and_message(
f"Error searching files at {search_path.as_posix()} \n {output.stderr.decode('utf-8') if output.stderr else ''}"
)
return all_folder_names, all_filenames

files_and_folders = json.loads(output.stdout)

# try:
for file_or_folder in files_and_folders:

name = file_or_folder["Name"]

if not fnmatch.fnmatch(name, search_prefix):
continue

is_dir = file_or_folder.get("IsDir", False)

to_append = search_path / name if return_full_path else name

if is_dir:
all_folder_names.append(to_append)
else:
all_filenames.append(to_append)

# except Exception:
# utils.log_and_message(
# f"Error searching files at {search_path.as_posix()}"
# )

return all_folder_names, all_filenames
3 changes: 2 additions & 1 deletion datashuttle/utils/rclone.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from subprocess import CompletedProcess
from typing import Dict, List, Literal

from datashuttle.configs import canonical_configs
from datashuttle.configs.config_class import Configs
from datashuttle.utils import utils
from datashuttle.utils.custom_types import TopLevelFolder
Expand Down Expand Up @@ -141,7 +142,7 @@ def setup_rclone_config_for_ssh(
f"sftp "
f"host {cfg['central_host_id']} "
f"user {cfg['central_host_username']} "
f"port 22 "
f"port {canonical_configs.get_default_ssh_port()} "
f"key_file {ssh_key_path.as_posix()}",
pipe_std=True,
)
Expand Down
17 changes: 13 additions & 4 deletions datashuttle/utils/ssh.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import paramiko

from datashuttle.configs import canonical_configs
from datashuttle.utils import utils

# -----------------------------------------------------------------------------
Expand Down Expand Up @@ -42,6 +43,7 @@ def connect_client_core(
else None
),
look_for_keys=True,
port=canonical_configs.get_default_ssh_port(),
)


Expand Down Expand Up @@ -83,15 +85,21 @@ def get_remote_server_key(central_host_id: str):
connection.
"""
transport: paramiko.Transport
with paramiko.Transport(central_host_id) as transport:
with paramiko.Transport(
(central_host_id, canonical_configs.get_default_ssh_port())
) as transport:
transport.connect()
key = transport.get_remote_server_key()
return key


def save_hostkey_locally(key, central_host_id, hostkeys_path) -> None:
client = paramiko.SSHClient()
client.get_host_keys().add(central_host_id, key.get_name(), key)
client.get_host_keys().add(
f"[{central_host_id}]:{canonical_configs.get_default_ssh_port()}",
key.get_name(),
key,
)
client.get_host_keys().save(hostkeys_path.as_posix())


Expand Down Expand Up @@ -183,15 +191,16 @@ def connect_client_with_logging(
f"Connection to { cfg['central_host_id']} made successfully."
)

except Exception:
except Exception as e:
utils.log_and_raise_error(
f"Could not connect to server. Ensure that \n"
f"1) You have run setup_ssh_connection() \n"
f"2) You are on VPN network if required. \n"
f"3) The central_host_id: {cfg['central_host_id']} is"
f" correct.\n"
f"4) The central username:"
f" {cfg['central_host_username']}, and password are correct.",
f" {cfg['central_host_username']}, and password are correct."
f"Original error: {e}",
ConnectionError,
)

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ select = ["I", "E", "F", "TCH", "TID252"]

[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["F401"]
"tests/**/*" = ["TID252"]

[tool.ruff.lint.mccabe]
max-complexity = 18
Expand Down
Empty file added tests/__init__.py
Empty file.
11 changes: 6 additions & 5 deletions tests/tests_integration/base.py → tests/base.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import warnings

import pytest
import test_utils

from datashuttle import DataShuttle

from . import test_utils

TEST_PROJECT_NAME = "test_project"


Expand All @@ -13,8 +14,8 @@ class BaseTest:
@pytest.fixture(scope="function")
def no_cfg_project(test):
"""
Fixture that creates an empty project. Ignore the warning
that no configs are setup yet.
Fixture that creates an empty project. Ignore the
warning that no configs are set up yet.
"""
test_utils.delete_project_if_it_exists(TEST_PROJECT_NAME)

Expand Down Expand Up @@ -64,8 +65,8 @@ def project(self, tmp_path, request):
def clean_project_name(self):
"""
Create an empty project, but ensure no
configs already exists, and delete created configs
after test.
configs already exists, and delete created
configs after test.
"""
project_name = TEST_PROJECT_NAME
test_utils.delete_project_if_it_exists(project_name)
Expand Down
47 changes: 0 additions & 47 deletions tests/conftest.py

This file was deleted.

5 changes: 0 additions & 5 deletions tests/quick_make_project.py

This file was deleted.

Loading