Search_for_wildcards function updated to add @DATETO@ functionality #510

Diya910 · 2025-04-13T23:43:09Z

Before submitting a pull request (PR), please read the contributing guide.

Please fill out as much of this template as you can, but if you have any problems or questions, just leave a comment and we will help out :)

Description

What is this PR

Bug fix
[ yes] Addition of a new feature
Other

Why is this PR needed?
This PR introduces a new @Dateto@ wildcard that enables users to search for folders based on a date range embedded in their names. This feature is especially useful when users want to transfer data recorded within a specific date range, without needing to create folders for every date in that range.

What does this PR do?
Implements @Dateto@ pattern recognition inside search_for_wildcards.

Uses get_values_from_bids_formatted_name to extract date-YYYYMMDD from folder names.

Filters the folders based on whether the date falls within the provided range.

References

#508

How has this PR been tested?

Created automated tests (test_date_search_range) using a simulated folder structure with date-YYYYMMDD format.

Verified that only folders within the specified date range are returned.

Confirmed that existing wildcard functionality remains unaffected.

Is this a breaking change?

No, this feature is additive and does not alter existing behavior.

Does this PR require an update to the documentation?

Yes. The documentation should be updated to mention the new @Dateto@ wildcard and its usage.

If any features have changed, or have been added. Please explain how the
documentation has been updated.

Checklist:

[ yes] The code has been tested locally
[ yes] Tests have been added to cover all new functionality
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

There are two minor mypy errors I couldn't fully resolve:
A type conflict involving the dummy Configs class used in tests — guidance from maintainers would help finalize this.
A type mismatch originating from an existing code path — this appears unrelated to the new functionality added.

Diya910 · 2025-04-17T21:10:55Z

@adamltyson @JoeZiminski
Is there any update on the pull request. Your feedback will be really helpful.

sumana-2705 · 2025-04-22T18:19:40Z

Hello @Diya910,
The changes are looking great. The review process might be a little delayed since the team is currently a bit busy. In the meantime, it might be a good idea to take a look at the documentation as well, in the transfer_data.md file :)

JoeZiminski · 2025-05-04T16:47:27Z

Hi @Diya910 so sorry for the delay in response! thanks a lot for this PR and the extensive tests. I'm still not back full time but will definitely have time to review this within the next two weeks. Thanks for your patience

JoeZiminski

Hey @Diya910 thanks a lot for this, its a really nice implementation and is exactly what we need to do in this case. I have left a few comments on refactoring, this is because the introduced functionality can be aligned with some existing code to reduce duplication across the codebase. This requires some massaging of existing datashuttle code to make it a little more general so it can be called here. The suggestions also extend the implementation to handle the TIMETO and DATETIMETO case. For now I have not reviewed the tests as they might need changing after the refactor, but in general they look good and the attention to detail on testing is much appreciated.

Let me know if anything is not clear and if you have any questions or alternative ways to tackle this. Refactorings like those suggested can be a little fiddly. The linting / type checking will be useful when performing such refactorings. Of course, I'm happy to help wherever it would be useful. Thanks again for this contribution!

Just a reminder to myself, we will also need to add documentation for this new functionality.

datashuttle/utils/folders.py

JoeZiminski · 2025-05-14T14:38:00Z

datashuttle/utils/folders.py

-            name = name.replace(canonical_tags.tags("*"), "*")
-
-            matching_names: List[str]
+        if canonical_tags.tags("*") in name or "@DATETO@" in name:


I think we can split this case, I think at present if there is both a wildcard and date in the name it will overwrite search_str generated with search_str = name.replace(canonical_tags.tags("*"), "*") with search_str = re.sub(r"\d{8}@DATETO@\d{8}", "date-*", name)

search_str = name if canonical_tags.tags("*") in name or "@DATETO@" in name: search_str = search_str = name.replace(canonical_tags.tags("*"), "*") if @DATETO@ in search_str: ... date replacement code

This is not very nice as it is constantly mutating search-string which can be difficult to debug. However, I think the problem at hand calls for this and it is the neatest way to indication the intention. We can leave a comment to explain.

Reorganized the code to handle wildcards and datetime tags separately

JoeZiminski · 2025-05-14T14:40:45Z

datashuttle/utils/folders.py

+        if canonical_tags.tags("*") in name or "@DATETO@" in name:
+            search_str = name.replace(canonical_tags.tags("*"), "*")
+            # If a date-range tag is present, extract dates and update the search string.
+            if "@DATETO@" in name:


We have a canonical tags.tags() function that contains all the tags (just in case we change them or some other problem that requires their editing arises). So @DATETO@, @TIMETO@ and @DATETIMETO@ could be added to that function and here @DATETO@ replaced with tags.tags("DATETO")

Added DATETO, TIMETO, and DATETIMETO to canonical_tags.py and using them through tags()

JoeZiminski · 2025-05-14T15:02:05Z

datashuttle/utils/folders.py

+            search_str = name.replace(canonical_tags.tags("*"), "*")
+            # If a date-range tag is present, extract dates and update the search string.
+            if "@DATETO@" in name:
+                m = re.search(r"(\d{8})@DATETO@(\d{8})", name)


This is nice, we have some validation code for ISO formats here however this function works on a list of names, and returns in a slightly strange format (which makes sense in the context of the validation functions).

I think to centralise this we can do a few refactorings:

Move this code block:

formats = { "datetime": "%Y%m%dT%H%M%S", "time": "%H%M%S", "date": "%Y%m%d", }

to configs/canonical_tags and wrap it in a function like get_datetime_format based on the format.

Then this code in datetimes_in_iso_format can be factored into a separate function valdiate_datetime

strfmt = formats[key] try: datetime.strptime(format_to_check, strfmt) error_message = [] except ValueError: error_message = [get_datetime_error(key, name, strfmt, path_)]

except it can just return True / False and we can leave the error message stuff to datetimes_in_iso_format.

Now, we can call this new function from here. I think it is also worth having a quick function to get the expected number of values (8 for date) for use above, instead of hard-coding. This could be like:

def get_expected_num_datetime_values(format): format_str = get_datetime_format(format) today = datetime.now() return len(today.strftime(format_str)

We can then pass this "date", "time" or "datetime" to get the values.

This section would then look something like:

# somewhere need to check that @DATETO@, @TIMETO@ and @DATETIMETO@ are used exclusively format = tag = None if tags.tag("DATETO") in search_str: format = "date" tag = tags.tag("DATETO") elif tags.tag("TIMETO" in search_str: format = "time" tag = tags.tag("TIMETO") elif tags.tag("DATETIMETO") in search_str: format = "datetime" tag = tags.tag("DATETIMETO") expected_values = get_expected_num_datetime_values(format) full_tag_regex = fr"(\d{{{num_values}}}){re.escape(tag)}(\d{{{num_values}}})" match = re.search(full_tag_regex , search_str) if not match: ...raise (use utils. raise_error and raise a NeuroBlueprint error) start_str, end_str = match.groups() start_timepoint =datetime.strptime(start_str, get_datetime_format(format)") end_timepoint =datetime.strptime(end, get_datetime_format(format)") if not validate_datetime(start_timepoint, format): ... raise error <same for end_timepoint> search_str = re.sub(full_tag_regex , f{format}-*", search_str)

I think all of this could be isolated to a new function such that in this function we just have something like:

format = tag = None if tags.tag("DATETO") in search_str: format = "date" tag = tags.tag("DATETO") elif tags.tag("TIMETO" in search_str: format = "time" tag = tags.tag("TIMETO") elif tags.tag("DATETIMETO") in search_str: format = "datetime" tag = tags.tag("DATETIMETO") search_str = format_and_validate_datetime_search_str(search_str, format, tag)

Moved datetime formats to canonical_tags.py and created get_datetime_format function for centralized access

JoeZiminski · 2025-05-14T15:03:44Z

datashuttle/utils/folders.py

                )[0]
-
+            # If a date-range tag was provided, further filter the results.
+            if "@DATETO@" in name:


This is nice, and at this point we know we have validated dates. Ideally the validation should happen immediately before the point of use but in this case, there is no point wasting time searching if the dates are not valid, so it makes sense to do it before. But it is worth leaving a comment to indicate we know the dates are valid at this stage.

JoeZiminski · 2025-05-14T15:07:38Z

datashuttle/utils/folders.py

+            # If a date-range tag was provided, further filter the results.
+            if "@DATETO@" in name:
+                filtered_names: List[str] = []
+                for candidate in matching_names:


This is nice, to generalise it a bit more we can have:

if format is not None: assert tag is not None, "format and tag should be set together" get_values_from_bids_formatted_name can use `format` and in the strptime call use the new `get_datetime_format ` function

JoeZiminski · 2025-05-14T15:08:43Z

datashuttle/utils/folders.py

+                    values_list = get_values_from_bids_formatted_name(
+                        [candidate_basename], "date"
+                    )
+                    if not values_list:


I think we can assume this list is not empty because date-* was used to search the names already (?)

Removed unnecessary empty list check since the search pattern ensures valid datetime values

JoeZiminski · 2025-05-14T15:09:30Z

datashuttle/utils/folders.py

+                    except ValueError:
+                        continue
+                    if start_date <= candidate_date <= end_date:
+                        filtered_names.append(candidate)


I think this entire block could be isolated in a new function filter list of names by datetime just for readability (and it might also be useful in future)

Created filter_names_by_datetime_range() function

JoeZiminski · 2025-06-06T12:49:23Z

Hey @Diya910 do you think you would be interested in continuing to work on this PR? This is a great addition and it would be nice to release it in a version soon. I'm happy to finalise the PR as most of the work now is just refactoring into the existing codebase.

Diya910 · 2025-06-06T13:15:09Z

Hey @Diya910 do you think you would be interested in continuing to work on this PR? This is a great addition and it would be nice to release it in a version soon. I'm happy to finalise the PR as most of the work now is just refactoring into the existing codebase.

Yes yes, I am interested. I was busy with my exams and other stuffs. Just allow me a day or two. I'll do the required changes suggested by you.

JoeZiminski · 2025-06-06T13:21:49Z

Hey @Diya910 great! No rush BTW I was just checking in, please prioritise exams / other stuff / taking some time to recuperate after exams. I was thinking it might be nice to merge over the next few weeks (rather than next few days), thanks!

Diya910 · 2025-06-06T13:24:23Z

Thanks, I'll try to work on it as soon as possible.

…ion of code my making functions in validation.py and using in search_with_tags feature in folders file

Diya910 · 2025-06-15T15:11:25Z

Hey, @JoeZiminski I have probably done all the changes suggested by you and also centralized the code. I have also changed the test file with additional test functions, everything is working fine from side. If any other changes are required, please let me know. I will do them at the earliest.

JoeZiminski · 2025-06-18T13:53:19Z

Hi @Diya910 thanks a lot for this! Will review tomorrow

JoeZiminski

Hey @Diya910 thanks for this, this is really great stuff. The code is very clean, this is going to make a great feature. I have left a few comments on the code, they just suggest some minor refactoring's to reduce code duplication where possible. For critical code, it makes sense to define the key parts only in once place, just in case they are changed later but the editor forgets to check for all places they are defined.

The tests are great for ensuring the features works well, I have suggested a refactoring here to use our existing testing machinery which I think should reduce some boilerplate, let me know if you have any questions about this. The tests will should probably test all three cases, dateto, timeto and datetimteto, happy to help with this.

I just pushed some fixes to the pre-commit on the CI which was failing, just some minor typing issues (see here for some detail on the pre-commit hooks). This should move on to the full test suite now.

Thanks again Diya this is nearly done! I just remembered we will also need to document this change, the contributing guide for this is here. It would make sense to add the new tags to this section. Happy to do this because the documentation can be a bit fiddly, but if you are interested in this please feel free to go ahead, let me know if you have any questions!

JoeZiminski · 2025-06-19T18:11:01Z

datashuttle/configs/canonical_tags.py

    }
    return tags[tag_name]
+
+
+_DATETIME_FORMATS = {


Great thanks for this. Can this be refacotred to be returned from a function rather than a dict with global scope e.g.:

def get_datetime_formats(): return {...}

The reason is that _DATETIME_FORMATS becomes a dictionary with global scope across the application, meaning if it is accidentally changed in one part of the code this will propagate everywhere. Wrapping it in a function means the scope is not longer globa.

Apologies I see what you did here, that's great. In this case, you can move _DATETIME_FORMATS directly into get_datetime_format

Great suggestion, I will move it into the function

JoeZiminski · 2025-06-19T18:14:58Z

datashuttle/utils/validation.py

-
-    key = next((key for key in formats if key in name), None)
+    key = next(
+        (key for key in ["datetime", "time", "date"] if key in name), None


here you can do datetime_keys = list(get_datetime_format().keys()) then

key for key in datetime_keys (just to avoid the re-definition of these keys))

datetime_keys = list(canonical_tags.get_datetime_format()) key = next((key for key in datetime_keys if key in name), None)

Like this??

Oh I see apologies for the confusion, the return value of get_datetime_format is slightly different to that I thought.

What do you think of changing the function to take no arguments and return the entire dictionary, and then it can be indexed (instead of called). For example format = get_datetime_format(format_type) becomes format = get_datatime_formats()[format_type].

Then above, the signature can be:

datetime_keys = list(canonical_tags.get_datetime_formats().keys()) key = next((key for key in datetime_keys if key in name), None)

This signature is now different to the tags() function, but it is probably a better design because it is more flexible. Now the datetime names are canonically defined in a central place, and we can grab them all or index them as we like. Later on tags() can be changed to follow the same design. Sorry this would be a bit of a pain to change all the calls in your code though. What do you think?

Thanks, that makes a lot of sense! I agree that returning the full dictionary from get_datetime_formats() would make the design cleaner and more flexible — especially for cases like this where we need all the keys. Happy to refactor the calls where needed. Also agree that aligning the design of tags() later on would help keep things consistent across the codebase. Will go ahead with this change!

JoeZiminski · 2025-06-19T18:31:38Z

datashuttle/utils/validation.py

        try:
-            datetime.strptime(format_to_check, strfmt)
-            error_message = []
+            if not validate_datetime(format_to_check, key):


I may be incorrect, I don't think this needs to be in a try catch block anymore? Would the below work? (The order of the conditional is also reversed to make the positive case first which is usually slightly more readable):

if validate_datetime(format_to_check, key): error_message = [] else: error_message = [ get_datetime_error( key, name, canonical_tags.get_datetime_format(key), path_, ) ] return error_message

JoeZiminski · 2025-06-19T18:32:45Z

datashuttle/utils/validation.py


    return error_message


+def validate_datetime(datetime_str: str, format_type: str) -> bool:


This is a good name but because there are a few similar functions it could be slightly more explicit e.g. datetime_value_str_is_iso_format()

JoeZiminski · 2025-06-19T18:34:09Z

datashuttle/utils/validation.py

+        return False
+
+
+def get_expected_num_datetime_values(format_type: str) -> int:


Because this function get_expected_num_datetime_values and format_and_validate_datetime_search_str are only used in folders.py I think it makes sense to move them there (they could be under a Datetime Tag section or something)

JoeZiminski · 2025-06-19T19:22:34Z

datashuttle/utils/folders.py

+                match
+            ):  # We know this is true because format_and_validate_datetime_search_str succeeded
+                start_str, end_str = match.groups()
+                start_timepoint = datetime.strptime(


These lines can also use the new datetime_object_from_string function

datashuttle/utils/folders.py

JoeZiminski · 2025-06-19T20:19:43Z

tests/test_date_search_range.py

@@ -0,0 +1,217 @@
+import glob


This is a well thought out test script that puts a lot of emphasis on realistic tests which is excellent. I think on balance, here it would be easier to use some of the test functionality to actually make a project and some folders, then check these are found as expected. For example here.

The test code might look something like the below. The project fixture is inherited from the BaseTest class and automatically comes with set up and tear down. Thinking about it, you might as well test directly with project.transfer_custom and see that the correct folders are transferred. This will then check every cog in the machine:

class TestDateSearchRange(BaseTest): def test_date_search_range(self, project) sub_names = ["a list of example subs to test"] ses_names = ["a list of example ses to test] # it might actually be easier to test the ses and sub case separately test_utils.make_and_check_local_project_folders( project, "rawdata", subs, sessions, ["behav", "ephys"] ) project.upload_custom( some search strings) transerred_subjects = (project.get_central_path() / "rawdata).glob("*") # now check that the correct files have been transferred )

also see here

When I am opening the first example, it is causing some disruption in the ui and i am not getting where you are pointing. Can you please see it? Can you share that again

Hi @Diya910, I think I accidentally copied from a commit, does this work?

it is the function test_wildcard_transfer in /tests/tests_integration/test_filesystem_transfer.py

JoeZiminski · 2025-06-19T20:21:56Z

tests/test_date_search_range.py

+    assert found_dates == expected_dates
+
+
+def test_simple_wildcard(temp_project_dir: Path):


I think we can remove this function given this but if this checks an extra case we can keep of course

JoeZiminski · 2025-06-19T20:22:26Z

tests/test_date_search_range.py

+        search_with_tags(cfg, base_folder, local_or_central, [pattern])
+    assert "Invalid" in str(exc_info.value)
+
+


Brilliant idea, this can be adjusted as suggested above but the core test is great

Diya910 · 2025-07-02T18:06:26Z

Hey @JoeZiminski, I have done changes required by you. These were a lot of changes I am not able to reply to all of them individually. But I made sure to make changes suggested by you. I have tested the changes on draft test file and they are working fine. I haven't properly done work on test file. It was a lot for me to do in a go. Once you confirm these changes I'll move ahead in refactoring test file. I hope you are fine with it. If I missed any suggestion above just in case, please point out to that I'll make those changes.

JoeZiminski

Hey @Diya910 this is great, definitely good to go bar some very minor suggestions. Most of these are minor github code suggestions so you can directly commit them.

Apologies, one of my suggestions was actually worse than what was already there 😅 around the walrus operator. Sorry for the inconvenience of having to revert this.

After these changes are integrated I will message @Akseli-Ilmanen to test this manually while the other tests are been written. Let me know if you have any questions as you refactor the tests. Thanks again!

datashuttle/utils/folders.py

datashuttle/utils/validation.py

datashuttle/configs/canonical_tags.py

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

…per review

…into date_feature

Diya910 · 2025-07-04T08:09:48Z

@JoeZiminski I have done all the changes. Please have a look. I am not sure about if I have removed declarations the right way. Please let me know if you want me to change docstrings in any specific way. Thankyou

JoeZiminski · 2025-07-04T16:43:36Z

Hey @Diya910 thanks for these changes, all looks great! Docstrings are formatted correctly, I think good to move on to tests now! Let me know if you have any questions about anything.

Currently the pre-commit test is failing, you can read some details on our tools here. Basically, you can do pre-commit install in the environment, then every time you commit it will run all the checks. Let me know if you have any questions about solving the issues, sometimes it is not clear. You can also do pre-commit run --all-files to run on the entire codebase (which you can do now to fix the issues). You can see the issues in the Checks section at the bottom of this page.

At some point I will merge the new main into your branch and fix the conflicts (I think this will be easier if I do it, as I know the changes that are coming in from main branch so will be easier to solve). Let me know a day that you will not be using the branch and I can do it then.

The final thing to do will be to add this feature to the documentation, see a guide here. If you would like to learn the sphinx documentation stack, please give it a go and ask me any questions you have. Otherwise, I'm happy to do it.

Diya910 · 2025-07-04T16:54:24Z

@JoeZiminski I'll now move to making changes to test file. Once you approve them you can take the lead to fix pre-commit.ci error. I'll definitely like to add the documentation for this feature once all other things are done.

JoeZiminski · 2025-07-10T09:34:05Z

Great! Thanks @Diya910 that sounds like a good plan

{Search for wildcards function updated}

3503808

JoeZiminski requested changes May 14, 2025

View reviewed changes

Refactoring changes asked by the maintainer which include centralisat…

054a98b

…ion of code my making functions in validation.py and using in search_with_tags feature in folders file

JoeZiminski added this to the v2.8.0 milestone Jun 17, 2025

JoeZiminski requested changes Jun 19, 2025

View reviewed changes

JoeZiminski linked an issue Jun 21, 2025 that may be closed by this pull request

Search within date range #508

Open

JoeZiminski mentioned this pull request Jun 26, 2025

Extend developer docs with coventions mentioned in issues #533

Open

Fabrication of the code moved functions in folders.py

69699d9

Diya910 force-pushed the date_feature branch from d038ac8 to 69699d9 Compare July 2, 2025 17:49

JoeZiminski requested changes Jul 3, 2025

View reviewed changes

JoeZiminski mentioned this pull request Jul 3, 2025

Check tests for validation #539

Open

JoeZiminski reviewed Jul 3, 2025

View reviewed changes

datashuttle/configs/canonical_tags.py Outdated Show resolved Hide resolved

Diya910 and others added 7 commits July 4, 2025 12:50

Update datashuttle/utils/folders.py

15f8a3c

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

Update datashuttle/utils/folders.py

8104024

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

Update datashuttle/utils/folders.py

8a2cbbd

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

Update datashuttle/utils/folders.py

a44cb61

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

Update datashuttle/utils/validation.py

f9a21b4

Co-authored-by: Joe Ziminski <55797454+JoeZiminski@users.noreply.github.com>

Refactor: Clean up docstrings in folders.py and canonical_tags.py as …

bd12cd6

…per review

Merge branch 'date_feature' of https://github.com/Diya910/datashuttle …

e0d2441

…into date_feature


		return error_message


		def validate_datetime(datetime_str: str, format_type: str) -> bool:

		return False


		def get_expected_num_datetime_values(format_type: str) -> int:

		assert found_dates == expected_dates


		def test_simple_wildcard(temp_project_dir: Path):

		search_with_tags(cfg, base_folder, local_or_central, [pattern])
		assert "Invalid" in str(exc_info.value)

Search_for_wildcards function updated to add @DATETO@ functionality #510

Are you sure you want to change the base?

Search_for_wildcards function updated to add @DATETO@ functionality #510

Uh oh!

Conversation

Diya910 commented Apr 13, 2025

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

Diya910 commented Apr 17, 2025

Uh oh!

sumana-2705 commented Apr 22, 2025

Uh oh!

JoeZiminski commented May 4, 2025

Uh oh!

JoeZiminski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JoeZiminski commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Diya910 commented Jun 6, 2025

Uh oh!

JoeZiminski commented Jun 6, 2025

Uh oh!

Diya910 commented Jun 6, 2025

Uh oh!

Diya910 commented Jun 15, 2025

Uh oh!

JoeZiminski commented Jun 18, 2025

Uh oh!

JoeZiminski left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Diya910 Jun 26, 2025 • edited by JoeZiminski Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

JoeZiminski commented Jun 6, 2025 •

edited

Loading

Diya910 Jun 26, 2025 •

edited by JoeZiminski

Loading