Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: is_ISO8601 #867

Closed
billdenney opened this issue Mar 10, 2020 · 7 comments
Closed

Feature Request: is_ISO8601 #867

billdenney opened this issue Mar 10, 2020 · 7 comments
Labels
feature a feature request or enhancement

Comments

@billdenney
Copy link
Contributor

I have an application where I need to be able to detect if a character string is formatted as required by ISO 8601. Given that #629 / #700 to format date-times as ISO 8601 was a good fit here with format_ISO8601(), I thought that a detection method would also be useful here.

What would you think about a function (or small family of functions) that was named something like is_ISO8601() which could do the following:

  • For character strings: returns TRUE (if formatted as ISO8601), FALSE (if not formatted as ISO8601), or NA (if NA).
  • For any type other than character string, the function would not apply (it would stop with an error).
  • It would take inputs of:
    • x the character string to test,
    • representation one or more of c("date", "datetime", "time", "duration", "interval", "repeating interval"),
    • precision to select the required precision of the argument (not applicable for "duration" or the duration part of an "interval"), and
    • usetz set to TRUE to require the time zone, FALSE to require no time zone, or NA to allow with or without a time zone (would not apply for durations).
  • While I think that people would typically test a single type, the default would allow for any of the representations or all of them.
@vspinu
Copy link
Member

vspinu commented Mar 11, 2020

What about vectorised use case?

In order to have this one would need a dedicated ISO8601 parser, which doesn't seem to be worth the effort for such a tiny use case.

@vspinu vspinu added the feature a feature request or enhancement label Mar 11, 2020
@billdenney
Copy link
Contributor Author

Maybe I'm over-simplifying it, but I think that it could be a set of small functions with reasonably straight-forward regular expressions and grepl calls (therefore inherent vectorization).

I agree that a specialized parser would not be worthwhile.

If the grepl method doesn't seem like a good fit, no worries.

@vspinu
Copy link
Member

vspinu commented Mar 12, 2020

You might be right actually, but this is low priority. If you can put together a PR and a bunch of tests for it, I would be more than happy to include it in the code base.

@billdenney
Copy link
Contributor Author

billdenney commented Apr 13, 2020

The regular expressions become convoluted, but I am algorithmically building them in a way that makes them reasonable to review (e.g. make the year part then use that to make the date part then use that and the time part to make a whole regexp). And, I'm building many tests for each part, so that it should be understandable.

This is now a work in progress.

@billdenney
Copy link
Contributor Author

With a lot of work, I now have a super-regexp and the ability to generate all variants (optional second, minute, hour, day, week/month, year). The regexp itself is a beast:

(?:(158[3-9]|159[0-9]|1[6-9][0-9]{2}|[2-9][0-9]{3})(?:(?:-(0[1-9]|1[0-2])(?:-(?:(0[1-9]|[12][0-9]|3[01])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?|-W(0[1-9]|[1-4][0-9]|5[0-3])(?:-(?:([1-7])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?|(?:-(?:(00[1-9]|0[1-9][0-9]|[12][0-9]{2}|3[0-5][0-9]|36[0-6])(?:(?:(?:T([01][0-9]|2[0-3])|T([01][0-9]|2[0-3]):([0-5][0-9])(?::((?:[0-5][0-9])(?:[\.,][0-9]+)?))?)(?:(Z|\+00(?::00)?|[\+-]00:(?:15|30|45)|[\+-](?:0[1-9]|1[1-4])(?::(?:00|15|30|45))?))?)?)?))?))?)?

With this easier to review visualization.

The part that I'd prefer to be able to fix is making it so that time is only represented once. I think that look-ahead and look-behind regexps may be the right answer, but I don't understand enough about them yet to be sure that's correct.

@vspinu
Copy link
Member

vspinu commented Jan 25, 2022

Sorry for not coming back on this earlier. But I am afraid this is too complex. I am pretty shure there should be a C or C++ code somewhere to test for this. Otherwise it's probably not very difficult to write our own.

@billdenney
Copy link
Contributor Author

Yeah, it makes sense that this isn't a good fit as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants