Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid ISO 8601 in example (and it probably shouldn't parse) #38

Closed
billdenney opened this issue Jul 27, 2022 · 9 comments · Fixed by #40
Closed

Invalid ISO 8601 in example (and it probably shouldn't parse) #38

billdenney opened this issue Jul 27, 2022 · 9 comments · Fixed by #40
Assignees

Comments

@billdenney
Copy link
Contributor

There are several examples in the code where the character string "20150101T08:35:32.123+05:30" is used as an example of condensed ISO 8601. According to Wikipedia, this is not valid ISO 8601 because both the date and time parts must use the extended format (https://en.wikipedia.org/wiki/ISO_8601#Combined_date_and_time_representations).

Locations where I found it are:

I think that it would be useful if the default regexp only allowed for one or the other of basic or extended forms. (In reality and for simplicity, that would be better as two regexps.)

@dgkf
Copy link
Owner

dgkf commented Jul 27, 2022

Thanks for reporting! I agree - we should disallow these or split this style off as a "best guess"-style parser.

@dgkf
Copy link
Owner

dgkf commented Jul 28, 2022

Just to make sure I'm understanding correctly. According to this spec, the parser should treat times like this:

2015-01-01T08:35:32.123+05:30  # valid, both extended
20150101T083532.123+05:30  # valid, both basic
20150101T08:35:32.123+05:30  # invalid, basic date with extended time
2015-01-01T083532.123+05:30  # invalid, extended date with basic time

Is that your interpretation as well?

@billdenney
Copy link
Contributor Author

My understanding of the spec has one difference relative to what you wrote:

20150101T083532.123+05:30 # valid, both basic

The time zone in that is not basic, as I understand it. The basic version would not include a colon in the time zone:

20150101T083532.123+0530 # valid, both basic

@dgkf
Copy link
Owner

dgkf commented Jul 28, 2022

Sounds good! I'll take a look and see how much effort this would take to enforce. If it's relatively straightforward I'll try to knock it out today.

@dgkf dgkf self-assigned this Jul 28, 2022
@billdenney
Copy link
Contributor Author

billdenney commented Jul 28, 2022

I think that it would require making two versions of the ISO 8601 regexp here, one for basic and one for extended. That would then necessitate either giving an option to choose basic or extended to parse_iso8601_datetime() or trying both regexps and choosing the one that parsed it.

When I've tried to put it all into one regexp, the regexp was big and effectively unusable: tidyverse/lubridate#867 (comment)

And, I don't think that it's worth going down the path of a parser. (Having written one for ISO 8601, it is likely more complex than the value here.)

re_iso8601 <- paste0(
"^\\s*",
"(?<year>[\\+-]?\\d{4}(?!\\d{2}\\b))",
"(?:",
"(?<dash>-?)",
"(?:(?<month>0[1-9]|1[0-2])",
"(?:\\g{dash}(?<day>[12]\\d|0[1-9]|3[01]))?",
"|W(?<week>[0-4]\\d|5[0-3])(?:-?(?<weekday>[1-7]))?",
"|(?<yearday>00[1-9]|0[1-9]\\d|[12]\\d{2}|3(?:[0-5]\\d|6[1-6]))",
")",
"(?<time>[T\\s]",
"(?:",
"(?:",
"(?<hour>[01]\\d|2[0-3])",
"(?:(?<colon>:?)(?<min>[0-5]\\d))?|24\\:?00",
")",
"(?<frac>[\\.,]\\d+(?!:))?",
")?",
"(?:\\g{colon}",
"(?<sec>[0-5]\\d)",
"(?<secfrac>[\\.,]\\d+)?",
")?",
"(?<tz>",
"[zZ]|(?<tzpm>[\\+-])",
"(?<tzhour>[01]\\d|2[0-3])",
":?",
"(?<tzmin>[0-5]\\d)?",
")?",
")?",
")?$"
)

@dgkf
Copy link
Owner

dgkf commented Jul 28, 2022

I think the easiest way would just be to permit the invalid forms in the regex, but enforce the restrictions once the regex capture groups have been converted into a matrix:

There are more cases to consider, but just to communicate the simplest form of the idea, it would happen in post-processing like:

m[is.na(m[,"dash"]) != is.na(m[,"colon"]),] <- NA

@billdenney
Copy link
Contributor Author

I think that logic allow dates to look like these (which are also invalid): "2022-0728", "2022-0728T14:4657". We'd have to check that all or none of the dashes and colons exist to enforce validity. We could do something like adding names to all of the dashes and colons and then confirming that we have all or none of them for all the parts that exist in the date-time.

@dgkf
Copy link
Owner

dgkf commented Jul 28, 2022

I think that logic allow dates to look like these (which are also invalid): "2022-0728", "2022-0728T14:4657"

I'll have to test, but I don't think so. The backreferences to the earlier dash and colon capture groups should require that if it's used for the first separator, that it's repeated for each of the other places where a dash or colon could exist. If that's not how it works now, I think it's well within scope to fix.

To handle it all in regex, it looks like it could use a conditional capture group. I've never used this regex behavior before, so I'm inclined to go with the R-side approach to not make the regex grow into a completely unintelligible mess with tricky-to-debug performance implications.

@billdenney
Copy link
Contributor Author

Overall, as long as we have an accurate, maintainable solution, I'm good with it. I haven't used backreferences like that before (and I've minimally used backreferences overall), so as long as it works, great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants