Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for providing Chrono formats when parsing strings -> datetime/nanos and Parser implementations #5303

Closed
Omega359 opened this issue Jan 15, 2024 · 5 comments
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog

Comments

@Omega359
Copy link
Contributor

This is related to apache/datafusion#5398 where functions for parsing string -> datetime and string -> nanos would be added to include a second 'format' parameter. The format would be a chrono format (https://docs.rs/chrono/latest/chrono/format/strftime/index.html) and would return an Err if the string couldn't be parsed with the provided format.

The proposed two new functions that would be added would be :

  • pub fn string_to_datetime_formatted<T: TimeZone>(timezone: &T, s: &str, format: &str) -> Result<DateTime<T>, ArrowError> (mirrors the existing string_to_datetime function)
  • pub fn string_to_timestamp_nanos_formatted(s: &str, format: &str) -> Result<i64, ArrowError> (mirrors the existing string_to_timestamp_nanos function)

In addition, for completeness all the existing Parser implementations in that file would be updated to include a 'parse_formatted' implementation where missing.

@Omega359 Omega359 added the enhancement Any new improvement worthy of a entry in the changelog label Jan 15, 2024
@tustvold
Copy link
Contributor

tustvold commented Jan 15, 2024

A few points:

  • The current logic is quite highly optimised, we need to be careful not to introduce any performance regressions
  • There are a couple of known bugs with the chrono format strings where the parser is overly permissive - %Y should return an error when parsing a non-4 digit year chronotope/chrono#332
  • Timestamp parsing, especially timezones is very fiddly. I would request we keep code changes small, as it is very easy to subtly break things

Overall I am rather lukewarm on this functionality, as it is by definition diverging from established standards, but I will endeavour to review well tested and focused changes to this logic that don't regress existing workloads.

@Omega359
Copy link
Contributor Author

Overall I am rather lukewarm on this functionality, as it is by definition diverging from established standards, but I will endeavour to review well tested and focused changes to this logic that don't regress existing workloads.

Would you prefer this to not be in arrow-cast and rather be left in datafusion? I'm only adding it here to mirror the existing functionality - I have no desire to change any of the current implementation details.

@tustvold
Copy link
Contributor

Incubating functionality in DataFusion before lifting it into arrow-rs has been a successful model in the past, and might be a quicker way to iterate on apache/datafusion#5398 without having to wait on an arrow-rs releases.

@Omega359
Copy link
Contributor Author

Sounds good, I'll pull in my changes to DataFusion for now and we can close this issue.

@Omega359
Copy link
Contributor Author

Closing this issue with apache/datafusion#5398 as an alternative.

@tustvold tustvold added the development-process Related to development process of arrow-rs label Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development-process Related to development process of arrow-rs enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants