Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor temporal extract date part kernels #5319

Merged
merged 5 commits into from
Jan 24, 2024

Conversation

Jefffrey
Copy link
Contributor

Which issue does this PR close?

Closes #5266

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

No, aside from new function. Existing functions (hour(...), year(...)) are unchanged, only internals are changed and should maintain existing behaviour.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 21, 2024
Copy link
Contributor Author

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think @tustvold , if this falls in line with what you're imagining.

I tried to make things simpler, without resorting to macros so the code might be duplicated a bit but I tried to reduce as much as I could.

Happy for any feedback

Comment on lines 121 to 122
// TODO expand support for Time types, see: https://github.com/apache/arrow-rs/issues/5261
_ => return_compute_error_with!(format!("{part} does not support"), array.data_type()),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can tackle this as separate issue to make this refactoring simpler (maintain existing behaviour)

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice step forward, I have some ideas on how we might use generic to reduce some of the duplication, but I'd need to try it out to know if it works 😅

}
});
/// Returns function to extract relevant [`DatePart`] from a [`NaiveDateTime`].
fn get_naive_date_time_part_extract_fn(part: DatePart) -> fn(NaiveDateTime) -> i32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why this doesn't need to be impl fn

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, really. I just stuck with the first signature that worked here 😅

arrow-arith/src/temporal.rs Outdated Show resolved Hide resolved
Comment on lines 253 to 265
fn date_part_timestamp_ns(
array: &PrimitiveArray<TimestampNanosecondType>,
part: DatePart,
) -> Int32Array {
let map_func = get_naive_date_time_part_extract_fn(part);
array.unary_opt(|d| timestamp_ns_to_datetime(d).map(map_func))
}

fn date_part_timestamp_ns_tz(
array: &PrimitiveArray<TimestampNanosecondType>,
part: DatePart,
tz: Tz,
) -> Int32Array {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way that you might be able to avoid this duplication would be to make the functions take a generic Timezone argument, and then provide Utc as the timezone when none is provided

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, will look into it

@Jefffrey
Copy link
Contributor Author

I think this is a nice step forward, I have some ideas on how we might use generic to reduce some of the duplication, but I'd need to try it out to know if it works 😅

I've refactored it to use generics more, let me know if it looks better this way 👍

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nice, thank you

}

/// Implement the specialized functions for extracting date part from temporal arrays.
trait ExtractDatePartExt {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is exactly what I was thinking of 👍

| DatePart::Microsecond
| DatePart::Nanosecond = part
{
Ok(self.unary(|_| 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW it might be cheaper to do Int32Array::new(vec![0; self.len()].into(), self.nulls().clone()) or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, applied change

@@ -228,14 +372,14 @@ where
T: ArrowTemporalType + ArrowNumericType,
i64: From<T::Native>,
{
time_fraction_internal(array, "year", |t| t.year())
date_part_primitive(array, DatePart::Year)
}

/// Extracts the quarter of a given temporal array as an array of integersa within
/// the range of [1, 4]. If the given array isn't temporal primitive or dictionary array,
/// an `Err` will be returned.
pub fn quarter_dyn(array: &dyn Array) -> Result<ArrayRef, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we deprecate these functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for the best, as if in the future we want to support more parts like century or decade in arrow-rs natively, we wouldn't have to add these shim functions for consistency if they are already deprecated.

That does raise a question that comes to mind, that if we add more of these parts, by extending the DatePart enum, would that constitute a breaking API change? Since we're introducing a new enum variant. Maybe mark it as non-exhaustive, or would this be unnecessary since its mainly used as an input to the date_part() function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non exhaustive sounds like a reasonable idea to me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add this and the deprecations for the shim functions, then PR should be good to go

@tustvold tustvold merged commit 20e723e into apache:master Jan 24, 2024
22 checks passed
@Jefffrey Jefffrey deleted the refactor_temporal_date_part_kernels branch January 24, 2024 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Temporal Extract/Date Part Kernel
2 participants