Improve type coercion and casting #8302

jayzhan211 · 2023-11-22T12:50:33Z

Is your feature request related to a problem or challenge?

I think there is room for improvement in type coerceion or casting.

Background

comparison_coercion is widely used in datafusion, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/binary.rs

can_coerce_from is used mainly for signature, a lossless conversion
https://github.com/apache/arrow-datafusion/blob/main/datafusion/expr/src/type_coercion/functions.rs

can_cast_types is from arrow-cast, which is a lossy conversion. It is also used in some comparison_coercion building block. https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76-L273

Not sure if there is other coercion I missed

Proposal

comparison_coercion and can_coerce_from seem like doing the similar thing, maybe we can just have one lossless conversion. If lossless conversion is useful for arrow-rs, we can introduce a lossless version of can_cast_types, then rely on it for datafusion.

Lossy conversion vs Lossless

I think the definition for lossy is that the value is not recoverable after casting back, otherwise it is lossless.

Lossy

Int32 to Int16 / Int8

Lossless

Int32 to Int64

Describe the solution you'd like

Replace can_coerce_from with comparison_coercion's building block numeric coercion, list coercion, string coercion, null coercion, etc
Split list_coercion from string_coercion to make each building block of coercion clear on the task it focus on. list_coercion do list/fixed size list/large list coercion, string_coercion do utf/large utf coercion.
Introduce these lossless coercion to arrow-rs?

Known issue or question I have

Introduce list_coercion that currently exist in string_concat_coercion
No list coercion for can_coerce_from
Decimal128 can cast to Float64 in can_coerce_from, why?

Describe alternatives you've considered

If there are many customize conversion need, then this change might not be helpful at all. We need other approach to let type casting / coercion easy to use.

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jayzhan211 · 2023-11-22T13:15:19Z

@tustvold Do you think introduce a lossless version of can_cast_types helpful?
cc @alamb

tustvold · 2023-11-22T13:19:27Z

Decimal128 can cast to Float64 in can_coerce_from, why?

Floating points inherently permit precision loss. It is fairly standard that otherwise checked conversion permit precision loss when converting between floats.

For example - https://docs.rs/num-traits/latest/num_traits/cast/trait.FromPrimitive.html

A value can be represented by the target type when it lies within the range of scalars supported by the target type. For example, a negative integer cannot be represented by an unsigned integer type, and an i64 with a very high magnitude might not be convertible to an i32. On the other hand, conversions with possible precision loss or truncation are admitted, like an f32 with a decimal part to an integer type, or even a large f64 saturating to f32 infinity.

No list coercion for can_coerce_from

I think this would make sense as an addition, the current support for nested types in arrow-cast is very limited.

If lossless conversion is useful for arrow-rs

In general I think it would be very confusing if coercion can result in silent data loss, perhaps you might articulate why this would be useful? My current thinking is casting should never silently lose data.

jayzhan211 · 2023-11-22T13:59:30Z

My current thinking is casting should never silently lose data.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L76C2-L80

Comment in can_cast_types mention that cast is lossy.

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/arrow-cast/src/cast.rs#L212-L216

Casting Int64 to Int32 return true. I'm not sure if the actual casting block this lossy conversion, but if not, could it silently lose data?

tustvold · 2023-11-22T14:01:49Z

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

jayzhan211 · 2023-11-22T14:05:59Z

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

jayzhan211 · 2023-11-22T14:09:51Z

However, I think we can still cleanup comparison_coercion and can_coerce_from and forget about lossy conversion.

jayzhan211 · 2023-11-22T14:17:13Z

That comment is incorrect and doesn't match the implementation. I believe it means that it supports conversions that are potentially fallible, e.g. Int32 to Int8, but it never silently looses data

Any scenario that we should hint it as potentially fallible but actually do nothing for them? It seems to me return false is more straightforward if we do nothing.

Maybe Int64(1) cast to Int32(1) is one that is able to safely cast without data loss.

jayzhan211 · 2023-11-22T14:23:30Z

Maybe we need another verion of can_cast_types that no possible fallible hint. For i64, only u64 or larger type, no i32 or i16 since we don't know the actual data.

tustvold · 2023-11-22T14:24:53Z

Is that not what can_coerce_from is?

jayzhan211 · 2023-11-22T14:31:38Z

can_coerce_from

Yes! Should we have it aside can_cast_types in arrow-rs? Or just let it stay in datafusion for now.

tustvold · 2023-11-22T14:34:45Z

Given the type coercion logic itself lives in DataFusion I personally think it makes sense for it to remain alongside it

jayzhan211 · 2023-11-23T01:29:26Z

Short summary for the next step of this issue

Replace can_coerce_from with component in comparison_coercion, so we dont need to think about which one to use. Reuse the code if possible, call can_cast_types if possible.
Add nested support. Start from List

TODO

Introduce Boolean Coercion #8331
Support LargeUtf8 to Temporal Coercion #8357
Improve numerice_coercion and reuse in both compare op and math op #8385
Move Coercion for MakeArray to coerce_arguments_for_signature and introduce another one for ArrayAppend #8317
coercion for MakeArray anf Expr::InList should be the same https://github.com/apache/arrow-datafusion/blob/06bbe1298fa8aa042b6a6462e55b2890969d884a/datafusion/optimizer/src/analyzer/type_coercion.rs#L284-L317
Reuse array_concat/append/prepend signature for list concat ops.

Issue

~~Return Result for comparison_coercion, so we can differentiate unimplemented casting or not possible casting~~ Return Option<T> is probably enough.
~~Consider having component of comparison_coercion public, like Allow default value on lag/lead if they are coerceable to value #8308 can have numerice_coercion instead of comparison_coercion~~. No use case.
Float is converted to Decimal with possible overflow in comparison_coercion, while Decimal is coerced to Float in can_coerced_from. Confused 😕 😕 😕

Note

can_coerce_from / comparison_coercion unlike arrow-cast should only check the casting to the larger type since the actual data is unknown.

jayzhan211 · 2023-11-29T01:12:55Z

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

alamb · 2023-11-29T21:53:20Z

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?

for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":

        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.

It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷

jayzhan211 · 2023-11-30T00:45:23Z

https://github.com/apache/arrow-datafusion/blob/2a692446f46ef96f48eb9ba19231e9576be9ff5a/datafusion/expr/src/type_coercion/binary.rs#L388-L406

In comparison_binary_numeric_coercion, signed type and u64 is coerced to i64, anyone knows why is this acceptable? what kind of cases are?
for type u64 and f32, it will return f32 which is smaller than u64, not sure why is this too

The comment in the code you linked seems to say "some information loss is inevitable":
        // The following match arms encode the following logic: Given the two
        // integral types, we choose the narrowest possible integral type that
        // accommodates all values of both types. Note that some information
        // loss is inevitable when we have a signed type and a `UInt64`, in
        // which case we use `Int64`;i.e. the widest signed integral type.
It is likely for "usability" so that you can compare i64 and u64 without an explicit type cast 🤷

I don't think that usability is the good reason :( It might cause data loss.

@alamb I would like to change them to the straightforward version, choose the larger type between two. I think it can also be used all the place in datafusion easily (one place is can_coerce_from). Is there any concern about this change?

If such implicit casting is found helpful in the end, we can have two version, one that lossy is acceptable, one is not.

alamb · 2023-11-30T00:56:43Z

. Is there any concern about this change?

I would recommend you write some tests / try out the change before making a full featured PR. I am not sure what the implications are with the change you propose, but I suspect it could be non trivial and subtle

jayzhan211 · 2023-11-30T13:06:26Z

// The following match arms encode the following logic: Given the two
// integral types, we choose the narrowest possible integral type that
// accommodates all values of both types. Note that some information
// loss is inevitable when we have a signed type and a UInt64, in
// which case we use Int64;i.e. the widest signed integral type.

I rethink about this. This is the best we can do, like the comment mentioned, it is inevitable :(

jayzhan211 · 2023-12-02T05:55:34Z

After #8385, I found that it is not ideal to have one coercion for all the place (compare op, math op, signature coercion, etc... ).

jayzhan211 added the enhancement New feature or request label Nov 22, 2023

tustvold mentioned this issue Nov 22, 2023

Improve cast docs apache/arrow-rs#5114

Merged

alamb mentioned this issue Nov 22, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 20, 2023 #8274

Closed

7 tasks

jayzhan211 mentioned this issue Nov 25, 2023

Move Coercion for MakeArray to coerce_arguments_for_signature and introduce another one for ArrayAppend #8317

Merged

alamb mentioned this issue Nov 27, 2023

DataFusion weekly project plan (Andrew Lamb) - Nov 27, 2023 #8329

Closed

8 tasks

jayzhan211 mentioned this issue Nov 27, 2023

Introduce Boolean Coercion #8331

Closed

jayzhan211 changed the title ~~Categorize lossless type coercion and lossy type coercion~~ Unify type coercion and casting Nov 27, 2023

jayzhan211 mentioned this issue Nov 29, 2023

Support LargeUtf8 to Temporal Coercion #8357

Merged

jayzhan211 mentioned this issue Dec 1, 2023

Improve numerice_coercion and reuse in both compare op and math op #8385

Closed

jayzhan211 changed the title ~~Unify type coercion and casting~~ Improve type coercion and casting Dec 2, 2023

jayzhan211 mentioned this issue Apr 15, 2024

Coerce Dictionary types for scalar functions #10077

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve type coercion and casting #8302

Improve type coercion and casting #8302

jayzhan211 commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023

tustvold commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023 •

edited

Loading

tustvold commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023 •

edited

Loading

tustvold commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023

tustvold commented Nov 22, 2023

jayzhan211 commented Nov 23, 2023 •

edited

Loading

jayzhan211 commented Nov 29, 2023 •

edited

Loading

alamb commented Nov 29, 2023

jayzhan211 commented Nov 30, 2023 •

edited

Loading

alamb commented Nov 30, 2023

jayzhan211 commented Nov 30, 2023

jayzhan211 commented Dec 2, 2023

Improve type coercion and casting #8302

Improve type coercion and casting #8302

Comments

jayzhan211 commented Nov 22, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Background

Proposal

Lossy conversion vs Lossless

Lossy

Lossless

Describe the solution you'd like

Known issue or question I have

Describe alternatives you've considered

Additional context

jayzhan211 commented Nov 22, 2023

tustvold commented Nov 22, 2023 • edited Loading

jayzhan211 commented Nov 22, 2023 • edited Loading

tustvold commented Nov 22, 2023 • edited Loading

jayzhan211 commented Nov 22, 2023 • edited Loading

jayzhan211 commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023 • edited Loading

tustvold commented Nov 22, 2023

jayzhan211 commented Nov 22, 2023

tustvold commented Nov 22, 2023

jayzhan211 commented Nov 23, 2023 • edited Loading

TODO

Issue

Note

jayzhan211 commented Nov 29, 2023 • edited Loading

alamb commented Nov 29, 2023

jayzhan211 commented Nov 30, 2023 • edited Loading

alamb commented Nov 30, 2023

jayzhan211 commented Nov 30, 2023

jayzhan211 commented Dec 2, 2023

jayzhan211 commented Nov 22, 2023 •

edited

Loading

tustvold commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023 •

edited

Loading

tustvold commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 22, 2023 •

edited

Loading

jayzhan211 commented Nov 23, 2023 •

edited

Loading

jayzhan211 commented Nov 29, 2023 •

edited

Loading

jayzhan211 commented Nov 30, 2023 •

edited

Loading