Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Interval Arithmetic Updates #8276

Merged
merged 47 commits into from
Nov 21, 2023
Merged

Refactor Interval Arithmetic Updates #8276

merged 47 commits into from
Nov 21, 2023

Conversation

berkaysynnada
Copy link
Contributor

@berkaysynnada berkaysynnada commented Nov 20, 2023

Which issue does this PR close?

Closes #7883 .

Rationale for this change

This is a refactoring PR for the interval library interval_arithmetic.rs. The key points are:

  1. Move the interval library to the expr crate to make it accessible to logical plans.
  2. Do we really need a bound type? Adopting a convention of always using open or always closed intervals may simplify usage.
  3. There should be no possible ways to create invalid intervals.
  4. There must be no multiple representations of the same intervals.
  5. Support for multiplication (Mul) and division (Div).
  6. Improved cardinality calculation.
  7. Improved overflow handling.
  8. Expanded test coverage.
  9. Support for temporal types.
  10. Propagation of boolean intervals through comparison and equality operators.
  11. A more understandable API and documentation for the library and cp_solver.rs functions.

What changes are included in this PR?

This PR includes the following changes:

  1. Interval bounds are now always closed. For strict equality cases, we use one-step increase or decrease utilities.
  2. There is a single method for creating intervals, which always returns valid and standardized intervals.
  3. Added support for multiplication and division.
  4. Overflows are handled more intelligently.
  5. Cardinality calculation has been simplified and made more accurate.
  6. Temporal types are now supported.
  7. Boolean intervals can be propagated through logical operators.
  8. OR operator is supported.
  9. Documentation has been improved. The new structure makes future implementations easier.

Are these changes tested?

Yes, both with existing tests and newly added tests for new features.

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules labels Nov 20, 2023
Copy link
Contributor

@ozankabak ozankabak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this very carefully and collaborated with @berkaysynnada closely on this. Since this will be a foundational utility for many use cases, we are looking forward to reviews from other pairs of eyes as well -- @alamb, PTAL

@alamb
Copy link
Contributor

alamb commented Nov 20, 2023

I will try to review this carefully over the next day or two, but given its size I don't think I will be able to provide a super detailed review. However, since @ozankabak has already done so I feel it is in good hands.

I will focus on the interaction with other components / test updates.

@berkaysynnada
Copy link
Contributor Author

There is such a diff size because we moved the main code. The key changes are in interval_arithmetic.rs and cp_solver.rs. We have carefully reviewed and thoroughly tested the functions for arithmetic operations on intervals. Reviewing from a higher-level perspective might be more time-efficient.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @berkaysynnada -- I think this API is much nicer.

I clearly wasn't able to give the whole thing a thorough review, but I did review test cases (outside the interval implementation itself) that were changed as well as the API where it interacted with the rest of the system

All in all I think it is really nicely done.

I had some additional improvement suggestions, but nothing I think is needed prior to merge

@@ -207,7 +206,7 @@ impl<S: SimplifyInfo> ExprSimplifier<S> {
/// (
/// col("x"),
/// NullableInterval::NotNull {
/// values: Interval::make(Some(3_i64), Some(5_i64), (false, false)),
/// values: Interval::make(Some(3_i64), Some(5_i64)).unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much nicer (avoid the explicit open/closed boundaries)

@@ -3281,7 +3279,7 @@ mod tests {
(
col("c3"),
NullableInterval::NotNull {
values: Interval::make(Some(0_i64), Some(2_i64), (false, false)),
values: Interval::make(Some(0_i64), Some(2_i64)).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the presence of Interval::make and Interval::try_new confusing as they both do the same thing but looking at this code I wonder "why sometimes use make and sometimes try_new?".

Would it be possible to remove all calls t Interval::make and instead call Interval::try_new (could be a follow on PR)

Copy link
Contributor

@ozankabak ozankabak Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make is a utility function for use in tests only. Using try_new in test functions balloons the line count 🙂 I am adding a comment in the docstring to make this clear.

BTW I tried a few tricks to make this a test-only function (using config directives, moving to a separate test_utils.rs file etc.), but it is not very straightforward since this is used in tests of multiple crates. One way I was able to make it work was via feature flags, but since that is a more involved change I decided not to put it in this PR.

@@ -262,18 +260,18 @@ mod tests {
#[test]
fn test_inequalities_non_null_bounded() {
let guarantees = vec![
// x ∈ (1, 3] (not null)
// x ∈ [1, 3] (not null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wjones127 (as I think you added this code originally in #7467)

@@ -34,13 +34,16 @@ path = "src/lib.rs"

[features]
crypto_expressions = ["md-5", "sha2", "blake2", "blake3"]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions"]
default = ["crypto_expressions", "regex_expressions", "unicode_expressions", "encoding_expressions",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a strange reformatting (as is the reformatting in the other .toml files) but I don't see anything wrong with it.

If you can figure out how to avoid such changes I think it would result in smaller diffs that are easier to review

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this kind of thing is distracting. Must be some VS code config thing, we'll share if we figure out why/how it turns up.

.unwrap_or(empty_field),
),
);
let interval = Interval::try_new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much nicer

}
}

#[cfg(test)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for half the file being tests

/// };
/// assert_eq!(interval.single_value(), None);
/// ```
pub fn single_value(&self) -> Option<ScalarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You could probably return Option<&ScalarValue> here to avoid the clone in some cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this, but the constructor call in the Null variant creates friction.

// In IEEE 754 standard for floating-point arithmetic, if we keep the sign and exponent fields same,
// we can represent 4503599627370496+1 different numbers by changing the mantissa
// (4503599627370496 = 2^52, since there are 52 bits in mantissa, and 2^23 = 8388608 for f32).
let distinct_f64 = 4503599627370497;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks like all floating point cardinality estimate is basically a large constant value I think it is not very valuable for use in cost models. It seems like there non trivial amounts of code that implies the result will be important (for example

    pub fn cardinality(&self) -> Option<u64> {
        let data_type = self.data_type();
        if data_type.is_integer() {
            self.upper.distance(&self.lower).map(|diff| diff as u64)
        } else if data_type.is_floating() {
            // Negative numbers are sorted in the reverse order. To
            // always have a positive difference after the subtraction,
            // we perform following transformation:
            match (&self.lower, &self.upper) {
                (
                    ScalarValue::Float32(Some(lower)),
                    ScalarValue::Float32(Some(upper)),
                ) => {
                    let lower_bits = map_floating_point_order!(lower.to_bits(), u32);
                    let upper_bits = map_floating_point_order!(upper.to_bits(), u32);
                    Some((upper_bits - lower_bits) as u64)
                }
                (
                    ScalarValue::Float64(Some(lower)),
                    ScalarValue::Float64(Some(upper)),
                ) => {
                    let lower_bits = map_floating_point_order!(lower.to_bits(), u64);
                    let upper_bits = map_floating_point_order!(upper.to_bits(), u64);
                    let count = upper_bits - lower_bits;
                    (count != u64::MAX).then_some(count)
                }
                _ => None,
            }

Given that, what do you think about simply special casing the values in this test (e.g. return 4503599627370497) in the implementation as well?

Copy link
Contributor

@ozankabak ozankabak Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That value in tests was carefully chosen to match exponents-of-two boundaries. It is just for ease of testing -- since IEEE 754 floats are logarithmically distributed in the normal zone but linearly distributed in the subnormal zone, one needs to resort to tricks like this during testing.

The actual implementation works in non-boundary-aligned cases too; it achieves generality by exploiting the bit-pattern ordering specification clause in IEEE 754.

I will leave comments both in the impl code and the test code to clarify this.

// Negative numbers are sorted in the reverse order. To
// always have a positive difference after the subtraction,
// we perform following transformation:
match (&self.lower, &self.upper) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below -- on the surface this looks like it is doing something complex for floating point values but the output doesn't seem to be very sophisticated. I think we could simplify the code, as metioned below)

/// The `Interval` type represents a closed interval used for computing
/// reliable bounds for mathematical expressions.
///
/// Conventions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@ozankabak
Copy link
Contributor

I will merge this to unlock future work on statistics and other dependent work. If there is anything we missed, just drop a line here and we will address with a quick follow-on PR.

@ozankabak ozankabak merged commit e9b9645 into apache:main Nov 21, 2023
22 checks passed
@alamb alamb changed the title Interval Arithmetic Updates Refactor Interval Arithmetic Updates Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for OR to interval arithmetic
4 participants