Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Splitting scalar functions outside Datafusion core #7752

Closed
wants to merge 1 commit into from

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Oct 6, 2023

Which issue does this PR close?

POC for #7110

Rationale for this change

This POC PR defined a new interface for ButinScalarFunction: trait ScalarFunctionDef, to make it possible to define scalar functions outside the core as #7110 stated.
This PR also demonstrated how to define function packages in their own crate and register them into context in datafusion-examples/examples/external_function_package.rs

Background

Currently, there are two ways to define a scalar function in Datafusion

  1. built-in functions are defined as enum BuiltinScalarFunction: it's a huge enum and there are several large functions like return_type() to check every enum variant. This way of implementation is not possible to define functions in other places, so there exist another way to define UDFs.
  2. UDFs: currently it uses struct ScalarUDF as an internal representation for UDF's logical plan, it's expressive enough to define all builtin functions (almost all, need some minor modifications like supporting aliases). There is an interface create_udf() on top of struct ScalarUDF for users to define UDFs.

 ┌───────────────────────┐  ┌──────────────────────────┐  ┌────────────────────────────────┐
 │                       │  │                          │  │                                │
 │                       │  │ create_udf()             │  │*THIS PR                        │
 │                       │  │ (interface to define UDF)│  │ trait ScalarFunctionDef        │
 │                       │  │                          │  │ (an interface to define        │
 │                       │  │                          │  │  scalar functions outside      │
 │                       │  │                          │  │  datafusion core)              │
 │ BuiltinScalarFunction │  │                          │  │                                │
 │                       │  └──────────────────────────┘  └────────────────────────────────┘
 │                       │
 │                       │  ┌──────────────────────────────────────────────────────────────┐
 │                       │  │ struct ScalarUDF                                             │
 │                       │  │ (internal representation for UDF's logical plan)             │
 │                       │  │                                                              │
 └───────────────────────┘  └──────────────────────────────────────────────────────────────┘

A new interface trait ScalarFunctionDef

ScalarFunctionDef trait can be used to define scalar functions, and the trait object can be converted into ScalarUDF, and reuse UDF infrastructures to register and use the functions.

pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
    fn name(&self) -> &str;
    fn signature(&self) -> Signature;
    fn return_type(&self) -> ReturnTypeFunction;
    fn execute(&self, args: &[ArrayRef]) -> Result<ArrayRef>;
}

One potential concern is why is this new trait needed, can we just use the existing interface struct ScalarUDF?
If the function interface is defined as a struct, then a new function should be defined in an imperative way (see examples/simple_udf.rs). There might be a lot of functions for function packages to manage, defining them in a declarative way (using trait) is cleaner and more ergonomic.
Besides, trait ScalarFunctionDef should cover the same functionality as struct ScalarUDF (also the same as BuiltinScalarFunction), now it's converted into ScalarUDF just because this way can reuse the existing implementation and minimize the patch size, it's possible to replace ScalarUDF with trait ScalarFunctionDef in the future as a cleanup task.
I think it's also possible to replace BuitlinScalarFunction with the new interface, as another cleanup refactor task. Adding a new function into BuiltinScalarFunction requires modifications in multiple places/files, the new ScalarFunctionDef interface only needs modifying one place, which can be simpler.

What changes are included in this PR?

1. ScalarFunctionDef trait as an alternative way to define scalar functions

Struct ScalarUDF is currently the internal representation of UDF's logical plan, it should be expressive enough to implement all built-in functions. (with some minor changes like supporting alias)
ScalarFunctionDef trait can be used to define scalar functions, and the trait object can be converted into ScalarUDF, and reuse UDF infrastructures to register and use the functions.

2. Managing extension points

Another issue is how to manage those pluggable crates, possibly with a extension/ folder under datafusion root?
Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed inside extension/ folder

.
├── extension
│   ├── data-source
│   │   └── parquet-reader
│   └── scalar-function
│       ├── hash-functions
│       ├── string-functions
│       └── ...

For example arrow-datafusion/extension/scalar-function/hash-functions is a separate crate defined all hash-related functions.

If this approach is possible I can try to pull some functions from datafusion core into a separate crate after

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Oct 6, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool @2010YOUY01 -- thank you very much 🙏

I suggest the next steps is we get some more feedback (I will post this PR's link around and see if we can get anyone else to comment on it).

Then I agree it would be great to try porting 2-3 exising BuiltInScalarFunctons and see what it looks like with the new API

It is great to see this work pushed forward.

2. Managing extension points

Another issue is how to manage those pluggable crates, possibly with a extension/ folder under datafusion root? Recently I saw discussions about pulling parquet reader outside the core, those optional extension points can all be placed inside extension/ folder

I left a suggestion inline

@@ -41,6 +41,7 @@ datafusion-common = { path = "../datafusion/common" }
datafusion-expr = { path = "../datafusion/expr" }
datafusion-optimizer = { path = "../datafusion/optimizer" }
datafusion-sql = { path = "../datafusion/sql" }
datafusion-extension-test-scalar-func = {path = "../extension/scalar-function/test-func"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of DataFusion is extensions, another potential might be a path structure like the following?

├── core #  (already exists)
├── functions-aggregate-core
├── functions-aggregate-statistics
├── functions-scalar-array
├── functions-scalar-core
├── functions-scalar-crypto
├── functions-scalar-timestamp
├── physical-expr #  (already exists)
...
└── physical-plan #  (already exists)

@@ -792,6 +794,30 @@ impl SessionContext {
.add_var_provider(variable_type, provider);
}

/// Register a function package into this context
pub fn register_scalar_function_package(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to implement a new API, maybe we can have it cover all types of functions (window, aggregate, scalar) so there is a place to add table functions eventually too:

Something like

    pub fn register_function_package(&self, func_pkg: Box<dyn FunctionPackage>)

Also I wonder if it should take Box<dyn FunctionPackage> (which is owned) or Arc<dyn FunctionPackage> which can be shared. Looking at this code, it seems like there is no reason for an owned pointer and we could get away with Arc (so function packages can be quickly registered with multiple SessionContexts)


// TODO: ReturnTypeFunction -> a ENUM
// most function's return type is either the same as 1st arg or a fixed type
fn return_type(&self) -> ReturnTypeFunction;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current API requires return_type to be passed the specific argument types

Also, #7657 suggests there is additional room for improvement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might mark this API as experimental for now, it can have frequent changes to this API, other unexpected cases like #7657 might show up when porting some trickier functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea --

maybe the implementation plan sequence could be:

  1. Introduce ScalarFunctionDef
  2. impl ScalarFunctionDef for ScalarUDF and BuiltInFunction
  3. Update the core DataFusion code to only use ScalarFunctionDef
  4. Port all BuiltInFunction to ScalarFunctionDef
  5. Deprecate ScalarFunctionDef
  6. Remove BuiltinFunctionDef

Copy link
Contributor Author

@2010YOUY01 2010YOUY01 Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great 👍🏼 My plan is
1st PR: impl ScalarFunctionDef for BuiltinScalarFunction and replace the execution code with the new interface. This step makes sure the new interface can cover all functionalities, and determines how the interface looks like
2nd PR: replace current UDF impl with the new interface
3rd and after: start porting existing functions

use std::fmt;
use std::fmt::Debug;
use std::fmt::Formatter;
use std::sync::Arc;

pub trait ScalarFunctionDef: Sync + Send + std::fmt::Debug {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like using a trait to define functions. To be honest, I am not sure why DataFusion didn't so that originally.

My only concern is that adding this we would now have three ways to define scalar functions (the ScalarUDF, ScalarFunctionDef and BuiltInScalarFunction).

I wonder if this trait is needed, or can we extend ScalarUDF to account for aliases?

Or perhaps should we be aiming to consolidate all functions to use ScalarFuntionDef 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a trait to define functions is more clear and easy to use, if extending ScalarUDF, then we must init a struct for each function in an imperative way, it can get messy if there are a lot of function packages to manage.
And trait ScalarFunctionDef should be equivalent to ScalarUDF and BuitInScalarFunction, it's possible to replace all with ScalarFunctionDef in the future
(I also updated a more detailed answer for this concern in the original PR rationale part)

Copy link
Contributor

@alamb alamb Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using a trait to define functions is more clear and easy to use

Yes, I agree

And trait ScalarFunctionDef should be equivalent to ScalarUDF

Maybe we could do something like

impl SclarFunctionDef for ScalarUDF {
...
}

So all of DataFusion's code was in terms of ScalarFunctionDef.

And then (eventally) deprecate ScalarUDF as part of helping people migrate to using ScalarFunctionUDF 🤔

}

pub trait ScalarFunctionPackage {
fn functions(&self) -> Vec<Box<dyn ScalarFunctionDef>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using Arc here as I don't see any reason for the to be state / need an owned copy

@alamb alamb marked this pull request as draft October 10, 2023 16:17
@alamb
Copy link
Contributor

alamb commented Oct 11, 2023

I think @sadboy, @schulte-lukas and @wolfram-s may also be interested in this idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants