diff --git a/datafusion/core/src/lib.rs b/datafusion/core/src/lib.rs index f5805bc06982..63dbe824c231 100644 --- a/datafusion/core/src/lib.rs +++ b/datafusion/core/src/lib.rs @@ -620,6 +620,12 @@ doc_comment::doctest!( user_guide_example_usage ); +#[cfg(doctest)] +doc_comment::doctest!( + "../../../docs/source/user-guide/crate-configuration.md", + user_guide_crate_configuration +); + #[cfg(doctest)] doc_comment::doctest!( "../../../docs/source/user-guide/configs.md", diff --git a/docs/source/index.rst b/docs/source/index.rst index d491df04f7fe..8fbff208f561 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -41,13 +41,16 @@ DataFusion offers SQL and Dataframe APIs, excellent CSV, Parquet, JSON, and Avro, extensive customization, and a great community. -To get started with examples, see the `example usage`_ section of the user guide and the `datafusion-examples`_ directory. +To get started, see -See the `developer’s guide`_ for contributing and `communication`_ for getting in touch with us. +* The `example usage`_ section of the user guide and the `datafusion-examples`_ directory. +* The `library user guide`_ for examples of using DataFusion's extension APIs +* The `developer’s guide`_ for contributing and `communication`_ for getting in touch with us. .. _example usage: user-guide/example-usage.html .. _datafusion-examples: https://github.com/apache/datafusion/tree/main/datafusion-examples .. _developer’s guide: contributor-guide/index.html#developer-s-guide +.. _library user guide: library-user-guide/index.html .. _communication: contributor-guide/communication.html .. _toc.asf-links: @@ -80,6 +83,7 @@ See the `developer’s guide`_ for contributing and `communication`_ for getting user-guide/introduction user-guide/example-usage + user-guide/crate-configuration user-guide/cli/index user-guide/dataframe user-guide/expressions diff --git a/docs/source/library-user-guide/index.md b/docs/source/library-user-guide/index.md index 47257e0c926e..fd126a1120ed 100644 --- a/docs/source/library-user-guide/index.md +++ b/docs/source/library-user-guide/index.md @@ -19,8 +19,25 @@ # Introduction -The library user guide explains how to use the DataFusion library as a dependency in your Rust project. Please check out the user-guide for more details on how to use DataFusion's SQL and DataFrame APIs, or the contributor guide for details on how to contribute to DataFusion. +The library user guide explains how to use the DataFusion library as a +dependency in your Rust project and customize its behavior using its extension APIs. -If you haven't reviewed the [architecture section in the docs][docs], it's a useful place to get the lay of the land before starting down a specific path. +Please check out the [user guide] for getting started using +DataFusion's SQL and DataFrame APIs, or the [contributor guide] +for details on how to contribute to DataFusion. +If you haven't reviewed the [architecture section in the docs][docs], it's a +useful place to get the lay of the land before starting down a specific path. + +DataFusion is designed to be extensible at all points, including + +- [x] User Defined Functions (UDFs) +- [x] User Defined Aggregate Functions (UDAFs) +- [x] User Defined Table Source (`TableProvider`) for tables +- [x] User Defined `Optimizer` passes (plan rewrites) +- [x] User Defined `LogicalPlan` nodes +- [x] User Defined `ExecutionPlan` nodes + +[user guide]: ../user-guide/example-usage.md +[contributor guide]: ../contributor-guide/index.md [docs]: https://docs.rs/datafusion/latest/datafusion/#architecture diff --git a/docs/source/user-guide/crate-configuration.md b/docs/source/user-guide/crate-configuration.md new file mode 100644 index 000000000000..0587d06a3919 --- /dev/null +++ b/docs/source/user-guide/crate-configuration.md @@ -0,0 +1,146 @@ + + +# Crate Configuration + +This section contains information on how to configure DataFusion in your Rust +project. See the [Configuration Settings] section for a list of options that +control DataFusion's behavior. + +[configuration settings]: configs.md + +## Add latest non published DataFusion dependency + +DataFusion changes are published to `crates.io` according to the [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) + +If you would like to test out DataFusion changes which are merged but not yet +published, Cargo supports adding dependency directly to GitHub branch: + +```toml +datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} +``` + +Also it works on the package level + +```toml +datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"} +``` + +And with features + +```toml +datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] } +``` + +More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) + +## Optimized Configuration + +For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is +worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. + +```toml +[dependencies] +datafusion = { version = "22.0" } +tokio = { version = "^1.0", features = ["rt-multi-thread"] } +snmalloc-rs = "0.3" + +[profile.release] +lto = true +codegen-units = 1 +``` + +Then, in `main.rs.` update the memory allocator with the below after your imports: + +```rust ,ignore +use datafusion::prelude::*; + +#[global_allocator] +static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; + +#[tokio::main] +async fn main() -> datafusion::error::Result<()> { + Ok(()) +} +``` + +Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally +with `native` or at least `avx2`. + +```shell +RUSTFLAGS='-C target-cpu=native' cargo run --release +``` + +## Enable backtraces + +By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error, +like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file: + +```toml +datafusion = { version = "31.0.0", features = ["backtrace"]} +``` + +Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables) + +```bash +RUST_BACKTRACE=1 ./target/debug/datafusion-cli +DataFusion CLI v31.0.0 +> select row_numer() over (partition by a order by a) from (select 1 a); +Error during planning: Invalid function 'row_numer'. +Did you mean 'ROW_NUMBER'? + +backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace + at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 + 1: std::backtrace_rs::backtrace::trace_unsynchronized + at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 + 2: std::backtrace::Backtrace::create + at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13 + 3: std::backtrace::Backtrace::capture + at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9 + 4: datafusion_common::error::DataFusionError::get_back_trace + at /datafusion/datafusion/common/src/error.rs:436:30 + 5: datafusion_sql::expr::function::>::sql_function_to_expr + ............ +``` + +The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs` + +``` +#[tokio::test] +async fn test_get_backtrace_for_failed_code() -> Result<()> { + let ctx = SessionContext::new(); + + let sql = " + select row_numer() over (partition by a order by a) from (select 1 a); + "; + + let _ = ctx.sql(sql).await?.collect().await?; + + Ok(()) +} +``` + +To obtain a backtrace: + +```bash +cargo build --features=backtrace +RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture +``` + +Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored diff --git a/docs/source/user-guide/example-usage.md b/docs/source/user-guide/example-usage.md index 7dbd4045e75b..813dbb1bc02a 100644 --- a/docs/source/user-guide/example-usage.md +++ b/docs/source/user-guide/example-usage.md @@ -33,29 +33,6 @@ datafusion = "latest_version" tokio = { version = "1.0", features = ["rt-multi-thread"] } ``` -## Add latest non published DataFusion dependency - -DataFusion changes are published to `crates.io` according to [release schedule](https://github.com/apache/datafusion/blob/main/dev/release/README.md#release-process) -In case if it is required to test out DataFusion changes which are merged but yet to be published, Cargo supports adding dependency directly to GitHub branch - -```toml -datafusion = { git = "https://github.com/apache/datafusion", branch = "main"} -``` - -Also it works on the package level - -```toml -datafusion-common = { git = "https://github.com/apache/datafusion", branch = "main", package = "datafusion-common"} -``` - -And with features - -```toml -datafusion = { git = "https://github.com/apache/datafusion", branch = "main", default-features = false, features = ["unicode_expressions"] } -``` - -More on [Cargo dependencies](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies) - ## Run a SQL query against data stored in a CSV ```rust @@ -201,109 +178,3 @@ async fn main() -> datafusion::error::Result<()> { | 1 | 2 | +---+--------+ ``` - -## Extensibility - -DataFusion is designed to be extensible at all points. To that end, you can provide your own custom: - -- [x] User Defined Functions (UDFs) -- [x] User Defined Aggregate Functions (UDAFs) -- [x] User Defined Table Source (`TableProvider`) for tables -- [x] User Defined `Optimizer` passes (plan rewrites) -- [x] User Defined `LogicalPlan` nodes -- [x] User Defined `ExecutionPlan` nodes - -## Optimized Configuration - -For an optimized build several steps are required. First, use the below in your `Cargo.toml`. It is -worth noting that using the settings in the `[profile.release]` section will significantly increase the build time. - -```toml -[dependencies] -datafusion = { version = "22.0" } -tokio = { version = "^1.0", features = ["rt-multi-thread"] } -snmalloc-rs = "0.3" - -[profile.release] -lto = true -codegen-units = 1 -``` - -Then, in `main.rs.` update the memory allocator with the below after your imports: - -```rust ,ignore -use datafusion::prelude::*; - -#[global_allocator] -static ALLOC: snmalloc_rs::SnMalloc = snmalloc_rs::SnMalloc; - -#[tokio::main] -async fn main() -> datafusion::error::Result<()> { - Ok(()) -} -``` - -Based on the instruction set architecture you are building on you will want to configure the `target-cpu` as well, ideally -with `native` or at least `avx2`. - -```shell -RUSTFLAGS='-C target-cpu=native' cargo run --release -``` - -## Enable backtraces - -By default Datafusion returns errors as a plain message. There is option to enable more verbose details about the error, -like error backtrace. To enable a backtrace you need to add Datafusion `backtrace` feature to your `Cargo.toml` file: - -```toml -datafusion = { version = "31.0.0", features = ["backtrace"]} -``` - -Set environment [variables](https://doc.rust-lang.org/std/backtrace/index.html#environment-variables) - -```bash -RUST_BACKTRACE=1 ./target/debug/datafusion-cli -DataFusion CLI v31.0.0 -> select row_number() over (partition by a order by a) from (select 1 a); -Error during planning: Invalid function 'row_number'. -Did you mean 'ROW_NUMBER'? - -backtrace: 0: std::backtrace_rs::backtrace::libunwind::trace - at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5 - 1: std::backtrace_rs::backtrace::trace_unsynchronized - at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5 - 2: std::backtrace::Backtrace::create - at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:332:13 - 3: std::backtrace::Backtrace::capture - at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/backtrace.rs:298:9 - 4: datafusion_common::error::DataFusionError::get_back_trace - at /datafusion/datafusion/common/src/error.rs:436:30 - 5: datafusion_sql::expr::function::>::sql_function_to_expr - ............ -``` - -The backtraces are useful when debugging code. If there is a test in `datafusion/core/src/physical_planner.rs` - -``` -#[tokio::test] -async fn test_get_backtrace_for_failed_code() -> Result<()> { - let ctx = SessionContext::new(); - - let sql = " - select row_number() over (partition by a order by a) from (select 1 a); - "; - - let _ = ctx.sql(sql).await?.collect().await?; - - Ok(()) -} -``` - -To obtain a backtrace: - -```bash -cargo build --features=backtrace -RUST_BACKTRACE=1 cargo test --features=backtrace --package datafusion --lib -- physical_planner::tests::test_get_backtrace_for_failed_code --exact --nocapture -``` - -Note: The backtrace wrapped into systems calls, so some steps on top of the backtrace can be ignored