Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve default output column names #6758

Open
alamb opened this issue Jun 24, 2023 · 0 comments
Open

Improve default output column names #6758

alamb opened this issue Jun 24, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 24, 2023

Is your feature request related to a problem or challenge?

The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named "FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)" (yes that whole thing)

Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression.

Here is the entire query

❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
0 rows in set. Query took 0.003 seconds.
❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3                                                                                                                                                           |
| 3                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase

select free / 1, free / 1 from foo;
Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
❯ select free / 1, free / 1 as col2 from foo;
+---------------------+------+
| foo.free / Int64(1) | col2 |
+---------------------+------+
| 1                   | 1    |
| 3                   | 3    |
+---------------------+------+

Describe the solution you'd like

While the output column names can be controlled via adding an explicit alias, I think it would be good to have a more concise way of naming the default outputs, especially for window functions

Perhaps something like not adding the OVER clause's contents so FIRST_VALUE(foo.free) in the above example.

If we do this we need to be careful to allow multiple calls to FIRST_VALUE with different OVER clauses in the same query

Describe alternatives you've considered

No response

Additional context

If we make the names more concise, we will probably need a solution like #6543 to ensure they remain unique

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant