Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouse some ENUM types doesn't stores in "datasources" - psycopg2.errors.StringDataRightTruncation: value too long for type character varying(32) #13572

Closed
Slach opened this issue Mar 11, 2021 · 7 comments
Labels
#bug Bug report data:connect:clickhouse Related to Clickhouse

Comments

@Slach
Copy link
Contributor

Slach commented Mar 11, 2021

I setup superset 1.0.1 + master version clickhouse driver from git https://github.com/xzkostyan/clickhouse-sqlalchemy
I successfully added "database" and try to add new "dataset", and press "Add"
image

Expected results

Successful added dataset

Actual results

image

Additional context

Stacktrace

superset_app            | [SQL: INSERT INTO table_columns (uuid, created_on, changed_on, column_name, verbose_name, is_active, type, groupby, filterable, description, table_id, is_dttm, expression, python_date_format, created_by_fk, changed_by_fk) VALUES (%(uuid)s, %(created_on)s, %(changed_on)s, %(column_name)s, %(verbose_name)s, %(is_active)s, %(type)s, %(groupby)s, %(filterable)s, %(description)s, %(table_id)s, %(is_dttm)s, %(expression)s, %(python_date_format)s, %(created_by_fk)s, %(changed_by_fk)s) RETURNING table_columns.id]
superset_app            | [parameters: {'uuid': UUID('0c461639-f66d-46b2-ae56-e994bee37d2d'), 'created_on': datetime.datetime(2021, 3, 11, 9, 40, 12, 26645), 'changed_on': datetime.datetime(2021, 3, 11, 9, 40, 12, 26678), 'column_name': 'event_type', 'verbose_name': None, 'is_active': True, 'type': "ENUM8('COMMITCOMMENTEVENT' = 1, 'CREATEEVENT' = 2, 'DELETEEVENT' = 3, 'FORKEVENT' = 4, 'GOLLUMEVENT' = 5, 'ISSUECOMMENTEVENT' = 6, 'ISSUESEVENT' = 7, ... (184 characters truncated) ...  'GISTEVENT' = 16, 'FOLLOWEVENT' = 17, 'DOWNLOADEVENT' = 18, 'PULLREQUESTREVIEWEVENT' = 19, 'FORKAPPLYEVENT' = 20, 'EVENT' = 21, 'TEAMADDEVENT' = 22)", 'groupby': True, 'filterable': True, 'description': None, 'table_id': 17, 'is_dttm': False, 'expression': None, 'python_date_format': None, 'created_by_fk': 1, 'changed_by_fk': 1}]
superset_app            | (Background on this error at: http://sqlalche.me/e/13/9h9h)
superset_app            | Traceback (most recent call last):
superset_app            |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1277, in _execute_context
superset_app            |     cursor, statement, parameters, context
superset_app            |   File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
superset_app            |     cursor.execute(statement, parameters)
superset_app            | psycopg2.errors.StringDataRightTruncation: value too long for type character varying(32)

How exactly should converts ENUM8 type from ClickHouse to superset?

@Slach Slach added the #bug Bug report label Mar 11, 2021
@hodgesrm
Copy link

Hi @Slach ! This looks a little weird. psycopg2 is the PostgreSQL driver. Are you running Superset in Docker with PostgreSQL as the backend?

(Superset folks, @Slach is working with me on ClickHouse driver support.)

@Slach
Copy link
Contributor Author

Slach commented Mar 11, 2021

@hodgesrm when we try to add "dataset" then superset itself try to store selected table from Datasource fields names and types into internal PostgreSQL \ MySQL \ sqlite table with name table_columns

@hodgesrm
Copy link

hodgesrm commented Mar 13, 2021

Here's a detailed reproduction of this issue. It does not come up in SQLite when I use a virtual env and dev setup. It seems to be related to the VARCHAR(36) column size in PostgreSQL.

  1. Update your [virtual] environment to install master version of clickhouse-sqlalchemy from github: pip install git+https://github.com/xzkostyan/clickhouse-sqlalchemy
  2. Go into Databases based and set up a database with the following SQLAlchemy URL: clickhouse+native://demo:demo@github.demo.trial.altinity.cloud/default?secure=true
  3. Save the database and go to Datasets panel.
  4. Select a new dataset as shown in the image in the bug description. Schema is 'default' and the table name is 'github_events'.
  5. Press Save to see the error.

ClickHouse Enum types are potentially very large because they list all values. It's hard to give an upper bound for ClickHouse types but if you are storing ClickHouse native types names in VARCHAR I would allocate at least 500 characters.

Another option obviously would be to convert to standard SQL types within Superset, which I understand is possible to do.

@junlincc junlincc added the data:connect:clickhouse Related to Clickhouse label Apr 9, 2021
@cccs-jc
Copy link
Contributor

cccs-jc commented Oct 7, 2021

I have encountered similar issues with Trino columns. Trino supports nested data types such as an array of struct with many fields. Some of these fields can be arrays of other struct etc. So the type of the column can be quite long

ARRAY<STRUCT F1 INT , F2 STRING, F3 ARRAY<STRUCT FF1 STRING>>

When you try to save a dataset containing columns like this it can fail because Superset's DataSet Model only allocates 32 characters to store the type of the column. In the UI you only see an error and it does not explain why it fails.

In the server logs you can see the cause which is a column type is too long to fit into the 32 chars. However as a user you have no idea.

If the type is less then 32 chars say STRUCT< F1 INT , F2 STRING>. Superset will accept it and will treat it as a string when rendering the value.

Instead of trying to store the type of such columns it might be better to fall back to a type that indicates it's a complex type. For example if a column is of type ARRAY or STRUCT save it as COMPLEX or JSON.

@villebro did you encounter this issue before? Any other solution?

@villebro
Copy link
Member

villebro commented Oct 8, 2021

Thanks for raising this issue. This isn't the first time people have bumped into the 32 char limit, and with NoSQL/complex datatypes becoming ever more prevalent, we should definitely address this shortcoming. Let's just change it to TEXT and make sure we can have arbitrarily complex types going forward. @cccs-jc would you or someone else in your org be willing to do this valuable improvement?

On a related note, @betodealmeida did amazing work to improve support for complex data types on BigQuery in #16822. I'm personally a big fan of Clickhouse and Trino, and would love to see similar functionality and tests added for Clickhouse and Trino, too. So let's collaborate on this!

@cccs-jc
Copy link
Contributor

cccs-jc commented Oct 12, 2021

Sounds good. We will explore falling back to TEXT when the fields is of STRUCT/ARRAY type. This should eliminate the 32 chars issue while keeping the current functionality.

@villebro
Copy link
Member

Fixed by #17360 , closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug Bug report data:connect:clickhouse Related to Clickhouse
Projects
None yet
Development

No branches or pull requests

5 participants