Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-117] Improve SQL parsing #26786

Closed
betodealmeida opened this issue Jan 24, 2024 · 14 comments
Closed

[SIP-117] Improve SQL parsing #26786

betodealmeida opened this issue Jan 24, 2024 · 14 comments
Assignees
Labels
sip Superset Improvement Proposal

Comments

@betodealmeida
Copy link
Member

betodealmeida commented Jan 24, 2024

[SIP-117] Improve SQL parsing

Motivation

The current status of SQL parsing in Superset is not ideal. A few reasons:

  1. We currently use 3 libraries for parsing:
    a. sqlparse is the original library used for SQL parsing. It's non-validating and non-dialect specific, which brings a few challenges.
    b. sqloxide was introduced as an optional dependency, where performance was critical.
    c. sqlglot was introduced in fix(sqlparse): improve table parsing #26476 to fix security issues.
  2. Superset has an interface to SQL parsing in superset.sql_parse.ParsedQuery, but many places in the code call sqlparse directly. This includes complex operations, like injecting RLS rules into SQL queries.
  3. ParsedQuery is sometimes used for single statements, sometimes for multi-statement queries, which could lead to subtle bugs, since it has methods that assume a single statement.
  4. In many functions a query will be repeatedly parsed and converted back to a string, which is inefficient.
  5. Parsing is done using a generic parser that is not dialect specific.

Superset needs robust and reliable SQL parsing, so it can understand and modify queries being run in programmatic ways. Today, many of the manipulations to SQL are done using string functions, because the sqlparse AST is too low-level. When queries are manipulated programmatically, like in RLS injection, the code is complex and hard to follow, for the same reason.

Proposed Change

My proposal is to create 2 new classes that should provide a clean interface to SQL parsing:

class SQLScript:

    """
    Class representing a SQL script (or block) with multiple statements.

    Examples:

        "SELECT 1; SELECT 2"  # 2 statements
        "CREATE TABLE a (b INT)"  # single statement
        ""  # 0 statements

    """

    statements: List[SQLStatement]

    def __init__(self, query: str, engine: str):
        ...


class SQLStatement:

    """
    Class representing a single SQL statement.

    Optionally in the future we might want to have derived classes for
    more specific statements, like DML or query.
    """
     
    def __init__(self, statement: str, engine: str):
        ...

(This is a simplified version, see #26767 for more details.)

These classes will provide:

  1. A clear differentiation between a parsed query (with multiple statements), and a single statement.
  2. Dialect-specific parsing, via a required attribute indicating the DB engine spec engine.
  3. All SQL-parsing related functionality needed by Superset, so that the Superset code will only use these classes for anything related to SQL parsing, introspection, and manipulation. No SQL parsing library should be imported anywhere else in Superset.
  4. Wrapped exceptions, so that no 3rd-party specific exceptions are bubbled up.

Initially these classes will be implemented using sqlglot, since it's fast, easy to install (pure Python), and has support for several dialects. The interface should be agnostic enough that it should be easy to rewrite the classes using a different parsing library in the future, if we ever need to.

New or Changed Public Interfaces

No public interfaces will be changed, but:

  1. sqlparse will be removed as a dependency.
  2. ParsedQuery will be removed, and replaced by SQLScript/SQLStatement.
  3. Indented SQL ("pretty-printed") will look different, since sqlglot formats SQL differently than sqlparse.

New dependencies

None.

Migration Plan and Compatibility

None.

Rejected Alternatives

None.

@betodealmeida betodealmeida added the sip Superset Improvement Proposal label Jan 24, 2024
@betodealmeida betodealmeida changed the title [SIP] Improve SQL parsing [SIP-117] Improve SQL parsing Jan 24, 2024
@TechAuditBI
Copy link
Contributor

I think that this SIP and the SIP-115 can be perfectly merged together due the same theme of SQL parsing.
P.S. Sorry I didn't update the SIP-115's description according to our discussion during Office hours yet.

@TechAuditBI
Copy link
Contributor

TechAuditBI commented Jan 24, 2024

To provide a brief overview of what we've got to. The main idea is not to create a new UI element for "PRE SQL" wich would not be very comfortable to use and also make it almost impossible to downgrade metadata DB without a sugnificant loss in functionality. So it was suggested to conduct all the necessary query parcing under the hood.

@betodealmeida betodealmeida self-assigned this Jan 24, 2024
@rusackas
Copy link
Member

@TechAuditBI perhaps you could make sure that all your requirements (e.g. WHEN clauses in MS-SQL) are supported by this proposal, and it would effectively replace SIP-115, so we can close that one?

Also CC @john-bodley since we were talking about this stuff just this morning.

@betodealmeida
Copy link
Member Author

@TechAuditBI perhaps you could make sure that all your requirements (e.g. WHEN clauses in MS-SQL) are supported by this proposal, and it would effectively replace SIP-115, so we can close that one?

Just to be clear, this proposal is to create a cleaner interface for SQL parsing that supports current functionality.

I think the novel functionality from SIP-115 is needed and should be implemented in the classes being proposed in this SIP. Since we'd be using sqlglot and dialect-specific parsing, I think it would be easy to implement the kind of manipulation that is needed for SIP-115. But I don't think they should be merged, because this SIP is about standardizing an interface, and SIP-117 is about new functionality.

@john-bodley
Copy link
Member

@betodealmeida I'm very supportive of consolidating our somewhat fragmented SQL parsing landscape.

Personally I was a fan of sqlparse (in part because I've used it extensively internally and have contributed to the repo) but sadly the project seems somewhat inactive—I created andialbrecht/sqlparse#745 (with an associated fix) back in November and there's been no response.

I also gave an internal talk (back in October) about non-technical learnings from open source where I mentioned:

SQLGlot (which has 3,900 GitHub stars) has likely surpassed sqlparse (which has 3,400 GitHub stars) as the preferred Python SQL parser. SQLGlot is actively being developed whereas sqlparse feels like it’s in maintenance mode. SQLGlot has had 18 major releases in the same period that sqlparse has had three patch releases. The later also has over 200 outstanding open issues.

The TL;DR is I sense SQLGlot is the way to go.

Out of interest would sqloxide also be removed alongside sqlparse?

@betodealmeida
Copy link
Member Author

betodealmeida commented Jan 25, 2024

@john-bodley personally I'm a fan of sqloxide because the AST is very elegant, but the support that sqlglot has for different dialects makes it a better choice for Superset IMHO (sqloxide supports dialects, but not nearly as many).

Today sqloxide is optional, so we could easily get rid of it as well.

@michael-s-molina
Copy link
Member

Thank you for writing this SIP @betodealmeida. These are really welcome changes.

No SQL parsing library should be imported anywhere else in Superset.

Can we add a Pylint check for this?

@geido
Copy link
Member

geido commented Jan 25, 2024

+1 for sqlglot!

@betodealmeida
Copy link
Member Author

No SQL parsing library should be imported anywhere else in Superset.

Can we add a Pylint check for this?

Yeah, we can do this: https://github.com/apache/superset/pull/26803/files

@michael-s-molina
Copy link
Member

Today sqloxide is optional, so we could easily get rid of it as well.

+1

@betodealmeida
Copy link
Member Author

SIP was approved on 2024-02-07.

@rusackas
Copy link
Member

rusackas commented Feb 9, 2024

Closing the issue... not because it's done, but because the SIP was approved. Keep on truckin'!

@rusackas
Copy link
Member

Any updates on what work remains here @betodealmeida? Might be good to give a "state of affairs" comment here.

@betodealmeida
Copy link
Member Author

@rusackas this is still work in progress. I'm currently working on moving the RLS functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Development

No branches or pull requests

6 participants