Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

Closed
wants to merge 3 commits into from

Conversation

mingmxu
Copy link

@mingmxu mingmxu commented Jun 15, 2017

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling eb5852b on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@xumingming
Copy link
Contributor

Some questions was left undiscussed from previous PR. I re-raise it here, the sample code of simpleQuery:

//run a simple query, and register the output as a table in BeamSql;
String sql1 = "select MY_FUNC(c1), c2 from TABLE_A";
PCollection<BeamSqlRow> outputTableA = inputTableA.apply(BeamSql.simpleQuery(sql1))
        .withUdf("MY_FUNC", myFunc);

TABLE_A is not registered before using. I understand your purpose is to make it more convenient to use, but IMO, it might confuse user as he need to know the convention before using it. How about something like:

inputTableA.apply(BeamSql.simpleQuery("TABLE_A", sql1));

Let user specify the TABLE_A explicitly. It is a little verboser but easier to understand.

@xumingming
Copy link
Contributor

One more question: I see BeamSqlEnv widely used in internal implementation classes(BeamAggregationRel, BeamMinusRel etc). As BeamSqlEnv is a surface api which needs to keep stable, while internal implementation might change from time to time, will it be better to create a dedicated class(similar to BeamSqlEnv) for internal implementation usage.

@mingmxu
Copy link
Author

mingmxu commented Jun 16, 2017

@xumingming , I'm open to both options. BeamSql.simpleQuery("TABLE_A", sql1) sounds better to me, to let users specify table name explicitly.
@lukecwik any comments?

I don't think it's necessary to have a duplicate copy of BeamSqlEnv for internal usage. After this clean-up phase, the backend interface should be stable as well, although implementation may be changed.

@lukecwik
Copy link
Member

Can't you just get the table name from the SQL query?

Would be annoying for users to write BeamSql.simpleQuery("TABLE_A", "SELECT * FROM TABLE_A");
Also, what would you do if they didn't match?

@mingmxu
Copy link
Author

mingmxu commented Jun 16, 2017

Technically, table name is not must, can be parsed as existing implementation.

Without this parameter, user can put any table name in query. If this parameter is added, they need be consistent.

@takidau
Copy link
Contributor

takidau commented Jun 16, 2017

Is there any way to restrict the set of valid table names when using simpleQuery? E.g., require the user to always use a name of TABLE or PCOLLECTION? Accepting an arbitrary string seems very strange to me.

@mingmxu
Copy link
Author

mingmxu commented Jun 17, 2017

rebased.

So now we have 3 opinions:

  1. BeamSql.simpleQuery("SELECT * FROM TABLE_A"); TABLE_A can be any name;
  2. BeamSql.simpleQuery("TABLE_A", "SELECT * FROM TABLE_A"); The two TABLE_A should be consistent;
  3. BeamSql.simpleQuery("SELECT * FROM TABLE|PCOLLECTION"); TABLE|PCOLLECTION is a static name.

Technically, each is doable, it's all about usability. Any clue or have a small vote?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a5d1a0d on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 132064f on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@takidau
Copy link
Contributor

takidau commented Jun 17, 2017

My thinking is that from a user's perspective:

  1. Convenient, but unintuitive, especially to an outsider trying to understand what the statement means. SQL syntax is typically very precise. I can't think of any other examples where an identifier is used as simply a placeholder, regardless of what the actual identifier is. When I first read the simpleQuery examples, I spent a bunch of time trying to figure out what the TABLE_A name was associated with. The fact that it isn't explicitly associated with anything is unexpected.
  2. Allows the SQL syntax to be checked against a specific identifier, which is nice, but requires that identifier to be redundantly specified in the same method call. In that sense, it's almost equivalent to option 1 (though it does at least give an answer to where the identifier comes from), but with the arbitrary name specified twice.
  3. Allows the SQL syntax to be checked against a specific identifier, and in the case of PCOLLECTION (and/or PCOL?), that identifier also helps document what is happening: you're very explicitly querying the PCollection in that case. For that reason, I'm somewhat less fond of TABLE, since it doesn't have the same self-describing quality.

So from that perspective, #3 with PCOLLECTION (and maybe optionally PCOL as a shorthand) is my preferred choice.

Other perspectives?

@xumingming
Copy link
Contributor

xumingming commented Jun 17, 2017

I want to analyze this problem from another perspective: the simpleQuery is not so simple that you just need to write a sql, you can't execute query on an arbitrary PCollection, before you invoke simpleQuery, you need to do some preparation to convert a normal PCollection to a PCollection<BeamSqlRow> -- to convert unstructured data to structured rows:

TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow(tableBSchema));

with all these preparation code already there, I don't think adding an explicit tableName declaration will make it verbose. It just does the registration and query in the same method.

So if vote, I will vote for #2.


And this also reminds me another option: can we declare the tableName when do the conversion(convert normal data into BeamSqlRow), something like:

TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow(tableName, tableBSchema));

it tells user that:

we are turning your data into a table with the given tableName and these resulting rows.

And then you can query without specifying any tableName, but the signature of the query methods might need to change.

@takidau
Copy link
Contributor

takidau commented Jun 17, 2017

If the registration can happen as part of BeamSql.fromTextRow, and be valid for any future invocation of simpleQuery, that sounds like it might be a good option. Would these registrations be valid for normal query() invocations as well, then? Or would PCollectionTuple association be necessary still?

@xumingming
Copy link
Contributor

The registration is valid for this whole session, valid for all query methods. And the PCollectionTuple will not be needed any more, a full usage sample might look like this:

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
//create table from TextIO;
TableSchema tableASchema = ...;
PCollection<BeamSqlRow> inputTableA = p.apply(TextIO.read().from("/my/input/patha"))
    .apply(BeamSql.fromTextRow("TableA", tableASchema));
TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow("TableB", tableBSchema));

//run a simple query, and register the output as a table in BeamSql;
PCollection<BeamSqlRow> outputTableA = BeamSql.simpleQuery("select MY_FUNC(c1), c2 from TableA");
//run a JOIN with one table from TextIO, and one table from another query
PCollection<BeamSqlRow> outputTableB = BeamSql.query("select * from TABLE_O_A JOIN TABLE_B where ..."));

//output the final result with TextIO
outputTableB.apply(BeamSql.toTextRow()).apply(TextIO.write().to("/my/output/path"));

But this seems deviate a lot from the original design, might miss some design consideration @xumingmin and @lukecwik originally have.

@mingmxu
Copy link
Author

mingmxu commented Jun 17, 2017

@xumingming we also talked this one, and move to the existing solution which follows the Beam-style, to translate query into a PTransform. --We would go with the one you mentioned in BeamSqlCli.

Back to the discussion, from my experience:

  1. An arbitrary table name would be confused to users, especially the same PCollection can have different table names, and different PCollection can have the same table name;
  2. explicit table name is more clear, as how we do in BeamSql.query();
  3. Since each query has its own Schema namespace, the table name in BeamSql.simpleQuery() is useless, required only by SQL grammar. A well-documented static name looks good here.

Another point to clarify is, BeamSql.fromTextRow is removed as we cannot force users to create PCollection<BeamSqlRow> only with this method.

Overall, I slightly prefer to #3.

@lukecwik
Copy link
Member

lukecwik commented Jun 19, 2017

I'm of the opinion that there are two usecases:

  1. BeamSqlCli: this makes sense for a registration mechanism that spans the lifetime of the CLI session. Since the interface the user is using is the CLI, the Java interface is not important to them.
  2. Programmatic pipeline builders: I believe that these people will be developers and
    a) I believe that having a "global" registration mechanism makes it difficult for someone to reason about the code. Imagine you had the following line of code BeamSql.simpleQuery("select * from TABLE_A where ..."));. As a developer I need to search the entire codebase for where TABLE_A was defined. If the code was pcollection.apply(BeamSql.simpleQuery("select * from TABLE_A where ...")));, as a developer I can use my IDE to find and follow where pcollection came from.
    b) That a "global" registration mechanism doesn't allow for libraries to provide meaningful names for tables and is error prone. Imagine you had a library that used BeamSql and it had a function myFancyQueries(...) that ran several queries based upon several attributes. What table names should the function use (when consuming outside tables, when producing new tables, what about intermediate tables)?, What if the user invoked it multiple times at different places within their codebase (should it overwrite previous tables or define new ones)?

@xumingming
Copy link
Contributor

xumingming commented Jun 20, 2017

Ok, since a) Each query has its own schema namespace, rather than sharing the same schema namespace across several queries(I missed this before), and b) It's for building Pipeline programmatically using BeamSql. I agree #3 might be better.

Copy link
Contributor

@xumingming xumingming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Let's merge it.

@mingmxu
Copy link
Author

mingmxu commented Jun 20, 2017

Thanks all for the discussion. I've updated the validation logic in BeamSql.simpleQuery with option #3.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling f7e4a51 on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling f7e4a51 on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@mingmxu
Copy link
Author

mingmxu commented Jun 20, 2017

@takidau could you merge it if no pending changes?

@takidau
Copy link
Contributor

takidau commented Jun 22, 2017

I'm sorry for the extra work, especially after my delay in reviewing this (I've been stuck in all-day training sessions most of the week :-P), but it would be better to separate the sqlEnv changes and the PCOLLECTION changes into independent PRs, as they are unrelated AFAICT. Can you please do that? I'm happy with the changes contained in this PR as a whole, so once you have one of those factored out into a separate PR, I'm happy to merge both.

@mingmxu
Copy link
Author

mingmxu commented Jun 22, 2017

remove the part of PCOLLECTION as table name.
@takidau could you help to merge this one first. I'll prepare another PR after that to avoid unnecessary rebase.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 8cb47dd on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

@mingmxu
Copy link
Author

mingmxu commented Jun 22, 2017

retest this please

asfgit pushed a commit that referenced this pull request Jun 22, 2017
@takidau
Copy link
Contributor

takidau commented Jun 22, 2017

@xumingmin: merged. Thank you!

@mingmxu
Copy link
Author

mingmxu commented Jun 22, 2017

Thank you @takidau @xumingming !

@mingmxu mingmxu closed this Jun 22, 2017
@mingmxu mingmxu deleted the BEAM-2446 branch June 22, 2017 20:46
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 8cb47dd on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants