[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

mingmxu · 2017-06-15T22:06:33Z

coveralls · 2017-06-15T23:23:22Z

Changes Unknown when pulling eb5852b on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

xumingming · 2017-06-16T02:31:19Z

Some questions was left undiscussed from previous PR. I re-raise it here, the sample code of simpleQuery:

//run a simple query, and register the output as a table in BeamSql;
String sql1 = "select MY_FUNC(c1), c2 from TABLE_A";
PCollection<BeamSqlRow> outputTableA = inputTableA.apply(BeamSql.simpleQuery(sql1))
        .withUdf("MY_FUNC", myFunc);

TABLE_A is not registered before using. I understand your purpose is to make it more convenient to use, but IMO, it might confuse user as he need to know the convention before using it. How about something like:

inputTableA.apply(BeamSql.simpleQuery("TABLE_A", sql1));

Let user specify the TABLE_A explicitly. It is a little verboser but easier to understand.

xumingming · 2017-06-16T05:58:37Z

One more question: I see BeamSqlEnv widely used in internal implementation classes(BeamAggregationRel, BeamMinusRel etc). As BeamSqlEnv is a surface api which needs to keep stable, while internal implementation might change from time to time, will it be better to create a dedicated class(similar to BeamSqlEnv) for internal implementation usage.

mingmxu · 2017-06-16T16:58:44Z

@xumingming , I'm open to both options. BeamSql.simpleQuery("TABLE_A", sql1) sounds better to me, to let users specify table name explicitly.
@lukecwik any comments?

I don't think it's necessary to have a duplicate copy of BeamSqlEnv for internal usage. After this clean-up phase, the backend interface should be stable as well, although implementation may be changed.

lukecwik · 2017-06-16T19:08:42Z

Can't you just get the table name from the SQL query?

Would be annoying for users to write BeamSql.simpleQuery("TABLE_A", "SELECT * FROM TABLE_A");
Also, what would you do if they didn't match?

mingmxu · 2017-06-16T19:28:26Z

Technically, table name is not must, can be parsed as existing implementation.

Without this parameter, user can put any table name in query. If this parameter is added, they need be consistent.

takidau · 2017-06-16T21:05:53Z

Is there any way to restrict the set of valid table names when using simpleQuery? E.g., require the user to always use a name of TABLE or PCOLLECTION? Accepting an arbitrary string seems very strange to me.

mingmxu · 2017-06-17T00:19:31Z

rebased.

So now we have 3 opinions:

BeamSql.simpleQuery("SELECT * FROM TABLE_A"); TABLE_A can be any name;
BeamSql.simpleQuery("TABLE_A", "SELECT * FROM TABLE_A"); The two TABLE_A should be consistent;
BeamSql.simpleQuery("SELECT * FROM TABLE|PCOLLECTION"); TABLE|PCOLLECTION is a static name.

Technically, each is doable, it's all about usability. Any clue or have a small vote?

coveralls · 2017-06-17T00:40:38Z

Changes Unknown when pulling a5d1a0d on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

coveralls · 2017-06-17T03:06:10Z

Changes Unknown when pulling 132064f on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

takidau · 2017-06-17T03:17:34Z

My thinking is that from a user's perspective:

Convenient, but unintuitive, especially to an outsider trying to understand what the statement means. SQL syntax is typically very precise. I can't think of any other examples where an identifier is used as simply a placeholder, regardless of what the actual identifier is. When I first read the simpleQuery examples, I spent a bunch of time trying to figure out what the TABLE_A name was associated with. The fact that it isn't explicitly associated with anything is unexpected.
Allows the SQL syntax to be checked against a specific identifier, which is nice, but requires that identifier to be redundantly specified in the same method call. In that sense, it's almost equivalent to option 1 (though it does at least give an answer to where the identifier comes from), but with the arbitrary name specified twice.
Allows the SQL syntax to be checked against a specific identifier, and in the case of PCOLLECTION (and/or PCOL?), that identifier also helps document what is happening: you're very explicitly querying the PCollection in that case. For that reason, I'm somewhat less fond of TABLE, since it doesn't have the same self-describing quality.

So from that perspective, #3 with PCOLLECTION (and maybe optionally PCOL as a shorthand) is my preferred choice.

Other perspectives?

xumingming · 2017-06-17T03:26:43Z

I want to analyze this problem from another perspective: the simpleQuery is not so simple that you just need to write a sql, you can't execute query on an arbitrary PCollection, before you invoke simpleQuery, you need to do some preparation to convert a normal PCollection to a PCollection<BeamSqlRow> -- to convert unstructured data to structured rows:

TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow(tableBSchema));

with all these preparation code already there, I don't think adding an explicit tableName declaration will make it verbose. It just does the registration and query in the same method.

So if vote, I will vote for #2.

And this also reminds me another option: can we declare the tableName when do the conversion(convert normal data into BeamSqlRow), something like:

TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow(tableName, tableBSchema));

it tells user that:

we are turning your data into a table with the given tableName and these resulting rows.

And then you can query without specifying any tableName, but the signature of the query methods might need to change.

takidau · 2017-06-17T03:43:14Z

If the registration can happen as part of BeamSql.fromTextRow, and be valid for any future invocation of simpleQuery, that sounds like it might be a good option. Would these registrations be valid for normal query() invocations as well, then? Or would PCollectionTuple association be necessary still?

xumingming · 2017-06-17T04:12:48Z

The registration is valid for this whole session, valid for all query methods. And the PCollectionTuple will not be needed any more, a full usage sample might look like this:

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
//create table from TextIO;
TableSchema tableASchema = ...;
PCollection<BeamSqlRow> inputTableA = p.apply(TextIO.read().from("/my/input/patha"))
    .apply(BeamSql.fromTextRow("TableA", tableASchema));
TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
    .apply(BeamSql.fromTextRow("TableB", tableBSchema));

//run a simple query, and register the output as a table in BeamSql;
PCollection<BeamSqlRow> outputTableA = BeamSql.simpleQuery("select MY_FUNC(c1), c2 from TableA");
//run a JOIN with one table from TextIO, and one table from another query
PCollection<BeamSqlRow> outputTableB = BeamSql.query("select * from TABLE_O_A JOIN TABLE_B where ..."));

//output the final result with TextIO
outputTableB.apply(BeamSql.toTextRow()).apply(TextIO.write().to("/my/output/path"));

But this seems deviate a lot from the original design, might miss some design consideration @xumingmin and @lukecwik originally have.

mingmxu · 2017-06-17T05:01:47Z

@xumingming we also talked this one, and move to the existing solution which follows the Beam-style, to translate query into a PTransform. --We would go with the one you mentioned in BeamSqlCli.

Back to the discussion, from my experience:

An arbitrary table name would be confused to users, especially the same PCollection can have different table names, and different PCollection can have the same table name;
explicit table name is more clear, as how we do in BeamSql.query();
Since each query has its own Schema namespace, the table name in BeamSql.simpleQuery() is useless, required only by SQL grammar. A well-documented static name looks good here.

Another point to clarify is, BeamSql.fromTextRow is removed as we cannot force users to create PCollection<BeamSqlRow> only with this method.

Overall, I slightly prefer to #3.

lukecwik · 2017-06-19T16:29:12Z

I'm of the opinion that there are two usecases:

BeamSqlCli: this makes sense for a registration mechanism that spans the lifetime of the CLI session. Since the interface the user is using is the CLI, the Java interface is not important to them.
Programmatic pipeline builders: I believe that these people will be developers and
a) I believe that having a "global" registration mechanism makes it difficult for someone to reason about the code. Imagine you had the following line of code BeamSql.simpleQuery("select * from TABLE_A where ..."));. As a developer I need to search the entire codebase for where TABLE_A was defined. If the code was pcollection.apply(BeamSql.simpleQuery("select * from TABLE_A where ...")));, as a developer I can use my IDE to find and follow where pcollection came from.
b) That a "global" registration mechanism doesn't allow for libraries to provide meaningful names for tables and is error prone. Imagine you had a library that used BeamSql and it had a function myFancyQueries(...) that ran several queries based upon several attributes. What table names should the function use (when consuming outside tables, when producing new tables, what about intermediate tables)?, What if the user invoked it multiple times at different places within their codebase (should it overwrite previous tables or define new ones)?

xumingming · 2017-06-20T00:25:32Z

Ok, since a) Each query has its own schema namespace, rather than sharing the same schema namespace across several queries(I missed this before), and b) It's for building Pipeline programmatically using BeamSql. I agree #3 might be better.

xumingming

LGTM.

Let's merge it.

mingmxu · 2017-06-20T01:47:17Z

Thanks all for the discussion. I've updated the validation logic in BeamSql.simpleQuery with option #3.

coveralls · 2017-06-20T02:57:56Z

Changes Unknown when pulling f7e4a51 on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

coveralls · 2017-06-20T03:02:42Z

Changes Unknown when pulling f7e4a51 on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

mingmxu · 2017-06-20T16:44:06Z

@takidau could you merge it if no pending changes?

takidau · 2017-06-22T15:48:03Z

I'm sorry for the extra work, especially after my delay in reviewing this (I've been stuck in all-day training sessions most of the week :-P), but it would be better to separate the sqlEnv changes and the PCOLLECTION changes into independent PRs, as they are unrelated AFAICT. Can you please do that? I'm happy with the changes contained in this PR as a whole, so once you have one of those factored out into a separate PR, I'm happy to merge both.

mingmxu · 2017-06-22T17:40:11Z

remove the part of PCOLLECTION as table name.
@takidau could you help to merge this one first. I'll prepare another PR after that to avoid unnecessary rebase.

coveralls · 2017-06-22T19:02:03Z

Changes Unknown when pulling 8cb47dd on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

mingmxu · 2017-06-22T19:35:54Z

retest this please

takidau · 2017-06-22T20:23:25Z

@xumingmin: merged. Thank you!

mingmxu · 2017-06-22T20:46:35Z

Thank you @takidau @xumingming !

coveralls · 2017-06-22T21:12:15Z

Changes Unknown when pulling 8cb47dd on XuMingmin:BEAM-2446 into ** on apache:DSL_SQL**.

mingmxu force-pushed the BEAM-2446 branch from eb5852b to a5d1a0d Compare June 16, 2017 23:27

mingmxu force-pushed the BEAM-2446 branch from a5d1a0d to 132064f Compare June 17, 2017 01:49

restrict the scope of BeamSqlEnv

ae6ace2

mingmxu force-pushed the BEAM-2446 branch from a6fe50d to 0fc16cd Compare June 20, 2017 01:36

update interface of BeamSql.simpleQuery

f7e4a51

mingmxu force-pushed the BEAM-2446 branch from 0fc16cd to f7e4a51 Compare June 20, 2017 01:40

xumingming approved these changes Jun 20, 2017

View reviewed changes

remove select * from PCOLLECTION to a separated PR.

8cb47dd

asfgit pushed a commit that referenced this pull request Jun 22, 2017

[BEAM-2446] This closes #3372

a680904

mingmxu closed this Jun 22, 2017

mingmxu deleted the BEAM-2446 branch June 22, 2017 20:46

mingmxu mentioned this pull request Jun 23, 2017

[BEAM-2503] use static table name in BeamSql.simpleQuery #3427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

mingmxu commented Jun 15, 2017

coveralls commented Jun 15, 2017

xumingming commented Jun 16, 2017

xumingming commented Jun 16, 2017

mingmxu commented Jun 16, 2017 •

edited

Loading

lukecwik commented Jun 16, 2017

mingmxu commented Jun 16, 2017

takidau commented Jun 16, 2017

mingmxu commented Jun 17, 2017

coveralls commented Jun 17, 2017

coveralls commented Jun 17, 2017

takidau commented Jun 17, 2017

xumingming commented Jun 17, 2017 •

edited

Loading

takidau commented Jun 17, 2017

xumingming commented Jun 17, 2017

mingmxu commented Jun 17, 2017

lukecwik commented Jun 19, 2017 •

edited

Loading

xumingming commented Jun 20, 2017 •

edited

Loading

xumingming left a comment

mingmxu commented Jun 20, 2017

coveralls commented Jun 20, 2017

coveralls commented Jun 20, 2017

mingmxu commented Jun 20, 2017

takidau commented Jun 22, 2017

mingmxu commented Jun 22, 2017

coveralls commented Jun 22, 2017

mingmxu commented Jun 22, 2017

takidau commented Jun 22, 2017

mingmxu commented Jun 22, 2017

coveralls commented Jun 22, 2017

[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372

Conversation

mingmxu commented Jun 15, 2017

coveralls commented Jun 15, 2017

xumingming commented Jun 16, 2017

xumingming commented Jun 16, 2017

mingmxu commented Jun 16, 2017 • edited Loading

lukecwik commented Jun 16, 2017

mingmxu commented Jun 16, 2017

takidau commented Jun 16, 2017

mingmxu commented Jun 17, 2017

coveralls commented Jun 17, 2017

coveralls commented Jun 17, 2017

takidau commented Jun 17, 2017

xumingming commented Jun 17, 2017 • edited Loading

takidau commented Jun 17, 2017

xumingming commented Jun 17, 2017

mingmxu commented Jun 17, 2017

lukecwik commented Jun 19, 2017 • edited Loading

xumingming commented Jun 20, 2017 • edited Loading

xumingming left a comment

Choose a reason for hiding this comment

mingmxu commented Jun 20, 2017

coveralls commented Jun 20, 2017

coveralls commented Jun 20, 2017

mingmxu commented Jun 20, 2017

takidau commented Jun 22, 2017

mingmxu commented Jun 22, 2017

coveralls commented Jun 22, 2017

mingmxu commented Jun 22, 2017

takidau commented Jun 22, 2017

mingmxu commented Jun 22, 2017

coveralls commented Jun 22, 2017

mingmxu commented Jun 16, 2017 •

edited

Loading

xumingming commented Jun 17, 2017 •

edited

Loading

lukecwik commented Jun 19, 2017 •

edited

Loading

xumingming commented Jun 20, 2017 •

edited

Loading