-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-2446] restrict scope of BeamSqlEnv in dsl query #3372
Conversation
Some questions was left undiscussed from previous PR. I re-raise it here, the sample code of //run a simple query, and register the output as a table in BeamSql;
String sql1 = "select MY_FUNC(c1), c2 from TABLE_A";
PCollection<BeamSqlRow> outputTableA = inputTableA.apply(BeamSql.simpleQuery(sql1))
.withUdf("MY_FUNC", myFunc);
inputTableA.apply(BeamSql.simpleQuery("TABLE_A", sql1)); Let user specify the |
One more question: I see |
@xumingming , I'm open to both options. I don't think it's necessary to have a duplicate copy of |
Can't you just get the table name from the SQL query? Would be annoying for users to write BeamSql.simpleQuery("TABLE_A", "SELECT * FROM TABLE_A"); |
Technically, table name is not must, can be parsed as existing implementation. Without this parameter, user can put any table name in query. If this parameter is added, they need be consistent. |
Is there any way to restrict the set of valid table names when using simpleQuery? E.g., require the user to always use a name of TABLE or PCOLLECTION? Accepting an arbitrary string seems very strange to me. |
rebased. So now we have 3 opinions:
Technically, each is doable, it's all about usability. Any clue or have a small vote? |
My thinking is that from a user's perspective:
So from that perspective, #3 with PCOLLECTION (and maybe optionally PCOL as a shorthand) is my preferred choice. Other perspectives? |
I want to analyze this problem from another perspective: the TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
.apply(BeamSql.fromTextRow(tableBSchema)); with all these preparation code already there, I don't think adding an explicit tableName declaration will make it verbose. It just does the So if vote, I will vote for #2. And this also reminds me another option: can we declare the tableName when do the conversion(convert normal data into BeamSqlRow), something like: TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
.apply(BeamSql.fromTextRow(tableName, tableBSchema)); it tells user that:
And then you can query without specifying any tableName, but the signature of the query methods might need to change. |
If the registration can happen as part of BeamSql.fromTextRow, and be valid for any future invocation of simpleQuery, that sounds like it might be a good option. Would these registrations be valid for normal query() invocations as well, then? Or would PCollectionTuple association be necessary still? |
The registration is valid for this whole session, valid for all query methods. And the PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
//create table from TextIO;
TableSchema tableASchema = ...;
PCollection<BeamSqlRow> inputTableA = p.apply(TextIO.read().from("/my/input/patha"))
.apply(BeamSql.fromTextRow("TableA", tableASchema));
TableSchema tableBSchema = ...;
PCollection<BeamSqlRow> inputTableB = p.apply(TextIO.read().from("/my/input/pathb"))
.apply(BeamSql.fromTextRow("TableB", tableBSchema));
//run a simple query, and register the output as a table in BeamSql;
PCollection<BeamSqlRow> outputTableA = BeamSql.simpleQuery("select MY_FUNC(c1), c2 from TableA");
//run a JOIN with one table from TextIO, and one table from another query
PCollection<BeamSqlRow> outputTableB = BeamSql.query("select * from TABLE_O_A JOIN TABLE_B where ..."));
//output the final result with TextIO
outputTableB.apply(BeamSql.toTextRow()).apply(TextIO.write().to("/my/output/path")); But this seems deviate a lot from the original design, might miss some design consideration @xumingmin and @lukecwik originally have. |
@xumingming we also talked this one, and move to the existing solution which follows the Beam-style, to translate query into a Back to the discussion, from my experience:
Another point to clarify is, Overall, I slightly prefer to #3. |
I'm of the opinion that there are two usecases:
|
Ok, since a) Each query has its own schema namespace, rather than sharing the same schema namespace across several queries(I missed this before), and b) It's for building Pipeline programmatically using BeamSql. I agree #3 might be better. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Let's merge it.
Thanks all for the discussion. I've updated the validation logic in |
@takidau could you merge it if no pending changes? |
I'm sorry for the extra work, especially after my delay in reviewing this (I've been stuck in all-day training sessions most of the week :-P), but it would be better to separate the sqlEnv changes and the PCOLLECTION changes into independent PRs, as they are unrelated AFAICT. Can you please do that? I'm happy with the changes contained in this PR as a whole, so once you have one of those factored out into a separate PR, I'm happy to merge both. |
remove the part of |
retest this please |
@xumingmin: merged. Thank you! |
Thank you @takidau @xumingming ! |
R: @xumingming @takidau