Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #692 obv dupe changes for ftd and matcher #694

Merged
merged 33 commits into from
Oct 22, 2023

Conversation

vikasgupta78
Copy link
Contributor

Issue #692 obv dupe changes for ftd and matcher

@@ -0,0 +1,247 @@
package zingg.common.core.executor;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is not an executor and should be broken down into two classes at least. they can live in a separate package zingg.common.core.obviousDupes. one that takes args, dsutil and creates filters. one that applies those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that may be we might want to create a phase out of it in future? 100% match filter kinds? If no such plan can break it down, no problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are talking of two different things here.

  • A new possible phase for obviousDupes. If we do that, we need some kind of executor.
  • Breaking down the class

The first one can always be defined later using the classes in second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to two seperate classes in commit a3d77a6 and f3480e1

@@ -126,7 +121,7 @@ public void execute() throws ZinggClientException {
ZFrame<D, R, C> obvDupePairs = getObvDupePairs(blocked);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic should move to obvious dupes. the matcher should just ask the obv dupes to add/remove dupes as necessary by passing blocks/pairs whatever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is already moved , this method delegates work to the obv dupe class. The method is kept as it gets overriden in enterprise matcher for purpose of instrumentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is getObvDupePairs method in Matcher

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then we should override the new class in enterprise and introduce im and weave into the ent matcher.

Obvious dupes logic should be encapsulated in its own package and classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and checked the code. The following should not be part of the Matcher.
if (obvDupePairs != null) { long obvDupeCount = obvDupePairs.count(); LOG.debug("obvDupePairs count " + obvDupeCount); if (obvDupeCount > 0) { blocks = removeObvDupesFromBlocks(blocks); } }

The methods in the Matcher that you have defined can continue to call the obv dupes classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in commit 70f1c33, please review

* @return
*/

public C getObviousDupesFilter(ZFrame<D, R, C> df1, ZFrame<D, R, C> dfToJoin, String obviousDupeString, C extraAndCond) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you define the obvious filter not through a string put through json, this code becomes much simpler. how about

"obviousDupes": [
"matchCondition": [{"column":"fname"},{"column":"lname"}]
"matchCondition":[{}]
]

You can simply define this object as part of the args. It gets populated automatically, no need to parse, trim etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will create a JSON sample for various scenarios and get it reviewed before implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in commit c2ddce9 please review

@vikasgupta78
Copy link
Contributor Author

have completed all review comments of obv dupe filter in OSS
once you have merged than will do enterprise side of changes

public static final Log LOG = LogFactory.getLog(ObvDupeFilter.class);

protected Arguments args;
protected Context<S,D,R,C,T> context;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass Context or can we pass DSUtil ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see if S and T are needed for this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in commit b29758c , please review

ZFrame<D,R,C> onlyIds = null;

// loop thru the values and build a filter condition
for (int i = 0; i < obviousDupes.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is tyhere a reason to loop through this one by one? I thought the filter would return the and, or conditions to join on completely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to loop thru all the match conditions and within match conditions for every field name.

e.g "obviousDupes":[{ "matchCondition":[ { "fieldName":"fname" }, { "fieldName":"stNo" }, { "fieldName":"add1" } ]},{"matchCondition":[ { "fieldName":"ssn" }] }, { "matchCondition":[{"fieldName":"fname" }, {"fieldName":"stNo" }, {"fieldName":"lname" } ]} ],

so we loop thru to get 3 matchCondition and then within each matchCondition for each fieldName

forming something like
(fname = z_fname and stNo = z_stNo and add1 = z_add1) OR
(ssn = z_ssn) OR
(fname = z_fname and stNo = z_stNo and lname = z_lname)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I get that. The point I am trying to make is can we have one single Column returned for all the ors along with the ands? Eventually the query push down is anyways going to do it in one single statement. Do we need to do the unions or can we get the resuklt using one single join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is query formation.

We don't actually use it exactly like this.

In case of getting obv dupes from blocked:

  1. We use each and condition
  2. union all the dfs got from above
  3. do distinct
    Directly doing OR was causing performance issues

For removing obv dupe from blocks
We first form a combined condition like above and than return a combined condition with NOT:
NOT (
(fname = z_fname and stNo = z_stNo and add1 = z_add1) OR
(ssn = z_ssn) OR
(fname = z_fname and stNo = z_stNo and lname = z_lname)
)

The way we make Column filter expression is independent of how we use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we don;t use one final Column in end and we are flexible. You have asked similar question below, I hope this answers both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By keeping build of column of filter expression independent of how we use it we can use it as needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also many of the methods are private only the public/protected can be used by outside code

* @param obvDupes
* @return
*/
public ZFrame<D,R,C> massageObvDupes(ZFrame<D,R,C> obvDupes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this belongs to the other class. It seems that the class names can be swapped - ObvDupeUtil can be renamed to ObviousDupeCondiiton or even ObvDupeFilter as it is only building the join/filter conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please tell the names of both and I will do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

method moved : commit 45e9205

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObvDupeUtil is the class that applies the filters, adds and removes obv dupes
ObvDupeCondition or ObvDupeFilter is the class that reads the args vaues and creates the conditions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in commit 6db0092 , please review

ZFrame<D,R,C> onlyIds = null;

// loop thru the values and build a filter condition
for (int i = 0; i < obviousDupes.length; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I get that. The point I am trying to make is can we have one single Column returned for all the ors along with the ands? Eventually the query push down is anyways going to do it in one single statement. Do we need to do the unions or can we get the resuklt using one single join?

* @param obvDupes
* @return
*/
public ZFrame<D,R,C> massageObvDupes(ZFrame<D,R,C> obvDupes) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ObvDupeUtil is the class that applies the filters, adds and removes obv dupes
ObvDupeCondition or ObvDupeFilter is the class that reads the args vaues and creates the conditions

@sonalgoyal sonalgoyal merged commit 7cfb69d into zinggAI:0.4.0 Oct 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants