data and scripts for amazon-google dataset #131

navinrathore · 2022-01-14T13:09:03Z

No description provided.

navinrathore · 2022-01-14T13:09:35Z

Run the pyspark scripts from dirctory - examples/amazon-google/scripts
spark-submit prepareTestDataSpark.py

training file has cluster_id and label at the end right now. trying to move them in the beginning...selection column list needs to be adjusted accordingly.

sonalgoyal · 2022-01-14T13:13:19Z

examples/amazon-google/config.json

@@ -0,0 +1,82 @@
+{


why do we need two configs - dont we need just one for linking?

Removed.
It was added if somebody wants to generate own training/labelled data.

File removal will be part of next checkin. Version from first checkin is still there.

examples/amazon-google/configWithTrainingSamples.json

examples/amazon-google/scripts/prepareTestDataSpark.py

examples/amazon-google/scripts/prepareTrainDataSpark.py

sonalgoyal

what is the need for config.json?

examples/amazon-google/config.json

sonalgoyal · 2022-01-14T13:21:22Z

examples/amazon-google/config.json

+					}"
+				},
+				{
+					"name": "test2",


call it aws?

Yes. I've made the changes.
For other datasets, these fields may be updated accordingly.

examples/amazon-google/scripts/prepareTestDataSpark.py

examples/amazon-google/configWithTrainingSamples.json

navinrathore · 2022-01-15T17:46:06Z

The code was updated yesterday with some changes. Some improvements were made and some mistakes were rectified.
Some analysis of the results was also done.
Result
Test1 (Google): Total 2293 records -> 771 unique ==> After Linking 565 records found matched
Test2 (AWS) : Total 2293 records -> 1090 unique ==> 2287 records found matched
Observations: in teset2 records (we did have matched records, we had other partially matched records). Perhaps updating model with some parameters, results can be improved. the score/probability was around 0.45 mostly.
Training data: both train,csv (6874 recs)and valid.csv(2293) was used to train the model.

…I#110

navinrathore · 2022-01-16T03:24:13Z

Some queries/notes about changes:

field 'price' is 'exact'/double and all others are 'fuzzy'/string
In configWithTrainingSample.json, the training sample data comprises both train.csv and valid.csv data
In configWithTrainingSample.json, is "data" attributes compulsory? it currently has both tableA and tableB. Or it could have data corresponding to train.csv and valid.csv
What could be ideal location of the script? Some minor changes can be made it absolutely generic in terms of location
Same script works with all dataset without any change.
Naming is simple testA, TestB etc.

added amazon-google dataset used in DeepMatcher

54a41be

sonalgoyal requested changes Jan 14, 2022

View reviewed changes

pyspark scripts to preprocess amazon-google data zinggAI#110

933fb69

navinrathore force-pushed the zNewModels branch from f14e405 to 933fb69 Compare January 14, 2022 19:11

sonalgoyal reviewed Jan 15, 2022

View reviewed changes

single pyspark script to do all the needful to preprocess data zinggA…

17facc4

…I#110

sonalgoyal merged commit 9cbb28b into zinggAI:main Jan 17, 2022

sonalgoyal mentioned this pull request Jan 18, 2022

Revert "data and scripts for amazon-google dataset " #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data and scripts for amazon-google dataset #131

data and scripts for amazon-google dataset #131

navinrathore commented Jan 14, 2022

navinrathore commented Jan 14, 2022

sonalgoyal Jan 14, 2022

navinrathore Jan 14, 2022

navinrathore Jan 15, 2022

sonalgoyal left a comment

sonalgoyal Jan 14, 2022

navinrathore Jan 15, 2022

navinrathore Jan 16, 2022

navinrathore commented Jan 15, 2022

navinrathore commented Jan 16, 2022

data and scripts for amazon-google dataset #131

data and scripts for amazon-google dataset #131

Conversation

navinrathore commented Jan 14, 2022

navinrathore commented Jan 14, 2022

sonalgoyal Jan 14, 2022

Choose a reason for hiding this comment

navinrathore Jan 14, 2022

Choose a reason for hiding this comment

navinrathore Jan 15, 2022

Choose a reason for hiding this comment

sonalgoyal left a comment

Choose a reason for hiding this comment

sonalgoyal Jan 14, 2022

Choose a reason for hiding this comment

navinrathore Jan 15, 2022

Choose a reason for hiding this comment

navinrathore Jan 16, 2022

Choose a reason for hiding this comment

navinrathore commented Jan 15, 2022

navinrathore commented Jan 16, 2022

Some queries/notes about changes: