Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data and scripts for amazon-google dataset #131

Merged
merged 3 commits into from
Jan 17, 2022

Conversation

navinrathore
Copy link
Contributor

No description provided.

@navinrathore
Copy link
Contributor Author

Run the pyspark scripts from dirctory - examples/amazon-google/scripts
spark-submit prepareTestDataSpark.py

training file has cluster_id and label at the end right now. trying to move them in the beginning...selection column list needs to be adjusted accordingly.

@@ -0,0 +1,82 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need two configs - dont we need just one for linking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.
It was added if somebody wants to generate own training/labelled data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File removal will be part of next checkin. Version from first checkin is still there.

examples/amazon-google/configWithTrainingSamples.json Outdated Show resolved Hide resolved
examples/amazon-google/configWithTrainingSamples.json Outdated Show resolved Hide resolved
examples/amazon-google/scripts/prepareTestDataSpark.py Outdated Show resolved Hide resolved
examples/amazon-google/scripts/prepareTrainDataSpark.py Outdated Show resolved Hide resolved
Copy link
Member

@sonalgoyal sonalgoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the need for config.json?

examples/amazon-google/config.json Outdated Show resolved Hide resolved
}"
},
{
"name": "test2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call it aws?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've made the changes.
For other datasets, these fields may be updated accordingly.

examples/amazon-google/scripts/prepareTestDataSpark.py Outdated Show resolved Hide resolved
examples/amazon-google/configWithTrainingSamples.json Outdated Show resolved Hide resolved
@navinrathore
Copy link
Contributor Author

The code was updated yesterday with some changes. Some improvements were made and some mistakes were rectified.
Some analysis of the results was also done.
Result
Test1 (Google): Total 2293 records -> 771 unique ==> After Linking 565 records found matched
Test2 (AWS) : Total 2293 records -> 1090 unique ==> 2287 records found matched
Observations: in teset2 records (we did have matched records, we had other partially matched records). Perhaps updating model with some parameters, results can be improved. the score/probability was around 0.45 mostly.
Training data: both train,csv (6874 recs)and valid.csv(2293) was used to train the model.

@navinrathore
Copy link
Contributor Author

Some queries/notes about changes:

  • field 'price' is 'exact'/double and all others are 'fuzzy'/string
  • In configWithTrainingSample.json, the training sample data comprises both train.csv and valid.csv data
  • In configWithTrainingSample.json, is "data" attributes compulsory? it currently has both tableA and tableB. Or it could have data corresponding to train.csv and valid.csv
  • What could be ideal location of the script? Some minor changes can be made it absolutely generic in terms of location
  • Same script works with all dataset without any change.
  • Naming is simple testA, TestB etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants