Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export labeled data from zingg learner and import it to new zingg instance/model #117

Closed
delta824 opened this issue Jan 8, 2022 · 10 comments
Assignees

Comments

@delta824
Copy link

delta824 commented Jan 8, 2022

Ability to export the labeled data from zingg’s learner and import it into a new zingg config/model. Currently there is no way to retain the labeled data if the config file is changed/updated. For example, adding an additional column to same dataset (labeled data still true/unchanged).

@sonalgoyal
Copy link
Member

sonalgoyal commented Jan 8, 2022

Thanks for reporting @delta824. What would be the values of the columns you want to add?

@delta824
Copy link
Author

delta824 commented Jan 8, 2022

Thanks for reporting @delta824. What would be the values of the columns you want to add?

Text and/or numbers

@sonalgoyal
Copy link
Member

What would be the default values?

@delta824
Copy link
Author

delta824 commented Jan 8, 2022

For example, the dataset(s) were labeled using zingg's learner based on 5 columns. Later on, an additional column will be appended to the dataset(s) to provide more details to help improve match accuracy.

zingg can output the labeled data from the learner in the same format as the "Using preexisting training data" feature described in #115 and https://docs.zingg.ai/docs/setup/training/addOwnTrainingData.html

@sonalgoyal
Copy link
Member

I see your point. The problem here is what value should Zingg assign to the 6th newly added column in the old labelled data with 5 columns? When building the model, we need to learn from all the columns collectively hence we can’t leave it as blank or null as that is not representative of how the 6th column will be in the data

@delta824
Copy link
Author

delta824 commented Jan 8, 2022

Is it possible to rebuild the model from zero and learn from all columns collectively, including the 6th newly added column using the exported labeled data? Similar to how the model will be built when using pre-existing training data. So it won't be a modification of the existing model, but rather a full re-build and new values for all columns again.

@sonalgoyal
Copy link
Member

Yes that is clearly possible. In fact, every time you run train, a new model is created which overwrites the last one. Does something like this work? @delta824

-Zingg command to export the labelled data to csv

  • user changes the columns
  • User adds this data through trainingSamples
  • Few rounds of findTrainingData and label to tune

@delta824
Copy link
Author

delta824 commented Jan 8, 2022

Yes, that would work!

  • export labeled data from zingg's learner in the same format as adding data through trainingSamples
  • user can add/remove/modify column(s) to exported labeled data
  • user runs train phase which will create a new model based on all columns collectively
  • run a few rounds of findTrainingData and label to tune

@sonalgoyal
Copy link
Member

@navinrathore can you please provide steps to convert parquet files to csv on spark-shell from the marked folder of the model? See if there is a way we can print/convert the schema in a way that can be used in the config json

@navinrathore
Copy link
Contributor

navinrathore commented Jan 10, 2022

Here are the steps to run in python.

$ pyspark 
>>> parquetFile = spark.read.parquet("models/<modelId>/trainingData/marked");
>>> parquetFile.coalesce(1).write.csv("outfile");

Note: a) Input is a directory where marked files are stored. replace <modeld> with actual value.
b) Output is also a directory that contains generated csv file.

navinrathore added a commit to navinrathore/zingg-1 that referenced this issue Jan 14, 2022
sonalgoyal added a commit that referenced this issue Jan 14, 2022
Document for Exporting labeled data as training samples #117
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants