Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use on custom dataset? #3

Open
sarmientoj24 opened this issue Jul 6, 2021 · 7 comments
Open

How to use on custom dataset? #3

sarmientoj24 opened this issue Jul 6, 2021 · 7 comments

Comments

@sarmientoj24
Copy link

For example, I only want to use one class in OpenImages, say the car class and I would want to use the InceptionV3 embedding.

How do I use it on custom dataset?

@TDeVries
Copy link
Collaborator

TDeVries commented Jul 6, 2021

It should be fairly easy to apply to custom datasets. You need to create a dataset object that contains your data and then pass it to the select_instances function. See https://github.com/uoguelph-mlrg/instance_selection_for_gans#applying-instance-selection-to-your-own-dataset or https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L91. To use the InceptionV3 embedding pass 'inceptionv3' as the embedding arg.

If you only want a single class from a larger dataset (like OpenImages) you may need to create a custom dataset object that only loads images from that class. If all car images are in a single folder you could maybe use an ImageFolder dataset (https://pytorch.org/vision/stable/datasets.html#torchvision.datasets.ImageFolder).

@sarmientoj24
Copy link
Author

For the custom dataset, is it fine if I only have one class that I have already loaded from, say, a folder?

@TDeVries
Copy link
Collaborator

TDeVries commented Jul 7, 2021

Yes, that should work fine.

@kc-puttagunta
Copy link

Yes, that should work fine.

is there a place from where I can access this subset of selected instances? I would like to observe these images and pass them on to StyleGAN2 as my training dataset.
I have an unlabelled custom dataset in a single folder and I have applied the select_instances function to it with inceptionV3 embeddings but I don't see any output or the original folder being reduced.
Thanks!

@TDeVries
Copy link
Collaborator

The select_instances function returns a Subset dataset object, so if you want to view the images it selected you could iterate through it to generate some samples sheets or even save each image separately to file, if that's what you are looking for. It works by keeping track of the indices from the original dataset that it needs for the reduced dataset, so your original folder won't be changed, and it doesn't save any new files.

If you want to save the indices that it selected you can pass the function a file path ending in .pkl (https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L95), and that will save a pickle file of indices. However, that might be a bit tricky to interpret, since the ordering is with reference to the file paths in the original dataset object, which may not line up with how files are listed in your directory. You might be able to dig into the dataset attributes to find a list of file paths though. For example if your original dataset is a DatasetFolder or ImageFolder object then it should have a self.samples attribute containing file paths which lines up with the selected indices.

@kc-puttagunta
Copy link

The select_instances function returns a Subset dataset object, so if you want to view the images it selected you could iterate through it to generate some samples sheets or even save each image separately to file, if that's what you are looking for. It works by keeping track of the indices from the original dataset that it needs for the reduced dataset, so your original folder won't be changed, and it doesn't save any new files.

If you want to save the indices that it selected you can pass the function a file path ending in .pkl (https://github.com/uoguelph-mlrg/instance_selection_for_gans/blob/master/instance_selection.py#L95), and that will save a pickle file of indices. However, that might be a bit tricky to interpret, since the ordering is with reference to the file paths in the original dataset object, which may not line up with how files are listed in your directory. You might be able to dig into the dataset attributes to find a list of file paths though. For example if your original dataset is a DatasetFolder or ImageFolder object then it should have a self.samples attribute containing file paths which lines up with the selected indices.

thank you for a prompt response. this was very helpful.
I am applying your code to solve a generation problem in the low-data regime and it has proven somewhat useful in bringing about convergence. I have a few questions to discuss around identifying clusters within the data to more precisely retain high density data belonging to a single homogenous cluster. currently, the retention ratio seems to be somewhat arbitrary and can only be validated after the generation task.
if this piques your interest, are you available for a chat sometime? :)
thanks in advance!

@kc-puttagunta
Copy link

also, would it help to extracts embeddings from more recent and deeper architectures that perform better than V3 and others?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants