Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd size revamp #247

Merged
merged 14 commits into from
Feb 7, 2024
Merged

Odd size revamp #247

merged 14 commits into from
Feb 7, 2024

Conversation

sanjanag
Copy link
Member

@sanjanag sanjanag commented Jan 29, 2024

  1. The major change in this PR is the logic behind odd size issue type.
    Previous logic:
    Any image which is x times larger/smaller than the median size in the dataset, where x was hardcoded to 10.
    New logic:
    Now we compute the q1 = 25th percentile and q3 = 75th percentile for the dataset
    An image is marked as oddly sized if size > q1 + 3 * IQR or size < q3 - 3 * IQR.

For scoring the issues, distance of each image is computed from the midpoint, where midpoint = (q1+q3) / 2 and normalized such that the images within the range have a score of 0.5 or less and outlier images have score > 0.5. There is one edge case where q1 = median = q3 and in this case the images with size = median are assigned a score of 1.0 and rest of scores are scaled accordingly, for this case the threshold is 1.0. This threshold can be supplied as a hyperparameter.

Since cleanvision shows low scores for problematic images, these scores are flipped and clipped between 0 and 1.

  1. Another change is the output of imagelab.info['statistics']. Earlier this had raw values of image properties, now it contains statistics from pd.Series.describe() instead of values for the whole dataset.

Here's an example of how it would look

Screenshot 2024-02-01 at 4 57 27 PM
  1. In the visualization of odd sized images, changed the title to size instead of original size
  2. Changed the signature of internal method ImageProperty.mark_issue from
    def mark_issue(self, scores: pd.DataFrame, threshold: float, issue_type: str): to
    def mark_issue(self, scores: pd.DataFrame, issue_type: str, threshold: Optional[float] = None)
    This makes threshold arg optional more visibly.

Results on cleanvision test/demo dataset

Screenshot 2024-02-01 at 4 56 12 PM

Copy link

codecov bot commented Feb 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (53ee321) 95.93% compared to head (5d51a9f) 96.10%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
+ Coverage   95.93%   96.10%   +0.16%     
==========================================
  Files          16       16              
  Lines         985     1001      +16     
  Branches      194      195       +1     
==========================================
+ Hits          945      962      +17     
+ Misses         20       19       -1     
  Partials       20       20              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@sanjanag sanjanag marked this pull request as ready for review February 2, 2024 18:24
@sanjanag sanjanag requested a review from elisno February 2, 2024 23:08
Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Just curious if you think a size distribution that is multi-modal (e.g. mostly 8Mb and 12Mb images, but a handful of, say, 16Mb images) should work well with this IQR rule?

@@ -153,17 +150,17 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The main way to interface with your data is via the `Imagelab` class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n",
"The main way to interface with your data is via the [Imagelab](https://cleanvision.readthedocs.io/en/latest/cleanvision/imagelab.html#cleanvision.imagelab.Imagelab) class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this link be "stable" or are these the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But both should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorial installs the latest version (main branch), so I think this link is good.

@sanjanag
Copy link
Member Author

sanjanag commented Feb 7, 2024

Just curious if you think a size distribution that is multi-modal (e.g. mostly 8Mb and 12Mb images, but a handful of, say, 16Mb images) should work well with this IQR rule?

It wouldn't unless you change the iqr hyperparameter. Say, most of the images are 8MB and 12MB such that 25% is 8 and 75% is 12, this would not detect 16MB since the iqr multiplier is 3, I used 3 instead of 1.5 the standard one, as it looked like a better threshold across the benchmark datasets I tested with. 1.5 gave too many examples as odd sized.

@sanjanag sanjanag merged commit 972f060 into cleanlab:main Feb 7, 2024
24 checks passed
@sanjanag sanjanag deleted the odd-size-revamp branch February 7, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants