Odd size revamp #247

sanjanag · 2024-01-29T07:46:39Z

The major change in this PR is the logic behind odd size issue type.
Previous logic:
Any image which is x times larger/smaller than the median size in the dataset, where x was hardcoded to 10.
New logic:
Now we compute the q1 = 25th percentile and q3 = 75th percentile for the dataset
An image is marked as oddly sized if size > q1 + 3 * IQR or size < q3 - 3 * IQR.

For scoring the issues, distance of each image is computed from the midpoint, where midpoint = (q1+q3) / 2 and normalized such that the images within the range have a score of 0.5 or less and outlier images have score > 0.5. There is one edge case where q1 = median = q3 and in this case the images with size = median are assigned a score of 1.0 and rest of scores are scaled accordingly, for this case the threshold is 1.0. This threshold can be supplied as a hyperparameter.

Since cleanvision shows low scores for problematic images, these scores are flipped and clipped between 0 and 1.

Another change is the output of imagelab.info['statistics']. Earlier this had raw values of image properties, now it contains statistics from pd.Series.describe() instead of values for the whole dataset.

Here's an example of how it would look

In the visualization of odd sized images, changed the title to size instead of original size
Changed the signature of internal method ImageProperty.mark_issue from
def mark_issue(self, scores: pd.DataFrame, threshold: float, issue_type: str): to
def mark_issue(self, scores: pd.DataFrame, issue_type: str, threshold: Optional[float] = None)
This makes threshold arg optional more visibly.

Results on cleanvision test/demo dataset

codecov · 2024-02-01T07:09:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (53ee321) 95.93% compared to head (5d51a9f) 96.10%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
+ Coverage   95.93%   96.10%   +0.16%     
==========================================
  Files          16       16              
  Lines         985     1001      +16     
  Branches      194      195       +1     
==========================================
+ Hits          945      962      +17     
+ Misses         20       19       -1     
  Partials       20       20

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

review-notebook-app · 2024-02-01T23:18:39Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

elisno

LGTM.

Just curious if you think a size distribution that is multi-modal (e.g. mostly 8Mb and 12Mb images, but a handful of, say, 16Mb images) should work well with this IQR rule?

elisno · 2024-02-02T23:25:51Z

docs/source/tutorials/tutorial.ipynb

@@ -153,17 +150,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The main way to interface with your data is via the `Imagelab` class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n",
+    "The main way to interface with your data is via the [Imagelab](https://cleanvision.readthedocs.io/en/latest/cleanvision/imagelab.html#cleanvision.imagelab.Imagelab) class. This class can be used to understand the issues in your dataset at a high level (global overview) and low level (issues and quality scores for each image) as well as additional information about the dataset. It has three main attributes:\n",


Should this link be "stable" or are these the same?

But both should work.

Tutorial installs the latest version (main branch), so I think this link is good.

sanjanag · 2024-02-07T18:13:03Z

Just curious if you think a size distribution that is multi-modal (e.g. mostly 8Mb and 12Mb images, but a handful of, say, 16Mb images) should work well with this IQR rule?

It wouldn't unless you change the iqr hyperparameter. Say, most of the images are 8MB and 12MB such that 25% is 8 and 75% is 12, this would not detect 16MB since the iqr multiplier is 3, I used 3 instead of 1.5 the standard one, as it looked like a better threshold across the benchmark datasets I tested with. 1.5 gave too many examples as odd sized.

sanjanag added 9 commits January 28, 2024 23:44

fixed formatting

e22e5e8

updated flake8 check

5314972

Merge branch 'main' into odd-size-revamp

ba986fb

Fixed test

2e48319

Fixed test

b563265

Removed unncessary dependencies in dev req file

ec1fd57

Fixed odd size tests

1cd9f1a

Fixed mypy error'

96f0143

Fixed typing syntax

8c73b2e

sanjanag added 3 commits February 1, 2024 15:16

Updated odd size title key

459ffaf

Update info['statistics'] with describe stats

50076ab

Updated tutorial notebook

1e311fe

sanjanag added 2 commits February 1, 2024 15:32

Fixed tests

ab50895

Merge branch 'main' into odd-size-revamp

5d51a9f

sanjanag marked this pull request as ready for review February 2, 2024 18:24

sanjanag requested a review from elisno February 2, 2024 23:08

elisno approved these changes Feb 2, 2024

View reviewed changes

sanjanag merged commit 972f060 into cleanlab:main Feb 7, 2024
24 checks passed

sanjanag deleted the odd-size-revamp branch February 7, 2024 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Odd size revamp #247

Odd size revamp #247

sanjanag commented Jan 29, 2024 •

edited

Loading

codecov bot commented Feb 1, 2024 •

edited

Loading

review-notebook-app bot commented Feb 1, 2024

elisno left a comment •

edited

Loading

elisno Feb 2, 2024

elisno Feb 2, 2024

sanjanag Feb 6, 2024

sanjanag commented Feb 7, 2024

Odd size revamp #247

Odd size revamp #247

Conversation

sanjanag commented Jan 29, 2024 • edited Loading

codecov bot commented Feb 1, 2024 • edited Loading

Codecov Report

review-notebook-app bot commented Feb 1, 2024

elisno left a comment • edited Loading

Choose a reason for hiding this comment

elisno Feb 2, 2024

Choose a reason for hiding this comment

elisno Feb 2, 2024

Choose a reason for hiding this comment

sanjanag Feb 6, 2024

Choose a reason for hiding this comment

sanjanag commented Feb 7, 2024

sanjanag commented Jan 29, 2024 •

edited

Loading

codecov bot commented Feb 1, 2024 •

edited

Loading

elisno left a comment •

edited

Loading