Skip to content

4.1.0-java

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 06 Aug 06:51
f2984b5
Fix Confidence Adjustment for Larger Shingle Sizes (#407)

* Fix Confidence Adjustment for Larger Shingle Sizes

This PR addresses further adjustments to the confidence calculation issue discussed in PR 405. While PR 405 successfully resolved the issue for a shingle size of 4, it did not achieve the same results for larger shingle sizes like 8.

Key Changes
1. Refinement of seenValues Calculation:
* Previously, the formula increased confidence even as numImputed (number of imputations seen) increased because seenValues (all values seen) also increased.
* This PR fixes the issue by counting only non-imputed values as seenValues.
2. Upper Bound for numImputed:
* The numImputed is now upper bounded to the shingle size.
* The impute fraction calculation, which uses numberOfImputed * 1.0 / shingleSize, now ensures the fraction does not exceed 1.
3. Decrementing numberOfImputed:
* The numberOfImputed is decremented when there is no imputation.
* Previously, numberOfImputed remained unchanged when there is an imputation as there was both an increment and a decrement, keeping the imputation fraction constant. This PR ensures the imputation fraction accurately reflects the current state. This adjustment ensures that the forest update decision, which relies on the imputation fraction, functions correctly. The forest is updated only when the imputation fraction is below the threshold of 0.5.

Testing
* Added test scenarios with various shingle sizes to verify the changes.

Signed-off-by: Kaituo Li <kaituo@amazon.com>

* added comment

Signed-off-by: Kaituo Li <kaituo@amazon.com>

---------

Signed-off-by: Kaituo Li <kaituo@amazon.com>