Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc(scripts): add lantern evaluation scripts #5257

Merged
merged 6 commits into from
May 23, 2018
Merged

Conversation

patrickhulce
Copy link
Collaborator

@patrickhulce patrickhulce commented May 18, 2018

phase 2 of run lantern accuracy checks per commit

related #5237

this adds a set of scripts that downloads a set of 100 minified traces, runs lantern over them and prints out some summary statistics.

the trace set is mostly representative, skewing slightly negative so there's more identifiable room for improvement, also the MAPE/spearman's rho stats on this set are slightly lower than you might have seen in previous PRs because of this and that this is 1 trace compared to median of WPT instead of median of 9 traces compared to WPT

Good/OK/Bad thresholds very open to discussion, right now...

  • Good
    <20% absolute error OR <10% percentile difference (predicting 60s when real is 80s is good IMO).
    Sites here are roughly indistinguishable from WPT results on single-runs as the error is within the normal variance level of WPT
  • OK
    <50% absolute error, sites here are roughly as inaccurate as DevTools throttling on its edge cases
  • Bad
    >50% absolute error, sites here are usually inaccurate and should be dug into, mostly fall into a few categories
    1. there's a large difference between optimistic/pessimistic and and estimate ends up somewhere in the middle
    2. Page is much slower on a real phone than when throttled (in some cases never reaching TTI)
    3. The underlying runs were very different (should probably exclude these from dataset eventually)
    4. Other unclear reason (prime investigation targets!)
see sample output
----    Metric Stats    ----
metric                         estimate                  rank error   MAPE       Good/OK/Bad
firstContentfulPaint           optimisticFCP             15.6%        32.4%      62/21/16
firstContentfulPaint           pessimisticFCP            15.6%        29.9%      57/29/13
firstContentfulPaint           roughEstimateOfFCP        15.4%        26.9%      59/26/14
firstMeaningfulPaint           optimisticFMP             16.9%        36.1%      61/19/19
firstMeaningfulPaint           pessimisticFMP            16.1%        32.1%      57/29/13
firstMeaningfulPaint           roughEstimateOfFMP        16.4%        28.1%      61/20/18
timeToFirstInteractive         optimisticTTFCPUI         11.7%        33.7%      62/21/16
timeToFirstInteractive         pessimisticTTFCPUI        14.2%        36.3%      60/15/24
timeToFirstInteractive         roughEstimateOfTTFCPUI    11.7%        33.3%      67/19/13
timeToConsistentlyInteractive  optimisticTTI             10.8%        36.1%      71/5/22
timeToConsistentlyInteractive  pessimisticTTI            11.4%        38.9%      61/18/19
timeToConsistentlyInteractive  roughEstimateOfTTI        11%          34.4%      62/21/15
speedIndex                     optimisticSI              19.8%        60%        45/22/32
speedIndex                     pessimisticSI             27.1%        54.8%      37/19/43
speedIndex                     roughEstimateOfSI         20.7%        39.6%      45/28/26

----    Summary Stats    ----
Good: 60%
OK: 23%
Bad: 17%

----    Worst10 Sites    ----
underestimated firstMeaningfulPaint by 23199 on http://www.thefreedictionary.com/
underestimated firstContentfulPaint by 23444 on http://www.thefreedictionary.com/
underestimated speedIndex by 22445 on http://www.thefreedictionary.com/
underestimated firstMeaningfulPaint by 14593 on http://www.rakuten.ne.jp/
overestimated speedIndex by 7867 on http://www.foxnews.com/
underestimated speedIndex by 23633 on http://www.huffingtonpost.com/
overestimated firstContentfulPaint by 3129 on http://www.7k7k.com/
underestimated speedIndex by 24705 on http://www.cnet.com/
underestimated speedIndex by 7606 on http://www.metacafe.com/
overestimated timeToFirstInteractive by 10463 on http://www.hatena.ne.jp/

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evaluating firstContentfulPaint vs. optimisticFCP: 15.6% 32.4% - 62/21/16
Evaluating firstContentfulPaint vs. pessimisticFCP: 15.6% 29.9% - 57/29/13

what do these numbers refer to? Maybe add a header or something to the print out?

@patrickhulce
Copy link
Collaborator Author

what do these numbers refer to? Maybe add a header or something to the print out?

headers added 👍

@@ -0,0 +1,10 @@
#!/bin/bash

# THIS SCRIPT ASSUMES CWD IS ROOT PROJECT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you want,

pwd="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
lhroot_path="$pwd/../.."

@patrickhulce
Copy link
Collaborator Author

anything else here folks?

LH_ROOT_PATH="$DIRNAME/../../.."
cd $LH_ROOT_PATH

TAR_URL="https://drive.google.com/a/chromium.org/uc?id=1_w2g6fQVLgHI62FApsyUDejZyHNXMLm0&amp;export=download"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add comment on what this is and if it's ever updated frozen snapshots from XXX date

Copy link
Member

@brendankenny brendankenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this too.

I mentioned this (jokingly) in another PR, but residual distribution is really one of the key signals we should be looking at for evaluating the regression and (in turn) the generation of the pessimistic/optimistic signals and if we're not incorporating influential variables that we should be.

We would need a random ("random") sample for that, though, and the scripts themselves are good regardless of data run through them, so I'm 👍 👍

const totalBad = [];

/**
* @param {string} metric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keyof whatever might let you skip some of the ts-ignores below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't remove ts-ignores but helps elsewhere 👍

* @property {string} url
* @property {string} tracePath
* @property {string} devtoolsLogPath
* @property {*} lantern
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the def for this? is it just that it's long?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@patrickhulce patrickhulce merged commit e523e2d into master May 23, 2018
@patrickhulce patrickhulce deleted the run_lantern branch May 23, 2018 23:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants