Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty urlClassification #122

Closed
franciscawijaya opened this issue Jul 21, 2024 · 20 comments · Fixed by #127
Closed

Fix empty urlClassification #122

franciscawijaya opened this issue Jul 21, 2024 · 20 comments · Fixed by #127
Assignees
Labels
bug Something isn't working core functionality New big feature infrastructure An issue relate to underlying compute or selecting technologies

Comments

@franciscawijaya
Copy link
Member

franciscawijaya commented Jul 21, 2024

As mentioned in the previous meeting, when I did the first batch of the crawl, all the data are collected except for the Firefox's urlClassification. When I open the site manually without a crawl, I was able to check the sites flagged by Firefox Tracking Protection

Some fundamental things that I checked:

  • Opened the protection dashboard while the crawl is ongoing to check if it is a matter of there are first party and third party sites blocked but just not recorded or if it is the problem of Firefox's Enhanced Tracking Protection not flagging and blocking the necessary sites during the crawl. The observation showed that it is the latter.
  • Used the code version for June crawl and ran the crawl for one isolated site and it still didn't collect the UrlClassification and the protections dashboard also suggests that 'No Trackers known to Nightly were detected on this page'

Since then, I have tried out different things and ruled out some of the possible causes:

  • VPN: I tried with both on and off and different IP locations for the crawl but the issue still persists.
  • Firefox version: I made sure the version that I used manually and for the crawl is the same via opening the setting for Firefox when the crawl is ongoing before the crawl exited Firefox to make sure of the version (there is a recent update of Firefox version on July 9)
  • Dependabot update on Optmeowt extension: I repackaged the XPI extension to include the dependabot update as well as @Mattm27's edit on the workflow for the extension and checked it manually by dropping it to Firefox extension and open a site. I was able to get the urlClassification when doing it manually. However, when I ran the crawl for just one isolated site with the newly updated xpi file, the urlClassification data is not collected.
@franciscawijaya franciscawijaya added bug Something isn't working discussion Let's talk about this invalid This doesn't seem right labels Jul 21, 2024
@franciscawijaya franciscawijaya self-assigned this Jul 21, 2024
@franciscawijaya
Copy link
Member Author

Main problem: there is no urlClassification data collected during the crawl because Firefox's Enhanced Tracking Protection did not detect any sites specifically during the crawl. (When visiting a site manually, opening the protection dashboard showed that Enhanced Tracking Protection detected and blocked first party and third party sites that fall under the relevant categories).

@franciscawijaya
Copy link
Member Author

Ruled out the Firefox Nightly recent update: I downloaded the older nightly version (the one that I used for June crawl) with both the current version of the code and the June crawl version of the code and they both still gave an empty output for the urlClassification.

@franciscawijaya
Copy link
Member Author

Ruled out the possibility of a problem from local device: I tried doing the crawl on my own computer and I also got the same empty output for urlClassification.

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Jul 25, 2024

I read more about the Firefox's Enhanced Tracking Protection and I think the problem is coming from the act of crawl or Firefox's response to the crawl.

These are some of the observations that I made from the previous tries:

  • Every time I entered a site manually (without using a crawl), the protection banner suggests that Firefox blocked trackers and harmful scripts on the site and I can see the sites that were blocked when clicking this protection banner (ie. the urlClassification that the crawl is looking for)
  • When I ran the crawl on the same set of sites for testing, whenever I checked the protection banner while the crawl is ongoing it shows that the Enhanced Protection is turned on the site but Firefox did not block any trackers or scripts (I noted that the Enhanced Tracking Protection is not inactive which implies that it is a sign that the problem is coming from Firefox responding to our crawl)
Screenshot 2024-07-25 at 1 58 11 AM Screenshot 2024-07-25 at 1 56 47 AM

I also ruled out the possibility of the settings of Enhanced Tracking Protection to be the cause of the problem: while the crawl was ongoing, I made sure all the settings for the Enhanced Tracking Protection is standard and same as its settings when I did it manually.

For now, it seems to me that the functionality of Enhanced Tracking Protection (ETP) during the crawl is the problem; while ETP is on and in the same settings for both conditions (manual and crawl access to the site), it did not successfully give the output of the sites blocked for the crawl.

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Aug 6, 2024

As discussed, I have checked several things the past week,

  • Dependencies: I went through the versions of all the dependencies for the July codebase and June codebase and they are identical (I also compared them to April's)
  • Selenium version: I realized that our crawler has been using the same Selenium version (4.7.1 - released in December 2022 according to this archive).

For July, June, and even April crawl, the codebase is the same regarding the selenium-webcrawler used (see attached). I further confirmed this by checking the "resolved" field in the package-lock.json in our codebase; it is indeed the downloaded 4.7.1 version. So, npm install will install this exact version. The only way for us to update the selenium version when running the crawl is either manually editing the package.json file or running npm update selenium-webdriver. I can be sure that I used the version 4.7.1 for June crawl and presumably that is also the case for the previous crawls.

Screenshot 2024-08-07 at 12 13 59 AM

So, my last experiment is manually editing the package-lock.json file to update the selenium version to the latest one (4.23.0) in issue-122. However, it still gave me no urlClassification. Nevertheless, I don't think we can rule out selenium versions just yet, given that the jump from v4.7.1 to v4.23.0 is pretty big.

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Aug 6, 2024

I went through the versions of all the dependencies for the July codebase and June codebase and they are identical (I also compared them to April's)

Thanks, @franciscawijaya! You compared the April/June dependency versions against the July dependency versions that you have installed locally on your computer, right? (as opposed to the July dependency versions that are committed here in the remote repo)

@franciscawijaya
Copy link
Member Author

You compared the April/June dependency versions against the July dependency versions that you have installed locally on your computer, right?

Yes, correct.

@SebastianZimmeck SebastianZimmeck changed the title July Crawl - empty urlClassification Fix empty urlClassification Aug 14, 2024
@Mattm27
Copy link
Member

Mattm27 commented Aug 15, 2024

Like @franciscawijaya, I received empty urlClassification entries when performing the crawl with the same dependencies from the April crawl. In addition, after using the identical codebase from the April crawl I still received an empty urlClassification. However, I performed this analysis before reading through @SebastianZimmeck's comment above, so I will double-check that the dependencies are identical in those three places and see if the results change at all.

I will continue to dive deeper into this issue in the coming days to see if I come across any new information/findings that may be relevant!

@SebastianZimmeck
Copy link
Member

However, I performed this analysis before reading through @SebastianZimmeck's comment above ...

Sounds good! (Also, as additional clarification, what matters are the local dependencies on your computer and not on the remote repo.)

@SebastianZimmeck SebastianZimmeck added core functionality New big feature infrastructure An issue relate to underlying compute or selecting technologies and removed discussion Let's talk about this invalid This doesn't seem right labels Aug 15, 2024
@Mattm27
Copy link
Member

Mattm27 commented Aug 20, 2024

I have gone through and run the crawl with identical dependencies from the June crawl at the top level, REST API, and crawler, and I still received empty URL classification results. The API call is working properly within the extension, indicating that it is most likely not an issue with the API.

After some more thought, it occurred to me as I was reviewing analysis.js that it could possibly be a timing issue with the urlClassification data being accessed before it is fully populated. I will look more into this, but I would imagine this issue would have been accounted for in previous crawls if it was causing problems with the crawl data.

As @SebastianZimmeck mentioned in last week's meeting, it will be beneficial to create a high-level crawler that only pulls the urlClassification object from a site, which @franciscawijaya and I will begin to work on!

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Aug 20, 2024

Similar to what @Mattm27 encountered, the issue persisted on my end even after making sure that all the dependencies from the three package.json are identical.

This past week, I have also written my own Python script using selenium to exclusively extract and collect the urlClassification from Nightly's Enhanced Tracking Protection(ETP) and finally finished trying it out. However, the result is still empty even with this isolation.

Screenshot 2024-08-20 at 10 21 48 PM

To double check that it's not a matter of the data being there but not collected properly, I also checked the crawl manually while it was still ongoing and a similar circumstance to our gpc crawler's issue also existed: when the protection banner is clicked during the ongoing crawl, there is no trackers known to nightly detected. This seems to suggest that the problem comes from the usage of selenium webdriver for collecting ETP data.
Screenshot 2024-08-20 at 10 12 08 PM

After some more thought, it occurred to me as I was reviewing analysis.js that it could possibly be a timing issue with the urlClassification data being accessed before it is fully populated. I will look more into this, but I would imagine this issue would have been accounted for in previous crawls if it was causing problems with the crawl data.

This is a good point for exploration! Following this suggestion, I reran the script after extending the timeout for the implicit wait of selenium to 30 seconds and 60 seconds. Unfortunately, using the python script below, the timeout seemed to not make any difference.

Here is the script that I wrote and used. I will discuss with @Mattm27 to see if there's anything that I miss in this script as this script is far from perfect at the moment.
crawler.py.zip

@SebastianZimmeck
Copy link
Member

@wesley-tan, given your expertise, do you know why our calls to the urlClassification API return nothing?

This behavior seems to have to do with Selenium. The call only fails when we use it as part of our crawl infrastructure with Selenium. @franciscawijaya created a minimal working example (which also requires Selenium to run).

Do you have any thoughts?

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Aug 22, 2024

As mentioned during the call, the API that I used in the minimal working example is unfortunately an internal API, rendering the script unable to access the ETP data and thus making it invalid.

Right now, I'm working on a new script that includes web API and extensions API (eg. methods like chrome.webRequest.onBeforeRequest.addListener that are used in our actual crawler) so that I would be able to access the ETP data. As of now, the new minimal working example script is still not working (ie. it's not printing the data yet) but when I ran the minimum selenium crawl and checked it manually when it is visiting a site, the protection banner suggests that there are sites blocked by firefox as according to ETP -- which is a good sign given that our first challenge with our gpc crawler is that the protection banner showed 'None detected' for sites blocked by ETP.

I will continue working on this and start discussing with @Mattm27 about possible ways to get the now present sites blocked by ETP with the new minimum working example script.

@SebastianZimmeck
Copy link
Member

@wesley-tan is saying:

A few things that may be of interest:

When I looked more into safe browsing, I did a few small crawls and found the main thing toggling whether urlClassification data is collected are the following:

fo.set_preference("browser.safebrowsing.provider.mozilla.gethashURL", "")
fo.set_preference("browser.safebrowsing.provider.mozilla.updateURL", "")
fo.set_preference("browser.safebrowsing.provider.mozilla.lists", "")

These lines of code are disabling various aspects of Firefox's Safe Browsing feature, which includes URL classification.

These URLs are used to check if a specific URL is in the Safe Browsing blocklist.

  • By setting gethashURL preferences to empty strings, the browser won't be able to check URLs against the Safe Browsing database.
  • updateURL is used to update the local Safe Browsing database with new threat information. Setting them to empty strings prevents the browser from updating its Safe Browsing lists.
  • lists preferences specify which Safe Browsing lists the browser should use. Setting them to empty strings effectively disables all Safe Browsing lists.

Read more here:

@Mattm27
Copy link
Member

Mattm27 commented Sep 2, 2024

I began to play around with the preferences highlighted above by @wesley-tan in local-crawler.js, however, the crawl still returned empty URL Classification data. I will continue to look into the other resources sent over in the background.

@eakubilo
Copy link
Member

eakubilo commented Sep 5, 2024

I did a lot of investigating, but I only made progress once I tried to figure out how Firefox blocked a tracker for a website to begin with. That train of thought leads you to the about:url-classifier page. On that page, you can search for a domain and see if it is included in the tracking blacklist somewhere. I searched https://ad-delivery.net, and in my regular nightly installation I would see two rows:
Screen Shot 2024-09-05 at 4 28 46 PM
The query returns no results when trying to do the same thing in the crawler's nightly browser. Even clicking "Trigger Update" on the two providers doesn't make the query return a result.

If you inspect element, you can find the actual code that's run when you search:

let input = "https://ad-delivery.net"

    let uri;
    try {
      uri = Services.io.newURI(input);
      if (!uri) {
        Search.reportError("url-classifier-search-error-invalid-url");
      }
    } catch (ex) {
      Search.reportError("url-classifier-search-error-invalid-url");
    }
console.log(uri);
    
let classifier = Cc["@mozilla.org/url-classifier/dbservice;1"].getService(
      Ci.nsIURIClassifier
    );    
let featureNames = classifier.getFeatureNames();
    let features = [];
    featureNames.forEach(featureName => {
      if (document.getElementById("feature_" + featureName).checked) {
        let feature = classifier.getFeatureByName(featureName);
        if (feature) {
          features.push(feature);
        }
      }
    });

    if (!features.length) {
      Search.reportError("url-classifier-search-error-no-features");
    }

    let listType =
      document.getElementById("search-listtype").value == 0
        ? Ci.nsIUrlClassifierFeature.blocklist
        : Ci.nsIUrlClassifierFeature.entitylist;

    classifier.asyncClassifyLocalWithFeatures( uri, features, listType, list =>
      Search.showResults(list)
    );

I thought the classifier object was interesting, so I just inspected the object in the browser console,
Screen Shot 2024-09-05 at 4 38 42 PM
I found the most interesting function available was 'getTables'. If you check the firefox source code, you'll see some uses of getTables: https://searchfox.org/mozilla-central/search?q=.getTables%28&path=&case=false&regexp=false

It's used to get the table states from the database. If you run the following line in the browser console:

var tables = () => Cc["@mozilla.org/url-classifier/dbservice;1"].getService(
Ci.nsIURIClassifier
    ).getTables(table => console.log(table));

Running this in the normal nightly browser prints a very full list, while the crawler browser prints nothing. This gave me the idea that the tracking list information simply wasn't in the database. I am not sure about the exact mechanics, but I read enough docs to realize that maybe remote-settings had a hand in propagating the tracking protection lists. Seeing the first errors in the browser console on the crawler browser come from RemoteSettingsClient gives more credence to the idea that there was a poor interaction with the remote settings causing this. Here, I detail what the broken config is and how to fix it.

I realized what setting to fiddle with by clicking resource://services-settings/Utils.sys.mjs:441 of the first error in the crawler browser console.
Screen Shot 2024-09-05 at 5 13 53 PM

The relevant code reads:

  async fetchLatestChanges(serverUrl, options = {}) {
    const { expectedTimestamp, lastEtag = "", filters = {} } = options;

    let url = serverUrl + Utils.CHANGES_PATH;
    const params = {
      ...filters,
      _expected: expectedTimestamp ?? 0,
    };
    if (lastEtag != "") {
      params._since = lastEtag;
    }
    if (params) {
      url +=
        "?" +
        Object.entries(params)
          .map(([k, v]) => `${k}=${encodeURIComponent(v)}`)
          .join("&");
    }
    const response = await Utils.fetch(url);

    if (response.status >= 500) {
      throw new Error(`Server error ${response.status} ${response.statusText}`);
    }

    const is404FromCustomServer =
      response.status == 404 &&
      Services.prefs.prefHasUserValue("services.settings.server");

    const ct = response.headers.get("Content-Type");
    if (!is404FromCustomServer && (!ct || !ct.includes("application/json"))) {
      throw new Error(`Unexpected content-type "${ct}"`);
    }

So, the RemoteSettingsClient seems to be fetching data from the server provided, but we seem to get caught because of a mismatched Content-Type header. If we look at the server which we query our settings from (which can be found in about:config under services.settings.server), we see it's default value has been set as data:,#remote-settings-dummy/v1. This is a dummy value, I suspected this is why things weren't working. Remember, we are still in the crawler browser. If you change the server to https://firefox.settings.services.mozilla.com/v1, then go back to about:url-classifier and click download for the provider mozilla, then try to search https://ad-delivery.net, you will see two rows as we expected. So, I set the preference at startup in the code and the crawler started working again.

@SebastianZimmeck
Copy link
Member

You applied nice intuition and excellent reasoning! Well done, @eakubilo! You earned that A+.

I created a reference to your writeup in the readme.

@eakubilo
Copy link
Member

eakubilo commented Sep 6, 2024

Thank you! The privacy project is amazing, I am happy I got a chance to contribute to it. For the sake of completeness as well:

  1. I believe the "default settings" are set here: https://searchfox.org/mozilla-central/source/remote/shared/RecommendedPreferences.sys.mjs
    You'll notice the following lines:
// Ensure remote settings do not hit the network
  ["services.settings.server", "data:,#remote-settings-dummy/v1"]

This was the only trace on the internet I found of the default value of services.setting.server preference, hopefully someone else will be able to provide a much more sound answer.
2. I thought Firefox used Shavar to update the tracking protections list: https://wiki.mozilla.org/Services/Shavar
3. The flow of how FireFox blocks a tracker can be found here: https://wiki.mozilla.org/Security/Tracking_protection
specifically:
Screen Shot 2024-09-06 at 11 48 04 AM
3.1 Because of 1. here, I thought I would have to finagle some local instance of Shavar for the Firefox client to remotely check. That line of thought lead me nowhere, as the Services/Shavar page doesn't actually tell you how to test new tracking protections list data sets in Firefox.
4. We can find more documentation for Shavar here: https://firefox-source-docs.mozilla.org/toolkit/components/antitracking/anti-tracking/tracking-lists/index.html
4.1. These docs holds the best information about how Shavar works specifically. Simply put, every 6 hours your client will request the most recent tracking list from Shavar. I presume it also asks for the most recent tracking list on startup.
4.2. Scrolling down further on the page, we see some breakdown of Remote Settings. Specifically, we see that Shavar is on the final vestiges of existence.
Screen Shot 2024-09-06 at 12 03 06 PM
4.3 Seeing that remote settings is now the primary manager of "evergreen data" in Firefox and the default behavior of the web drivers seems to be to have a dummy server target, it seems like what happened in between the June and July crawls was that Mozilla changed something so that Firefox now exclusively gets tracking blacklists from remote settings. This is, I think, the best diagnosis of the issue I can come up with.

I hope this helped!

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Sep 9, 2024

Thank you so much, @eakubilo! This is super!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core functionality New big feature infrastructure An issue relate to underlying compute or selecting technologies
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants