Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only include rules with diagnostics in SARIF metadata #13268

Merged

Conversation

RussellLuo
Copy link
Contributor

This PR can be considered as a follow-up of #13180.

Motivation

The current Ruff's SARIF output contains all the 800+ rules, the vast majority of which have no relevance to the errors in results. IMO, this results in a large output and a lot of noise.

Proposal

Per the SARIF example, other linters such as ESLint only emits rules relevant to the errors in results. I think this could be an approach worth considering.

{
  "version": "2.1.0",
  "$schema": "http://json.schemastore.org/sarif-2.1.0-rtm.4",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "ESLint",
          "informationUri": "https://eslint.org",
          "rules": [
            {
              "id": "no-unused-vars",
              "shortDescription": {
                "text": "disallow unused variables"
              },
              "helpUri": "https://eslint.org/docs/rules/no-unused-vars",
              "properties": {
                "category": "Variables"
              }
            }
          ]
        }
      },
      "artifacts": [
        {
          "location": {
            "uri": "file:///C:/dev/sarif/sarif-tutorials/samples/Introduction/simple-example.js"
          }
        }
      ],
      "results": [
        {
          "level": "error",
          "message": {
            "text": "'x' is assigned a value but never used."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "file:///C:/dev/sarif/sarif-tutorials/samples/Introduction/simple-example.js",
                  "index": 0
                },
                "region": {
                  "startLine": 1,
                  "startColumn": 5
                }
              }
            }
          ],
          "ruleId": "no-unused-vars",
          "ruleIndex": 0
        }
      ]
    }
  ]
}

Copy link
Contributor

github-actions bot commented Sep 6, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow up.

I'm not sure if reducing the rules to the reported tools is the right behavior. According to the spec

A toolComponent object MAY contain a property named rules whose value is an array of zero or more unique (§3.7.3) reportingDescriptor objects (§3.49) each of which provides information about an analysis rule supported by the tool component.

The way I understand this is that the rules table should list all rules supported by the tool and not just rules seen in this analysis run. Do you have a resource that states if it is okay for the rules table to only include the reported rules?

@MichaReiser MichaReiser added the cli Related to the command-line interface label Sep 8, 2024
@RussellLuo
Copy link
Contributor Author

@MichaReiser Thanks for the reply!

I have found two resources that might be relevant:

  1. Rule metadata

    Rule metadata is optional. An analysis tool can choose not to include it at all, to include metadata for only those rules that are relevant to the results, or to include metadata for all rules known to the tool.13

  2. Appendix E. (Informative) Locating rule and notification metadata

    The SARIF format allows rule and notification metadata to be included in a SARIF log file (see §3.19.23 and §3.19.24). A SARIF log file does not need to include any metadata. This raises the questions of when metadata should be included in a log file, and how to locate the metadata if it is not included in the log file.

    Metadata should be included in a log file in the following circumstances:

    · The log file is intended to be viewed in a tool such as a log file viewer that needs to display metadata related to each result or notification even when the tool is not connected to a network.

    · The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

    · Neither of the above applies, but the increased log file size due to the metadata is not considered significant.

To summarize, including only the rules relevant to the results or including all rules is okay, but IMO the latter would result in a large output.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. These two sections are extremely helpful.

From the Appendix E:

Metadata should be included in a log file in the following circumstances:

Our challenge is that we can't make assumptions about the use case and I don't fully understand the following scenario:

The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

What do you make of this scenario?

Either way. I'm slightly leaning towards moving forward with this until someone comes up with a use case where they need all metadata.

@@ -27,6 +28,13 @@ impl Emitter for SarifEmitter {
.map(SarifResult::from_message)
.collect::<Result<Vec<_>>>()?;

let mut rule_ids = HashSet::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this to

Suggested change
let mut rule_ids = HashSet::new();
let unique_rules: HashSet<_> = results.iter().filter_map(|result| result.rule).collect();
let mut rules: Vec<SarifRule> = unique_rules.into_iter().map(SarifRule::from).collect();
rules.sort_by(|a, b| a.code.cmp(&b.code));

and then use rules on line 46. This avoids allocating unnecessary Strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is fantastic!

(This PR is my second time writing Rust code. Learned a lot!)

@RussellLuo
Copy link
Contributor Author

What do you make of this scenario?

To answer this question, let me analyze the second section (from spec) step by step.

The log file is intended to be viewed in a tool such as a log file viewer that needs to display metadata related to each result or notification even when the tool is not connected to a network.

Regarding "display metadata related to each result", my understanding is that rules are primarily prepared for the results.

The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

From "requires information about every rule specified by every result", again, I think rules are mainly used as additional details about the corresponding results.

Neither of the above applies, but the increased log file size due to the metadata is not considered significant.

I'm currently using Ruff through MegaLinter, which provides a webhook mechanism to send the SARIF output. Therefore, I'm concerned about the size of the HTTP request body (of the webhook).

@MichaReiser
Copy link
Member

I'm currently using Ruff through MegaLinter, which provides a webhook mechanism to send the SARIF output. Therefore, I'm concerned about the size of the HTTP request body (of the webhook).

I wouldn't be too concerned about that. I would suspect that the JSON file, even with all rule metadata, is still less than 1 MB, and HTTP compression should bring that down to 100 KB.

@RussellLuo
Copy link
Contributor Author

Given this code snippet:

# test.py
import os

def greet():
    print('hello {name}'.format())

I just performed a quick test:

# Using the official Ruff
$ ruff check --output-format=sarif test.py > official_result.json
$ du -sh official_result.json 
2.0M	official_result.json
# Using my local Ruff
$ ./target/debug/ruff check --output-format=sarif test.py > local_result.json
$ du -sh local_result.json   
8.0K	local_result.json

Overall, I agree with you that the size of the request body is not a big deal (although smaller is perhaps better). To my understanding of the spec, as described above, I think the rules are primarily used in conjunction with results. I'm not sure if there are scenarios where all rules are needed.

@MichaReiser MichaReiser changed the title Remove unnecessary SARIF rules Only include rules with diagnostics in SARIF metadata Sep 9, 2024
@MichaReiser MichaReiser merged commit 5ef6979 into astral-sh:main Sep 9, 2024
20 checks passed
@MichaReiser
Copy link
Member

Thanks for making this contribution. This does sound reasonable to me and we can always revert back to the previous behavior if there's an actual use case for it.

@RussellLuo RussellLuo deleted the remove-unnecessary-sarif-rules branch September 10, 2024 00:36
dhruvmanila added a commit that referenced this pull request Sep 13, 2024
## Summary

Follow-up from #13268, this PR updates the test case to use
`assert_snapshot` now that the output is limited to only include the
rules with diagnostics.

## Test Plan

`cargo insta test`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli Related to the command-line interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants