Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of parsed data with assertions #435

Open
GreyCat opened this issue May 15, 2018 · 26 comments
Open

Validation of parsed data with assertions #435

GreyCat opened this issue May 15, 2018 · 26 comments
Milestone

Comments

@GreyCat
Copy link
Member

GreyCat commented May 15, 2018

Extracting easier part of #81, with a concrete proposal on implementation.

Context: we want to have some sort of way to state that certain attributes have some sort of "valid" values. Stating that has two implications:

  1. All values which are "not valid" must trigger an error of parsing
  2. In some cases, we can calculate values automatically during generation of stream

The simplest existing implementation is contents: it is essentially a byte array, which validates that its contents are exactly as specified in ksy file. On parsing, we'll throw an error if byte array contents won't match the expected value. On generation, we can just write the expected value, and not bother the end-user with setting the attribute manually.

Syntax proposal

Add valid key to the attribute spec. Inside it, there could be:

  • A string, integer or boolean => it will be treated as expression that's supposed to be the only valid value of this attribute. For example:
- id: foo
  type: u4le
  valid: 0x42 # expected in stream: 42 00 00 00
- id: bar
  type: strz
  valid: '"abcd"' # expected in stream: 61 62 63 64 00
- id: baz
  type: u2le
  valid: '2 * foo' # expected in stream: 84 00
  • A map, which can feature keys:
    • eq: expected states that attribute value must be equal to given value (i.e. the same as above)
    • min: expected states minimum valid value for this attribute (compared with a >= expected)
    • max: expected states maximum valid value for this attribute (compared with a <= expected)
    • any-of: list-of-expected states that valid value must be equal to one in the given list of expressions
    • expr: expression states that given expression need to be evaluated (substituting _ with an actual attribute value) to be true to treat attribute value as valid.

All expected and expression values are KS expression strings, which can be constants or can depend on other attributes. list-of-expected must be a YAML array of KS expression strings. expected inferred type must be comparable with type of the attribute. expression inferred type must be boolean.

@KOLANICH
Copy link

You have suggested to check such relations using syntax in YAML. Having this syntax in YAML has both drawbacks and profits. The drawback is that we have to support it alongside with usual expressions. The profit is that it is yaml, any tools working with yaml would be able to parse it.

So shouod we drop our current expr syntax and start using entirely yaml-based?

@GreyCat
Copy link
Member Author

GreyCat commented May 16, 2018

So shouod we drop our current expr syntax

What exactly is "our current expr syntax", and what do you mean by "dropping" it? We had no syntax proposal for validation until now, and I suspect what you imply is covered in valid/expr, for example, checking that value is even:

- id: foo
  type: u4
  valid:
    expr: _ % 2 == 0

@webbnh
Copy link

webbnh commented May 16, 2018

A map, which can feature keys

If the field is supposed to contain an enum value, can I check that by specifying a key of enum, or can any-of take the enum type-name instead of a list?

@GreyCat
Copy link
Member Author

GreyCat commented May 16, 2018

@webbnh Probably for enums we can just do the following:

  1. Ensure that in all languages "invalid" enum values does not trigger a error by default
  2. Add extra key here, something like that:
- id: foo
  type: u4
  enum: bar
  valid:
    enum-key: true

@KOLANICH
Copy link

What exactly is "our current expr syntax", and what do you mean by "dropping" it?

What you have proposed can be expressed as a logical expression, assumming that we have some facilities.
eq: expected <=>

throw:
  if: _ != expected

min: expected <=> if: _ < expected
max: expected <=> if: _ > expected
any-of, enum-key <=> #434

But we can go from another side. We can eliminate textual expressions and force the developer into using YAML AST.

for example if: a != 0 and b > _io.size would be something like

size:
  and:
    - '!=':
      - a
      - 0
    - '>':
      - b
      - '_io.size'

I know that it is monstrouous space-wasting and unusable, but I'm not sure ;)

@GreyCat
Copy link
Member Author

GreyCat commented May 17, 2018

My point here is actually very simple: if we'll just do it as one expression — which we actually still can do, i.e. stuff like that is legal:

valid:
  expr: _ == 42 # the same as `eq: 42`
valid:
  expr: _ >= 42 # the same as `min: 42`

It's easy to do, but we lose declarativity this way. If it's only an expression, one can't just look at this validation spec and render it as a range input (for example, if a min/max boundaries are known), or as a combo box with a set of choices (if a set of choices is known). Basically, if we only have a validation expression, the only thing we can do (without reversing it and applying some symbolic solver) is checking every particular value for validity, and that's it. Declarative configuration offers more, for example, you can clearly get a list of valid values, or boundaries, or stuff like that. Proposed mechanism is also extensible. For example, if we'll want to add size limits of strings in future, we can do something like

valid:
  max-size: 42

and it could be rendered as string with limited input box, not asking a user to enter arbitrary string and then rejecting it for some mysterious reason.

@KOLANICH
Copy link

KOLANICH commented May 17, 2018

I haven't thought about using that info for metadata for GUI purposes. It's cool idea. In this case

reversing it and applying some symbolic solver

IMHO is the best way to do it. If we have problems with performance we can try to cache solver results.

It's easy to do, but we lose declarativity this way.

I guess we don't lose declarativity, since we still declare "this value must satisfy this logical expression" and all the needed info is there, but we lose some verbosity. What we win, is that we don't make an assumption about programmer using the proposed verbose syntax instead of exprs. So even if a programmer used exprs, if we have a system recovering these info from exprs, we would expect it working correctly, even retroactively, when a new extractor is added and it becomes working even on old specs without any modification

@kalidasya
Copy link

Please take into account bit sized values like:

- id: fixed_10
  type: b2
  contents: [2]

- id: fixed_10
  contents: [true, false]

@GreyCat
Copy link
Member Author

GreyCat commented Jul 6, 2019

I went forward and do some proof-of-concept implementation in current compiler. Namely:

  • valid is now recognized and parsed in its short form => only equals is supported now, but it's really easy to extend it to support ranges, lists, arbitrary expressions, etc.
  • There is a system of relevant ValidationSpec objects created during KSY loading
  • For Java, Ruby, Python, JS: codegen actually generates validation code
  • Added 3 tests valid_*.ksy:
  • Added a more or less formal system of exceptions specific to KS
    • Implemented it in Ruby, Python, Java; JS is on the way
    • Added relevant knowledge of this system of exceptions into ksc

@generalmimon
Copy link
Member

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Test code:

meta:
  id: valid_if_test
seq:
  - id: optional
    if: false
    type: u1
    valid: 42

produces (in JavaScript, but it happens in all languages):

ValidIfTest.prototype._read = function() {
  if (false) {
    this.optional = this._io.readU1();
  }
  if (!(this.optional == 42)) {
    throw new KaitaiStream.ValidationNotEqualError(42, this.optional, this._io, "/seq/0");
  }
}

@GreyCat
Copy link
Member Author

GreyCat commented Oct 19, 2019

@generalmimon

The current KSC (0.9-SNAPSHOT) generates incorrect code when valid is used on an optional field (with if). The validation check is executed outside the if clause instead of inside it.

Probably moving it inside if would make sense, good catch! Can I ask you to contribute relevant test to tests repo?

@generalmimon
Copy link
Member

Sure.

generalmimon added a commit to generalmimon/kaitai_struct_formats that referenced this issue Nov 24, 2019
There are two types of RIFF structs: with asserts (chunk and
parent_chunk_data) and without asserts (chunk_generic and
parent_chunk_data_generic). Those without asserts won't be
needed when kaitai-io/kaitai_struct#435 (comment) is resolved.
@GreyCat
Copy link
Member Author

GreyCat commented Nov 25, 2019

Recent compiler updates fixed ValidFailInst and ValidNotParsedIf tests by @generalmimon. Thanks for enhancing our test base!

generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Jun 17, 2022
generalmimon added a commit to kaitai-io/coreldraw_cdr.ksy that referenced this issue Oct 1, 2022
I forgot that the compiler doesn't yet perform type checking on `valid` expression
(see kaitai-io/kaitai_struct#435 (comment)),
so we need to pay attention ourselves.
@Omar-Abdul-Azeez
Copy link

Omar-Abdul-Azeez commented Jan 14, 2023

Considering that we can assign contents: ["some string", 0x30, 30] shouldn't we also be able to assign

seq:
  - id: sig
    size: : 6
    valid:
      any-of: [ [ "AA", 0, 0, 1, 0 ] , [ "BB", 0, 0, 1, 0 ] ]

currently the compiler throws an error as it expects "string"? I assume that means KSY expressions so I tried [ '["AA", 0, 0, 1, 0]' , '["BB", 0, 0, 1, 0]' ] to no avail... can't combine output types: CalcStrType vs Int1Type(true)

Heck even the painful approach doesn't work:

seq:
  - id: sig1
    type: str
    size: : 2
    valid:
      any-of: [ "AA", "BB" ]
  - id: sig2
    contents: [ 0, 0, 1, 0 ]

KSC seems to evaluate AA and BB as KSY expressions which makes it search for ids matching those values but an id can only have lower case letters hence an error is thrown...

@demberto
Copy link

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

@krisutofu
Copy link

krisutofu commented Apr 13, 2023

I accidentally discovered that this was a thing. When will the relevant keywords be added to the docs and the YAML schema (the kaitai-struct-vscode plugin uses it).

Which version of the VSCode extension do you have? I have v0.8.1 and it does not have the valid-expressions included. They are underlined red in my version. It seems to be outdated sinde it doesn't know the bit-endian meta property.

A year ago, I tried to contribute documentation about the valid property but it was entirely ignored.
Here it is: valid expression documentation (Jan, 2022)

@demberto
Copy link

demberto commented Apr 13, 2023

Which version of the VSCode extension do you have? I have v0.8.1

Same.

mblsha added a commit to mblsha/kaitai_struct_doc that referenced this issue Dec 26, 2023
Adapted the discussion thread [1] to document the expected `valid` key behavior.

[1]: kaitai-io/kaitai_struct#435
generalmimon added a commit to kaitai-io/kaitai_struct_cpp_stl_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_csharp_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_go_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_java_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_javascript_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_php_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_python_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_ruby_runtime that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_compiler that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Aug 19, 2024
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Aug 29, 2024
@generalmimon
Copy link
Member

I almost forgot to mention that there is now a new subkey of valid called valid/in-enum based on the idea in #435 (comment). I suppose this is also related to #778, because it was also discussed there. It's implemented for all 9 languages where valid is implemented in the first place (well, after checking ValidFailInEnum at https://ci.kaitai.io/, except for Rust, which would be the 10th such language). The only thing I changed was the name. IMHO valid/in-enum sounds more like a validation constraint than valid/enum-key, and it seems to be a more intuitive name when you see it in an actual .ksy spec like this:

seq:
  - id: foo
    type: u4
    enum: animal
    valid:
      in-enum: true

enums:
  animal:
    4: dog
    12: chicken

valid/in-enum: true checks that the parsed enum value is "in enum", i.e. that it is a known value defined in the enum specified by the enum key. It's functionally equivalent to using valid/any-of with an exhaustive list of all enum entries:

seq:
  - id: foo
    type: u4
    enum: animal
    valid:
      any-of:
        - animal::dog
        - animal::chicken

enums:
  animal:
    4: dog
    12: chicken

... but obviously more convenient to use and easier to maintain. Of course, this valid/any-of pattern still has a use case when you only want to allow a subset of enum values, but if the intent is that it can be any value in the enum, valid/in-enum: true should be preferred.

It's intended only for fields with the enum key. If used on a field without it, a compilation error is planned, but it's not implemented yet at the time of writing (as with most other valid compile-time checks).

In case you were wondering, true is the only allowed value for valid/in-enum. valid/in-enum: false won't work, and there's an already passing test for that:

attr_bad_valid_in_enum_false.ksy: /seq/0/valid/in-enum:
	error: only `true` is supported as value, got `false` (if you don't want any validation, omit the `valid` key)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests