Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 was detected in one or more arguments #4954

Open
quasis opened this issue Sep 24, 2022 · 11 comments
Open

Invalid UTF-8 was detected in one or more arguments #4954

quasis opened this issue Sep 24, 2022 · 11 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@quasis
Copy link

quasis commented Sep 24, 2022

Running a WASM file with a binary argument, such as:

wasmtime run hmac_md5 --key `printf "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa"` "String to hash"

or

wasmtime run hmac_md5 -- --key `printf "\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xaa"` "String to hash"

Aborts execution with the following error:

error: Invalid UTF-8 was detected in one or more arguments

I see three problems with this:

  1. There is no standard I'm aware of that requires the arguments to be textual
  2. There is no standard that requires them to be UTF-8 encoded
  3. Wasmtime is doing unneeded work while validating the arguments, and/or converting them to UTF-8
@quasis quasis added the bug Incorrect behavior in the current implementation that needs fixing label Sep 24, 2022
@bjorn3
Copy link
Contributor

bjorn3 commented Sep 24, 2022

Arguments are text, not binary data. For example on Windows it has to transcode from UTF-16 (as provided by the OS) to UTF-8 (as expected by wasi programs) Additionally null bytes are not allowed in arguments on Windows and Unix as they indicate the end of the argument (on Unix) or the end of the entire commandline (on Windows).

@quasis
Copy link
Author

quasis commented Sep 24, 2022

I believe we have to differentiate between two types of arguments here: arguments to wasmtime itself and arguments to the executed program. The program in its minimal variant will look like this:

int
main(int argc, char *argv[]) {
    // parse 2nd type of arguments here and convert them to whatever encoding necessary,
    // or fail if encoding doesn't fit the design
}

Right now wasmtime is forcing UTF-8 arguments on the program itself, which is IMHO quite limiting for the program (if done by design).

@bjorn3
Copy link
Contributor

bjorn3 commented Sep 25, 2022

If you want the program to run on Windows, UTF-16 to UTF-8 conversion has to be done one way or another as wasi programs don't know the host architecture.

@sunfishcode
Copy link
Member

Also, command-line arguments are already not suitable for passing arbitrary binary data, because they're passed to C programs as NUL-terminated strings, so they can't contain embedded 0 bytes. If you want to pass arbitrary binary data as arguments to programs today, I suggest using an escape or quoting system, or a format like BASE64, to ensure that all the bytes are preserved.

@alexcrichton
Copy link
Member

I believe there's a bug here where the wasmtime executable shouldn't panic on invalid-utf-8, but otherwise I agree with others that your program shouldn't rely on being able to receive arbitrary bytes as arguments.

Additionally the wasmtime CLI has no way of actually communicating this argument to the program at this time. The wasmtime cli can only invoke functions which take numbers as input, which means that even if this were accepted it wouldn't end up doing anything.

@quasis
Copy link
Author

quasis commented Sep 26, 2022

I believe we're concentrating on the wrong side of the coin. Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

The other side of the coin is the forceful conversion of arguments to the UTF-8 encoding, which is a rather large constraint IMHO considering the zoo of character encodings out there. Perhaps the problem with UTF-16 on Windows, which as I understand is the main reason for conversion to UTF-8, can be solved some other way?

p.s. There is a relevant discussion on the SG-16 group page: sg16-unicode/sg16#66, it seems they aim to solve the same problem that we're facing?

@bjorn3
Copy link
Contributor

bjorn3 commented Sep 26, 2022

Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

On Windows, binary arguments would have to have a multiple-of-two size as the argument list consist of a single 16bit character list. In addition on Windows the program itself it responsible for parsing the argument list from a single string into however many arguments the user intended to parse. This is not the responsibility of the shell like on Unix. This means that if your binary argument contains a space character, it would either cause two arguments to be passed, or you would need to escape it, which doesn't really work well with binary arguments I would think.

The other side of the coin is the forceful conversion of arguments to the UTF-8 encoding, which is a rather large constraint IMHO considering the zoo of character encodings out there.

How would you suggest making wasi programs agnostic to the character encoding without forcefully converting from and to UTF-8 at the boundary between the wasi program and the runtime? On Windows the A variants of all system api's have a behavior depending on the system character set, which wasi programs can't query and even if it could, now wasi programs would need to ship with huge tables translating between every possible character set and whatever encoding it uses internally (like UTF-8).

@quasis
Copy link
Author

quasis commented Sep 26, 2022

How would you suggest making wasi programs agnostic to the character encoding without forcefully converting from and to UTF-8 at the boundary between the wasi program and the runtime?

Perhaps a flag to wasmtime, which defaults to UTF-8? Something like:

wasmtime --default-encoding=[system|utf-8|...]

This was the first thing I was looking for after facing the problem the first time. It might not be perfect, but at least intuitive. :)

@sunfishcode
Copy link
Member

I believe we're concentrating on the wrong side of the coin. Binary arguments are useful for unit tests, where you fully control the environment, know the limitations, and want as little opcodes as possible between the essence of the test and the environment.

For a truly minimal unit test, could you use the embedding API, instead of the CLI? Command-line arguments involve quite a lot of code, on both the host side and in wasm, even ignoring the character encoding issues. With the embedding API, you can have a function that takes an array of u8 directly, which would avoid the whole question about string encodings, and avoid all the command-line argument code too.

@quasis
Copy link
Author

quasis commented Sep 26, 2022

For a truly minimal unit test, could you use the embedding API, instead of the CLI? Command-line arguments involve quite a lot of code, on both the host side and in wasm, even ignoring the character encoding issues. With the embedding API, you can have a function that takes an array of u8 directly, which would avoid the whole question about string encodings, and avoid all the command-line argument code too.

I definitely agree with your reasoning. My only comment on that, is that Linux CLI is based on a mature code-base and provides a "battle-proven" solution for almost any upstream/downstream problem, while embedding API almost always lacks an "in-house" solution and requires a third partly library, that is generally either not mature enough or not maintained for years. So there are advantages and disadvantages in both approaches.

@sunfishcode
Copy link
Member

I'm a little unclear about what you're referring to. When I say embedding API, II mean Wasmtime's embedding API: https://docs.wasmtime.dev/lang.html

I can also mention that the current direction for WASI is that it will define command-line arguments to be Unicode strings. While it may be useful in some unit-test-like use cases to let users that know how the host environment works to take advantage of it, the goal in many real-world wasm use cases is to have modules that don't know the environment, and don't interact with the "system" character encoding, because such knowledge would make them less portable.

This only applies to things which are called "strings", which includes command-line arguments. If you have APIs that take u8s, you'll always be able to pass arbitrary bytes through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

4 participants