Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve portable ingestion of command-line arguments #66

Open
peter-b opened this issue Jan 22, 2021 · 7 comments
Open

Improve portable ingestion of command-line arguments #66

peter-b opened this issue Jan 22, 2021 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed

Comments

@peter-b
Copy link
Collaborator

peter-b commented Jan 22, 2021

Currently, C++ requires the following forms of main() function to be supported:

int main() { /* body */ }

int main(int argc, char* argv[]) { /* body */ }

Implementations can also define other additional entry points. For example, some implementations permit main() to accept environment variables:

int main(int argc, char* argv[], char* environ[]);

Some permit arguments to be accepted as wide characters; examples:

int main(int argc, wchar_t** argv[]);
int wmain(int argc, wchar_t** argv[]);

Many applications make assumptions about the encoding of the contents of argv (and environ if available). These assumptions are very rarely portable between different deployments of a single C++ implementation, let alone across multiple implementations. On some implementations individual components of argv and environ variables may have different encodings; some may not even be text. On some implementations, using the contents of argv may be guaranteed to lose data, and implementation-specific library functions must be use to safely access arguments.

We should standardize portable ways to access data from outside the program via command-line arguments and environment variables, for example by:

  • requiring implementations to support more expressive prototypes for main()
  • adding standard library features for safely accessing command-line arguments and environment variables
@peter-b peter-b added enhancement New feature or request help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed labels Jan 22, 2021
@tahonermann
Copy link
Member

Some relevant discussion can be found in the SG16 mailing list archives at https://lists.isocpp.org/sg16/2021/01/2005.php. Unfortunately, the list archives have predictably mojibaked the message, but the experimental results presented are still understandable.

In recent discussion, Peter Brett and I discussed the following design to address command line options:

  • A library solution that provides global immutable access to command line options. This would allow access to command line options from constructors of global objects prior to the call to main(). This is a feature that, while not frequently needed, is quite useful for some use cases and is already provided by most implementations in some form.
  • That library solution would provide the sequence of command line arguments as a range.
  • The range elements, each corresponding to an argument, would provide access to the raw (char or wchar_t based) argument as well as to transcoded char, wchar_t, char8_t, char16_t, and char32_t representations; similar to the std::filesystem::path string format observers.
  • Providing conversion to std::filesystem::path would enable implementations to provide appropriately preserved paths portably; on Windows, the implementation could construct from the wchar_t-based command line while implementations for other OSs construct from the char-based command line.
  • No need for additional signatures for main().

Code using this might look like the following (which would itself benefit from being wrapped in a nicer command line argument handling facility).

#include <filesystem>
#include <program_arguments>
#include <ranges>
#include <string>

int main() {
  auto &&args = std::program_arguments(1);   // Skip argument 0, keep 1...
  for (auto arg = std::ranges::begin(args);
       arg != std::ranges::end(args);
       ++arg)
  {
    // arguments implicitly convert to std::string_view or a similar char-based range.
    // When only basic source characters are expected, comparison with execution character set is fine.
    if (*arg == "--file") {
      if (++arg != std::ranges::end(args)) {
        // Retrieve the filename operand converted to a path.
        std::filesystem::path filename = arg->path();
      }
    }
    else if (*arg == "--name") {
      if (++arg != std::ranges::end(args)) {
        // Retrieve the provided username converted from the command line encoding to UTF-8.
        std::u8string username = arg->u8string();
      }
    }
    else {
      ...
    }
  }
}

@jensmaurer
Copy link
Collaborator

jensmaurer commented Jan 22, 2021

Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings? I understand the fundamental idea here is "ability to convert to path or u8string or UTF-16 or UTF-32; the implementation knows best what the lossless source encoding is and will transcode as necessary".

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)

@tahonermann
Copy link
Member

Is this also the suggested model for accessing other external input, such as environment variables or (maybe) registry-style settings?

I think it should work for environment variables as well, but I haven't given it as much thought yet. The situation on Windows is complicated. For a program using the Microsoft C run-time, there are (at least) three sets of environment variables:

  • The Win32 environment block provided at process creation. This is the one manipulated by the GetEnvironmentStrings(), FreeEnvironmentStrings(), GetEnvironmentVariable(), and SetEnvironmentVariable() functions. Each of these functions comes in ANSI and Unicode variants. However, I think the ANSI versions of the first two use the OEM character set, while the ANSI versions of the latter two use the MBCS. The actual environment block may be in either ANSI or Unicode form; which one depends on if the CreateProcess() invocation that created the process passed the CREATE_UNICODE_ENVIRONMENT flag.
  • The MBCS environment managed by the Microsoft C run-time. This is the one managed by the getenv() and _putenv() functions. It is initially constructed from the Win32 environment block for a ANSI (main() program) or lazily copied from the Unicode C run-time environment when needed (at which point the run-time synchronizes updates between the two copies).
  • The Unicode environment managed by the Microsoft C run-time. This is the one managed by the _wgetenv() and _wputenv() functions. It is initially constructed from the Win32 environment block for a Unicode (_wmain() program) or lazily copied from the ANSI C run-time environment when needed (at which point the run-time synchronizes updates between the two copies).

It seems clear that the implementation knows what encoding to use for each of these blocks, so it can be implementor's discretion which is used. Not losing data (due to encoding) would presumably require use of the Win32 environment block.

I believe most, if not all implementations (including Windows) permit data in the environment block that is not well-formed for any particular encoding.

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux, so I'd prefer a solution that uses argc and argv as-is. (Or is this lossy on Windows?)

Not losing data on Windows would require using the _wmain() entry point, so is non-portable.

@tahonermann
Copy link
Member

Oh, and it seems to require quite a bit of hackery to access command-line arguments as global state in Linux

I do recall that being cumbersome the last time I needed to access them. I believe it required reading /proc/<pid>/environ (which, of course, may not be mounted). On other systems, there was generally a global __argc and __argv or similarly named variable pair available.

@cor3ntin
Copy link
Collaborator

An idea I had for a while
Add:

int main(int argc, char8_t** argv[]);

When this overload is selected

  • Arguments are utf8

  • Locale is C.UTF-8

  • The functions u8getenv, u8putenv are provided to handle the environment

  • getenv/putenv are also utf-8 but deal in char

when another overload is selected, u8getenv, u8putenv are UB

@jensmaurer
Copy link
Collaborator

"putenv" does not exist in either C or C++.

@cor3ntin
Copy link
Collaborator

cor3ntin commented Mar 24, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed
Development

No branches or pull requests

4 participants