Skip to content
This repository has been archived by the owner on Aug 8, 2023. It is now read-only.

Port Collator expressions to core/iOS/Android/Qt #11869

Merged
merged 8 commits into from
Jul 3, 2018
Merged

Conversation

ChrisLoer
Copy link
Contributor

@ChrisLoer ChrisLoer commented May 8, 2018

When finished, this PR will fix issue #11692 (port of mapbox/mapbox-gl-js#6270). The functionality requires 3 separate platform implementations. Currently, I have an initial implementation of the core part (expression parsing/evaluation etc.), and I plan to start working through the platform implementations starting with iOS/macOS.

  • Core implementation
  • BCP-47 locale identifier parser/generator (necessary for going back and forth between platform-specific locale identifiers)
  • BCP-47 unit test
  • Darwin platform implementation
  • Darwin tests: done with test-expressions
  • Android platform implementation
  • Android tests (exercising "collator" and "resolved-locale" through render tests)
  • locale-unaware "default" implementation using nunicode ICU (requires new Mason build with bundled collators)
  • nunicode/default tests: done with test-expressions
  • Qt implementation? Using default/nunicode for now
    /cc @anandthakker @1ec5 @nickidlugash @brunoabinader @tobrun

@ChrisLoer ChrisLoer added ⚠️ DO NOT MERGE Work in progress, proof of concept, or on hold GL JS parity For feature parity with Mapbox GL JS localization Human language support and internationalization labels May 8, 2018
@ChrisLoer
Copy link
Contributor Author

I've pushed initial implementations for darwin and Android, but without any testing so far -- basically just making sure I can hook everything up. Doing that exercise made me realize I hadn't fully thought through how the PRIMARY/SECONDARY/TERTIARY collator strengths (used by Android/ICU) would translate to case-sensitive/diacritic-sensitive. There's no built-in support for "diacritic-insensitive but case-sensitive". 😬

I implemented a workaround of the form:

  • Compare diacritic-inensitive/case-insensitive -- if there's a difference, return the result
  • If they appear to be the same, use platform::lowercase to see if there's any difference in case. If not, return 0.
  • Compare the two strings using the base (i.e. no locale awareness) comparison, and return the result. If there was a "diacritic" difference, that could cause the comparison result to be the opposite of what we'd get if we were looking just at case.

There is another wrinkle here in that it is up to the locale what PRIMARY/SECONDARY/TERTIARY actually mean, so it's not guaranteed they'll map the way people expect to "case" and "diacritic", but I think it should usually (?) be a pretty good mapping.

@1ec5
Copy link
Contributor

1ec5 commented May 15, 2018

I think the collator strengths will correspond to the flags found on other platforms, at least for Latin alphabets. Vietnamese is as good a stress test for Latin alphabets as I can think of:

  • A versus B — primary difference
  • A versus  — primary difference, because  is a different base letter than A
  • Ạ versus Ậ — primary difference
  • A versus Ạ — secondary difference
  • Â versus Ậ — secondary difference
  • Ậ versus ậ — tertiary difference

}

int compare(const std::string& lhs, const std::string& rhs) const {
NSString* nsLhs = [NSString stringWithUTF8String:lhs.c_str()];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you can write @(lhs.c_str()) instead of calling -[NSString stringWithUTF8String:] explicitly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

int compare(const std::string& lhs, const std::string& rhs) const {
NSString* nsLhs = [NSString stringWithUTF8String:lhs.c_str()];
// TODO: verify "abc" != "abcde" -- the "range" argument seems strange to me
// https://developer.apple.com/documentation/foundation/nsstring/1414561-compare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does seem weird that there isn’t a -compare:options:locale: that doesn’t take an NSRange. It looks like CFString has the same quirk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could implement this ourselves with a length check -- ie:

  • First check that lhs == rhs over lhs's range
  • If they're equal, check if rhs is longer than lhs, in which case rhs > lhs no matter what's in the extra characters
    But I can imagine cases where that's not necessarily correct because the "extra characters" affect the sort order of the original strings (e.g. "ue" gets split up but it's in a german locale that treats "ue" as equal to "ü")

Copy link
Contributor

@1ec5 1ec5 May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After playing around with this method a bit, I’m satisfied that the length check won’t even be necessary. As far as I can tell, a string always sorts after its prefix. The case below is the closest I could come to an unexpected result, but I think even this is correct. (Pardon the Swift; the playground makes it much easier to try out possibilities.)

ü

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also verified the pre-fix sort behavior, and the only other concern would be character folding right at the end of the range -- however, it appears Apple doesn't do "ü"->"ue" type folding.

@ChrisLoer
Copy link
Contributor Author

I just pushed an initial implementation of BCP 47 parsing in language_tag.cpp. Using boost::spirit to implement the grammar was fun, although it came with a healthy dose of inscrutable template error messages.

I did the most basic test of creating a collator with the unlikely language tag de-Hans-US-variant-x-private and verifying that passed through to the Apple locale identifier de-Hans_US. For some reason I haven't figured out yet, the extlang rule seems to break if you provide a 4-character script, so I just have it disabled for now.

@ChrisLoer
Copy link
Contributor Author

I'm not sure why spirit doesn't backtrack on reaching end of input without matching, and it seems like it'd be better to configure that globally, but for now I added explicit lookaheads to the extlang rule. Now it can handle the even more unlikely but still valid tag de-ext-two-Hans-US-variant-x-private.

@ChrisLoer
Copy link
Contributor Author

🤔Another wrinkle, this time on the Darwin side. The accent-equals-de test fails because NSString using the "de" NSLocale doesn't fold "ü" to "ue". I think this falls in the class of "locale-based comparison is inherently system-dependent" behavior we have to accept with the approach we're using, but it's unfortunate that it's a relatively prominent difference from the GL JS behavior (at least running in Chrome on my machine!). I don't know if NSString avoids all folding that changes the length of a string, but it occurs to me that compare(_:options:range:locale:) is only really well-defined if they never change character lengths, so that might be a good hint.

The good news is: the rest of the collator tests running on darwin pass! (can't unignore them yet, though, since they don't have a "default" implementation).

@kkaefer and/or @brunoabinader: what do you think of doing a very minimal default/linux/Qt implementation of Collator that basically ignores locale and just uses a relatively simple hardwired latin-character diacritic-stripping table? Although using ICU and bundling locale information gets us the closest to the GL-JS behavior, working on the iOS/Android implementations has driven home for me the point that we will always have to tolerate locale-specific differences in this behavior. A non-locale-aware approach for Qt would still be able to reasonably render styles using collator expressions, and it would significantly constrain the scope of this PR.

I can see the counter-argument that if we don't do locale support on Qt, then either (1) we're piling up hidden technical debt, or (2) locale-awareness wasn't really necessary in the first place. But I still think this is a good punt -- if we start developing bilingual in-car navigation apps w/ Qt, this would be enough to get started.

@ChrisLoer
Copy link
Contributor Author

I just pushed a "default" implementation using nunicode (depends on adding nunicode 1.8 to mason: mapbox/mason#616).

I believe the nunicode implementation should be able to pass all the tests except accent-equals-de, which already doesn't pass on iOS.

@ChrisLoer ChrisLoer force-pushed the port-collator branch 2 times, most recently from 30180fc to 23b33b3 Compare June 4, 2018 22:40
@ChrisLoer
Copy link
Contributor Author

Capturing some very useful conversation with @brunoabinader:

  • Even if we set up mason to build and publish Windows binaries, we still wouldn't want to include any pre-compiled mason binaries in our Qt build, because we need to be able to build as part of Qt, and Qt has a significantly larger CI matrix than ours.
  • However, nunicode is small, simple, and MIT licensed, so we should be able to fold it directly into the gl-native project (as we've done with csscolorparser and parsedate). I think it's a little unfortunate that folding it into our project kind of hides the dependency, but my plan is to give this a try (and abandon all my experiments with using mason on Windows).
  • Long-term, the most satisfying solution here will be to use QCollator instead of nunicode, but we're blocked on diacritic-insensitivity (https://bugreports.qt.io/browse/QTBUG-22727). Bruno will push a little bit to see if Qt can merge some version of the proposed patch.

@ChrisLoer ChrisLoer force-pushed the port-collator branch 3 times, most recently from 3fff201 to f5f9b95 Compare June 7, 2018 21:14
@ChrisLoer
Copy link
Contributor Author

ChrisLoer commented Jun 11, 2018

Android is almost passing! And the failure is on the weird diacritic-insensitve/case-sensitive combo that I hacked together...

Expected:
actual

Actual (Android):
expected

Also I think the render tests crash every once in a while, which probably isn't supposed to happen...

@tobrun
Copy link
Member

tobrun commented Jun 12, 2018

Also I think the render tests crash every once in a while, which probably isn't supposed to happen...

Initially had some issues with stability of the MapSnapshotter when running render test but that shouldn't happen anymore. Is this happening randomly or specific to your test?

@ChrisLoer
Copy link
Contributor Author

Is this happening randomly or specific to your test?

I only just figured out how to get the Studio logcat to keep showing logs for a dead process and it hasn't crashed since I've been following it that way -- but my guess would be the crash is in the collator test, considering my dev process has been "keep changing JNI invocations until it seems to work". 😜

@ChrisLoer
Copy link
Contributor Author

@tobrun Actually it looks like the crash is occurring outside of collator. I just saw it in the icon-image test (which happens before any collator expressions are used in any other tests):

06-12 10:54:22.462 21677-21677/com.mapbox.mapboxsdk.testapp D/RenderTestActivity: Next test: 317 / 557
    Render test zoom-and-property-function,icon-offset
06-12 10:54:22.689 21677-21677/com.mapbox.mapboxsdk.testapp D/RenderTestActivity: Next test: 318 / 557
    Render test default,icon-no-cross-source-collision
06-12 10:54:22.939 21677-21677/com.mapbox.mapboxsdk.testapp D/RenderTestActivity: Next test: 319 / 557
    Render test literal,icon-image
06-12 10:54:22.942 21677-21677/com.mapbox.mapboxsdk.testapp A/libc: Fatal signal 11 (SIGSEGV), code 1, fault addr 0x100 in tid 21677 (pboxsdk.testapp), pid 21677 (pboxsdk.testapp)

This is just a guess, but I noticed the screen dim on my test device (a Pixel 2) just before the crash. Maybe some Android user-interaction idle timeout is messing with it?

@tobrun
Copy link
Member

tobrun commented Jun 12, 2018

This is just a guess, but I noticed the screen dim on my test device (a Pixel 2) just before the crash. Maybe some Android user-interaction idle timeout is messing with it?

Thank you for adding that information, will try reproducing and I'm hoping that this is the source of the crash. Fwiw on my test device I always have the following enabled:

Settings > System > Developer Options > Stay Awake (screen will never sleep while charging)

@ChrisLoer ChrisLoer removed the ⚠️ DO NOT MERGE Work in progress, proof of concept, or on hold label Jun 12, 2018
@ChrisLoer ChrisLoer changed the title WIP: Port Collator expressions to core/iOS/Android/Qt Port Collator expressions to core/iOS/Android/Qt Jun 12, 2018
@ChrisLoer
Copy link
Contributor Author

@kkaefer I checked in your vendoring script and used it to re-import nunicode. One issue is that the nunicode implementation files expect to be in the same relative path as the header files, but I didn't want to pollute the root header search path with all the nunicode stuff, so instead I had to modify all the includes to use <libnu/[file].h>

I squashed and re-wrote the commit history, in the process taking out a few TODOs that were covered by review comments. I also stopped using our hand-rolled BCP 47 parser in the darwin implementation, based on @1ec5 's assurance that "language identifiers" can already handle BCP 47.

I think this is ready? Follow up work:

  • NSExpression binding and ios/macos changelog
  • Android binding and android changelog
  • (longer term) @nagineni getting diacritic-sensitivity into QCollator and switching Qt over to that

(diacriticSensitive ? 0 : NSDiacriticInsensitiveSearch))
, locale(locale_ ?
[[NSLocale alloc] initWithLocaleIdentifier:@((*locale_).c_str())] :
[NSLocale currentLocale])
Copy link
Contributor

@1ec5 1ec5 Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although localeIdentifier is BCP 47 compliant when NSLocale is created with a BCP 47 locale identifier, +[NSLocale currentLocale] places an underscore before the region code, for backwards compatibility. So the one modification we need to make is to replace underscores with hyphens. (Alternatively, we can build a canonical BCP 47 code by stringing together the languageCode, scriptCode, and regionCode properties instead of using localeIdentifier.) Otherwise we’re in good shape on iOS and macOS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean in resolvedLocale we should do the _/- replacement? I guess since _ is not a valid part of any BCP 47 tag, that's a relatively safe replacement to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the replacement -- experimentally, it looks like the identifier passed into initWithLocaleIdentifier gets round-tripped intact whether it uses the underscore or not. Not sure what pathway currentLocale goes down.

// may append the region tag with an "_". Changing that to a "-" makes the identifier BCP 47 compliant.
// Experimentally, "zh-Hans-HK" and "zh-Hans_HK" both round trip -- if the second is used by
// `currentLocale`, we don't want to return the underscore.
return [[locale localeIdentifier] stringByReplacingOccurrencesOfString:@"_" withString:@"-"].UTF8String;
Copy link
Contributor

@1ec5 1ec5 Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is good enough to avoid the underscore issue, since none of the ISO or UN region codes have hyphens or underscores. If we want to be super safe and protect against extensions in the locale identifier, then this code would ensure that we only get the bits we care about:

// NSLocale.languageCode & company are unavailable in iOS 9.
NSArray *keys = @[NSLocaleLanguageCode, NSLocaleScriptCode, NSLocaleCountryCode];
NSMutableArray *components = [NSMutableArray array];
for (NSString *key in keys) {
    NSString *code = [locale objectForKey:NSLocaleLanguageCode];
    if (code) {
        [components addObject:code];
    }
}
return [components componentsJoinedByString:@"-"].UTF8String;

(Or the C++ string stream equivalent.) But I’m not even sure that an extension can ever be present in +[NSLocale currentLocale]. For example, my Mac’s locale settings are all over the place, including custom sorting and date formats, but the locale identifier is still just vi_US, which would be fine with just the underscore replacement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think if there's an extension that doesn't conform to BCP 47 conventions, all bets are off anyway.

We should try to get your personal machine containerized and used by CI for testing locale functionality. 😉

@friedbunny friedbunny added the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 28, 2018
@@ -0,0 +1,38 @@
macro(mbgl_nunicode_core)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a macro mixin, we could also use an interface library in CMake, like this: https://github.com/mapbox/mapbox-gl-native/blob/master/cmake/loop-uv.cmake

This allows you to add target_link_libraries(mbgl-core PRIVATE nunicode) to "mix in" the sources and include paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, building libnu as a standalone static library also makes sense because it allows us to restrict the define flags to just the nunicode files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Standalone static probably makes sense and is most similar to how we used to load it with mason. I'll look into setting that up (my CMake game is not that strong, but time to learn).

Do you have any concerns about the manual stripping of unused files I did in the process of vendoring? It's a potential source of confusion, but just helps reduce the total number of lines of code that have to get pulled into the repo, included in search, etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The define flags actually have to be public because they toggle sections of the header files on and off (unless we just hard-wire the headers to the state we want).

nunicode/LICENSE Outdated
@@ -0,0 +1,19 @@
Copyright (c) 2013 Aleksey Tulinov <aleksey.tulinov@gmail.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this at a level that is not the root level? e.g. in vendor/nunicode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand -- are you suggesting just the license should live in vendor/nunicode, or the entire nunicode should live under a vendor folder? Moving everything into a vendor folder under the root makes sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the entire nunicode folder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ChrisLoer
Copy link
Contributor Author

@friedbunny I think the plan is to add changelog entries with the platform specific NSExpression and Expression.java wrappers. A possible entry could be:

Add "collator" expression that enables locale-aware string comparison with optional case and diacritic sensitivity, following the pattern of javascript's Intl.Collator object. Collator expressions can be passed as optional arguments to all string comparison expressions. Also adds "resolved-locale" expression to test whether request locales are locally available.

@friedbunny friedbunny removed the needs changelog Indicates PR needs a changelog entry prior to merging. label Jun 29, 2018
@ChrisLoer
Copy link
Contributor Author

I filed #12268 and #12269 for the Android/iOS-specific follow-on items.

@kkaefer Does the nunicode vendoring looks serviceable to you now?

@ChrisLoer ChrisLoer force-pushed the port-collator branch 2 times, most recently from ad763e2 to cc98bfc Compare June 29, 2018 23:30
@ChrisLoer
Copy link
Contributor Author

@jfirebaugh I just rebased on top of your DSL changes from #12258, and had to modify the eq implementation to use a nullopt Collator. It looks like the DSL stuff is test-oriented, so it's OK to put off adding collators to the DSL until we want to use them in a test?

@jfirebaugh
Copy link
Contributor

Yeah, the DSL is piecemeal as needed.

- Version bump to 1.8 necessary for "unaccent" functionality
- Qt now depends on nunicode, ruling out use of precompiled binaries
Cross platform parsing and evaluation code.
 - Based on nunicode
 - Not locale-aware
 - Used by linux and Qt builds
 - Uses NSString for comparison
 - Uses NSLocale for loading locales
 - Uses java.text.Collator for comparison
 - Uses java.util.Locale for locale loading
 - Uses LanguageTag for BCP 47 parsing
 - Falls back to non-locale-aware nunicode/default comparison for case-sensitive/diacritic-insensitive.
 - Testing these changes depends on running Android render tests
 - "collator" is not yet exposed in the SDK bindings.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
GL JS parity For feature parity with Mapbox GL JS localization Human language support and internationalization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants