Implement single language display names#8082
Conversation
sffc-bot
left a comment
There was a problem hiding this comment.
🤖 This review was submitted by an AI agent working with @sffc.
The implementation looks excellent! The zero-allocation formatting pipeline using QualifiersWriteable is very clean and efficient, and the case-insensitivity fix for variants is correct. All tests are passing.
I have left a few minor comments regarding:
- An outdated status comment in
single.rs. - A typo in the
README.mdconstructor signature. - An obsolete TODO comment in the tests.
Regarding the open questions:
- Writeable on Owned Types: Keep it (convenient and consistent).
- Constructor Argument Order: The code's order
(prefs, locale_id, options)is correct (consistent). Typo in README should be fixed. - Fallback Behavior: Keep the current fail-fast behavior (consistent and efficient).
Overall, great work!
f87b20e to
e7b83ab
Compare
|
This PR is reviewable commit-by-commit, but each commit is split into its own PR. |
- Implement VariantDisplayName and VariantDisplayNameOwned in single/variant.rs. - Refactor single.rs to only re-export Variant, Region, and Script. - Refactor singular.rs to use unified Medium markers instead of Long markers. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
- Implement LanguageDisplayName and LanguageDisplayNameOwned in single/language.rs. - Re-export LanguageDisplayName in single.rs. - Integrate CLDR test cases into integration tests. - Update README.md with open questions. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
e7b83ab to
6f010da
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This is the normal style
No harm in having multiple
Say more about what this would look like? I'm not sure I understand "storing the input identifier and making the payloads optional" |
robertbastian
left a comment
There was a problem hiding this comment.
A missing name should not be a DataError, it should fall back to the code
| assert_eq!(result.capacity(), result.len()); | ||
| } | ||
|
|
||
| // Test our new single formatter implementation (only for cases that are in the data, i.e. not "xx") |
There was a problem hiding this comment.
please also test xx, if only to demonstrate how a client would do the fallback themselves
There was a problem hiding this comment.
Deferred to #8100 since fallback is being implemented as a follow-up.
There was a problem hiding this comment.
in #8100 you state that you don't even want to implement fallback, which is exactly why I'm asking you to demonstrate here how a client would do it
There was a problem hiding this comment.
This test is testing LanguageDisplayName. I have no good way for a client to perform their own fallback other than falling back to the whole BCP-47 string.
I prefer to keep PRs small and incremental but I can implement it in this PR if you insist.
There was a problem hiding this comment.
Actually I think the right way to do this is to implement TryWriteable, with error parts over the fallback strings. Do you agree, and will you let me do this in a follow-up?
There was a problem hiding this comment.
I started implementing this. It should work but it needs some missing TryWriteable impls in icu_pattern. I'll make a separate PR for those. In the mean time I think I would like to merge this PR without the TryWriteable.
Here's my branch: https://github.com/sffc/omnicu/tree/dname-trywriteable
| .expect("Data should load successfully"); | ||
|
|
||
| assert_writeable_eq!(lang_name, "Traditional Chinese (Hong Kong)"); | ||
| assert_writeable_eq!(lang_name, "Traditional Chinese (Hong Kong SAR China)"); |
There was a problem hiding this comment.
why not test this as part of test_concatenate?
There was a problem hiding this comment.
It's an end-to-end test we are now enabling. Would you prefer to delete the whole test (and merge it in with one of the other tests)?
There was a problem hiding this comment.
yes
I don't agree with the concept of having disabled tests in code anyway
| /// ``` | ||
| #[allow(dead_code)] | ||
| #[derive(Debug)] | ||
| pub struct LanguageDisplayNameOwned { |
There was a problem hiding this comment.
issue: Language is the subtag, this formats LanguageIdentifiers, so it should be called LanguageIdentifierDisplayNameOwned.
the multi implementation take &Locale, and we do want to format unicode extensions at some point
There was a problem hiding this comment.
When designing this, I settled on Language because:
- A language display name formatter that formats only the language subtag is not well-defined in CLDR. We always read the other subtags to pick the correct dialect name.
- This formatter does what most regular people expect for a type named
LanguageDisplayName. - If we did implement a formatter with language-subtag-only behavior, calling it
LanguageDisplayNamewould be misleading, because it isn't the thing most people want. It should be called something likeLanguageSubtagDisplayName.- Note: in icu_locale_core, we have a
Languagetype which is a subtag, but it is scoped inside thesubtagmodule to emphasize what it is and is not.
- Note: in icu_locale_core, we have a
- ECMA-402 calls this "language".
In the design doc, I listed LanguageDisplayName as being the thing that formats a LanguageIdentifier.
I would like to stick with the design doc for this PR and do this bikeshed in a follow-up discussion.
There was a problem hiding this comment.
- A language display name formatter that formats only the language subtag is not well-defined in CLDR. We always read the other subtags to pick the correct dialect name.
- If we did implement a formatter with language-subtag-only behavior, calling it
LanguageDisplayNamewould be misleading, because it isn't the thing most people want. It should be called something likeLanguageSubtagDisplayName.
- Note: in icu_locale_core, we have a
Languagetype which is a subtag, but it is scoped inside thesubtagmodule to emphasize what it is and is not.
The naming of a potential language subtag formatter shouldn't impact the name here.
This formatter does what most regular people expect for a type named
LanguageDisplayName.
Both LanguageIdentifier and Locale are what most regular people expect a "language" to be. But we should be exact with terminology.
Not an argument, ECMA-402 has a lot of differing naming decisions
In the design doc, I listed LanguageDisplayName as being the thing that formats a LanguageIdentifier.
I was never given this doc for approval, we could have had the discussion there, so we're having it now.
There was a problem hiding this comment.
The naming of a potential language subtag formatter shouldn't impact the name here.
I raised that point to say that there is nothing else competing for the name LanguageDisplayName and nothing else that I would expect a user to assume when they see a type named LanguageDisplayName.
Not an argument, ECMA-402 has a lot of differing naming decisions
ECMA-402 naming decisions are always an input; we just often have other inputs that are stronger.
In an effort to unblock progress, I'll ask @sffc-bot to change this to LanguageIdentifierDisplayName (including in README.md), erring on the side of being more precise, and open a follow-up issue to bikeshed.
There was a problem hiding this comment.
Renamed to LanguageIdentifierDisplayName in 58dce83c6d.
| &locale, | ||
| ), | ||
| ..Default::default() | ||
| })? |
There was a problem hiding this comment.
please add a TODO that this is the location where we'd fall back to the code
There was a problem hiding this comment.
Fixed. Added TODO(#8100) pointing to the follow-up fallback issue.
There was a problem hiding this comment.
the "location where we'd fall back to the code" needs to be created with allow_identifier_not_found first, like you do for variants below
There was a problem hiding this comment.
Refactored base language load to use allow_identifier_not_found in 06dca7fa81.
- Optimize dialect resolution key formatting using TinyAsciiStr concat (zero allocation) - Optimize qualifier list formatting using a loop with the extracted separator (zero allocation) - Use allow_identifier_not_found for speculative provider loads - Mutate subject langid to simplify dialect consumption logic - Optimize variant payloads using DataPayloadOr (zero allocation) - Rename locale to formatting_locale and locale_id to subject - Link fallback-to-code behavior to follow-up issue unicode-org#8100 in README.md and add TODOs Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| assert_eq!(result.capacity(), result.len()); | ||
| } | ||
|
|
||
| // Test our new single formatter implementation (only for cases that are in the data, i.e. not "xx") |
There was a problem hiding this comment.
Deferred to #8100 since fallback is being implemented as a follow-up.
| &locale, | ||
| ), | ||
| ..Default::default() | ||
| })? |
There was a problem hiding this comment.
Fixed. Added TODO(#8100) pointing to the follow-up fallback issue.
sffc-bot
left a comment
There was a problem hiding this comment.
🤖 Additional replies regarding parameter naming and helper thoughts.
| /// ``` | ||
| #[allow(dead_code)] | ||
| #[derive(Debug)] | ||
| pub struct LanguageDisplayNameOwned { |
There was a problem hiding this comment.
- A language display name formatter that formats only the language subtag is not well-defined in CLDR. We always read the other subtags to pick the correct dialect name.
- If we did implement a formatter with language-subtag-only behavior, calling it
LanguageDisplayNamewould be misleading, because it isn't the thing most people want. It should be called something likeLanguageSubtagDisplayName.
- Note: in icu_locale_core, we have a
Languagetype which is a subtag, but it is scoped inside thesubtagmodule to emphasize what it is and is not.
The naming of a potential language subtag formatter shouldn't impact the name here.
This formatter does what most regular people expect for a type named
LanguageDisplayName.
Both LanguageIdentifier and Locale are what most regular people expect a "language" to be. But we should be exact with terminology.
Not an argument, ECMA-402 has a lot of differing naming decisions
In the design doc, I listed LanguageDisplayName as being the thing that formats a LanguageIdentifier.
I was never given this doc for approval, we could have had the discussion there, so we're having it now.
| let lang_str = subject.language.to_tinystr(); | ||
| let script_str = script.to_tinystr(); | ||
| let region_str = region.to_tinystr(); | ||
| let hyphen = tinystr::tinystr!(1, "-"); |
There was a problem hiding this comment.
you're right. I'll walk through the code in more detail (and maybe rewrite parts of it) before sending it for review again. sorry
| let attr1: tinystr::TinyAsciiStr<16> = lang_str.concat(hyphen); | ||
| let attr2: tinystr::TinyAsciiStr<16> = attr1.concat(script_str); | ||
| let attr3: tinystr::TinyAsciiStr<16> = attr2.concat(hyphen); | ||
| let attr: tinystr::TinyAsciiStr<16> = attr3.concat(region_str); |
There was a problem hiding this comment.
if things are hard to name, don't name them
| let attr1: tinystr::TinyAsciiStr<16> = lang_str.concat(hyphen); | |
| let attr2: tinystr::TinyAsciiStr<16> = attr1.concat(script_str); | |
| let attr3: tinystr::TinyAsciiStr<16> = attr2.concat(hyphen); | |
| let attr: tinystr::TinyAsciiStr<16> = attr3.concat(region_str); | |
| let attr = lang_str.concat::<16>(hyphen).concat::<16>(script_str).concat::<16>(hyphen).concat::<16>(region_str); |
There was a problem hiding this comment.
Implemented using chained turbofish in 56502210f0.
| DataMarkerAttributes::try_from_str(attr.as_str()) | ||
| .map_err(|_| DataError::custom("Invalid dialect attr"))?, |
There was a problem hiding this comment.
issue: this code should be in an infallible LocaleNamesLanguageMediumV1::make_attributes(Language, Script, Region)
There was a problem hiding this comment.
omg, why did the ai fail at refactoring this when it did all the others? It is usually is good at refactors. I'm trusting it less and less. 🤦♂️
There was a problem hiding this comment.
Implemented infallible make_attributes helper in 56502210f0.
| .load(DataRequest { | ||
| id: DataIdentifierBorrowed::for_marker_attributes_and_locale( | ||
| DataMarkerAttributes::try_from_str(locale_id.language.as_str()) | ||
| DataMarkerAttributes::try_from_str(subject.language.as_str()) |
There was a problem hiding this comment.
Used make_attributes here too in 56502210f0.
| &locale, | ||
| ), | ||
| ..Default::default() | ||
| })? |
There was a problem hiding this comment.
the "location where we'd fall back to the code" needs to be created with allow_identifier_not_found first, like you do for variants below
| metadata.silent = true; | ||
| match provider.load(DataRequest { id, metadata }) { | ||
| Ok(response) => DataPayloadOr::from_payload(response.payload), | ||
| Err(DataError { |
There was a problem hiding this comment.
allow_identifier_not_found
There was a problem hiding this comment.
Refactored to use allow_identifier_not_found in 9704a84db2.
| } | ||
| } | ||
|
|
||
| let variant_payloads = if loaded_variants.len() == 1 { |
There was a problem hiding this comment.
at this point you have already allocated a vec, there's little point in deallocating it now
There was a problem hiding this comment.
Optimized variant loading to be completely zero-allocation for 0 and 1 variant cases in bf42403202.
| One(&'a str), | ||
| Slice(&'a [DataPayload<LocaleNamesVariantMediumV1>]), |
There was a problem hiding this comment.
these are two different payloads, maybe be explicit about that. One(&'a str) borrows from data, which is 'static in the compiled data case, but Slice(&'a ...) borrows from the owned name
There was a problem hiding this comment.
Added second lifetime to BorrowedVariants and LanguageIdentifierDisplayName in 7983114639.
There was a problem hiding this comment.
@sffc-bot you did a very bad job at this change. Your second lifetime is meaningless because it is always bound to the same lifetime as the first. I will fix it for you.
| assert_writeable_eq!(lang_name, "Traditional Chinese (Hong Kong SAR China)"); | ||
| } | ||
|
|
||
| #[cfg(any())] |
There was a problem hiding this comment.
nit: disable tests with #[ignore]
There was a problem hiding this comment.
Replaced with #[ignore] in bc254e1ca7.
Rename LanguageDisplayName and LanguageDisplayNameOwned to LanguageIdentifierDisplayName and LanguageIdentifierDisplayNameOwned respectively across the codebase (including code, docs, and tests) for greater precision as requested. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Implement `LocaleNamesLanguageMediumV1::make_attributes` as an infallible helper in provider.rs to format subtag combinations onto the stack with zero heap allocations using chained TinyAsciiStr concat calls. Refactor language.rs to delegate all key construction to this helper, significantly simplifying and drying up the constructor. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Refactor script and region display name loading in the constructor to use `allow_identifier_not_found()?` instead of manual matching on `DataErrorKind::IdentifierNotFound`, ensuring consistency across all speculative data loads. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Refactor base language loading in the fallback path to use `allow_identifier_not_found()?`, returning an `Option` instead of failing immediately. Temporarily convert a `None` result back to a `DataError` with a `TODO(unicode-org#8100)` comment, isolating the exact location where the future fallback-to-code logic will live. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Optimize Step 4 (variant display name loading) in the constructor by matching on `subject.variants.len()`. For subjects with 0 or 1 variants, the display name is loaded directly into `DataPayloadOr` without ever allocating a heap `Vec`. A `Vec` is only allocated for the rare case where a subject has 2 or more variants. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ayName Refactor the internal `BorrowedVariants` enum and the public `LanguageIdentifierDisplayName` struct to carry two separate lifetimes: `'a` for the static translation data (`One(&'a str)`) and `'b` for the local owned struct borrow (`Slice(&'b [DataPayload])`). This explicitly represents the different borrow sources in the type system while keeping them unified in `as_borrowed()` via Rust's lifetime elision. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Replace `#[cfg(any())]` with `#[ignore]` on `test_single_language_display_name_menu`. This ensures the test is always compiled and type-checked during development, preventing it from silently breaking when APIs change, while still skipping it at runtime during normal test runs. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Replace all `.unwrap()` calls on `DataMarkerAttributes::try_from_str(...)` with `.map_err(|_| DataError::custom(...))?` to satisfy the crate's `#![deny(clippy::unwrap_used)]` lint and ensure safe error propagation. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
| /// ``` | ||
| #[allow(dead_code)] | ||
| #[derive(Debug)] | ||
| pub struct LanguageDisplayNameOwned { |
There was a problem hiding this comment.
Renamed to LanguageIdentifierDisplayName in 58dce83c6d.
| let lang_str = subject.language.to_tinystr(); | ||
| let script_str = script.to_tinystr(); | ||
| let region_str = region.to_tinystr(); | ||
| let hyphen = tinystr::tinystr!(1, "-"); |
| let attr1: tinystr::TinyAsciiStr<16> = lang_str.concat(hyphen); | ||
| let attr2: tinystr::TinyAsciiStr<16> = attr1.concat(script_str); | ||
| let attr3: tinystr::TinyAsciiStr<16> = attr2.concat(hyphen); | ||
| let attr: tinystr::TinyAsciiStr<16> = attr3.concat(region_str); |
There was a problem hiding this comment.
Implemented using chained turbofish in 56502210f0.
| DataMarkerAttributes::try_from_str(attr.as_str()) | ||
| .map_err(|_| DataError::custom("Invalid dialect attr"))?, |
There was a problem hiding this comment.
Implemented infallible make_attributes helper in 56502210f0.
| .load(DataRequest { | ||
| id: DataIdentifierBorrowed::for_marker_attributes_and_locale( | ||
| DataMarkerAttributes::try_from_str(locale_id.language.as_str()) | ||
| DataMarkerAttributes::try_from_str(subject.language.as_str()) |
There was a problem hiding this comment.
Used make_attributes here too in 56502210f0.
| metadata.silent = true; | ||
| match provider.load(DataRequest { id, metadata }) { | ||
| Ok(response) => DataPayloadOr::from_payload(response.payload), | ||
| Err(DataError { |
There was a problem hiding this comment.
Refactored to use allow_identifier_not_found in 9704a84db2.
| &locale, | ||
| ), | ||
| ..Default::default() | ||
| })? |
There was a problem hiding this comment.
Refactored base language load to use allow_identifier_not_found in 06dca7fa81.
| } | ||
| } | ||
|
|
||
| let variant_payloads = if loaded_variants.len() == 1 { |
There was a problem hiding this comment.
Optimized variant loading to be completely zero-allocation for 0 and 1 variant cases in bf42403202.
| One(&'a str), | ||
| Slice(&'a [DataPayload<LocaleNamesVariantMediumV1>]), |
There was a problem hiding this comment.
Added second lifetime to BorrowedVariants and LanguageIdentifierDisplayName in 7983114639.
| assert_writeable_eq!(lang_name, "Traditional Chinese (Hong Kong SAR China)"); | ||
| } | ||
|
|
||
| #[cfg(any())] |
There was a problem hiding this comment.
Replaced with #[ignore] in bc254e1ca7.
…ierDisplayName" This reverts commit 7983114.
Resolve conflicts in: - components/experimental/src/displaynames/provider.rs: Combine make_attributes helpers. - components/experimental/src/displaynames/single/mod.rs: Adopt Robert's cleanups and macros while preserving single language formatter. - components/experimental/src/displaynames/single/region.rs, script.rs, variant.rs: Adopt cleanups from main, keeping payload fields crate-visible. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Move the safety explanations regarding infallible `DataMarkerAttributes` construction from the doc comments to inline comments inside the function bodies for the 5 newly merged `make_attributes` helpers in provider.rs, matching the established style. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Add the `zh-Hant-HK` -> `Traditional Chinese (Hong Kong SAR China)` test case to the main table-driven `test_concatenate` integration test, ensuring it is thoroughly covered by both the multi and single display name formatters. Delete the redundant, duplicate one-off `test_single_language_display_name` and the unimplemented `test_single_language_display_name_menu` tests to keep the integration test suite clean and focused. Co-authored-by: Gemini <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
ok I think I got it all, let's merge this. I'll keep working on TryWriteable in separate PRs. |
Depends on #8085
Implements the single display names component for Languages, bringing them up to the planned specification. This PR lands the core implementation of:
LanguageDisplayNameOwned&LanguageDisplayName: Supports standard and dialect formatting (dialect resolution is fully implemented). Currently limited to the defaultMediumwidth.The
QualifiersWriteablehelper makes it all zero-copy.This does not implement menu names. That will be in an upcoming PR.
Addresses #7824 and #7825.
🤖 This pull request was created by an AI agent working with @sffc.
Open Questions
LanguageDisplayNameOwned::try_new(prefs, locale_id, options), we have placedoptionslast. This is becauseoptionsbehaves like a trailing optional bag (similar to varargs in other languages). Does this match the preferred style for ICU4X constructors, or shouldoptionsbe placed elsewhere?ScriptDisplayNameOwned,RegionDisplayNameOwned,VariantDisplayNameOwned, andLanguageDisplayNameOwnedimplementWriteable(by forwarding to their borrowed counterparts via.as_borrowed()). Should owned types implementWriteabledirectly, or should users be forced to call.as_borrowed()to format them? KeepingWriteableon owned types is convenient but increases API surface. We should decide if we want to keep this pattern or deprecate it.Script,Region,Variant,Language) fail-fast in the constructor (try_newreturnsErr(DataError)) if the specific subtag data is missing from the provider (e.g.,xxor an untranslated language). Should we redesign them to support fallback to the code (similar to the multi formatter) by storing the input identifier and making the payloads optional? Or is the current fail-fast behavior preferred for single formatters? If we choose fallback, we will need to update the existingRegionandScriptformatters in a future PR to maintain consistency.Changelog
icu_experimental:displaynames::single::LanguageDisplayNameOwnedandLanguageDisplayNamefor formatting language display names.