Guide β€Ί The Silo Handbook β€Ί Strings and Formatting

Chapter 12 Strings and Formatting

Source

Strings in Silo are UTF-8, byte-indexed at the storage level but codepoint-oriented at the API level. This chapter covers the ordinary string operations, the format macro, and the string-pattern machinery that unifies formatting and parsing.

The basics

A string literal is a double-quoted UTF-8 sequence. The usual escapes work:

"hello world"
"tab:\there\n"                 # \n \r \t \\ \" \0
"\u{1F600}"                    # Unicode escape β€” πŸ˜€

.len returns the byte length in O(1):

"hello" .len                   # ⌊5βŒ‰
"ζ—₯本θͺž" .len                   # ⌊9βŒ‰  (3 codepoints Γ— 3 bytes each)

.codepoints gives you an Iterator that yields one Codepoint per scalar value:

"ζ—₯本θͺž" .codepoints           # iterator yielding 'ζ—₯' '本' 'θͺž'

The byte-vs-codepoint split matters because a lot of codepoint-looking operations on UTF-8 are really byte operations. Silo keeps them visibly separate so you pick deliberately.

Concatenation uses +, which is just the Add trait impl for (Add Str Str | Str):

"hello" " " + "world" +        # ⌊"hello world"βŒ‰

String patterns are a type

Here's the distinctive bit. A string literal containing {} holes isn't a Str β€” it's a Pattern. The compiler reads the holes and infers a Pattern shape from their declared types:

"hello"                         # ⌊"hello"βŒ‰ : Str
"{} is {} years old"            # a value of type (Pattern Str Int)
r"no {} here"                   # ⌊"no {} here"βŒ‰ : Str (raw: braces literal)

(Pattern Str Int) is "a format pattern that takes a Str and an Int." It's a first-class type β€” you can bind it, pass it around, store it in a record. And because it's a real type, the compiler can check every use of it.

format and parse are bidirectional

A typed Pattern drives both directions:

"Alice" 30 "{} is {} years old" format
# ⌊"Alice is 30 years old"βŒ‰

"Alice is 30 years old" "{} is {} years old" parse
# ⌊"Alice" 30βŒ‰       the two captures as Str, Int

format fills the holes from the stack; parse reads the holes off a source string and pushes the captures. Same pattern value, opposite directions, same grammar.

The compiler checks the template at compile time: hole count must match the number of stack values consumed, and each hole's expected type must match what's on the stack (or match the source string's structure for parse).

This is a real unification, not a syntactic convenience. Silo's pattern machinery is the same machinery the .sil localisation format uses (chapter 14), which is why a localised message can be round-tripped.

Typed format holes

A hole can name a format trait instead of being bare. The trait determines how the value is rendered (or parsed), and the hole's type is constrained to any value implementing it:

255 "{Hex}" format             # ⌊"FF"βŒ‰
"FF" "{Hex}" parse             # ⌊Some(255)βŒ‰
42 "{Bin}" format              # ⌊"101010"βŒ‰

Format traits are bidirectional by construction: each defines both a value -> Str direction and a Str -> (Option value) direction. That's what makes format and parse interchangeable on the same pattern.

The standard-library format traits:

Trait Direction Applies to
Hex Hexadecimal Integer types
Dec Decimal (no grouping) Integer types (default)
Oct Octal Integer types
Bin Binary Integer types
Decimal Locale-aware decimal Numeric types (ICU4X + CurrentLocale)
Percent Percentage Float/Numeric (ICU4X)
Currency Locale-aware currency Numeric (ICU4X + CurrentLocale)
Debug Structural dump All types (derivable)

A format hole has two orthogonal layers, separated by an optional ::

{ RENDERING : LAYOUT }

Both halves are optional.

Rendering (before the :) is a Silo expression that produces a formatter value. It's postfix β€” trait-specific parameters first, then the trait name that consumes them. If omitted, the hole uses Display.

Layout (after the :) is Rust-style field-fitting: [fill][align][sign][#][0][width][.precision][?]. It's applied to the rendered string to fit a field. If omitted, no padding or precision is applied.

Silo drops most of Rust's trailing type character shortcuts. There's no :x / :X / :b / :o β€” rendering traits ({Hex}, {UpperHex}, {Bin}, {Oct}) cover that ground. Two Rust conventions are kept:

  • ? β€” the {Debug} shortcut. {:?} means "Debug", same as {Debug}, and combines with layout ({:>10?} is Debug + right-align).
  • # β€” the alternate-form flag. It's a request the rendering trait sees at render time; Hex responds by prepending 0x, Bin with 0b, Oct with 0o. Traits that don't recognise it ignore it.

Keeping the layout half almost entirely about field fitting means Decimal, Hex, and custom traits all share the same padding machinery without reimplementing it.

"{}"                         # default Display, no layout
"{:>10}"                     # Display, right-aligned width 10
"{:0>8}"                     # Display, zero-padded width 8
"{:?}"                       # Debug shortcut β€” "42"
"{:>10?}"                    # Debug + right-aligned width 10
"{Debug}"                    # Debug trait (long form)
"{Hex}"                      # Hex trait
"{Hex:#}"                    # Hex + alternate form β€” "0xFF"
"{Hex:#>10}"                 # Hex alternate form + width 10 β€” "      0xFF"
"{Hex:>10}"                  # Hex, no prefix β€” "        FF"
"{Hex:0>8}"                  # zero-padded β€” "000000FF"
"{Decimal}"                  # locale-aware Decimal
"{Decimal:.2}"               # precision 2
"{Decimal:>10.2}"            # precision 2, right-aligned width 10
"{\"EUR\" Currency}"         # Currency(EUR) with defaults
"{\"EUR\" Currency:>10.2}"   # Currency(EUR), precision 2, right-aligned 10

Because the rendering side is real Silo, you can construct a formatter outside a hole and reuse it:

"EUR" Currency pop-> euros
99.99 euros .render            # ⌊"€99.99"βŒ‰   (respecting CurrentLocale)

The format macro's job is to thread the formatter through its .render method on the value, then apply the layout spec. The {…} syntax is sugar for that two-step.

Trait parameters vs layout parameters

The split falls out cleanly once you ask "does this change the text content, or only how it sits in a field?":

  • Trait parameter (pre-:): changes what the text is. Currency code, locale hint, rounding mode.
  • Layout parameter (post-:): changes only how the text fits. Width, alignment, fill, and precision (because precision is fundamentally "how long is the rendered string" and belongs with the field-fitting layer).

Custom format traits plug into the rendering side: define a record for the formatter, implement the format-trait methods on it, and the layout spec applies to whatever string you produce.

ICU4X at the core

The locale-aware half of the trait table above isn't a convenience layer β€” it's the whole model. Silo's standard library integrates ICU4X (icu crate) as a foundational dependency:

  • Pattern is backed by icu::pattern::Pattern<MultiNamedPlaceholder> under the hood. The pattern parser, the interpolation engine, and the named-placeholder layout all come from ICU4X; Silo adds typed holes on top.
  • Locale is a real ICU locale value (language, script, region, variant, Unicode extensions), not a symbol tag. (Locale .try-from) parses a BCP-47 tag.
  • Decimal, Percent, Currency format traits dispatch to ICU4X's FixedDecimalFormatter, PercentFormatter, and CurrencyFormatter respectively.
  • Date, time, and calendar formatting (chapter 15) use ICU4X calendars and formatters.
  • Collation, case mapping, and Unicode segmentation are all via ICU4X primitives.

Practically, this means Silo has proper Unicode and locale-awareness everywhere you'd want them, without each application having to shop for a library. Str .to-upper is locale-sensitive. Decimal formatting picks the right grouping character and decimal mark for the current locale. Dates render with the right calendar for the locale.

The trade-off is a runtime commitment to ICU4X data. Hosts that need a smaller footprint can swap in a cut-down data profile, but the API doesn't change.

Locale-aware formatting in practice

Locale dispatch is wired through the CurrentLocale aspect. The default locale is whatever the host installs at startup (usually en-US or the system locale); :with overrides for a scope:

1234567.89 "{Decimal}" format
# ⌊"1,234,567.89"βŒ‰   under the default locale

'de-DE (Locale .try-from) .unwrap :with CurrentLocale
  1234567.89 "{Decimal}" format
# ⌊"1.234.567,89"βŒ‰   de-DE grouping and decimal mark
:end

Same pattern, same format call; the difference is purely the installed locale aspect. That's the whole API.

Ordinary string operations

Alongside the pattern-based API, strings support the usual collection-style operations. A few common ones:

"a,b,c" "," .split             # ⌊[ "a" "b" "c" ]βŒ‰
"hello world" "world" .contains  # ⌊trueβŒ‰
"  hello  " .trim              # ⌊"hello"βŒ‰
"hello" "HELLO" .to-upper .=    # ⌊trueβŒ‰

Most of them are trait methods β€” .split comes from the Splittable surface, .contains from Searchable, and so on. A full reference lives in spec/stdlib/str.md.

Raw strings and byte strings

Two literal prefixes handle the edge cases:

r"no {} here"                  # Str β€” raw; braces are literal, escapes ignored
b"raw bytes \x00"              # Bytes β€” UTF-8-safe byte string

Use raw strings when you want a Str that doesn't trigger the Pattern typing β€” regular expressions, file paths on Windows, anywhere { and } are part of the literal. Use byte strings when you want raw bytes rather than a UTF-8 sequence.

Key points

  • Strings are UTF-8. .len is bytes; .codepoints gives you the codepoint iterator. The two are always visibly different.
  • A string literal with {} holes is a (Pattern …) typed value, not a Str. The type encodes the hole count and types. Raw strings (r"…") opt out of pattern typing when you want literal braces.
  • format and parse are bidirectional on the same pattern. Compile-time checks ensure hole count and types line up on both directions.
  • A format hole splits into rendering (a Silo postfix expression producing a formatter, left of :) and layout (Rust-style width/align/fill/precision spec, right of :). Either half is optional. The rendering expression is first-class Silo β€” construct a formatter, bind it, reuse it with .render.
  • ICU4X is a foundational dependency. Pattern, Locale, locale-aware number/currency formatting, calendar handling, collation, and Unicode segmentation are all ICU4X-backed.
  • Locale dispatch runs through the CurrentLocale aspect. :with overrides for a scope.

Next: localization β€” .sil message catalogues, typed placeholders, and how they plug into the pattern machinery above.