Chapter 12 Strings and Formatting
SourceStrings in Silo are UTF-8, byte-indexed at the storage level but codepoint-oriented at the API level. This chapter covers the ordinary string operations, the format macro, and the string-pattern machinery that unifies formatting and parsing.
The basics
A string literal is a double-quoted UTF-8 sequence. The usual escapes work:
"hello world"
"tab:\there\n" # \n \r \t \\ \" \0
"\u{1F600}" # Unicode escape β π
.len returns the byte length in O(1):
"hello" .len # β5β
"ζ₯ζ¬θͺ" .len # β9β (3 codepoints Γ 3 bytes each)
.codepoints gives you an Iterator that yields one Codepoint per scalar value:
"ζ₯ζ¬θͺ" .codepoints # iterator yielding 'ζ₯' 'ζ¬' 'θͺ'
The byte-vs-codepoint split matters because a lot of codepoint-looking operations on UTF-8 are really byte operations. Silo keeps them visibly separate so you pick deliberately.
Concatenation uses +, which is just the Add trait impl for (Add Str Str | Str):
"hello" " " + "world" + # β"hello world"β
String patterns are a type
Here's the distinctive bit. A string literal containing {} holes isn't a Str β it's a Pattern. The compiler reads the holes and infers a Pattern shape from their declared types:
"hello" # β"hello"β : Str
"{} is {} years old" # a value of type (Pattern Str Int)
r "no {} here" # β"no {} here"β : Str (raw: braces literal)
(Pattern Str Int) is "a format pattern that takes a Str and an Int." It's a first-class type β you can bind it, pass it around, store it in a record. And because it's a real type, the compiler can check every use of it.
format and parse are bidirectional
A typed Pattern drives both directions:
"Alice" 30 "{} is {} years old" format
# β"Alice is 30 years old"β
"Alice is 30 years old" "{} is {} years old" parse
# β"Alice" 30β the two captures as Str, Int
format fills the holes from the stack; parse reads the holes off a source string and pushes the captures. Same pattern value, opposite directions, same grammar.
The compiler checks the template at compile time: hole count must match the number of stack values consumed, and each hole's expected type must match what's on the stack (or match the source string's structure for parse).
This is a real unification, not a syntactic convenience. Silo's pattern machinery is the same machinery the .sil localisation format uses (chapter 14), which is why a localised message can be round-tripped.
Typed format holes
A hole can name a format trait instead of being bare. The trait determines how the value is rendered (or parsed), and the hole's type is constrained to any value implementing it:
255 "{Hex}" format # β"FF"β
"FF" "{Hex}" parse # βSome(255)β
42 "{Bin}" format # β"101010"β
Format traits are bidirectional by construction: each defines both a value -> Str direction and a Str -> (Option value) direction. That's what makes format and parse interchangeable on the same pattern.
The standard-library format traits:
| Trait | Direction | Applies to |
|---|---|---|
Hex |
Hexadecimal | Integer types |
Dec |
Decimal (no grouping) | Integer types (default) |
Oct |
Octal | Integer types |
Bin |
Binary | Integer types |
Decimal |
Locale-aware decimal | Numeric types (ICU4X + CurrentLocale) |
Percent |
Percentage | Float/Numeric (ICU4X) |
Currency |
Locale-aware currency | Numeric (ICU4X + CurrentLocale) |
Debug |
Structural dump | All types (derivable) |
A format hole has two orthogonal layers, separated by an optional ::
{ RENDERING : LAYOUT }
Both halves are optional.
Rendering (before the :) is a Silo expression that produces a formatter value. It's postfix β trait-specific parameters first, then the trait name that consumes them. If omitted, the hole uses Display.
Layout (after the :) is Rust-style field-fitting: [fill][align][sign][#][0][width][.precision][?]. It's applied to the rendered string to fit a field. If omitted, no padding or precision is applied.
Silo drops most of Rust's trailing type character shortcuts. There's no :x / :X / :b / :o β rendering traits ({Hex}, {UpperHex}, {Bin}, {Oct}) cover that ground. Two Rust conventions are kept:
?β the{Debug}shortcut.{:?}means "Debug", same as{Debug}, and combines with layout ({:>10?}is Debug + right-align).#β the alternate-form flag. It's a request the rendering trait sees at render time;Hexresponds by prepending0x,Binwith0b,Octwith0o. Traits that don't recognise it ignore it.
Keeping the layout half almost entirely about field fitting means Decimal, Hex, and custom traits all share the same padding machinery without reimplementing it.
"{}" # default Display, no layout
"{:>10}" # Display, right-aligned width 10
"{:0>8}" # Display, zero-padded width 8
"{:?}" # Debug shortcut β "42"
"{:>10?}" # Debug + right-aligned width 10
"{Debug}" # Debug trait (long form)
"{Hex}" # Hex trait
"{Hex:#}" # Hex + alternate form β "0xFF"
"{Hex:#>10}" # Hex alternate form + width 10 β " 0xFF"
"{Hex:>10}" # Hex, no prefix β " FF"
"{Hex:0>8}" # zero-padded β "000000FF"
"{Decimal}" # locale-aware Decimal
"{Decimal:.2}" # precision 2
"{Decimal:>10.2}" # precision 2, right-aligned width 10
"{\"EUR\" Currency}" # Currency(EUR) with defaults
"{\"EUR\" Currency:>10.2}" # Currency(EUR), precision 2, right-aligned 10
Because the rendering side is real Silo, you can construct a formatter outside a hole and reuse it:
"EUR" Currency pop-> euros
99.99 euros .render # β"β¬99.99"β (respecting CurrentLocale)
The format macro's job is to thread the formatter through its .render method on the value, then apply the layout spec. The {β¦} syntax is sugar for that two-step.
Trait parameters vs layout parameters
The split falls out cleanly once you ask "does this change the text content, or only how it sits in a field?":
- Trait parameter (pre-
:): changes what the text is. Currency code, locale hint, rounding mode. - Layout parameter (post-
:): changes only how the text fits. Width, alignment, fill, and precision (because precision is fundamentally "how long is the rendered string" and belongs with the field-fitting layer).
Custom format traits plug into the rendering side: define a record for the formatter, implement the format-trait methods on it, and the layout spec applies to whatever string you produce.
ICU4X at the core
The locale-aware half of the trait table above isn't a convenience layer β it's the whole model. Silo's standard library integrates ICU4X (icu crate) as a foundational dependency:
Patternis backed byicu::pattern::Pattern<MultiNamedPlaceholder>under the hood. The pattern parser, the interpolation engine, and the named-placeholder layout all come from ICU4X; Silo adds typed holes on top.Localeis a real ICU locale value (language, script, region, variant, Unicode extensions), not a symbol tag.(Locale .try-from)parses a BCP-47 tag.Decimal,Percent,Currencyformat traits dispatch to ICU4X'sFixedDecimalFormatter,PercentFormatter, andCurrencyFormatterrespectively.- Date, time, and calendar formatting (chapter 15) use ICU4X calendars and formatters.
- Collation, case mapping, and Unicode segmentation are all via ICU4X primitives.
Practically, this means Silo has proper Unicode and locale-awareness everywhere you'd want them, without each application having to shop for a library. Str .to-upper is locale-sensitive. Decimal formatting picks the right grouping character and decimal mark for the current locale. Dates render with the right calendar for the locale.
The trade-off is a runtime commitment to ICU4X data. Hosts that need a smaller footprint can swap in a cut-down data profile, but the API doesn't change.
Locale-aware formatting in practice
Locale dispatch is wired through the CurrentLocale aspect. The default locale is whatever the host installs at startup (usually en-US or the system locale); :with overrides for a scope:
1234567.89 "{Decimal}" format
# β"1,234,567.89"β under the default locale
'de-DE ( Locale .try-from) .unwrap :with CurrentLocale
1234567.89 "{Decimal}" format
# β"1.234.567,89"β de-DE grouping and decimal mark
:end
Same pattern, same format call; the difference is purely the installed locale aspect. That's the whole API.
Ordinary string operations
Alongside the pattern-based API, strings support the usual collection-style operations. A few common ones:
"a,b,c" "," .split # β[ "a" "b" "c" ]β
"hello world" "world" .contains # βtrueβ
" hello " .trim # β"hello"β
"hello" "HELLO" .to-upper .= # βtrueβ
Most of them are trait methods β .split comes from the Splittable surface, .contains from Searchable, and so on. A full reference lives in spec/stdlib/str.md.
Raw strings and byte strings
Two literal prefixes handle the edge cases:
r "no {} here" # Str β raw; braces are literal, escapes ignored
b "raw bytes \x00" # Bytes β UTF-8-safe byte string
Use raw strings when you want a Str that doesn't trigger the Pattern typing β regular expressions, file paths on Windows, anywhere { and } are part of the literal. Use byte strings when you want raw bytes rather than a UTF-8 sequence.
Key points
- Strings are UTF-8.
.lenis bytes;.codepointsgives you the codepoint iterator. The two are always visibly different. - A string literal with
{}holes is a(Pattern β¦)typed value, not aStr. The type encodes the hole count and types. Raw strings (r"β¦") opt out of pattern typing when you want literal braces. formatandparseare bidirectional on the same pattern. Compile-time checks ensure hole count and types line up on both directions.- A format hole splits into rendering (a Silo postfix expression producing a formatter, left of
:) and layout (Rust-style width/align/fill/precision spec, right of:). Either half is optional. The rendering expression is first-class Silo β construct a formatter, bind it, reuse it with.render. - ICU4X is a foundational dependency.
Pattern,Locale, locale-aware number/currency formatting, calendar handling, collation, and Unicode segmentation are all ICU4X-backed. - Locale dispatch runs through the
CurrentLocaleaspect.:withoverrides for a scope.
Next: localization β .sil message catalogues, typed placeholders, and how they plug into the pattern machinery above.