Source guide/strings.simd
1# Strings and Formatting 2 3Strings in Silo are UTF-8, byte-indexed at the storage level but 4codepoint-oriented at the API level. This chapter covers the 5ordinary string operations, the `format` macro, and the 6string-pattern machinery that unifies formatting and parsing. 7 8## The basics 9 10A string literal is a double-quoted UTF-8 sequence. The usual 11escapes work: 12 13```silo 14"hello world" 15"tab:\there\n" # \n \r \t \\ \" \0 16"\u{1F600}" # Unicode escape — 😀 17``` 18 19`.len` returns the **byte** length in O(1): 20 21```silo 22"hello" .len # ⌊5⌉ 23"日本語" .len # ⌊9⌉ (3 codepoints × 3 bytes each) 24``` 25 26`.codepoints` gives you an `Iterator` that yields one 27`Codepoint` per scalar value: 28 29```silo 30"日本語" .codepoints # iterator yielding '日' '本' '語' 31``` 32 33The byte-vs-codepoint split matters because a lot of 34codepoint-looking operations on UTF-8 are really byte 35operations. Silo keeps them visibly separate so you pick 36deliberately. 37 38Concatenation uses `+`, which is just the `Add` trait impl for 39`(Add Str Str | Str)`: 40 41```silo 42"hello" " " + "world" + # ⌊"hello world"⌉ 43``` 44 45## String patterns are a type 46 47Here's the distinctive bit. **A string literal containing `{}` 48holes isn't a `Str` — it's a `Pattern`.** The compiler reads 49the holes and infers a `Pattern` shape from their declared 50types: 51 52```silo 53"hello" # ⌊"hello"⌉ : Str 54"{} is {} years old" # a value of type (Pattern Str Int) 55r"no {} here" # ⌊"no {} here"⌉ : Str (raw: braces literal) 56``` 57 58`(Pattern Str Int)` is "a format pattern that takes a `Str` 59and an `Int`." It's a first-class type — you can bind it, pass 60it around, store it in a record. And because it's a real type, 61the compiler can check every use of it. 62 63## `format` and `parse` are bidirectional 64 65A typed `Pattern` drives both directions: 66 67```silo 68"Alice" 30 "{} is {} years old" format 69# ⌊"Alice is 30 years old"⌉ 70 71"Alice is 30 years old" "{} is {} years old" parse 72# ⌊"Alice" 30⌉ the two captures as Str, Int 73``` 74 75`format` fills the holes from the stack; `parse` reads the 76holes off a source string and pushes the captures. Same 77pattern value, opposite directions, same grammar. 78 79The compiler checks the template at compile time: hole count 80must match the number of stack values consumed, and each 81hole's expected type must match what's on the stack (or match 82the source string's structure for `parse`). 83 84This is a real unification, not a syntactic convenience. 85Silo's pattern machinery is the same machinery the `.sil` 86localisation format uses 87([chapter 14](./14-localization.simd)), which is why a 88localised message can be round-tripped. 89 90## Typed format holes 91 92A hole can name a **format trait** instead of being bare. The 93trait determines how the value is rendered (or parsed), and 94the hole's type is constrained to any value implementing it: 95 96```silo 97255 "{Hex}" format # ⌊"FF"⌉ 98"FF" "{Hex}" parse # ⌊Some(255)⌉ 9942 "{Bin}" format # ⌊"101010"⌉ 100``` 101 102Format traits are **bidirectional by construction**: each 103defines both a `value -> Str` direction and a `Str -> (Option 104value)` direction. That's what makes `format` and `parse` 105interchangeable on the same pattern. 106 107The standard-library format traits: 108 109| Trait | Direction | Applies to | 110|-------------|-------------------------|--------------------------------------| 111| `Hex` | Hexadecimal | Integer types | 112| `Dec` | Decimal (no grouping) | Integer types (default) | 113| `Oct` | Octal | Integer types | 114| `Bin` | Binary | Integer types | 115| `Decimal` | Locale-aware decimal | Numeric types (ICU4X + `CurrentLocale`) | 116| `Percent` | Percentage | Float/Numeric (ICU4X) | 117| `Currency` | Locale-aware currency | Numeric (ICU4X + `CurrentLocale`) | 118| `Debug` | Structural dump | All types (derivable) | 119 120A format hole has two orthogonal layers, separated by an 121optional `:`: 122 123``` 124{ RENDERING : LAYOUT } 125``` 126 127Both halves are optional. 128 129**Rendering** (before the `:`) is a Silo expression that 130produces a formatter value. It's postfix — trait-specific 131parameters first, then the trait name that consumes them. If 132omitted, the hole uses `Display`. 133 134**Layout** (after the `:`) is Rust-style field-fitting: 135`[fill][align][sign][#][0][width][.precision][?]`. It's 136applied to the rendered string to fit a field. If omitted, no 137padding or precision is applied. 138 139Silo drops most of Rust's trailing *type character* shortcuts. 140There's no `:x` / `:X` / `:b` / `:o` — rendering traits 141(`{Hex}`, `{UpperHex}`, `{Bin}`, `{Oct}`) cover that ground. 142Two Rust conventions are kept: 143 144- `?` — the `{Debug}` shortcut. `{:?}` means "Debug", same as 145 `{Debug}`, and combines with layout (`{:>10?}` is Debug + 146 right-align). 147- `#` — the alternate-form flag. It's a request the rendering 148 trait sees at render time; `Hex` responds by prepending 149 `0x`, `Bin` with `0b`, `Oct` with `0o`. Traits that don't 150 recognise it ignore it. 151 152Keeping the layout half almost entirely about field fitting 153means `Decimal`, `Hex`, and custom traits all share the same 154padding machinery without reimplementing it. 155 156```silo 157"{}" # default Display, no layout 158"{:>10}" # Display, right-aligned width 10 159"{:0>8}" # Display, zero-padded width 8 160"{:?}" # Debug shortcut — "42" 161"{:>10?}" # Debug + right-aligned width 10 162"{Debug}" # Debug trait (long form) 163"{Hex}" # Hex trait 164"{Hex:#}" # Hex + alternate form — "0xFF" 165"{Hex:#>10}" # Hex alternate form + width 10 — " 0xFF" 166"{Hex:>10}" # Hex, no prefix — " FF" 167"{Hex:0>8}" # zero-padded — "000000FF" 168"{Decimal}" # locale-aware Decimal 169"{Decimal:.2}" # precision 2 170"{Decimal:>10.2}" # precision 2, right-aligned width 10 171"{\"EUR\" Currency}" # Currency(EUR) with defaults 172"{\"EUR\" Currency:>10.2}" # Currency(EUR), precision 2, right-aligned 10 173``` 174 175Because the rendering side is real Silo, you can construct a 176formatter outside a hole and reuse it: 177 178```silo 179"EUR" Currency pop-> euros 18099.99 euros .render # ⌊"€99.99"⌉ (respecting CurrentLocale) 181``` 182 183The `format` macro's job is to thread the formatter through 184its `.render` method on the value, then apply the layout spec. 185The `{…}` syntax is sugar for that two-step. 186 187### Trait parameters vs layout parameters 188 189The split falls out cleanly once you ask "does this change the 190text content, or only how it sits in a field?": 191 192- **Trait parameter** (pre-`:`): changes what the text is. 193 Currency code, locale hint, rounding mode. 194- **Layout parameter** (post-`:`): changes only how the text 195 fits. Width, alignment, fill, and precision (because 196 precision is fundamentally "how long is the rendered string" 197 and belongs with the field-fitting layer). 198 199Custom format traits plug into the rendering side: define a 200record for the formatter, implement the format-trait methods 201on it, and the layout spec applies to whatever string you 202produce. 203 204## ICU4X at the core 205 206The locale-aware half of the trait table above isn't a 207convenience layer — it's the whole model. Silo's standard 208library integrates ICU4X (`icu` crate) as a foundational 209dependency: 210 211- `Pattern` is backed by `icu::pattern::Pattern<MultiNamedPlaceholder>` 212 under the hood. The pattern parser, the interpolation 213 engine, and the named-placeholder layout all come from 214 ICU4X; Silo adds typed holes on top. 215- `Locale` is a real ICU locale value (language, script, 216 region, variant, Unicode extensions), not a symbol tag. 217 `(Locale .try-from)` parses a BCP-47 tag. 218- `Decimal`, `Percent`, `Currency` format traits dispatch to 219 ICU4X's `FixedDecimalFormatter`, `PercentFormatter`, and 220 `CurrencyFormatter` respectively. 221- Date, time, and calendar formatting 222 ([chapter 15](./15-temporal.simd)) use ICU4X calendars and 223 formatters. 224- Collation, case mapping, and Unicode segmentation are all 225 via ICU4X primitives. 226 227Practically, this means Silo has proper Unicode and 228locale-awareness everywhere you'd want them, without each 229application having to shop for a library. `Str .to-upper` is 230locale-sensitive. `Decimal` formatting picks the right 231grouping character and decimal mark for the current locale. 232Dates render with the right calendar for the locale. 233 234The trade-off is a runtime commitment to ICU4X data. Hosts that 235need a smaller footprint can swap in a cut-down data profile, 236but the API doesn't change. 237 238## Locale-aware formatting in practice 239 240Locale dispatch is wired through the 241**:gloss[`CurrentLocale` aspect](./A1-glossary.simd#aspect)**. The 242default locale is whatever the host installs at startup 243(usually `en-US` or the system locale); `:with` overrides for 244a scope: 245 246```silo 2471234567.89 "{Decimal}" format 248# ⌊"1,234,567.89"⌉ under the default locale 249 250'de-DE (Locale .try-from) .unwrap :with CurrentLocale 251 1234567.89 "{Decimal}" format 252# ⌊"1.234.567,89"⌉ de-DE grouping and decimal mark 253:end 254``` 255 256Same pattern, same `format` call; the difference is purely the 257installed locale aspect. That's the whole API. 258 259## Ordinary string operations 260 261Alongside the pattern-based API, strings support the usual 262collection-style operations. A few common ones: 263 264```silo 265"a,b,c" "," .split # ⌊[ "a" "b" "c" ]⌉ 266"hello world" "world" .contains # ⌊true⌉ 267" hello " .trim # ⌊"hello"⌉ 268"hello" "HELLO" .to-upper .= # ⌊true⌉ 269``` 270 271Most of them are trait methods — `.split` comes from the 272`Splittable` surface, `.contains` from `Searchable`, and so 273on. A full reference lives in `spec/stdlib/str.md`. 274 275## Raw strings and byte strings 276 277Two literal prefixes handle the edge cases: 278 279```silo 280r"no {} here" # Str — raw; braces are literal, escapes ignored 281b"raw bytes \x00" # Bytes — UTF-8-safe byte string 282``` 283 284Use raw strings when you want a `Str` that doesn't trigger the 285`Pattern` typing — regular expressions, file paths on Windows, 286anywhere `{` and `}` are part of the literal. Use byte strings 287when you want raw bytes rather than a UTF-8 sequence. 288 289## Key points 290 291- Strings are UTF-8. `.len` is bytes; `.codepoints` gives you 292 the codepoint iterator. The two are always visibly 293 different. 294- A string literal with `{}` holes is a `(Pattern …)` typed 295 value, not a `Str`. The type encodes the hole count and 296 types. Raw strings (`r"…"`) opt out of pattern typing when 297 you want literal braces. 298- `format` and `parse` are bidirectional on the same pattern. 299 Compile-time checks ensure hole count and types line up on 300 both directions. 301- A format hole splits into **rendering** (a Silo postfix 302 expression producing a formatter, left of `:`) and **layout** 303 (Rust-style width/align/fill/precision spec, right of `:`). 304 Either half is optional. The rendering expression is 305 first-class Silo — construct a formatter, bind it, reuse it 306 with `.render`. 307- ICU4X is a foundational dependency. `Pattern`, `Locale`, 308 locale-aware number/currency formatting, calendar handling, 309 collation, and Unicode segmentation are all ICU4X-backed. 310- Locale dispatch runs through the `CurrentLocale` aspect. 311 `:with` overrides for a scope. 312 313Next: [localization](./14-localization.simd) — `.sil` message 314catalogues, typed placeholders, and how they plug into the 315pattern machinery above.