Source guide/strings.simd

1# Strings and Formatting
2
3Strings in Silo are UTF-8, byte-indexed at the storage level but
4codepoint-oriented at the API level. This chapter covers the
5ordinary string operations, the `format` macro, and the
6string-pattern machinery that unifies formatting and parsing.
7
8## The basics
9
10A string literal is a double-quoted UTF-8 sequence. The usual
11escapes work:
12
13```silo
14"hello world"
15"tab:\there\n"                 # \n \r \t \\ \" \0
16"\u{1F600}"                    # Unicode escape — 😀
17```
18
19`.len` returns the **byte** length in O(1):
20
21```silo
22"hello" .len                   # ⌊5⌉
23"日本語" .len                   # ⌊9⌉  (3 codepoints × 3 bytes each)
24```
25
26`.codepoints` gives you an `Iterator` that yields one
27`Codepoint` per scalar value:
28
29```silo
30"日本語" .codepoints           # iterator yielding '日' '本' '語'
31```
32
33The byte-vs-codepoint split matters because a lot of
34codepoint-looking operations on UTF-8 are really byte
35operations. Silo keeps them visibly separate so you pick
36deliberately.
37
38Concatenation uses `+`, which is just the `Add` trait impl for
39`(Add Str Str | Str)`:
40
41```silo
42"hello" " " + "world" +        # ⌊"hello world"⌉
43```
44
45## String patterns are a type
46
47Here's the distinctive bit. **A string literal containing `{}`
48holes isn't a `Str` — it's a `Pattern`.** The compiler reads
49the holes and infers a `Pattern` shape from their declared
50types:
51
52```silo
53"hello"                         # ⌊"hello"⌉ : Str
54"{} is {} years old"            # a value of type (Pattern Str Int)
55r"no {} here"                   # ⌊"no {} here"⌉ : Str (raw: braces literal)
56```
57
58`(Pattern Str Int)` is "a format pattern that takes a `Str`
59and an `Int`." It's a first-class type — you can bind it, pass
60it around, store it in a record. And because it's a real type,
61the compiler can check every use of it.
62
63## `format` and `parse` are bidirectional
64
65A typed `Pattern` drives both directions:
66
67```silo
68"Alice" 30 "{} is {} years old" format
69# ⌊"Alice is 30 years old"⌉
70
71"Alice is 30 years old" "{} is {} years old" parse
72# ⌊"Alice" 30⌉       the two captures as Str, Int
73```
74
75`format` fills the holes from the stack; `parse` reads the
76holes off a source string and pushes the captures. Same
77pattern value, opposite directions, same grammar.
78
79The compiler checks the template at compile time: hole count
80must match the number of stack values consumed, and each
81hole's expected type must match what's on the stack (or match
82the source string's structure for `parse`).
83
84This is a real unification, not a syntactic convenience.
85Silo's pattern machinery is the same machinery the `.sil`
86localisation format uses
87([chapter 14](./14-localization.simd)), which is why a
88localised message can be round-tripped.
89
90## Typed format holes
91
92A hole can name a **format trait** instead of being bare. The
93trait determines how the value is rendered (or parsed), and
94the hole's type is constrained to any value implementing it:
95
96```silo
97255 "{Hex}" format             # ⌊"FF"⌉
98"FF" "{Hex}" parse             # ⌊Some(255)⌉
9942 "{Bin}" format              # ⌊"101010"⌉
100```
101
102Format traits are **bidirectional by construction**: each
103defines both a `value -> Str` direction and a `Str -> (Option
104value)` direction. That's what makes `format` and `parse`
105interchangeable on the same pattern.
106
107The standard-library format traits:
108
109| Trait       | Direction               | Applies to                           |
110|-------------|-------------------------|--------------------------------------|
111| `Hex`       | Hexadecimal             | Integer types                        |
112| `Dec`       | Decimal (no grouping)   | Integer types (default)              |
113| `Oct`       | Octal                   | Integer types                        |
114| `Bin`       | Binary                  | Integer types                        |
115| `Decimal`   | Locale-aware decimal    | Numeric types (ICU4X + `CurrentLocale`) |
116| `Percent`   | Percentage              | Float/Numeric (ICU4X)                |
117| `Currency`  | Locale-aware currency   | Numeric (ICU4X + `CurrentLocale`)    |
118| `Debug`     | Structural dump         | All types (derivable)                |
119
120A format hole has two orthogonal layers, separated by an
121optional `:`:
122
123```
124{ RENDERING : LAYOUT }
125```
126
127Both halves are optional.
128
129**Rendering** (before the `:`) is a Silo expression that
130produces a formatter value. It's postfix — trait-specific
131parameters first, then the trait name that consumes them. If
132omitted, the hole uses `Display`.
133
134**Layout** (after the `:`) is Rust-style field-fitting:
135`[fill][align][sign][#][0][width][.precision][?]`. It's
136applied to the rendered string to fit a field. If omitted, no
137padding or precision is applied.
138
139Silo drops most of Rust's trailing *type character* shortcuts.
140There's no `:x` / `:X` / `:b` / `:o` — rendering traits
141(`{Hex}`, `{UpperHex}`, `{Bin}`, `{Oct}`) cover that ground.
142Two Rust conventions are kept:
143
144- `?` — the `{Debug}` shortcut. `{:?}` means "Debug", same as
145  `{Debug}`, and combines with layout (`{:>10?}` is Debug +
146  right-align).
147- `#` — the alternate-form flag. It's a request the rendering
148  trait sees at render time; `Hex` responds by prepending
149  `0x`, `Bin` with `0b`, `Oct` with `0o`. Traits that don't
150  recognise it ignore it.
151
152Keeping the layout half almost entirely about field fitting
153means `Decimal`, `Hex`, and custom traits all share the same
154padding machinery without reimplementing it.
155
156```silo
157"{}"                         # default Display, no layout
158"{:>10}"                     # Display, right-aligned width 10
159"{:0>8}"                     # Display, zero-padded width 8
160"{:?}"                       # Debug shortcut — "42"
161"{:>10?}"                    # Debug + right-aligned width 10
162"{Debug}"                    # Debug trait (long form)
163"{Hex}"                      # Hex trait
164"{Hex:#}"                    # Hex + alternate form — "0xFF"
165"{Hex:#>10}"                 # Hex alternate form + width 10 — "      0xFF"
166"{Hex:>10}"                  # Hex, no prefix — "        FF"
167"{Hex:0>8}"                  # zero-padded — "000000FF"
168"{Decimal}"                  # locale-aware Decimal
169"{Decimal:.2}"               # precision 2
170"{Decimal:>10.2}"            # precision 2, right-aligned width 10
171"{\"EUR\" Currency}"         # Currency(EUR) with defaults
172"{\"EUR\" Currency:>10.2}"   # Currency(EUR), precision 2, right-aligned 10
173```
174
175Because the rendering side is real Silo, you can construct a
176formatter outside a hole and reuse it:
177
178```silo
179"EUR" Currency pop-> euros
18099.99 euros .render            # ⌊"€99.99"⌉   (respecting CurrentLocale)
181```
182
183The `format` macro's job is to thread the formatter through
184its `.render` method on the value, then apply the layout spec.
185The `{…}` syntax is sugar for that two-step.
186
187### Trait parameters vs layout parameters
188
189The split falls out cleanly once you ask "does this change the
190text content, or only how it sits in a field?":
191
192- **Trait parameter** (pre-`:`): changes what the text is.
193  Currency code, locale hint, rounding mode.
194- **Layout parameter** (post-`:`): changes only how the text
195  fits. Width, alignment, fill, and precision (because
196  precision is fundamentally "how long is the rendered string"
197  and belongs with the field-fitting layer).
198
199Custom format traits plug into the rendering side: define a
200record for the formatter, implement the format-trait methods
201on it, and the layout spec applies to whatever string you
202produce.
203
204## ICU4X at the core
205
206The locale-aware half of the trait table above isn't a
207convenience layer — it's the whole model. Silo's standard
208library integrates ICU4X (`icu` crate) as a foundational
209dependency:
210
211- `Pattern` is backed by `icu::pattern::Pattern<MultiNamedPlaceholder>`
212  under the hood. The pattern parser, the interpolation
213  engine, and the named-placeholder layout all come from
214  ICU4X; Silo adds typed holes on top.
215- `Locale` is a real ICU locale value (language, script,
216  region, variant, Unicode extensions), not a symbol tag.
217  `(Locale .try-from)` parses a BCP-47 tag.
218- `Decimal`, `Percent`, `Currency` format traits dispatch to
219  ICU4X's `FixedDecimalFormatter`, `PercentFormatter`, and
220  `CurrencyFormatter` respectively.
221- Date, time, and calendar formatting
222  ([chapter 15](./15-temporal.simd)) use ICU4X calendars and
223  formatters.
224- Collation, case mapping, and Unicode segmentation are all
225  via ICU4X primitives.
226
227Practically, this means Silo has proper Unicode and
228locale-awareness everywhere you'd want them, without each
229application having to shop for a library. `Str .to-upper` is
230locale-sensitive. `Decimal` formatting picks the right
231grouping character and decimal mark for the current locale.
232Dates render with the right calendar for the locale.
233
234The trade-off is a runtime commitment to ICU4X data. Hosts that
235need a smaller footprint can swap in a cut-down data profile,
236but the API doesn't change.
237
238## Locale-aware formatting in practice
239
240Locale dispatch is wired through the
241**:gloss[`CurrentLocale` aspect](./A1-glossary.simd#aspect)**. The
242default locale is whatever the host installs at startup
243(usually `en-US` or the system locale); `:with` overrides for
244a scope:
245
246```silo
2471234567.89 "{Decimal}" format
248# ⌊"1,234,567.89"⌉   under the default locale
249
250'de-DE (Locale .try-from) .unwrap :with CurrentLocale
251  1234567.89 "{Decimal}" format
252# ⌊"1.234.567,89"⌉   de-DE grouping and decimal mark
253:end
254```
255
256Same pattern, same `format` call; the difference is purely the
257installed locale aspect. That's the whole API.
258
259## Ordinary string operations
260
261Alongside the pattern-based API, strings support the usual
262collection-style operations. A few common ones:
263
264```silo
265"a,b,c" "," .split             # ⌊[ "a" "b" "c" ]⌉
266"hello world" "world" .contains  # ⌊true⌉
267"  hello  " .trim              # ⌊"hello"⌉
268"hello" "HELLO" .to-upper .=    # ⌊true⌉
269```
270
271Most of them are trait methods — `.split` comes from the
272`Splittable` surface, `.contains` from `Searchable`, and so
273on. A full reference lives in `spec/stdlib/str.md`.
274
275## Raw strings and byte strings
276
277Two literal prefixes handle the edge cases:
278
279```silo
280r"no {} here"                  # Str — raw; braces are literal, escapes ignored
281b"raw bytes \x00"              # Bytes — UTF-8-safe byte string
282```
283
284Use raw strings when you want a `Str` that doesn't trigger the
285`Pattern` typing — regular expressions, file paths on Windows,
286anywhere `{` and `}` are part of the literal. Use byte strings
287when you want raw bytes rather than a UTF-8 sequence.
288
289## Key points
290
291- Strings are UTF-8. `.len` is bytes; `.codepoints` gives you
292  the codepoint iterator. The two are always visibly
293  different.
294- A string literal with `{}` holes is a `(Pattern …)` typed
295  value, not a `Str`. The type encodes the hole count and
296  types. Raw strings (`r"…"`) opt out of pattern typing when
297  you want literal braces.
298- `format` and `parse` are bidirectional on the same pattern.
299  Compile-time checks ensure hole count and types line up on
300  both directions.
301- A format hole splits into **rendering** (a Silo postfix
302  expression producing a formatter, left of `:`) and **layout**
303  (Rust-style width/align/fill/precision spec, right of `:`).
304  Either half is optional. The rendering expression is
305  first-class Silo — construct a formatter, bind it, reuse it
306  with `.render`.
307- ICU4X is a foundational dependency. `Pattern`, `Locale`,
308  locale-aware number/currency formatting, calendar handling,
309  collation, and Unicode segmentation are all ICU4X-backed.
310- Locale dispatch runs through the `CurrentLocale` aspect.
311  `:with` overrides for a scope.
312
313Next: [localization](./14-localization.simd) — `.sil` message
314catalogues, typed placeholders, and how they plug into the
315pattern machinery above.