@@ -227,8 +227,7 @@ assert_eq!(parts.next(), None);
227
227
[ reference-level-explanation ] : #reference-level-explanation
228
228
229
229
It is trivial to apply the pattern API to ` OsStr ` on platforms where it is just an ` [u8] ` . The main
230
- difficulty is on Windows where it is an ` [u16] ` encoded as WTF-8. This RFC thus focuses on Windows
231
- only.
230
+ difficulty is on Windows where it is an ` [u16] ` encoded as WTF-8. This RFC thus focuses on Windows.
232
231
233
232
We will generalize the encoding of ` OsStr ` to specify these two capabilities:
234
233
@@ -262,13 +261,43 @@ representing the high surrogate by the first 3 bytes, and the low surrogate by t
262
261
"\u{10000}"[ 2..] = 90 80 80
263
262
```
264
263
264
+ The index splitting the surrogate pair will be positioned at the middle of the 4-byte sequence
265
+ (index "2" in the above example).
266
+
265
267
Note that this means:
266
268
267
269
1. `x[..i]` and `x[i..]` will have overlapping parts. This makes `OsStr::split_at_mut` (if exists)
268
270
unable to split a surrogate pair in half. This also means `Pattern<&mut OsStr>` cannot be
269
271
implemented for `&OsStr`.
270
272
2. The length of `x[..n]` may be longer than `n`.
271
273
274
+ ### Platform-agnostic guarantees
275
+
276
+ If an index points to an invalid position (e.g. `\u{1000}[1..]` or `"\u{10000}"[1..]` or
277
+ `"\u{10000}"[3..]`), a panic will be raised, similar to that of `str`. The following are guaranteed
278
+ to be valid positions on all platforms:
279
+
280
+ * `0`.
281
+ * `self.len()`.
282
+ * The returned indices from `find()`, `rfind()`, `match_indices()` and `rmatch_indices()`.
283
+ * The returned ranges from `find_range()`, `rfind_range()`, `match_ranges()` and `rmatch_ranges()`.
284
+
285
+ Index arithmetic is wrong for `OsStr`, i.e. `i + n` may not produce the correct index (see
286
+ [Drawbacks](#drawbacks)).
287
+
288
+ For WTF-8 encoding on Windows, we define:
289
+
290
+ * boundary of a character or surrogate byte sequence is Valid.
291
+ * middle (byte 2) of a 4-byte sequence is Valid.
292
+ * interior of a 2- or 3-byte sequence is Invalid.
293
+ * byte 1 or 3 of a 4-byte sequence is Invalid.
294
+
295
+ Outside of Windows where the `OsStr` consists of arbitrary bytes, all indices are considered valid.
296
+ This is because we want to allow `os_str.find(OsStr::from_bytes(b"\xff"))`.
297
+
298
+ Note that we have never guaranteed the actual `OsStr` encoding, these should only be considered an
299
+ implementation detail.
300
+
272
301
## Comparison and storage
273
302
274
303
All `OsStr` strings with sliced 4-byte sequence can be converted back to proper WTF-8 with an O(1)
@@ -284,7 +313,9 @@ We can this transformation “*canonicalization*”.
284
313
All owned `OsStr` should be canonicalized to contain well-formed WTF-8 only: `Box<OsStr>`,
285
314
`Rc<OsStr>`, `Arc<OsStr>` and `OsString`.
286
315
287
- Two `OsStr` are compared equal if they have the same canonicalization.
316
+ Two `OsStr` are compared equal if they have the same canonicalization. This may slightly reduce the
317
+ performance with a constant overhead, since there would be more checking involving the first and
318
+ last three bytes.
288
319
289
320
## Matching
290
321
@@ -423,7 +454,9 @@ match self.matcher.next_match() {
423
454
# Rationale and alternatives
424
455
[alternatives ]: #alternatives
425
456
426
- This is the only design which allows borrowing a sub - slice of a surrogate code point from a
457
+ ## Indivisible surrogate pair
458
+
459
+ This RFC is the only design which allows borrowing a sub - slice of a surrogate code point from a
427
460
surrogate pair .
428
461
429
462
An alternative is keep using the vanilla WTF - 8 , and treat a surrogate pair as an atomic entity :
@@ -446,7 +479,48 @@ There are two potential implementations when we want to match with an unpaired s
446
479
Note that , for consistency , we need to make `" \ u{ 10000} " . starts_with (" \ u{ d800} " )` return `false ` or
447
480
panic .
448
481
482
+ ## Slicing at real byte offset
483
+
484
+ The current RFC defines the index that splits a surrogate pair into half at byte 2 of the 4 - byte
485
+ sequence . This has the drawback of `" \ u{ 10000} " [.. 2 ]. len () == 3 `, and caused index arithmetic to be
486
+ wrong .
487
+
488
+ ```
489
+ "\u{10000}" = f0 90 80 80
490
+ "\u{10000}"[ ..2] = f0 90 80
491
+ "\u{10000}"[ 2..] = 90 80 80
492
+ ```
493
+
494
+ The main advantage of this scheme is we could use the same number as the start and end index.
495
+
496
+ ```rust
497
+ let s = OsStr::new("\u{10000}");
498
+ assert_eq!(s.len(), 4);
499
+ let index = s.find('\u{dc00}').unwrap();
500
+ let right = &s[index..]; // [90 80 80]
501
+ let left = &s[..index]; // [f0 90 80]
502
+ ```
503
+
504
+ An alternative make the index refer to the real byte offsets:
505
+
506
+ ```
507
+ "\u{10000}" = f0 90 80 80
508
+ "\u{10000}"[..3] = f0 90 80
509
+ "\u{10000}"[1..] = 90 80 80
510
+ ```
511
+
512
+ However the question would be, what should ` s[..1] ` do?
513
+
514
+ * ** Panic** — But this means we cannot get ` left ` . We could inspect the raw bytes of ` s ` itself and
515
+ perform ` &s[..(index + 2)] ` , but we never explicitly exposed the encoding of ` OsStr ` , so we
516
+ cannot read a single byte and thus impossible to do this.
517
+
518
+ * ** Treat as same as ` s[..3] ` ** — But then this inherits all the disadvantages of using 2 as valid
519
+ index, plus we need to consider whether ` s[1..3] ` and ` s[3..1] ` should be valid.
520
+
521
+ Given these, we decided not to treat the real byte offsets as valid indices.
522
+
449
523
# Unresolved questions
450
524
[ unresolved ] : #unresolved-questions
451
525
452
- None yet .
526
+ None yet.
0 commit comments