-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Arising out of discussion in #46.
Right now, we have parsers in the modules Text.Parsing.StringParser.CodePoints and Text.Parsing.StringParser.CodeUnits, which use the same Parser data type, except that the former treats the integer pos field in the parser state as the number of code points we have consumed in the string being parsed, and the latter treats the pos field as the number of code units. This can cause problems if they are mixed:
> runParser (Tuple <$> CP.string "🐱" <*> CP.anyChar) "🐱hi"
(Right (Tuple "🐱" 'h'))
> runParser (Tuple <$> CP.string "🐱" <*> CU.anyChar) "🐱hi"
(Right (Tuple "🐱" '�'))
Addtionally, storing an index into code points is not really justifiable from a performance perspective, since indexing into a string using code points is an O(n) operation, where n is the index; it requires looking at every code point in the string up to the given index.
If we compare the APIs exported by the Text.Parsing.StringParser.Code{Units,Points} modules, they are basically the same; in particular, the CodePoints parsers still use Char almost everywhere, which limits their utility quite severely. As far as I can tell, the only difference between the CodePoints and CodeUnits parsers (now that #46 has been merged) is that the CodePoints ones will fail rather than splitting up surrogate pairs.
I think the ideal solution would be to do the following:
- Say that the
posfield in the parser state always counts code units - Unify
Text.Parsing.StringParser.CodePointsandText.Parsing.StringParser.CodeUnitsinto just one module; effectively, get rid of the former and move most/all of the contents of the latter back toText.Parsing.StringParser. - Clearly demarcate parsers which have the ability to split up surrogate pairs, like
anyChar. This could be done with doc-comments or we could move them into their own moduleText.Parsing.StringParser.CodeUnits. - Provide
CodePoint-based alternatives to any of the parsers which are currently based onChar, so that it is possible to do everything you might want to do without having to resort to using parsers likeanyCharwhich can split up surrogate pairs.