Parsing improvements #874

Jolanrensen · 2024-09-19T14:47:35Z

fixes #849

Small logic rewrite for tryParseImpl and added kdocs.

StringParsers can now be "covered by" another parser, meaning they will be skipped if the other parser is run. It, for instance, makes no sense to check whether a string can be parsed as a java.time.Instant if it cannot be parsed as a kotlinx.datetime.Instant.

Why I didn't remove parsers that are covered by other parsers is keep open the option to skip some parser types in the future. Say a user wants Java date-time classes instead of kotlin ones, the kotlin ones can be skipped and the java ones will still run. This would need to be implemented separately in the future, but I have plans to implement parser-skipping for CSV readers that already handle the parsing of some types but not all.

More importantly, this PR removes as many exceptions as possible from the "default path" of parsing:

When running parse() on a DF with columns of normal strings (a pretty common scenario), for each r number of rows in the column, all parsers are run. Many date-time parsers threw exceptions that needed to be caught which is a heavy affair (about 15 exceptions per String, according to the issue). Now is r usually 1, which limits the expected number of exceptions, however, it still scales with the number of columns c. It's easy to find a worst-case scenario. Imagine a DF consisting of valid LocalTimes for all rows except the final row (maybe a user made an error copy-pasting), We'll now get 14 exceptions times r rows times c columns and the resulting DF will be unchanged... We probably cannot completely erase this scenario, but at least we could limit the number of exceptions thrown :)

To avoid exceptions I did the following:

kotlinx.datetime.Instant
- Instant.parse calls DateTimeComponents.Formats.ISO_DATE_TIME_OFFSET.parse().toInstantUsingOffset() instead of the exception-less parseOrNull(), so we'll simply use that version instead. Plus, to catch leap seconds, when it fails to parse it, we'll try the java instants too.
java.time.Duration
- This function works by applying a regex to the String. I simply copied the regex to manually check whether it can parse it or not before passing it to the java function. Tests are in place to notice us of any changes.
kotlin.time.Duration
- This was a bit more difficult, as it supports both the ISO-8601 format, like Java, but also its own String notation, like "1d 23h 2s". The method I chose was to copy over the logic from the stdlib and replace the exceptions with return null. This might be a bit more difficult to maintain, however, I put some tests in place to check behavioral changes on the kotlin side. (Tests that are "inspired" by the official tests)
java.time.Instant, java.time.LocalDateTime, java.time.LocalDate, java.time.LocalTime
- Instead of calling DateTimeFormatter.parse directly, we can call parseUnresolved first. This we can catch failing without it throwing an error. If it does not fail, we call the normal parse, which has a much lower chance of throwing an exception now.
JSON
- expanded the checks a tiny bit so that the chance of JSON exceptions is slightly lower

Finally I used coroutines (as suggested in #723) to parallelize the parse operation per-column.

… can now be "covered by" another parser, meaning they will be skipped if the other parser is run. parsersOrder was also cleaned up a tiny bit

…ns in parsing

… parsing implementation

Jolanrensen · 2024-09-23T18:45:35Z

I ran a small local test with the IntelliJ profiler to see how effective the drop of exceptions is. The test might not reflect real-world differences, but I emphasised the worst-case scenario (aka a DF with many String columns).

The test:

val df = dataFrameOf(List(5_000) { "_$it" }).fill(100) {
    Random.nextInt().toChar().toString() + Random.nextInt().toChar()
} // a 100r x 5000c DF filled with strings

df.parse()

All parsers are run, coveredBy is disabled for these tests.

These are the results:

Type	CPU Time old	CPU Time new	Memory Allocations old	Memory Allocations new
kotlinx.datetime.Instant	9,563 ms	990 ms	5.41 GB	929.55 MB
java.time.Instant	2,851 ms	140 ms	1.42 GB	423.11 MB
kotlinx.datetime.LocalDateTime	10,314 ms	160 ms	4.45 GB	480.71 MB
java.time.LocalDateTime	10,344 ms	190 ms	4.45 GB	481.29 MB
kotlinx.datetime.LocalDate	10,634 ms	190 ms	4.44 GB	488.12 MB
java.time.LocalDate	10,053 ms	180 ms	4.45 GB	486.04 MB
kotlin.time.Duration	4,492 ms	40 ms	1.7 GB	24.64 MB
java.time.Duration	2,501 ms	30 ms	840.99 MB	123.74 MB
kotlinx.datetime.LocalTime	10,894 ms	160 ms	4.44 GB	483.92 MB
java.time.LocalTime	9,674 ms	210 ms	4.44 GB	490.72 MB

(cherry picked from commit 5c567c5)

# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt

…as keys like Double appear multiple times)

zaleslaw · 2024-10-04T11:44:44Z

core/generated-sources/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt

    fun applyOptions(options: ParserOptions?): (String) -> T?

+    /** If a parser with one of these types is run, this parser can be skipped. */
+    val coveredBy: Collection<KType>


Comment is clear, but probably naming is weird

any suggestions? :)

zaleslaw · 2024-10-04T12:08:37Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt

+                    // parse each value/frame column at any depth inside each DataFrame in the frame column
+                    col.isFrameColumn() ->
+                        col.values.map {
+                            async {


do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?

Probably we need to set up correct dispatcher or give this ability to the user
https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-coroutine-dispatcher/

In the case, if we run with the Default Dispatcher, it could consume too many resources on the machine.

do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?

Coroutines are cheap :) DataFrames in the cells of frame columns are independent of each other, as are columns in a DF. It only makes sense to split them off in different coroutine branches.

Probably we need to set up correct dispatcher or give this ability to the user

I agree, but I'm not sure how to do this neatly from the API. The way to do it correctly, would be to make parse() and all its overloads suspend functions. That way the user can decide which scope to run it on and with which dispatcher. The problem is that the DF API is not built around coroutines, nor should users be forced to put every call in a suspending context, so this would require all overloads to be written twice, both with and without suspend... @koperagen any ideas?

zaleslaw · 2024-10-04T12:10:38Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt

+
+                    // Base case, parse the column if it's a `String?` column
+                    col.isSubtypeOf<String?>() ->
+                        col.cast<String?>().tryParse(options)


Are we throwing for now here some exceptions? in tryParse?

No, DataColumn.tryParse means it tries to parse it, but if it fails, it will keep it as String. This is in contrast with DataColumn.parse which does throw an exception if it couldn't be parsed.

I'll add some quick kdoc, because it looks confusing indeed.

zaleslaw

I worry about parallel parsing a little bit

Also I had a thought, what if we enalbe DEBUG logging for catchSilent {
logger.debug ("parsing problems")
}

… and parse.

Jolanrensen · 2024-10-04T18:56:24Z

I worry about parallel parsing a little bit

Also I had a thought, what if we enalbe DEBUG logging for catchSilent { logger.debug ("parsing problems") }

parallel might need a little redesign indeed. Especially if we'd want to introduce coroutines in other parts of the library where they might be useful.

I wouldn't add logging to each individual parser personally, unless we can guarantee that we have 0 impact on performance when the logging level is not debug. But even then, with debug enabled we would generate a TON of logs when parsing.

koperagen · 2024-10-07T10:39:14Z

I'd try ParallelStream instead of coroutines

koperagen · 2024-10-07T10:43:19Z

core/generated-sources/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/parse.kt

    }

+    @Test
+    fun `can parse instants`() {


can we ask stdlib for a non-throwing parsing function?

ah, my bad, it's from Java, right?

in any case, i want to understand better what copy-pasted parts are needed

for kotlin Instants, we've got parseOrNull, a non-throwing parsing function :) For Java I managed to make my own parseOrNull.

I copied over a lot of tests from kotlinx-datetime such that we can catch functional changes when we update java/kotlinx-datetime versions.

The biggest thing I copied was regarding Kotlin Duration. It would be great if we could have a non-throwing Duration.parseOrNull in the stdlib. I solved this by creating a function Duration.canParse which contains copied logic but returns false instead of throwing an exception. For this I also put plenty of tests in place to catch functional changes if we bump Kotlin versoin.

Jolanrensen · 2024-10-08T10:38:43Z

I'd try ParallelStream instead of coroutines

ParallelStream does work, however, for big json-like DataFrames I'm a bit concerned:
How does ParallelStream handle where a node spawns multiple other parallelstreams? such as for frame columns. I know coroutines neatly manage and distribute their work. Plus, it's more multiplatform-ready.

Jolanrensen · 2024-10-08T12:24:46Z

@zaleslaw @koperagen I adjusted the coroutine implementation such that users can supply how they want the parser to run in their coroutinescope, if they desire to do so.

Example can be found in the tests:

dataframe/core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/api/parse.kt

Line 488 in ba6299c

fun `Parse with different coroutine scope or context`() {

I made the parse functions inline as to "leak" the suspend scope inside. The benefit of this notation is that the function can be called both inside and outside suspend functions while still allowing the user to control how it's executed.

More explanation here:

dataframe/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/aliases.kt

Line 232 in ba6299c

public typealias CoroutineProvider<T> = (suspend CoroutineScope.() -> T) -> T

Jolanrensen · 2024-10-14T17:22:16Z

Removing parallel behavior for now. We can discuss it in #723

github-actions · 2024-10-14T17:26:46Z

Generated sources will be updated after merging this PR.
Please inspect the changes in here.

small logic rewrite for tryParseImpl and added kdocs. StringParsers…

0ca3883

… can now be "covered by" another parser, meaning they will be skipped if the other parser is run. parsersOrder was also cleaned up a tiny bit

Jolanrensen force-pushed the parsing-optimization branch from 5de4ca6 to 0ca3883 Compare September 19, 2024 14:54

added "canParse" functions for Kotlin/Java duration to avoid exceptio…

95b9df6

…ns in parsing

Jolanrensen force-pushed the parsing-optimization branch from 86a089f to 95b9df6 Compare September 19, 2024 18:51

added toInstantOrNull functions avoiding exceptions when parsing

abe03dc

Jolanrensen force-pushed the parsing-optimization branch from cde546c to abe03dc Compare September 20, 2024 12:08

adding tests to ensure consistency between duration and instant parsing

e0c4de7

Jolanrensen force-pushed the parsing-optimization branch from eb1cb37 to e0c4de7 Compare September 20, 2024 13:47

Jolanrensen added 2 commits September 23, 2024 13:40

creating low-in-exceptions DateTimeFormatter.parseOrNull function for…

af19cfd

… parsing implementation

small json parsing fix

e3dfcfd

cleaning up

bee6c2f

Jolanrensen marked this pull request as ready for review September 23, 2024 19:08

Jolanrensen added 2 commits September 24, 2024 13:53

adding contracts to AnyCol.isX functions

fcfac95

(cherry picked from commit 5c567c5)

enabling column-parallel parsing with coroutines

de81ab6

Jolanrensen force-pushed the parsing-optimization branch from 7bcfef2 to 37847a3 Compare September 24, 2024 12:28

Merge branch 'master' into parsing-optimization

bd178e3

# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt

Jolanrensen force-pushed the parsing-optimization branch from 37847a3 to bd178e3 Compare September 24, 2024 12:42

Jolanrensen requested review from koperagen and zaleslaw September 24, 2024 12:43

Jolanrensen added 3 commits September 25, 2024 13:08

Merge branch 'master' into parsing-optimization

219ddff

fixup! enabling column-parallel parsing with coroutines

7b9c86b

tiny logic change so we take all parsers and not just the parserMap (…

5057eeb

…as keys like Double appear multiple times)

zaleslaw reviewed Oct 4, 2024

View reviewed changes

zaleslaw requested changes Oct 4, 2024

View reviewed changes

copied some kdoc to denote the difference between DataColumn.tryParse…

72c8a17

… and parse.

koperagen reviewed Oct 7, 2024

View reviewed changes

Jolanrensen requested review from koperagen and zaleslaw October 11, 2024 12:48

Jolanrensen force-pushed the parsing-optimization branch from ba6299c to cd7463b Compare October 14, 2024 10:21

removing parallelization from parse for now

2a90396

Jolanrensen force-pushed the parsing-optimization branch from cd7463b to 2a90396 Compare October 14, 2024 17:20

Jolanrensen merged commit a8cee48 into master Oct 15, 2024
5 checks passed

Jolanrensen deleted the parsing-optimization branch October 15, 2024 12:34

Parsing improvements #874

Parsing improvements #874

Uh oh!

Conversation

Jolanrensen commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Sep 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zaleslaw left a comment

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented Oct 4, 2024

Uh oh!

koperagen commented Oct 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jolanrensen Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Oct 14, 2024

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Jolanrensen commented Sep 19, 2024 •

edited

Loading

Jolanrensen Oct 8, 2024 •

edited

Loading

Jolanrensen commented Oct 8, 2024 •

edited

Loading

Jolanrensen commented Oct 8, 2024 •

edited

Loading