-
Notifications
You must be signed in to change notification settings - Fork 76
Parsing improvements #874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing improvements #874
Conversation
… can now be "covered by" another parser, meaning they will be skipped if the other parser is run. parsersOrder was also cleaned up a tiny bit
5de4ca6 to
0ca3883
Compare
86a089f to
95b9df6
Compare
cde546c to
abe03dc
Compare
eb1cb37 to
e0c4de7
Compare
… parsing implementation
|
I ran a small local test with the IntelliJ profiler to see how effective the drop of exceptions is. The test might not reflect real-world differences, but I emphasised the worst-case scenario (aka a DF with many String columns). The test: val df = dataFrameOf(List(5_000) { "_$it" }).fill(100) {
Random.nextInt().toChar().toString() + Random.nextInt().toChar()
} // a 100r x 5000c DF filled with strings
df.parse()All parsers are run, These are the results:
|
(cherry picked from commit 5c567c5)
7bcfef2 to
37847a3
Compare
# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt
37847a3 to
bd178e3
Compare
…as keys like Double appear multiple times)
| fun applyOptions(options: ParserOptions?): (String) -> T? | ||
|
|
||
| /** If a parser with one of these types is run, this parser can be skipped. */ | ||
| val coveredBy: Collection<KType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is clear, but probably naming is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions? :)
| // parse each value/frame column at any depth inside each DataFrame in the frame column | ||
| col.isFrameColumn() -> | ||
| col.values.map { | ||
| async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we need to set up correct dispatcher or give this ability to the user
https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-coroutine-dispatcher/
In the case, if we run with the Default Dispatcher, it could consume too many resources on the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?
Coroutines are cheap :) DataFrames in the cells of frame columns are independent of each other, as are columns in a DF. It only makes sense to split them off in different coroutine branches.
Probably we need to set up correct dispatcher or give this ability to the user
I agree, but I'm not sure how to do this neatly from the API. The way to do it correctly, would be to make parse() and all its overloads suspend functions. That way the user can decide which scope to run it on and with which dispatcher. The problem is that the DF API is not built around coroutines, nor should users be forced to put every call in a suspending context, so this would require all overloads to be written twice, both with and without suspend... @koperagen any ideas?
|
|
||
| // Base case, parse the column if it's a `String?` column | ||
| col.isSubtypeOf<String?>() -> | ||
| col.cast<String?>().tryParse(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we throwing for now here some exceptions? in tryParse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, DataColumn.tryParse means it tries to parse it, but if it fails, it will keep it as String. This is in contrast with DataColumn.parse which does throw an exception if it couldn't be parsed.
I'll add some quick kdoc, because it looks confusing indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about parallel parsing a little bit
Also I had a thought, what if we enalbe DEBUG logging for catchSilent {
logger.debug ("parsing problems")
}
parallel might need a little redesign indeed. Especially if we'd want to introduce coroutines in other parts of the library where they might be useful. I wouldn't add logging to each individual parser personally, unless we can guarantee that we have 0 impact on performance when the logging level is not debug. But even then, with debug enabled we would generate a TON of logs when parsing. |
|
I'd try ParallelStream instead of coroutines |
| } | ||
|
|
||
| @Test | ||
| fun `can parse instants`() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we ask stdlib for a non-throwing parsing function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my bad, it's from Java, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any case, i want to understand better what copy-pasted parts are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for kotlin Instants, we've got parseOrNull, a non-throwing parsing function :) For Java I managed to make my own parseOrNull.
I copied over a lot of tests from kotlinx-datetime such that we can catch functional changes when we update java/kotlinx-datetime versions.
The biggest thing I copied was regarding Kotlin Duration. It would be great if we could have a non-throwing Duration.parseOrNull in the stdlib. I solved this by creating a function Duration.canParse which contains copied logic but returns false instead of throwing an exception. For this I also put plenty of tests in place to catch functional changes if we bump Kotlin versoin.
ParallelStream does work, however, for big json-like DataFrames I'm a bit concerned: |
|
@zaleslaw @koperagen I adjusted the coroutine implementation such that users can supply how they want the parser to run in their coroutinescope, if they desire to do so. Example can be found in the tests:
I made the parse functions inline as to "leak" the suspend scope inside. The benefit of this notation is that the function can be called both inside and outside suspend functions while still allowing the user to control how it's executed. More explanation here:
|
ba6299c to
cd7463b
Compare
cd7463b to
2a90396
Compare
|
Removing parallel behavior for now. We can discuss it in #723 |
|
Generated sources will be updated after merging this PR. |
fixes #849
Small logic rewrite for
tryParseImpland added kdocs.StringParsers can now be "covered by" another parser, meaning they will be skipped if the other parser is run. It, for instance, makes no sense to check whether a string can be parsed as a
java.time.Instantif it cannot be parsed as akotlinx.datetime.Instant.Why I didn't remove parsers that are covered by other parsers is keep open the option to skip some parser types in the future. Say a user wants Java date-time classes instead of kotlin ones, the kotlin ones can be skipped and the java ones will still run. This would need to be implemented separately in the future, but I have plans to implement parser-skipping for CSV readers that already handle the parsing of some types but not all.
More importantly, this PR removes as many exceptions as possible from the "default path" of parsing:
To avoid exceptions I did the following:
kotlinx.datetime.InstantInstant.parsecallsDateTimeComponents.Formats.ISO_DATE_TIME_OFFSET.parse().toInstantUsingOffset()instead of the exception-lessparseOrNull(), so we'll simply use that version instead. Plus, to catch leap seconds, when it fails to parse it, we'll try the java instants too.java.time.Durationkotlin.time.Durationreturn null. This might be a bit more difficult to maintain, however, I put some tests in place to check behavioral changes on the kotlin side. (Tests that are "inspired" by the official tests)java.time.Instant,java.time.LocalDateTime,java.time.LocalDate,java.time.LocalTimeDateTimeFormatter.parsedirectly, we can callparseUnresolvedfirst. This we can catch failing without it throwing an error. If it does not fail, we call the normalparse, which has a much lower chance of throwing an exception now.Finally I used coroutines (as suggested in #723) to parallelize the parse operation per-column.