Improve `list.chunked()` + `List<List<T>>.toDataFrame` for parsing .srt and similar text files #1486

koperagen · 2025-10-09T13:52:58Z

Consider file structured as this:

header1
data1

header2
data2

I believe it's popular, one example is srt, but i personally had to deal with it a lot

It can be parsed into dataframe:

File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame()

Surprisingly here toDataFrame is not generic Iterable.toDataFrame, but completely another function.
Problem: In current shape it's not helpful.
I end up with something very close to what i want, but required change to code is somewhat non-trivial.

I'd either have to:

val data = File("data").readText().lines().chunked(3).map { it.dropLast(1) }
val df = buildList { 
  add(listOf("col1", "col2"))
  addAll(data)
}.toDataFrame()

Or switch to completely different route:

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame("data").split("data").cast<List<*>>().into("col1", "col2")

with compiler plugin

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame("data").split { data }.into("col1", "col2")

With this API change:

val df = File("data").readText().lines().chunked(3).map { it.dropLast(1) }.toDataFrame(header = listOf("col1", "col2"))

Plugin will understand resulting schema too

zaleslaw · 2025-10-10T11:47:28Z

I agree that this a popular use-case, I also faced with the same and handled with File API parsing

Great, if tiny example, saying, with subtitles will be also added

Jolanrensen · 2025-10-10T15:08:53Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/common.kt

-public fun <T> List<List<T>>.toDataFrame(containsColumns: Boolean = false): AnyFrame =
+@Refine
+@Interpretable("ValuesListsToDataFrame")
+public fun <T> List<List<T>>.toDataFrame(header: List<String>? = null, containsColumns: Boolean = false): AnyFrame =


this is a breaking change, can we keep the old function and deprecate it?

Deprecated and moved from io to api package

Jolanrensen

Seems like a useful addition :) it's common in excel/csv's too to take the first row as headers unless headers are supplied explicitly, so it makes sense here too I guess

koperagen · 2025-10-15T16:44:35Z

@zaleslaw

Great, if tiny example, saying, with subtitles will be also added

Updated documentation.
In the process, updated page with headers to be able to refer to individual operations
Before:

After:

…pages

Jolanrensen · 2025-10-16T13:06:29Z

docs/StardustDocs/topics/createDataFrame.md

-Creates a [`DataFrame`](DataFrame.md) from an [`Iterable`](https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.collections/-iterable/) of objects:
+#### [`DataFrame`](DataFrame.md) from `List<List<T>>`:
+
+This is useful for parsing text files. For example, the `.srt` subtitle format can be parsed like this:


core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/common.kt

Jolanrensen

useful feature and even better example :)

You still have a failing test, but other than that it looks good

koperagen added this to the 1.0.0-Beta4 milestone Oct 9, 2025

koperagen requested review from AndreiKingsley, Jolanrensen and zaleslaw October 9, 2025 13:52

koperagen self-assigned this Oct 9, 2025

koperagen added enhancement New feature or request Compiler plugin Anything related to the DataFrame Compiler Plugin labels Oct 9, 2025

Jolanrensen reviewed Oct 10, 2025

View reviewed changes

Jolanrensen approved these changes Oct 10, 2025

View reviewed changes

koperagen added 3 commits October 15, 2025 19:40

Improve list.chunked() + List<List<T>>.toDataFrame use case

194e962

Move List<List<T>>.toDataFrame from io to api with deprecation

4ab8ebf

Add an example in the documentation for List<List<T>>.toDataFrame

e84ddaa

koperagen force-pushed the lists-todataframe branch from b6c1734 to 995529d Compare October 15, 2025 16:42

koperagen requested a review from Jolanrensen October 15, 2025 16:44

koperagen mentioned this pull request Oct 15, 2025

Update spacing in createDataFrame.md #1480

Closed

Add headers for operations in createDataFrame.md to refer from other …

075e96d

…pages

koperagen force-pushed the lists-todataframe branch from 995529d to 075e96d Compare October 15, 2025 16:51

Jolanrensen reviewed Oct 16, 2025

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/common.kt Show resolved Hide resolved

Jolanrensen approved these changes Oct 16, 2025

View reviewed changes

Fix test and add deprecation level

dc0fd33

koperagen changed the title ~~Improve list.chunked() + List<List<T>>.toDataFrame use case~~ Improve list.chunked() + List<List<T>>.toDataFrame for parsing .srt and similar text files Oct 16, 2025

koperagen merged commit 643a123 into master Oct 16, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `list.chunked()` + `List<List<T>>.toDataFrame` for parsing .srt and similar text files #1486

Improve `list.chunked()` + `List<List<T>>.toDataFrame` for parsing .srt and similar text files #1486

Uh oh!

koperagen commented Oct 9, 2025 •

edited by Jolanrensen

Loading

Uh oh!

zaleslaw commented Oct 10, 2025

Uh oh!

Jolanrensen Oct 10, 2025

Uh oh!

koperagen Oct 15, 2025

Uh oh!

Jolanrensen left a comment

Uh oh!

koperagen commented Oct 15, 2025

Uh oh!

Jolanrensen Oct 16, 2025

Uh oh!

Uh oh!

Jolanrensen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve list.chunked() + List<List<T>>.toDataFrame for parsing .srt and similar text files #1486

Improve list.chunked() + List<List<T>>.toDataFrame for parsing .srt and similar text files #1486

Uh oh!

Conversation

koperagen commented Oct 9, 2025 • edited by Jolanrensen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zaleslaw commented Oct 10, 2025

Uh oh!

Jolanrensen Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

koperagen Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Jolanrensen left a comment

Choose a reason for hiding this comment

Uh oh!

koperagen commented Oct 15, 2025

Uh oh!

Jolanrensen Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jolanrensen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve `list.chunked()` + `List<List<T>>.toDataFrame` for parsing .srt and similar text files #1486

Improve `list.chunked()` + `List<List<T>>.toDataFrame` for parsing .srt and similar text files #1486

koperagen commented Oct 9, 2025 •

edited by Jolanrensen

Loading