Skip to content

Conversation

@halhen
Copy link

@halhen halhen commented Jul 14, 2022

This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.
Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30
seconds simplifying colums. The transpose_list now takes <2 seconds.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435

> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))

Before this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10

With this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10

This improves the runtime significantly for loading data with many
columns. The order of loop nesting as well as a much more efficient
binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from
MongoDB, this brings the query + load time from 70 seconds to ~40.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora
Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 8085435

```
> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))
```

Before this change

```
> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10
```

With this change

```
> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10
```
#' @useDynLib jsonlite C_transpose_list
transpose_list <- function(x, names) {
.Call(C_transpose_list, x, names)
# Sort names before entering C, allowing for a binary search
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting names lets us use a binary search in C. Sorting is a lot easier to do in R. If you are willing to add {stringi} or {withr} as dependencies, we can sort C-style without this ugly Sys.setlocale() dance? There might well be other, cleaner ways to do this.

for(size_t k = 0; k < Rf_length(listnames); k++){
if(!strcmp(CHAR(STRING_ELT(listnames, k)), targetname)){
SET_VECTOR_ELT(out, i, col);
UNPROTECT(1);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very inexperienced integrating R and C. Please take extra care to review my PROTECT():s.

halhen added 2 commits July 17, 2022 11:57
If a name exists in the data, sorted less than the smallest being
requested, the previous code would end up in an infinite loop.
@jeroen jeroen force-pushed the master branch 8 times, most recently from e43a8f6 to 89a3e99 Compare December 5, 2022 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant