Skip to content

summarize() with multi-row returns #6382

@krlmlr

Description

@krlmlr

As of dplyr 1.0.0, summarize() will create multiple rows per group, according to the length of the return value of the summary function. This new feature leads to unintended behavior if the vector return is accidental, and also can lead to data loss.

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop") %>% 
  ungroup()
#> # A tibble: 3 × 2
#>       n   out
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     2     2

Created on 2022-08-01 by the reprex package (v2.0.1)

Should we introduce a .multi = c("allow", "require", "fail") argument that supports the pre-1.0.0 strict mode of operation? Should .multi = "fail" even be the default?

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop", .multi = "fail") %>% 
  ungroup()
## Error: `out` has length != 1 in groups 1, 3, use `.multi = "allow"` if this is intended

Imagined on 2022-08-01 by the reprex package (v2.0.1)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions