-
Couldn't load subscription status.
- Fork 1k
Description
Submitted by: James Sams; Assigned to: Nobody; R-Forge link
TL;DR: dim(unique(..., by=c(A, B))) reports MORE rows than dim(unique(..., by=c(A, B, C))). Affects duplicated() and merge(). I see this in 1.8.11, not 1.8.10.
I actually discovered this when a merge that was working previously stopped working, believing itself to be a cartesian join. So, the affected code is used by merge() as well. However, I think the problem is made more clear using unique(). I have a data.table with 3 columns (double, integer, integer). The double column, when read by fread, is integer64. However, I've found integer64 to be unreliable; so, I stick to using double/numeric. The values are up to 12 digits, all positive, and as I said, always integral values. I've duplicated this problem by coercing the other columns to double and reading using read.delim and coercing to data.table.
sapply(DT, class)
# upc upc_ver_uc panel_year
# "numeric" "integer" "integer"
#
str(DT)
# Classes ‘data.table’ and 'data.frame': 779473 obs. of 3 variables:
# <censored>
# - attr(*, ".internal.selfref")=<externalptr>
# - attr(*, "sorted")= chr "upc" "panel_year"
#
dim(DT)
# [1] 779473 3
key(DT)
# [1] "upc" "panel_year"
dim(unique(DT))
# [1] 779473 3
dim(unique(DT, by=c("upc", "panel_year")))
# [1] 779473 3THIS is where things go wrong. Notice that adding the rows:
dim(unique(DT, by=c("upc", "upc_ver_uc", "panel_year")))
# [1] 725228 3There are no NA's or similar in the data:
DT[,list(sum(is.na(upc), is.na(upc_ver_uc), is.na(panel_year)))]
# V1
#1: 0
DT[,list(sum(is.nan(upc), is.nan(upc_ver_uc), is.nan(panel_year)))]
# V1
#1: 0
DT[,list(sum(is.null(upc), is.null(upc_ver_uc), is.null(panel_year)))]
# V1
#1: 0