Skip to content

[R-Forge #5369] Implement integer64 grouping/unique etc or options(datatable.tolerance=0) or both #342

@arunsrinivasan

Description

@arunsrinivasan

Submitted by: James Sams; Assigned to: Nobody; R-Forge link

TL;DR: dim(unique(..., by=c(A, B))) reports MORE rows than dim(unique(..., by=c(A, B, C))). Affects duplicated() and merge(). I see this in 1.8.11, not 1.8.10.

I actually discovered this when a merge that was working previously stopped working, believing itself to be a cartesian join. So, the affected code is used by merge() as well. However, I think the problem is made more clear using unique(). I have a data.table with 3 columns (double, integer, integer). The double column, when read by fread, is integer64. However, I've found integer64 to be unreliable; so, I stick to using double/numeric. The values are up to 12 digits, all positive, and as I said, always integral values. I've duplicated this problem by coercing the other columns to double and reading using read.delim and coercing to data.table.

sapply(DT, class)
#        upc upc_ver_uc panel_year 
#  "numeric"  "integer"  "integer"
# 
str(DT)
# Classes ‘data.table’ and 'data.frame':  779473 obs. of  3 variables:
# <censored>
#  - attr(*, ".internal.selfref")=<externalptr> 
#  - attr(*, "sorted")= chr  "upc" "panel_year"
# 
dim(DT)
# [1] 779473      3
key(DT)
# [1] "upc"        "panel_year"
dim(unique(DT))
# [1] 779473      3
dim(unique(DT, by=c("upc", "panel_year")))
# [1] 779473      3

THIS is where things go wrong. Notice that adding the rows:

dim(unique(DT, by=c("upc", "upc_ver_uc", "panel_year")))
# [1] 725228      3

There are no NA's or similar in the data:

DT[,list(sum(is.na(upc), is.na(upc_ver_uc), is.na(panel_year)))]
#    V1
#1:  0
DT[,list(sum(is.nan(upc), is.nan(upc_ver_uc), is.nan(panel_year)))]
#    V1
#1:  0
DT[,list(sum(is.null(upc), is.null(upc_ver_uc), is.null(panel_year)))]
#    V1
#1:  0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions