Start replacing `TRUELENGTH` markers with a hash #6694

aitap · 2024-12-26T15:52:14Z

With apologies to Matt Dowle, who had poured a lot of effort into making data.table go fast.

Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until data.table stops using TRUELENGTH to mark them.

Currently implemented:

An open-addressing hash table with Knuth's linear multiplication hashing
- Deliberately no removal operation, only insertion and update
- Can be made more performant (collisions much less likely) by switching to tables of power-of-two length and double hashing
Almost all uses of TRUELENGTH to mark CHARSXPs or columns replaced with a hash

Needs more work:

the uses in rbindlist() and forder() pre-allocate memory for the worst-case usage
- may be noticeably bad for cache performance
- a dynamically growing hash table would take even more CPU time and cache misses and maybe system calls to resize
forder.c, the last remaining user of savetl
- SET_TRUELENGTH is atomic, hash_set is not, will need additional care in multi-threaded environment
savetl machinery in assign.c

Let's just see how much worse is the performance going to get.

codecov · 2024-12-26T15:59:05Z

Codecov Report

❌ Patch coverage is 98.84393% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 99.09%. Comparing base (4d0b4a5) to head (56dcbf9).

Files with missing lines	Patch %	Lines
src/hash.c	98.09%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6694      +/-   ##
==========================================
- Coverage   99.10%   99.09%   -0.02%     
==========================================
  Files          84       85       +1     
  Lines       16126    16166      +40     
==========================================
+ Hits        15981    16019      +38     
- Misses        145      147       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2024-12-26T16:07:30Z

HEAD=truehash slower P<0.001 for isoweek improved in #7144

Generated via commit 56dcbf9

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	5 minutes and 4 seconds
Installing different package versions	10 minutes and 32 seconds
Running and plotting the test cases	2 minutes and 42 seconds

Also avoid crashing when creating a 0-size hash.

This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.

The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.

tdhock · 2024-12-29T16:38:51Z

hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use := or set to add a lot more columns?

aitap · 2024-12-29T17:27:17Z

The rbindlist() code pre-allocates O(sum(lengths(lapply(l, names))) memory. In the best case, only O(length(names(l[[1]]))) is needed, length(l) times less. In the worst case, all names are different and the code needs all that memory for fill = TRUE.

The forder() code pre-allocates O(nrow(x)) memory.

I'll try giving many data.tables to rbindlist() and long data.tables with string columns to setorder().

SebKrantz · 2024-12-30T12:50:30Z

Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/

I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash

aitap · 2024-12-31T10:28:36Z

It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in forder():

The hash table is only on par by time in the worst case for forder() (all strings different) and might be on par in the worst case for rbindlist(fill = TRUE) (all column names different), but the latter is completely unrealistic as a use case.

# may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)

N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

* Edit: Some of the memory use in data.table is invisible to Rprofmem(). For forder(), data.table is said to allocate as much memory as one extra column, i.e. 8 bytes per CHARSXP element on a 64-bit computer. The hash table preallocates enough space for 2*N (original + UTF-8) cells at 50% load factor (so twice more), with each cell consisting of a SEXP pointer and an R_xlen_t mark, so 2 * 2 * 16 = 64 bytes more per element, in addition to normal memory use. Measuring forder()'s memory use as getrusage() + ru_maxrss and subtracting the value before the function call gives coef(lm(forderuse_kb*1024 ~ N:version, x))[-1] as N:versionmaster = 8.993967 and N:versiontruehash = 72.178404, which is close to the theory.

I'll try profiling the code.

Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive.

tdhock · 2024-12-31T19:50:05Z

thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage.

In forder() and rbindlist(), there is no good upper boundary on the number of elements in the hash known ahead of time. Grow the hash table dynamically. Since the R/W locks are far too slow and OpenMP atomics are too limited, rely on strategically placed flushes, which isn't really a solution.

aitap · 2025-01-01T18:16:10Z

Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table:

The memory use is significantly reduced (except for the worst cases), but cannot be measured with atime (just like other uses of calloc/malloc/realloc in data.table are invisible to Rprofmem).

library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master   = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
  dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

The main problem with the current approach is that since the parallel loop in src/forder.c:range_str(...) needs to read the hash table from outside the critical section while hash elements are being set inside the critical section, it's possible that dhash_set() will need to reallocate the table while dhash_lookup() is using it. I have already tried using pthread R/W locks and OpenMP critical sections; they were too slow. I wasn't able to use OpenMP atomics to make atomic reads/writes through a pointer (maybe that's asking for too much). Any third-party hash table will also need to be outfitted with appropriate locking.

The current code keeps one previous hash table until the next reallocation cycle and hopes that dhash_enlarge writes the new table pointer into the object roughly at the same time as the metadata, so that when dhash_lookup reads the object, it will hopefully receive a consistent view of the object. This is not good enough for production because (1) the race window still exists and (2) there's no proof that there won't be a stale reader still using the previous table array when the current writer deallocates it. What can be done with it:

Don't deallocate the previous hash tables until the function is completely done with them. This will double the memory use (because the code doubles the buffer when re-allocating it and $\sum_{i=0}^{n} 2^i = 2^{n+1} - 1$). We'll be able to switch back from malloc to R_alloc and make the allocations visible to R's allocator.
Devise a read-copy-update scheme (see also: 1 2) to properly expire the previous versions of the hash table. These may require synchronisation primitives not easily available through OpenMP. Reference-counting the hash tables may result in contention and slowdowns similar to pthread R/W locks.

SebKrantz · 2025-01-01T20:23:48Z

In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better.

PS: you can test collapse::fmatch() against your current attempts rewriting chmatch().

aitap · 2025-01-08T21:41:44Z

The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly.

library(atime)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
  'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)

# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
  format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
  pkg.path, N,
  expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)

# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
  rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)

# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
  replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
  pkg.path, N,
  expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
  sha.vec = versions, seconds.limit = limit, verbose = TRUE,
  pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)

save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')

collapse::fmatch() works very well, faster than data.table::chmatch() even in its current form:

library(atime)
library(data.table)

limit <- 1

# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch1 <- atime(
  N,
  seconds.limit = limit, verbose = TRUE,
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')

And the real memory cost isn't even that large:

library(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)

# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
  struct rusage ru;
  int ret = getrusage(RUSAGE_SELF, &ru);
  *kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))

limit <- 1

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
versions <- expression(
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)

plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
  # use a disposable child process
  mccollect(mcparallel({
    eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
    .C('maxrss', kb = double(1))$kb
  }))
})

rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
  maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
  auto.key = TRUE, scales = list(log = 10),
  par.settings=list(superpose.symbol=list(pch=19))
)

SebKrantz · 2025-01-08T22:20:59Z

Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor.

SebKrantz · 2025-01-08T22:51:00Z

PS: I believe it also depends on the size of the table in R. letters is small. I believe for much larger tables chmatch() should be faster. Would be interesting to also check those hash functions.

tdhock · 2025-01-09T14:38:48Z

excellent work thank you very much

Use only 28 bits of the pointer (lower 32 but discard the lowest 4). Inline the linear search by advancing the pointer instead of repeatedly computing and dividing the hash value. Average improvement of 10%.

aitap · 2025-03-22T19:59:07Z

The chmatch() performance problem is three-dimensional.

There's length(x); we'd like the lookup time to be linear in that and spend as few instructions per lookup as possible. TRUELENGTH() is a very short function call.

Initialising the hash currently takes time and memory proportional to length(table), but we really want it to be closer to length(unique(table)). See the benchmark where the table is 10 times as large as x:

data.table/src/chmatch.c

Lines 163 to 170 in 9fe1b8d

    
           set.seed(45L) 
        
           x <- sample(letters, 1e6, TRUE) 
        
           y <- sample(letters, 1e7, TRUE) 
        
           system.time(ans0 <- base::pmatch(x,y,0L))           # over 5 minutes as of R 3.5.3 (March 2019) 
        
           system.time(ans1 <- .Call("Cchmatch2_old", x,y,0L)) # 2.40sec  many years old 
        
           system.time(ans2 <- .Call("Cchmatch2", x,y,0L))     # 0.17sec  as of 1.12.0 and in place for several years before that 
        
           system.time(ans3 <- chmatchdup(x,y,0L))             # 0.09sec  from 1.12.2; but goal wasn't speed rather simplified code; e.g. rbindlist.c down from 960 to 360 lines 
        
           identical(ans2,ans3)  # test 2000

A good TRUELENGTH() replacement needs to perform well when length(x) and length(table) are large and length(unique(table)) is still reasonable. (length(unique(table)) becomes unreasonable when it starts to strain the CHARSXP cache.)

I still have a few ideas how to speed this up.

aitap · 2025-03-22T20:00:46Z

src/chmatch.c

+    int tl = hash_lookup(marks, s, 0);
+    if (tl==0) hash_set(marks, s, chmatchdup ? -(++nuniq) : -i-1); // first time seen this string in table


hash_set wastes time performing the lookup again. Need to expose a "lookup or create" operation somehow to avoid that.

I guess we can do that analagous to hash_set?

R_xlen_t hash_lookup_or_insert(hashtab *h, SEXP key, R_xlen_t value) { struct hash_pair *cell = h->tb + hash_index(key, h->multiplier) % h->size, *end = h->tb + h->size - 1; for (size_t i = 0; i < h->size; ++i, cell = (cell == end ? h->tb : cell + 1)) { if (cell->key == key) { return cell->value; // Key exists, don't update, return value } else if (!cell->key) { if (!h->free) internal_error( __func__, "no free slots left (full size=%zu)", h->size ); --h->free; *cell = (struct hash_pair){.key = key, .value = value}; return value; // insert here } } internal_error( // # nocov __func__, "did not find a free slot for key %p; size=%zu, free=%zu", (void*)key, h->size, h->free ); // Should be impossible, but just in case: return value; }

It can be found in 337a0c2. Gives a little speedup but not too much.

MichaelChirico · 2025-03-22T20:06:17Z

WDYT about the possibility of proposing our %chin% just be up streamed into R itself? Our version is "well-tested & battle-hardened" for many years, and seems naturally suited to r-devel given it's reliance on internal behavior.

aitap · 2025-03-22T20:07:30Z

src/hash.c

+    SEXP key;
+    R_xlen_t value;


Can halve the memory requirements by storing int indices into an array instead of SEXP pointers and int mark values instead of R_xlen_t (thanks to Sebastian Krantz for the idea). Cost: this makes the hash table incompatible with long vectors. Since int dimensions for matrices and data.frames are baked into almost all of R, it's probably worth using ints here too.

aitap · 2025-03-22T20:22:05Z

src/forder.c

-      if (TRUELENGTH(s)>0)   // save any of R's own usage of tl (assumed positive, so we can both count and save in one scan), to restore
-        savetl(s);           // afterwards. From R 2.14.0, tl is initialized to 0, prior to that it was random so this step saved too much.
-      // now save unique SEXP in ustr so i) we can loop through them afterwards and reset TRUELENGTH to 0 and ii) sort uniques when sorting too
+    // Why is it acceptable to call dhash_lookup when marks can be shared between threads?


So far the only way I know to make this provably safe is to throw away old hash table buffers (to be garbage collected by R).

Fast, thread-safe, dynamically growing, memory efficient hash table: choose two out of four, maybe three if you're lucky.

aitap · 2025-03-22T20:34:01Z

@MichaelChirico, it may be worth a try. I suppose this would mean falling back to TRUELENGTH code on R < 4.6? Although the hash table by Sebastian Krantz is pretty fast and sometimes beats data.table::chmatch (need to explore all three performance dimensions to understand it better); we can have one like that too.

The forder.c problem is also quite hair-raising. We can't just replace it with sort(method = "radix"), can we?

The remaining uses of TRUELENGTH are much less performance-critical, for things that people usually don't have too many of (e.g. factor levels, columns).

jangorecki · 2025-03-23T07:24:00Z

Can't start implementing stretchy ALTREP vectors until data.table stops using TRUELENGTH to mark them.

It is worth noting that there is an ongoing effort, expected to start from May 2025 to work on R API growable vector API. Then we may expect easier, or more performant, ways to address current problems.

MichaelChirico · 2025-03-23T07:27:32Z

If %chin% were accepted into r-devel, we could just mark %chin% as deprecated and meanwhile mask `%chin%` = `%in%`; chmatch = match, leaving users on older R with worse performance, WDYT

ben-schwen · 2025-03-23T17:56:29Z

I extended Ivans benchmark a little further e.g. when our table gets bigger and also included NO matches. We definitely have to take a closer look at this with varying table size etc. (might also try other hashing methods like double, robin hood or cuckoo hashing)

library(atime)
library(data.table)

# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',

sample_strings = function(N=10, len=4) {
   do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}

N <- 10^seq(2, 7.5, .25)
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(tab_full, N, TRUE)
)
chmatch1 <- atime(
  N,
  seconds.limit = limit, verbose = TRUE,
  master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], tab_small),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], tab_small),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], tab_small)
)
plot(chmatch1)

rm(chmatch_work1); gc(full = TRUE)

save(chmatch1, file = 'fmatch_missings.rda')

edit: now with cuckoo hashing with the original setup and it seems close to master (in terms of speed)

Ivans original case:

Case with misses:

library(atime)
library(data.table)

pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
  pkg_find_replace <- function(glob, FIND, REPLACE) {
    atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
  }
  Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
  Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
  new.Package_ <- paste0(Package_, "_", sha)
  pkg_find_replace(
    "DESCRIPTION",
    paste0("Package:\\s+", old.Package),
    paste("Package:", new.Package))
  pkg_find_replace(
    file.path("src", "Makevars.*in"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    Package_regex,
    new.Package_)
  pkg_find_replace(
    file.path("R", "onLoad.R"),
    sprintf('packageVersion\\("%s"\\)', old.Package),
    sprintf('packageVersion\\("%s"\\)', new.Package))
  pkg_find_replace(
    file.path("src", "init.c"),
    paste0("R_init_", Package_regex),
    paste0("R_init_", gsub("[.]", "_", new.Package_)))
  pkg_find_replace(
    "NAMESPACE",
    sprintf('useDynLib\\("?%s"?', Package_regex),
    paste0('useDynLib(', new.Package_))
}

versions <- c(
  master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
  knuth_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
  lookup_insert = '337a0c2d508a31c59885416d7929ff6d6a4b0bda',
  cuckoo_hash = '09b3725acce257bbc6ef2cb55c36220528bc42e0'
)

sample_strings = function(N=10, len=4) {
   do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}

N <- 10^seq(2, 7.5, .25)
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(tab_full, N, TRUE)
)

chmatch1 <- atime_versions(
  pkg.path, N, 
  expr = data.table::chmatch(chmatch_work1[[as.character(N)]], tab_small),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch1)

# expected case: a few distinct strings
chmatch_work2 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch2 <- atime_versions(
  pkg.path, N, 
  expr = data.table::chmatch(chmatch_work2[[as.character(N)]], letters),
  seconds.limit = limit, verbose = TRUE, sha.vec = versions,
  pkg.edit.fun = pkg.edit.fun
)
plot(chmatch2)

The hash can only be enlarged from (1) a single-thread context, or (2) under a critical section, so there is no need to worry about other threads getting a use-after-free due to a reallocation. This should halve the memory use by the hash table.

ben-schwen · 2025-11-06T14:04:12Z

Another low hanging fruit:
When probing in chmatch we currently do not parallize after building the hash table, but probing could be essentially parallized (at least for the chmatch and chin cases).

#pragma omp parallel for num_threads(getDTthreads(xlen, true))
  for (int i=0; i<xlen; i++) {
    const int m = hash_lookup(marks,xd[i],0);
    ansd[i] = (m<0) ? -m : nomatch;
  }

edit: Without checking I suppose that collapse is essentially already doing this

library(atime)
library(data.table)
library(ggplot2)

limit <- 1

N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
  sample(letters, N, TRUE)
)
chmatch1 <- atime(
  N,
  seconds.limit = limit, verbose = TRUE,
  master = data.table.3c90b0f80e1d5b54ea97f3b56f28df07b93e820a::chmatch(chmatch_work1[[as.character(N)]], letters),
  Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
  double_noxor = data.table.48b19422dd331a9d7eaddff507563bcf9643ebdd::chmatch(chmatch_work1[[as.character(N)]], letters),
  khash = data.table.3178f10af7860223b7bea0fba519104d6a9d55ca::chmatch(chmatch_work1[[as.character(N)]], letters),
  collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)

ggsave(plot(chmatch1), filename = 'times_collapse.png')

save(chmatch1, file = 'times_collapse.rda')

aitap marked this pull request as ready for review December 26, 2024 15:52

aitap requested review from HughParsonage, MichaelChirico, ben-schwen and tdhock as code owners December 26, 2024 15:52

aitap added 11 commits December 29, 2024 19:06

Implement the hash table

5782a7a

memrecycle(): replace TRUELENGTH marks with a hash

3846cc9

rbindlist(): replace 1/2 TRUELENGTH with hashing

b700bf6

Also avoid crashing when creating a 0-size hash.

rbindlist(): replace 2/2 TRUELENGTH with hashing

86511ab

This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.

chmatchMain(): replace TRUELENGTH marks with hash

486dd7a

copySharedColumns(): hash instead of TRUELENGTH

f26d043

combineFactorLevels(): hash instead of TRUELENGTH

cae6b02

anySpecialStatic(): hash instead of TRUELENGTH

a09f39b

forder(): hash instead of TRUELENGTH

962b272

The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.

Remove savetl()

22981ad

Add codecov suppressions

24e8178

aitap force-pushed the truehash branch from 566a49e to 24e8178 Compare December 29, 2024 16:07

aitap mentioned this pull request Dec 29, 2024

Growable ALTREP vectors #6697

Draft

1 task

Minor hash improvements

c2b5c67

Use only 28 bits of the pointer (lower 32 but discard the lowest 4). Inline the linear search by advancing the pointer instead of repeatedly computing and dividing the hash value. Average improvement of 10%.

aitap marked this pull request as draft February 14, 2025 16:33

aitap commented Mar 22, 2025

View reviewed changes

aitap and others added 2 commits July 6, 2025 23:27

dhash: no need to keep previous table

15753e2

The hash can only be enlarged from (1) a single-thread context, or (2) under a critical section, so there is no need to worry about other threads getting a use-after-free due to a reallocation. This should halve the memory use by the hash table.

Merge branch 'master' into truehash

56dcbf9

ben-schwen mentioned this pull request Sep 25, 2025

Growable vectors with R supporting growable vectors #7343

Open

ben-schwen mentioned this pull request Nov 6, 2025

use double hashing instead of linear probing #7418

Open

		int tl = hash_lookup(marks, s, 0);
		if (tl==0) hash_set(marks, s, chmatchdup ? -(++nuniq) : -i-1); // first time seen this string in table

Start replacing TRUELENGTH markers with a hash #6694

Are you sure you want to change the base?

Start replacing TRUELENGTH markers with a hash #6694

Uh oh!

Conversation

aitap commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Dec 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Dec 29, 2024

Uh oh!

aitap commented Dec 29, 2024

Uh oh!

SebKrantz commented Dec 30, 2024

Uh oh!

aitap commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdhock commented Dec 31, 2024

Uh oh!

aitap commented Jan 1, 2025

Uh oh!

SebKrantz commented Jan 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aitap commented Jan 8, 2025

Uh oh!

SebKrantz commented Jan 8, 2025

Uh oh!

SebKrantz commented Jan 8, 2025

Uh oh!

tdhock commented Jan 9, 2025

Uh oh!

aitap commented Mar 22, 2025

Uh oh!

aitap Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

ben-schwen Mar 23, 2025

Choose a reason for hiding this comment

Uh oh!

ben-schwen Mar 23, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico commented Mar 22, 2025

Uh oh!

aitap Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

aitap Mar 22, 2025

Choose a reason for hiding this comment

Uh oh!

aitap commented Mar 22, 2025

Uh oh!

jangorecki commented Mar 23, 2025

Uh oh!

MichaelChirico commented Mar 23, 2025

Uh oh!

ben-schwen commented Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ben-schwen commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Start replacing `TRUELENGTH` markers with a hash #6694

Start replacing `TRUELENGTH` markers with a hash #6694

aitap commented Dec 26, 2024 •

edited

Loading

codecov bot commented Dec 26, 2024 •

edited

Loading

github-actions bot commented Dec 26, 2024 •

edited

Loading

aitap commented Dec 31, 2024 •

edited

Loading

SebKrantz commented Jan 1, 2025 •

edited

Loading

ben-schwen commented Mar 23, 2025 •

edited

Loading

ben-schwen commented Nov 6, 2025 •

edited

Loading