-
Notifications
You must be signed in to change notification settings - Fork 1k
Start replacing TRUELENGTH markers with a hash
#6694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #6694 +/- ##
==========================================
- Coverage 99.10% 99.09% -0.02%
==========================================
Files 84 85 +1
Lines 16126 16166 +40
==========================================
+ Hits 15981 16019 +38
- Misses 145 147 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Generated via commit 56dcbf9 Download link for the artifact containing the test results: ↓ atime-results.zip
|
Also avoid crashing when creating a 0-size hash.
This may likely require a dynamically growing hash of TRUELENGTHs instead of the current pre-allocation approach with a very conservative over-estimate.
The hash needs O(n) memory (actually 2*n/load_factor entries) which isn't great.
|
hi, thanks for this. Can you please propose one or two performance test cases that you think may be adversely affected by these changes? Is it when we create a table with one column, and then use |
|
The The I'll try giving many |
|
Perhaps you can also try a fast 3rd party hash map: https://martin.ankerl.com/2019/04/01/hashmap-benchmarks-01-overview/ I particular Google's Abseil hash is pretty fast: https://abseil.io/docs/cpp/guides/container https://abseil.io/docs/cpp/guides/hash |
|
It's pretty bad. For typical cases, the current hash table eats >1 order of magnitude more R* memory, and it's similarly slower in The hash table is only on par by time in the worst case for # may need 16G of RAM to run comfortably due to pathological memory allocation patterns
library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
truehash = '24e81785669e70caac31501bf4424ba14dbc90f9'
)
N <- 10^seq(2, 8.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 6.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')* Edit: Some of the memory use in I'll try profiling the code. Thanks @SebKrantz for the link, a newer benchmark by the same author is also very instructive. |
|
thanks for proposing the performance test cases and sharing the atime benchmark graphs. I agree that we should try to avoid an order of magnitude constant factor increase in time/memory usage. |
In forder() and rbindlist(), there is no good upper boundary on the number of elements in the hash known ahead of time. Grow the hash table dynamically. Since the R/W locks are far too slow and OpenMP atomics are too limited, rely on strategically placed flushes, which isn't really a solution.
|
Since profiling has shown that a noticeable amount of time is wasted initialising the giant pre-allocated hash tables, I was able to make the slowdown factor closer to 2 by dynamically re-allocating the hash table: The memory use is significantly reduced (except for the worst cases), but cannot be measured with library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
static_hash = '24e81785669e70caac31501bf4424ba14dbc90f9',
dynamic_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
#save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')The main problem with the current approach is that since the parallel loop in The current code keeps one previous hash table until the next reallocation cycle and hopes that
|
|
In case it helps, {collapse}'s hash functions (https://github.com/SebKrantz/collapse/blob/master/src/kit_dup.c and https://github.com/SebKrantz/collapse/blob/master/src/match.c) are pretty fast as well - inspired by base R -> multiplication hash using unsigned integer prime number. It's bloody fast but requires a large table. But Calloc() is quite efficient. Anyway, would be great if you'd test the Google Hash function, curious to see it it can do much better. PS: you can test |
|
The abseil hash function is very slightly slower in my tests, although the difference may be not significant. Perhaps that's because my C port fails to inline some of the things that naturally inline in the original C++ with templates. I can try harder, but that's a lot of extra code to bring in properly. library(atime)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
'abseil_hash' = '159e1d48926b72af9f212b8c645a8bc8ab6b20be'
)
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
forderv1_work <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
forderv1 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv1_work); gc(full = TRUE)
# worst case: all strings different
# (a challenge for the allocator too due to many small immovable objects)
N <- 10^seq(2, 7.5, .25)
forderv2_work <- lapply(setNames(nm = N), \(N)
format(runif(N), digits = 16)
)
forderv2 <- atime_versions(
pkg.path, N,
expr = data.table:::forderv(forderv2_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(forderv2_work); gc(full = TRUE)
# expected case: all columns named the same
N <- 10^seq(1, 5.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist1_work <- lapply(setNames(nm = N), \(N)
rep(list(setNames(as.list(1:k), letters[1:k])), N)
)
rbindlist1 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist1_work[[as.character(N)]]),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist1_work); gc(full = TRUE)
# worst case: all columns different
N <- 10^seq(1, 4.5, .25) # number of data.tables in the list
k <- 10 # number of columns per data.table
rbindlist2_work <- lapply(setNames(nm = N), \(N)
replicate(N, setNames(as.list(1:k), format(runif(k), digits = 16)), FALSE)
)
rbindlist2 <- atime_versions(
pkg.path, N,
expr = data.table::rbindlist(rbindlist2_work[[as.character(N)]], fill = TRUE),
sha.vec = versions, seconds.limit = limit, verbose = TRUE,
pkg.edit.fun = pkg.edit.fun
)
rm(rbindlist2_work); gc(full = TRUE)
save(forderv1, forderv2, rbindlist1, rbindlist2, file = 'times.rda')
library(atime)
library(data.table)
limit <- 1
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chmatch1 <- atime(
N,
seconds.limit = limit, verbose = TRUE,
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda')And the real memory cost isn't even that large: library(atime)
library(data.table)
# assumes that atime_versions() had pre-installed the packages
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
library(data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac)
library(data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4)
library(parallel)
# only tested on a recent Linux system
# measures the _maximal_ amount of memory in kB used by the current process
writeLines('
#include <sys/resource.h>
void maxrss(double * kb) {
struct rusage ru;
int ret = getrusage(RUSAGE_SELF, &ru);
*kb = ret ? -1 : ru.ru_maxrss;
}
', 'maxrss.c')
tools::Rcmd('SHLIB maxrss.c')
dyn.load(paste0('maxrss', .Platform$dynlib.ext))
limit <- 1
N <- 10^seq(2, 7.5, .25)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
versions <- expression(
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], letters),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], letters),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], letters)
)
plan <- expand.grid(N = N, version = names(versions))
chmatch1 <- lapply(seq_len(nrow(plan)), \(i) {
# use a disposable child process
mccollect(mcparallel({
eval(versions[[plan$version[[i]]]], list(N = plan$N[[i]]))
.C('maxrss', kb = double(1))$kb
}))
})
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'times_collapse.rda')
chmatch1p <- lattice::xyplot(
maxrss_kb ~ N, cbind(plan, maxrss_kb = unlist(chmatch1)), group = version,
auto.key = TRUE, scales = list(log = 10),
par.settings=list(superpose.symbol=list(pch=19))
) |
|
Nice! Thanks. I always wondered about this tradeoff between the size of the table and the quality of the hash function. Looks like speed + large table still wins. Anyway, if you want to adopt, feel free to copy it under your MPL license. Just mention me in the top of the file and as a contributor. |
|
PS: I believe it also depends on the size of the |
|
excellent work thank you very much |
Use only 28 bits of the pointer (lower 32 but discard the lowest 4). Inline the linear search by advancing the pointer instead of repeatedly computing and dividing the hash value. Average improvement of 10%.
|
The There's Initialising the hash currently takes time and memory proportional to Lines 163 to 170 in 9fe1b8d
A good I still have a few ideas how to speed this up. |
| int tl = hash_lookup(marks, s, 0); | ||
| if (tl==0) hash_set(marks, s, chmatchdup ? -(++nuniq) : -i-1); // first time seen this string in table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hash_set wastes time performing the lookup again. Need to expose a "lookup or create" operation somehow to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can do that analagous to hash_set?
R_xlen_t hash_lookup_or_insert(hashtab *h, SEXP key, R_xlen_t value) {
struct hash_pair *cell = h->tb + hash_index(key, h->multiplier) % h->size, *end = h->tb + h->size - 1;
for (size_t i = 0; i < h->size; ++i, cell = (cell == end ? h->tb : cell + 1)) {
if (cell->key == key) {
return cell->value; // Key exists, don't update, return value
} else if (!cell->key) {
if (!h->free) internal_error(
__func__, "no free slots left (full size=%zu)", h->size
);
--h->free;
*cell = (struct hash_pair){.key = key, .value = value};
return value; // insert here
}
}
internal_error( // # nocov
__func__, "did not find a free slot for key %p; size=%zu, free=%zu",
(void*)key, h->size, h->free
);
// Should be impossible, but just in case:
return value;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be found in 337a0c2. Gives a little speedup but not too much.
|
WDYT about the possibility of proposing our %chin% just be up streamed into R itself? Our version is "well-tested & battle-hardened" for many years, and seems naturally suited to r-devel given it's reliance on internal behavior. |
| SEXP key; | ||
| R_xlen_t value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can halve the memory requirements by storing int indices into an array instead of SEXP pointers and int mark values instead of R_xlen_t (thanks to Sebastian Krantz for the idea). Cost: this makes the hash table incompatible with long vectors. Since int dimensions for matrices and data.frames are baked into almost all of R, it's probably worth using ints here too.
| if (TRUELENGTH(s)>0) // save any of R's own usage of tl (assumed positive, so we can both count and save in one scan), to restore | ||
| savetl(s); // afterwards. From R 2.14.0, tl is initialized to 0, prior to that it was random so this step saved too much. | ||
| // now save unique SEXP in ustr so i) we can loop through them afterwards and reset TRUELENGTH to 0 and ii) sort uniques when sorting too | ||
| // Why is it acceptable to call dhash_lookup when marks can be shared between threads? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the only way I know to make this provably safe is to throw away old hash table buffers (to be garbage collected by R).
Fast, thread-safe, dynamically growing, memory efficient hash table: choose two out of four, maybe three if you're lucky.
|
@MichaelChirico, it may be worth a try. I suppose this would mean falling back to The The remaining uses of |
It is worth noting that there is an ongoing effort, expected to start from May 2025 to work on R API growable vector API. Then we may expect easier, or more performant, ways to address current problems. |
|
If |
|
I extended Ivans benchmark a little further e.g. when our table gets bigger and also included NO matches. We definitely have to take a closer look at this with varying table size etc. (might also try other hashing methods like double, robin hood or cuckoo hashing)
library(atime)
library(data.table)
# master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
# 'Knuth_hash' = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
sample_strings = function(N=10, len=4) {
do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}
N <- 10^seq(2, 7.5, .25)
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
# expected case: a few distinct strings
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(tab_full, N, TRUE)
)
chmatch1 <- atime(
N,
seconds.limit = limit, verbose = TRUE,
master = data.table.70c64ac08c6becae5847cd59ab1efcb4c46437ac::chmatch(chmatch_work1[[as.character(N)]], tab_small),
Knuth_hash = data.table.d7a9a1707ec94ec4f2bd86a5dfb5609207029ba4::chmatch(chmatch_work1[[as.character(N)]], tab_small),
collapse = collapse::fmatch(chmatch_work1[[as.character(N)]], tab_small)
)
plot(chmatch1)
rm(chmatch_work1); gc(full = TRUE)
save(chmatch1, file = 'fmatch_missings.rda')edit: now with cuckoo hashing with the original setup and it seems close to master (in terms of speed) Ivans original case:
Case with misses:
library(atime)
library(data.table)
pkg.path <- '.'
limit <- 1
# taken from .ci/atime/tests.R
pkg.edit.fun <- function(old.Package, new.Package, sha, new.pkg.path) {
pkg_find_replace <- function(glob, FIND, REPLACE) {
atime::glob_find_replace(file.path(new.pkg.path, glob), FIND, REPLACE)
}
Package_regex <- gsub(".", "_?", old.Package, fixed = TRUE)
Package_ <- gsub(".", "_", old.Package, fixed = TRUE)
new.Package_ <- paste0(Package_, "_", sha)
pkg_find_replace(
"DESCRIPTION",
paste0("Package:\\s+", old.Package),
paste("Package:", new.Package))
pkg_find_replace(
file.path("src", "Makevars.*in"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
Package_regex,
new.Package_)
pkg_find_replace(
file.path("R", "onLoad.R"),
sprintf('packageVersion\\("%s"\\)', old.Package),
sprintf('packageVersion\\("%s"\\)', new.Package))
pkg_find_replace(
file.path("src", "init.c"),
paste0("R_init_", Package_regex),
paste0("R_init_", gsub("[.]", "_", new.Package_)))
pkg_find_replace(
"NAMESPACE",
sprintf('useDynLib\\("?%s"?', Package_regex),
paste0('useDynLib(', new.Package_))
}
versions <- c(
master = '70c64ac08c6becae5847cd59ab1efcb4c46437ac',
knuth_hash = 'd7a9a1707ec94ec4f2bd86a5dfb5609207029ba4',
lookup_insert = '337a0c2d508a31c59885416d7929ff6d6a4b0bda',
cuckoo_hash = '09b3725acce257bbc6ef2cb55c36220528bc42e0'
)
sample_strings = function(N=10, len=4) {
do.call(paste0, replicate(len, sample(LETTERS, N, TRUE), FALSE))
}
N <- 10^seq(2, 7.5, .25)
tab_full = sample_strings(1e6, 10)
tab_small = sample(tab_full, 9e5)
chmatch_work1 <- lapply(setNames(nm = N), \(N)
sample(tab_full, N, TRUE)
)
chmatch1 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work1[[as.character(N)]], tab_small),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch1)
# expected case: a few distinct strings
chmatch_work2 <- lapply(setNames(nm = N), \(N)
sample(letters, N, TRUE)
)
chmatch2 <- atime_versions(
pkg.path, N,
expr = data.table::chmatch(chmatch_work2[[as.character(N)]], letters),
seconds.limit = limit, verbose = TRUE, sha.vec = versions,
pkg.edit.fun = pkg.edit.fun
)
plot(chmatch2)
|
The hash can only be enlarged from (1) a single-thread context, or (2) under a critical section, so there is no need to worry about other threads getting a use-after-free due to a reallocation. This should halve the memory use by the hash table.














With apologies to Matt Dowle, who had poured a lot of effort into making
data.tablego fast.Ongoing work towards #6180. Unfortunately, doesn't completely remove any uses of non-API entry points by itself. Detailed motivation here in a pending blog post. Can't start implementing stretchy ALTREP vectors until
data.tablestops usingTRUELENGTHto mark them.Currently implemented:
TRUELENGTHto markCHARSXPs or columns replaced with a hashNeeds more work:
rbindlist()andforder()pre-allocate memory for the worst-case usageforder.c, the last remaining user ofsavetlSET_TRUELENGTHis atomic,hash_setis not, will need additional care in multi-threaded environmentsavetlmachinery inassign.cLet's just see how much worse is the performance going to get.