Simulation Code for Cooperative Penalized Regression (CooPeR) Project

Code accompanying paper "High-Dimensional Variable Selection With Competing Events Using Cooperative Penalized Regression".

Simulation code for cooper in competing risk settings, based on the original fwelnet implementation (preprint).

Structure

R/: Helper functions for simulation and plotting.
- Loaded automatically via .Rprofile.
n-<name>/: Code for simulation experiments and application example.
- 1-proof-of-concept: Supports section 3.1. Proof of Concept.
- 2-variable-selection: Supports section 3.2. High-Dimensional Data.
- 3-example-bladder: Supports section 4. Application Example.
data-raw/: Preprocessing script and intermediate / raw files for the bladder cancer example application.
data/: Processed data in different forms depending on use-case.
run-all.R: Wraps relevant scripts needed to reproduce results, with additional instructions for simulations. See below.
registries/: Holds batchtools registries as created by n-run-batchtools.R scripts.
renv/, renv.lock, .renvignore ensure fixed R package versions via renv.

Note the project uses renv to ensure consistent R dependencies.
Upon first load it's going to bug you to run renv::restore() to locally install dependencies as specified in renv.lock.

Reproducing results

Run renv::restore() to install all dependencies. Please note that you may want to install the appropriate R version (see e.g. rig)
Refer to run-all.R to reproduce figures and tables from simulations results and data example.

If renv is not used to restore packages, it is necessary to install the cooper package from GitHub:

# install.packages("remotes")
remotes::install_github("jemus42/cooper")

Then, run the run-all.R script to reproduce the results.

Section 2: Proof of Concept

To re-run simulations, the 1-run-batchtools.R script will need to be adapted to the local environment as they expect an HPC setting (lines 72f). See instructions below for running them locally.
The number of replications can be changed using the repls variable in line 15: repls = 1000.
Results are stored in results/ as .rds files 1-results-poc.rds and 1-results-long-poc.rds.
Figure 1 of the manuscript is then produced by 1-figure-1-proof-of-concept-bias.R.
Figures 2 through 4 are produced by 2-figure-2-3-4-high-dim-sim-variable-selection.R.
Figure 5 is produced by 2-figure-5-high-dim-sim-performance.R.
Tables 1 through 4 are produced by 2-tables-1-2-3-high-dim-sim-variable-selection.R.

Section 3: High-Dimensional Data

A batchtools script analogous to 1-run-batchtools.R
Lines 91f will need to be adapted to a local environment. See instructions below for running them locally.
The number of replications can be changed using the repls variable in line 15: repls = 1000.
Results are stored in results/ as .rds files
- 2-results-varsel-csc-varsel.rds for the variable selection performance scores.
- 2-results-varsel-csc-perf.rds for the prediction performance scores.

Section 4. Bladder Cancer Data

The data is described by Dyrskjøt et al. (2007).
Visit https://aacrjournals.org/clincancerres/article/13/12/3545/13137/Gene-Expression-Signatures-Predict-Outcome-in-Non
- Download supplemental tables 1 and to to data-raw.
- The script data-raw/preprocess-bladder-data.R downloads the remaining data automatically given an internet connection is available

The expected folder structure is then

data-raw
├── 10780432ccr062940-sup-supplemental_file_1.xls
├── 10780432ccr062940-sup-supplemental_file_2.xls
├── GSE5479_Final_processed_data_1.txt
├── GSE5479_Final_processed_data_2.txt
└── preprocess-bladder-data.R

The remaining scripts in 3-example-bladder/ will take some time to fit models but produce results prefixed with 3- in results/
Figure 6 is produced by 3-figure-6-bladder-performance.R

Running simulations locally

The simulations form sections 2 and 3 are designed to be run on a high-performance computing (HPC) cluster. To run them locally, create a file ./batchtools.conf.R with contents

cluster.functions <- makeClusterFunctionsSSH(list(Worker$new("localhost", ncpus = 4, max.load = 10)), fs.latency = 0)

Where the value for ncpus determines the number of jobs to run simultaneously, and max.load the maximum load average to allow for the local machine. For example, on a machine with 8 logical cores, ncpus = 4 and max.load = 7 would run 4 jobs in parallel, as long as the load average is below 7 (while 8 would be the maximum).

With this file in place, running the *-run-batchtools.R scripts will run all simulation jobs locally in the background with the specified number of cores.

To only run a subset of jobs, e.g. 5 randomly chosen iterations for each simulation setting and algorithm, replace the submitJobs() call with the following:

sample_ids = jobtbl[, .SD[sample(nrow(.SD), 5)], by = c("problem", "algorithm")]

message("Submitting subset of jobs only!")
submitJobs(sample_ids)

which may take some time but will create at least some results. The remainder of the scripts can then be run as usual and will create (partial) intermediate results.

Session Info

In addition to renv.lock, sessionInfo() also presents operating system and locale information.

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

loaded via a namespace (and not attached):
[1] compiler_4.4.1 tools_4.4.1    renv_1.0.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simulation Code for Cooperative Penalized Regression (CooPeR) Project

Structure

Reproducing results

Section 2: Proof of Concept

Section 3: High-Dimensional Data

Section 4. Bladder Cancer Data

Running simulations locally

Session Info

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 292 Commits
1-proof-of-concept		1-proof-of-concept
2-variable-selection-sim		2-variable-selection-sim
3-example-bladder		3-example-bladder
R		R
data-raw		data-raw
data		data
registries		registries
renv		renv
results		results
.Rprofile		.Rprofile
.gitignore		.gitignore
.renvignore		.renvignore
1-figure-1-proof-of-concept-bias.R		1-figure-1-proof-of-concept-bias.R
2-figure-2-3-4-high-dim-sim-variable-selection.R		2-figure-2-3-4-high-dim-sim-variable-selection.R
2-figure-5-high-dim-sim-performance.R		2-figure-5-high-dim-sim-performance.R
2-tables-1-2-3-high-dim-sim-variable-selection.R		2-tables-1-2-3-high-dim-sim-variable-selection.R
3-figure-6-bladder-performance.R		3-figure-6-bladder-performance.R
README.md		README.md
cooper_sim.Rproj		cooper_sim.Rproj
renv.lock		renv.lock
run-all.R		run-all.R

slds-lmu/paper_2023_cooper

Folders and files

Latest commit

History

Repository files navigation

Simulation Code for Cooperative Penalized Regression (CooPeR) Project

Structure

Reproducing results

Section 2: Proof of Concept

Section 3: High-Dimensional Data

Section 4. Bladder Cancer Data

Running simulations locally

Session Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages