A "framework" (Makefile pattern rules + conventions) for fully reproducible academic papers.
Even though no figures or computational results are committed to the repository,
anyone can generate an exact copy of your published .pdf with the following steps†:
git clone paper_repo
cd paper_repo
conda env create -f environment.yml
conda activate paper_env
make
Capturing the dependencies between LaTeX, figures, data, and code in a Makefile and storing computational results on disk ensures that the paper always reflects changes in the data and code without requiring a full rebuild for every change.
A few other convenient features are included:
- Generate abridged and extended versions from the same
.texfiles using environment variables and conditional compilation. - Strip comments and generate a
.zipfile for arXiv upload.
† We assume a Unix-like system with Anaconda and LaTeX installed.
To use this repository as a starting point for your own project, click "Use
this template" above. Or, you can git clone it, then rm -rf .git and git init to start your own repo with a clean commit history.
This is a low-tech solution using directory layout conventions and Makefile pattern rules. The hand-written portion of your project is laid out like:
input/ : Data sets from "the world" -- not your own computational results.
src/ : Source code for computations, figures, and generated LaTeX.
tex/ : Hand-written LaTeX code for the paper.
The computed results are laid out like:
build/ : Final .pdf file and .zip file for arXiv.
data/ : Computational results, optionally using /inputs/.
figures/ : Plots, produced from files in /data/.
tex/ : LaTeX generated from /data/ is added alongside hand-written files.
The Makefile pattern rules for figures implement the following dependency structure,
where .data is your chosen data file format
and .img is your chosen image file format:
src/x_fig.py --,
src/x_data.py ---, |
| ,--+--> figures/x.img
input/x.data --ANY--> data/x.data --+
| '--+--> tex/x_gen.tex
<nothing> ---' |
src/x_gentex.py --'
Figure- and LaTeX-generating scripts always look for the
corresponding x.data file. We support three cases:
- Program-generated data:
If the script
src/x_data.pyexists,makewill use it to generatex.data. - Data from the outside world:
If
input/x.dataexists,makewill copy it todata/x.data. - No data needed to make figure/LaTeX:
If neither of the above conditions are satisfied,
makewill run the figure/LaTeX-generating script anyway.
Figure-generating scripts should take the input and output paths as command-line arguments:
python src/x_fig.py data/x.data figures/x.img
LaTeX-generating scripts should take the input path as the command-line argument and print to stdout:
python src/x_gentex.py data/x.data > tex/x_gen.tex
It may also be useful to generate data files from other data files, for example:
- Saving intermediate results in very slow computations.
- Building several plots/tables from one experiment.
- Parameterizing a data generation process, with parameters stored as input data.
- etc...
With data-to-data scripts there is no formula to derive the input file name from
the output file name. The user must write the rule manually in the Makefile.
This is demonstrated by the example sine_taylor_data.py.
make will apply the figure- and LaTeX-generating rules in an "opt-in" way
based on the lists figs and texs in the Makefile. You must edit these
lists whenever you add a new figure or generated LaTeX file.
The figure and data formats are controlled by the variables FIGEXT and
DATAEXT in the Makefile.
All files in data/ are marked as .PRECIOUS in the Makefile, so they will
not be deleted even though they are
intermediate files.
The included tex/preamble.tex implements conditional compilation based on the
environment variable ABRIDGED. See the example tex/reproducible.tex for usage.
The default make target is the unabridged/extended version.
To build the abridged version, run make abridged_build/reproducible.pdf.
It will set the environment variable ABRIDGED and store the result separately.
To package the source code for upload to arXiv, run make build/arxiv.zip.
The unabridged version is zipped. The LaTeX source files are passed through
arxiv-latex-cleaner
to strip comments. The zipped package includes only the .bbl file generated
by BibTeX instead of the .bib files, so only the references you used in the
paper are included.
Overall, there are six kinds of recipes this Makefile will run. We provide examples of each:
| Input | Output | Description | Example |
|---|---|---|---|
| --- | Data | Data generation | sine_derivs_data.py |
| --- | Figure | Figure generation | quadratic_roots_fig.py |
| --- | TeX | TeX generation | quadratic_gentex.py |
| Data | Data | Processing | sine_taylor_data.py |
| Data | Figure | Visualization | sine_taylor_fig.py |
| Data | TeX | Tables, etc. | sine_taylor_gentex.py |
-
If your project contains a lot of source code, it may be better to create a separate library-like repository and include it as a git submodule in
src/. -
To help decouple the computation and plotting stages, we suggest storing all the data you think you might need in the "tidy data" layout. Plotting tools designed to consume "tidy" data make it easy to select and combine data to generate many different kinds of plots.
-
For deterministic results, remember to seed random number generators with a constant.