Code & Data
There are several different categories of code used in this work. in the sections below, we describe each category and provide links to their contents where necessary.
Data Accessibility
In the sections below, we link every script with its required data files. This excludes data needed to run the tpm
module as well as the preprocessed .mat
files used by compile_data.py
. The raw image files are remarkably large (~ 10 TB) and are thus preserved on cold storage and are available upon request. The results from the image files (~ 10 GB) that are scraped by the compile_data.py
script are stored on the CaltechDATA research data repository and are accessible under the DOI: 10.22002/D1.1288
.
The tpm
Matlab Module
This module, available from the associated GitHub repository, is used during the acquisition and direct measurement from raw image data of tethered beads. It is implemented as previously described in Lovely et al. PNAS, 112(14) 2015 and Johnson et al Nucleic Acids Research, 40(16) 2012..
The vdj
Python Module
This module, written explicitly for this work, is composed of a variety of Python functions useful for the processing, analysis, and presentation of data. This module is also available from the associated GitHub repository.
Python Processing Script
We used a single Python script
compile_data.py
which
extracts measurements from a series of .mat
files that are produced via the
tpm
module.
Python Analysis Scripts
These are Python files which can be run independently from the command line and perform data analysis processes, such as parameter inference and bootstrapping.
looping_frequency_bootstrap.py
: This script performs the bootstrapping analysis of the looping frequencies. Necessary data sets:pooled_cutting_probability.py
: This script calculates the summary statistics for the cleavage probability of each sequence and numerically evaluates the log posterior over a range of probabilities. Necessary data sets:leaving_rate_inference.py
: This script performs the parameter inference of the leaving rates assuming exponentially distributed dwell paired complex dwell times. Necessary materials:
Python Figure Scripts
These are Python files which can be run independently from the command line and produce all data-based figures in the work, including the interactive figures.
- Interactive Endogenous Sequence Explorer| Generates the interactive figure for comparing endogenous RSSs. Necessary Data Sets:
- Summary of looping frequency bootstrap analysis
- Compiled paired complex dwell times.
- Summary of cleavage probabilities.
- Posterior distributions for cleavage probabilities.
- Interactive Synthetic Sequence Explorer| Generates the interactive figure for examining the effects of single point mutations in the V4-57-1 sequence. Necessary Data Sets:
- Summary of looping frequency bootstrap analysis
- Compiled paired complex dwell times.
- Summary of cleavage probabilities.
- Posterior distributions for cleavage probabilities.
- Interactive Comparison of Endogenous and Synthetic RSSs| Generates the interactive figure for examining how point mutations in the V4-57-1 sequence compare to other endogenous sequences. Necessary Data Sets:
- Summary of looping frequency bootstrap analysis
- Compiled paired complex dwell times.
- Summary of cleavage probabilities.
- Posterior distributions for cleavage probabilities.
- `JavaScript` file used for interaction
- Fig. 3: TPM data for point mutations introduced at various positions of the reference RSS.| Generates the stickplots and posterior distribution ridgeline plots for the point mutants. Necessary Data Sets:
- Summary of looping frequency bootstrap analysis
- Compiled paired complex dwell times.
- Summary of cleavage probabilities.
- Posterior distributions for cleavage probabilities.
- Fig. 4: Observed dynamics between RAG and endogenous RSS sequences.| Generates the stickplots of looping frequency, median dwell time, and cleavage probabilities for the endogenous RSSs. Necessary Data Sets:
- Summary of looping frequency bootstrap analysis
- Compiled paired complex dwell times.
- Summary of cleavage probabilities.
- Fig. 5: Non-exponential waiting time distributions for synthetic and endogenous RSSs.| Generates figure showing the obsrved dwell time distributions overlaid by the credible region of a fit to a single exponential. Necessary Data Sets:
- Compiled paired complex dwell times.
- Parameter summaries from epxonential fitting.
- Posterior samples from exponential fitting.
- Fig. 6: Empirical cumulative distributions of paired complex lifetimes with different divalent cations.| Used to compare the dwell time distributions of three RSSs with either calcium or magnesium salts. Necessary Data Sets:
- Fig. S2: Representative bootstrap looping frequency distribution.| Performs bootstrapping an dcalculation of confidence intervals for the looping frequency of the reference RSS. Necessary Data Sets:
- Fig. S3: Posterior distributions for endogenous RSS cleavage probabilities| Generates a ridgeline plot of the posterior distributions of PC cleavage probability for the endogenous RSSs. Necessary Data Sets:
- Fig. S4 - S5: Comparisons of looping frequencies for special cases| Generates two figures comparing the looping frequencies for differing coding flank sequences as well as for a critical SpaceC1A mutation. Necessary Data Sets:
- Fig. S6: Comparison of looping frequencies for different divalent cations.| Generates a figure comparing the looping frequencies of three RSSs in both calcium and magnesium conditioned buffers. Necessary Data Sets: