In previous vignettes we have demonstrated the slendr R interface for defining and executing msprime and SLiM population genetic models. We have also provided an overview for its interface to the tree-sequence processing library tskit.
However, although its model-definition interface is quite convenient, slendr cannot (and never will) support every possible msprime or SLiM. model The array of features provided by these simulation frameworks is simply to big and implementing an R interface to every single one of them would not make much sense.
That said, the tree-sequence outputs produced by “pure” (i.e. non-slendr) msprime and SLiM scripts are no different from those generated by slendr models themselves. For users who would rather use their own simulation scripts but find slendr’s tskit R interface (not its simulation interface) appealing, the R package provides a possibility to load, process, visualize and analyze standard tree sequences which were not generated by slendr itself.
In this vignette we give a brief overview of how this works, using example tree sequences produced by two very simple simulation scripts written in pure SLiM and msprime (i.e. scripts which don’t utilize slendr’s spatial features, symbolic names of simulated individuals, and automated translation of time units). We won’t be going into detail explaining how those scripts work, because we assume this functionality will be used by those already familiar with msprime or SLiM. Similarly, we won’t cover the R-tskit functionality of slendr in detail either, simply because the contents of this vignette is already covered by other tutorials provided by slendr.
library(slendr)
library(dplyr)
library(magrittr)
library(ggplot2)
<- 42
SEED set.seed(SEED)
Consider the following SLiM script, which creates a couple of
populations (with different \(N_e\))
splitting from an ancestral population p1
(lets save it to
/tmp/nonspatial.slim
):
initialize() {
setSeed(42);
initializeTreeSeq();
initializeMutationRate(0);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e6);
initializeRecombinationRate(1e-8);
}
1 {
sim.addSubpop("p1", 10);
}
1000 {
sim.addSubpopSplit("p2", 500, p1);
}
3000 {
sim.addSubpopSplit("p3", 2500, p1);
}
5000 {
sim.addSubpopSplit("p4", 10000, p1);
}
6000 late() {
sim.treeSeqOutput("/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T//RtmpCCdv89/file15ed810e6379c");
}
After we run this script in SLiM, we can use slendr to load
the output tree sequence (saved to
/tmp/nonspatial-slim.trees
), simplify it, and overlay
mutations on it using the standard
functionality originally developed for slendr tree
sequences. Note that this is the same command we would use for loading
slendr tree sequences, except we direct the
ts_load()
function straight to the tree-sequence output
file rather than using the ts_load(<model-object>)
format used when working with standard slendr simulations. This
way, slendr
<- ts_load(nonspatial_trees_file) %>%
ts ts_simplify() %>%
ts_mutate(mutation_rate = 1e-7, random_seed = SEED)
We can extract information about individual’s names, nodes,
population assignments, etc. just as with any slendr tree
sequence with the function ts_nodes()
. As with standard
slendr models, this function loads the “raw” node and
individual tree-sequences
tables, performs a couple of join operations, and presents the whole
thing in a nice unified form for interactive data analysis (which can
also include spatial information—see below):
<- ts_nodes(ts) %>% dplyr::filter(sampled)
data
data#> slendr 'nodes' object
#> ---------------------
#> times are expressed in a forward time direction
#>
#> summary of the table data contents:
#> p1 - 10 'sampled', 0 'remembered', 0 'retained', 10 'alive' individuals
#> p2 - 500 'sampled', 0 'remembered', 0 'retained', 500 'alive' individuals
#> p3 - 2500 'sampled', 0 'remembered', 0 'retained', 2500 'alive' individuals
#> p4 - 10000 'sampled', 0 'remembered', 0 'retained', 10000 'alive' individuals
#>
#> total:
#> - 13010 'sampled' individuals
#> - 0 'remembered' individuals
#> - 0 'retained' individuals
#> - 13010 'alive' individuals
#> ---------------------
#> oldest sampled individual: 0 time units 'before present'
#> youngest sampled individual: 0 time units 'before present'
#>
#> oldest node: 0 time units 'before present'
#> youngest node: 0 time units 'before present'
#> ---------------------
#> overview of the underlying table object:
#>
#> # A tibble: 26,020 × 10
#> pop node_id time time_tskit sampled remembe…¹ retai…² alive pedig…³ ind_id
#> <chr> <int> <dbl> <dbl> <lgl> <lgl> <lgl> <lgl> <dbl> <dbl>
#> 1 p1 0 0 0 TRUE FALSE FALSE TRUE 2.01e7 0
#> 2 p1 1 0 0 TRUE FALSE FALSE TRUE 2.01e7 0
#> 3 p1 2 0 0 TRUE FALSE FALSE TRUE 2.01e7 1
#> 4 p1 3 0 0 TRUE FALSE FALSE TRUE 2.01e7 1
#> 5 p1 4 0 0 TRUE FALSE FALSE TRUE 2.01e7 2
#> 6 p1 5 0 0 TRUE FALSE FALSE TRUE 2.01e7 2
#> 7 p1 6 0 0 TRUE FALSE FALSE TRUE 2.01e7 3
#> 8 p1 7 0 0 TRUE FALSE FALSE TRUE 2.01e7 3
#> 9 p1 8 0 0 TRUE FALSE FALSE TRUE 2.01e7 4
#> 10 p1 9 0 0 TRUE FALSE FALSE TRUE 2.01e7 4
#> # … with 26,010 more rows, and abbreviated variable names ¹remembered,
#> # ²retained, ³pedigree_id
#> # ℹ Use `print(n = ...)` to see more rows
Moving on to tskit statistics, we can use the data table above to
extract a list of nodes belonging to each population (this is what
various tskit tree-sequence statistics operate on, and slendr
follows that design). Here we are computing the nucleotide diversity in
each of the four populations using the ts_diversity()
function, by first creating a list of lists with node IDs
(i.e. chromosomes) of individuals in each population:
<- split(data$node_id, data$pop)
sample_sets
# compute nucleotide diversity in each population
# (any other ts_*() tskit R interface function should work)
ts_diversity(ts, sample_sets)
#> # A tibble: 4 × 2
#> set diversity
#> <chr> <dbl>
#> 1 p1 0.00000755
#> 2 p2 0.000241
#> 3 p3 0.000463
#> 4 p4 0.000201
Just as with slendr tree sequences (as demonstrated in our paper) we can get a individual trees too, extracted in the in the phylogenetic format provided by the ape R package. Here we first simplify the tree sequence even further to just 10 nodes to make things manageable:
<- sample(data$node_id, 10)
samples <- ts_simplify(ts, simplify_to = samples)
ts_small
# extract the 42nd tree in the genealogy to an R 'phylo' format
<- ts_phylo(ts_small, 42)
tree #> Starting checking the validity of tree...
#> Found number of tips: n = 10
#> Found number of nodes: m = 9
#> Done.
tree#>
#> Phylogenetic tree with 10 tips and 9 internal nodes.
#>
#> Tip labels:
#> 1, 0, 9, 4, 8, 7, ...
#> Node labels:
#> 45, 28, 22, 31, 35, 36, ...
#>
#> Rooted; includes branch lengths.
Once we have that R tree object, we can use packages like ggtree to visualize the
tree (any other phylogenetic package would work too). Note that because
nodes of ‘ape phylo’ trees must conform to a strict format (they must be
labelled 1...N
), we will extract the information about the
node IDs in the tskit tree-sequence data to be able to plot them in the
tree.
library(ggtree)
<- ts_nodes(tree) %>% select(node = phylo_id, tskit_id = node_id)
labels
ggtree(tree, branch.length = "none") %<+% labels +
geom_label(aes(label = tskit_id))
library(ape)
plot(tree, show.tip.label = FALSE)
nodelabels()
tiplabels()
The same as above applies also to msprime tree sequences (which is really not that surprising, given that it’s all tskit under the hood).
We can start with Python:
import msprime
= msprime.sim_ancestry(100)
ts <trees file>) ts.dump(
And then we can proceed with loading the msprime tree sequence into R and analyze it with the slendr functionality:
<- ts_load(msprime_trees_file)
ts
ts_nodes(ts)
#> slendr 'nodes' object
#> ---------------------
#> times are expressed in a backward time direction
#> ---------------------
#> overview of the underlying table object:
#>
#> # A tibble: 199 × 7
#> pop ind_id node_id time time_tskit sampled pop_id
#> <chr> <dbl> <int> <dbl> <dbl> <lgl> <int>
#> 1 0 NA 0 0 0 TRUE 0
#> 2 0 NA 1 0 0 TRUE 0
#> 3 0 NA 2 0 0 TRUE 0
#> 4 0 NA 3 0 0 TRUE 0
#> 5 0 NA 4 0 0 TRUE 0
#> 6 0 NA 5 0 0 TRUE 0
#> 7 0 NA 6 0 0 TRUE 0
#> 8 0 NA 7 0 0 TRUE 0
#> 9 0 NA 8 0 0 TRUE 0
#> 10 0 NA 9 0 0 TRUE 0
#> # … with 189 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Furthermore, the generalized interface also supports slendr’s spatial tree-sequence features, with all the bells and whistles.
For instance, lets take the following spatial SLiM script (modified from the SLiM manual) and execute it with SLiM the usual way:
initialize() {
setSeed(42);
initializeSLiMOptions(keepPedigrees=T, dimensionality="xy");
initializeTreeSeq();
initializeMutationRate(1e-7);
initializeMutationType("m1", 0.5, "f", 0.0);
initializeGenomicElementType("g1", m1, 1.0);
initializeGenomicElement(g1, 0, 1e6);
initializeRecombinationRate(1e-8);
}
1 early() {
sim.addSubpop("p1", 500);
// initial positions are random in ([0,1], [0,1])
p1.individuals.x = runif(p1.individualCount);
p1.individuals.y = runif(p1.individualCount);
}
modifyChild() {
// draw a child position near the first parent, within bounds
do child.x = parent1.x + rnorm(1, 0, 0.02);
while ((child.x < 0.0) | (child.x > 1.0));
do child.y = parent1.y + rnorm(1, 0, 0.02);
while ((child.y < 0.0) | (child.y > 1.0));
return T;
}
1: late() {
sim.treeSeqRememberIndividuals(sim.subpopulations.individuals, permanent = F);
}
10000 late() {
sim.treeSeqOutput("/var/folders/d_/hblb15pd3b94rg0v35920wd80000gn/T//RtmpCCdv89/file15ed84ff4c17c");
}
We can then load and simplify the output tree sequence in just as we did above in this vignette (or anywhere in the slendr documentation):
<- ts_load(spatial_trees_file) %>% ts_simplify() ts
Finally, we can access the spatio-temporal data embedded in the
output tree sequence in the standard slendr way (note the
spatial sf column
location
with the POINT
data type):
<- ts_nodes(ts)
data
data#> slendr 'nodes' object
#> ---------------------
#> times are expressed in a forward time direction
#>
#> summary of the table data contents:
#> p1 - 500 'sampled', 0 'remembered', 500 'retained', 500 'alive' individuals
#>
#> total:
#> - 500 'sampled' individuals
#> - 0 'remembered' individuals
#> - 895 'retained' individuals
#> - 500 'alive' individuals
#> ---------------------
#> oldest sampled individual: 0 time units 'before present'
#> youngest sampled individual: 0 time units 'before present'
#>
#> oldest node: 8463 time units 'before present'
#> youngest node: 0 time units 'before present'
#> ---------------------
#> overview of the underlying sf object:
#>
#> # A tibble: 1,955 × 11
#> pop node_id time time_…¹ location sampled remembered retai…²
#> <chr> <int> <dbl> <dbl> <POINT> <lgl> <lgl> <lgl>
#> 1 p1 0 0 0 (0.904485 0.0777489) TRUE FALSE TRUE
#> 2 p1 1 0 0 (0.904485 0.0777489) TRUE FALSE TRUE
#> 3 p1 2 0 0 (0.668828 0.5246381) TRUE FALSE TRUE
#> 4 p1 3 0 0 (0.668828 0.5246381) TRUE FALSE TRUE
#> 5 p1 4 0 0 (0.7707701 0.1518462) TRUE FALSE TRUE
#> 6 p1 5 0 0 (0.7707701 0.1518462) TRUE FALSE TRUE
#> 7 p1 6 0 0 (0.7617792 0.2616453) TRUE FALSE TRUE
#> 8 p1 7 0 0 (0.7617792 0.2616453) TRUE FALSE TRUE
#> 9 p1 8 0 0 (0.9807475 0.04223008) TRUE FALSE TRUE
#> 10 p1 9 0 0 (0.9807475 0.04223008) TRUE FALSE TRUE
#> # … with 1,945 more rows, 3 more variables: alive <lgl>, pedigree_id <dbl>,
#> # ind_id <dbl>, and abbreviated variable names ¹time_tskit, ²retained
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Because we get the tree sequence converted to the spatial sf data format, we can use standard geospatial packages to use any spatial data analysis methods that those packages provide.
To briefly demonstrate what this means, we can trivially plot the location of each recorded node:
ggplot() + geom_sf(data = data, aes(color = time), alpha = 0.5)
We can also collect spatio-temporal ancestry information of a particular node (i.e. the times and locations of all of its ancestors all the way to the root, with each “link” in the plot signifying parent-child edge somewhere along the tree sequence) and plot it on a 2D surface (x and y dimensions [0, 1]). The plot is a little chaotic, but hopefully conveys the idea (the “focal node” 0 is highlighted in red). This is essentially the same plot we have in the last figure of our paper.
<- ts_ancestors(ts, 0)
ancestral_links
ggplot() +
geom_sf(data = ancestral_links, size = 0.5, aes(alpha = parent_time)) +
geom_sf(data = sf::st_set_geometry(ancestral_links, "parent_location"), aes(color = parent_time)) +
geom_sf(data = data[data$node_id == 0, ], size = 3, color = "red")
In this vignette we gave a brief overview of using slendr’s R-tskit interface for loading, processing, and analyzing “pure” non-slendr tree sequences produced by msprime and SLiM scripts.
Although we have only touched upon the most basic features of its R-tskit interface for standard tree sequences, it is important to note that as far as slendr is concerned, it does not matter how a tree sequence was produced, as long as it conforms to the tskit specification. This means that regardless of the source of your tree sequence data, you should be able to use slendr’s tskit functionality to run your analyses.