Most simple example with bulk ATAC

In this tutorial, we will show you a first brief guide how to run EpiTrace analysis. We do this by using a tiny demo dataset of bulk ATAC-seq from [Corces et al., 2016]. These demo data are from FACS-sorted pure hematopoietic cell populations and sequenced by ATAC-seq.

We would suggest users to use the iterative updating algorithm regardless of whether it is bulk or single-cell ATAC-seq data. At the moment, we show you how to do this with both non-iterative algorithm and iterative algorithm.

First of all, the input data for EpiTrace are:

a count matrix where rows are genomic loci, and columns are cells, and elements are ATAC-seq read numbers.
a data frame or GRanges object corresponds to the genomic loci (we call it “peak sets”)

You also need to choose a reference clock-like loci for age estimation. Currently, the package provides a human clock-like differential methylated loci dataset as reference. Alternative datasets could be chosen by user.

To interpret the data, an important note is that EpiTrace age is reversed between bulk and single cell datasets. For a detailed technical explanation, please see [Xiao et al., 2024]. At the moment, please remember that higher age corresponds to lower EpiTrace age in bulk ATAC datasets, and higher EpiTrace age in single cell ATAC datasets.

Step 1. Initialize environment

Load EpiTrace as:

library(EpiTrace)

There are many other libraries should be loaded for this tutorial:

library(dplyr)
library(tidyr)
library(readr)
library(GenomicRanges)
library(reshape2)
library(openxlsx)
library(ggplot2)
library(Matrix)
library(Seurat)
library(SeuratObject)
library(ggtree)
library(EnsDb.Hsapiens.v86)
library(patchwork)
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)

Step 2. Load data

We load the demo human FACS-sorted pure blood cell data using:

data("clock_gr_list")
data("hematopoiesis.2018")

The package provided clock_gr_list is a list of GRanges, each corresponds to a set of clock-like DML (clockDML) in human.

There are three sets of human clockDML that provided in clock_gr_list:

Mitosis: 616 clock-like DML compiled from papers of [Yang et al., 2016] and [Youn and Wang, 2018].

Chronology: 125625 clock-like DML/DMR computed from genome-wide bisulfite capture sequencing data and a compilation of 450k methyl-array data, removing the ones that overlaps with Mitosis dataset.

solo_WCGW: 1669720 solitary WCGW motif in partial methylated domain, from [Zhou et al., 2018]. This is just for comparison purpose and in general not used in age estimation.

Step 3. Initialize peak sets

Peak sets are the annotation of ATAC peaks in your input dataset. Either a dataframe with columns chr,start,end or a GRanges object is fine. Here we provide you with an example GRanges inputpeak from the data hematopoiesis.2018. Initaliztion of peak sets is necessary for following steps:

initiated_peaks <- Init_Peakset(inputpeak)

Step 4. Initialize read count matrix

We initiate a count matrix for EpiTrace by:

initiated_matrix <- Init_Matrix(cellname = colnames(inputmatrix),peakname = initiated_peaks$peakId, matrix = inputmatrix)

Step 5. Calculate EpiTrace Age

The non-iterative algorithm first needs to build an object:

epitrace_obj <- EpiTrace_prepare_object(initiated_peaks,inputmatrix,cell_metadata$celltype,ref_genome = 'hg19',non_standard_clock = F,clock_gr_list = clock_gr_list)
epitrace_obj <- RunEpiTraceAge(epitrace_obj)

It emits:

ref clock list is set to be standard (Homo sapiens, hg19)
Input peakset is set to be hg19
Joining with `by = join_by(Clock_panel)`
add MitosisClock
good quality cells 120 and peaks 259
add ChronologyClock
good quality cells 120 and peaks 24446
add AllClock
good quality cells 120 and peaks 24600
Performing TF-IDF normalization
Running SVD
Scaling cell embeddings
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
19:18:27 UMAP embedding parameters a = 0.9922 b = 1.112
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
19:18:27 Read 120 rows and found 49 numeric columns
19:18:27 Using Annoy for neighbor search, n_neighbors = 30
Found more than one class "dist" in cache; using the first, from namespace 'spam'
Also defined by ‘BiocGenerics’
19:18:27 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:18:27 Writing NN index file to temp file /var/folders/0t/xpysj0d91jxcjqvbkhnlg88w0000gn/T//Rtmp1Xns5f/file1136be2dda88
19:18:27 Searching Annoy index using 1 thread, search_k = 3000
19:18:27 Annoy recall = 100%
19:18:27 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
19:18:28 Initializing from normalized Laplacian + noise (using RSpectra)
19:18:28 Commencing optimization for 500 epochs, with 4368 positive edges
Using method 'umap'
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
19:18:29 Optimization finished
Computing nearest neighbor graph
Computing SNN
...
Estimating Age by EpiTraceAge...
Estimating Age by EpiTraceAge...
Estimating Age by EpiTraceAge...
Joining with `by = join_by(cell)`
Joining with `by = join_by(cell)`
Joining with `by = join_by(cell)`

Alternatively, you can use the iterative algorithm. It defaults to use the reference clock-like loci combining both Mitosis and Chronology datasets. A one-shot function to calculate EpiTrace age using standard (human) ClockDML as reference:

epitrace_obj_age_estimated_iterative <- EpiTraceAge_Convergence(initiated_peaks,inputmatrix,ref_genome = 'hg19',non_standard_clock = F,parallel = F,iterative_time = 10,Z_cutoff = 3,qualnum = 1)
epitrace_obj_age_estimated_iterative@meta.data$celltype <- left_join(epitrace_obj_age_estimated_iterative@meta.data,cell_metadata %>% dplyr::select(cell,celltype)) %>% dplyr::select(celltype) %>% unlist()

It emits:

Preparing obj...
ref clock list is not standard. Please make sure the input data, peak set and clock set are in similar reference genome.
Input peakset is set to be hg19
Joining with `by = join_by(Clock_panel)`
add iterative
good quality cells 120 and peaks 24600
Finished prepare obj
Estimating age for initialization...
Finished age estimation
Iterating 1
Performing Corr
Finished Corr
Calculate iterative age
Single thread run
Update dataframe
Joining with `by = join_by(cell)`
mean_error = 0

Step 6. Compute the phylogenetic tree based on EpiTrace

We pass the prepared epitrace object to estimate per-cell-cluster phylogenetic tree using RunEpiTracePhylogeny command.

This generates a list which contains the assay (id of assay), tree (phylogenetic tree of the clusters), tree_plot (a ggtree object).

If you would like to use other clustering of cells (say, single cell) you have to change the Idents of the object. Currently we tend to suggest use well annotated single cell clusters (or in this demo case, known FACS-sorted cell types).:

Idents(epitrace_obj) <- epitrace_obj$celltype
phylotree_res_myeloid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','GMP','Monocyte')))
chronology_tree_myeloid <- phylotree_res_myeloid[['AllClock']][[2]]
chronology_tree_myeloid <- ape::root(chronology_tree_myeloid,outgroup='HSC')
plot(chronology_tree_myeloid)

You can change the subset of cells to each lineage, for example:

phylotree_res_lymphoid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','LMPP','CLP','NK','B','CD4T','CD8T')))
chronology_tree_lymphoid <- phylotree_res_lymphoid[['AllClock']][[2]]
chronology_tree_lymphoid <- ape::root(chronology_tree_lymphoid,outgroup='HSC')
plot(chronology_tree_lymphoid)

phylotree_res_erythroid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','MEP','Ery')))
chronology_tree_erythroid <- phylotree_res_erythroid[['AllClock']][[2]]
chronology_tree_erythroid <- ape::root(chronology_tree_erythroid,outgroup='HSC')
plot(chronology_tree_erythroid)

Expected messages would be (as with Seurat V5. Other versions might change):

As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
This message is displayed once per session.
Performing TF-IDF normalization
Running SVD
Warning in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) :
  You're computing too large a percentage of total singular values, use a standard svd instead.
Warning in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) :
  did not converge--results might be invalid!; try increasing work or maxit
Scaling cell embeddings
Performing TF-IDF normalization
.......
Warning message:
The `slot` argument of `AverageExpression()` is deprecated as of Seurat 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Example phylogenetic trees plotted would be like:

Step 7. Compute association between regulatory region accessibility and cell age.

We pass the age-estimated object to AssociationOfPeaksToAge function to compute association result.

Note that association is most meaningful for a sub-population or a lineage:

associated_res_myeloid <- AssociationOfPeaksToAge(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','GMP','Monocyte')),epitrace_age_name = "EpiTraceAge_AllClock")
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
associated_res_myeloid <- separate(associated_res_myeloid,col='locus',into=c('chr','start','end'),remove=F,convert=T)
associated_res_myeloid_gr <- makeGRangesFromDataFrame(associated_res_myeloid)
findOverlaps(associated_res_myeloid_gr,plyranges::reduce_ranges(c(clock_gr_list[[1]],clock_gr_list[[2]])))@from %>% unique -> peaks_overlap_with_clock
associated_res_myeloid_gr$locus_type <- 'Others'
associated_res_myeloid_gr$locus_type[peaks_overlap_with_clock] <- 'Clock-like DML'
annotatePeak(associated_res_myeloid_gr, tssRegion=c(-2000, 500), TxDb=txdb,addFlankGeneInfo=F, flankDistance=50000,annoDb = "org.Hs.eg.db") -> associated_res_myeloid_gr_anno
as.data.frame(associated_res_myeloid_gr_anno@anno) -> associated_res_myeloid_gr_anno_df
cbind(associated_res_myeloid, associated_res_myeloid_gr_anno_df %>% dplyr::select(SYMBOL,distanceToTSS,annotation,locus_type) ) -> associated_res_myeloid
associated_res_myeloid$promoter_scaled_cor <- NA
associated_res_myeloid$promoter_scaled_cor[grepl('Promoter', associated_res_myeloid$annotation)] <- scale(associated_res_myeloid $correlation_of_EpiTraceAge[grepl('Promoter', associated_res_myeloid$annotation)])
associated_res_myeloid <- arrange(associated_res_myeloid,scaled_correlation_of_EpiTraceAge)
(associated_res_myeloid) %>% na.omit() %>% tail(5)

You can now check which peak is associated with replicative age shift in this lineage by looking at the dataframe associated_res_myeloid.