Most simple example with bulk ATAC ---------------------------------- In this tutorial, we will show you a first brief guide how to run EpiTrace analysis. We do this by using a tiny demo dataset of bulk ATAC-seq from :cite:p:`Corces2016`. These demo data are from FACS-sorted pure hematopoietic cell populations and sequenced by ATAC-seq. We would suggest users to use the iterative updating algorithm regardless of whether it is bulk or single-cell ATAC-seq data. At the moment, we show you how to do this with both non-iterative algorithm and iterative algorithm. First of all, the input data for EpiTrace are: - a count matrix where rows are genomic loci, and columns are cells, and elements are ATAC-seq read numbers. - a data frame or GRanges object corresponds to the genomic loci (we call it "peak sets") You also need to choose a reference clock-like loci for age estimation. Currently, the package provides a human clock-like differential methylated loci dataset as reference. Alternative datasets could be chosen by user. To interpret the data, an important note is that EpiTrace age is reversed between bulk and single cell datasets. For a detailed technical explanation, please see :cite:p:`Xiao2024`. At the moment, please remember that higher age corresponds to **lower** EpiTrace age in bulk ATAC datasets, and **higher** EpiTrace age in single cell ATAC datasets. Step 1. Initialize environment '''''''''''''''''''''''''''''' Load EpiTrace as:: library(EpiTrace) There are many other libraries should be loaded for this tutorial:: library(dplyr) library(tidyr) library(readr) library(GenomicRanges) library(reshape2) library(openxlsx) library(ggplot2) library(Matrix) library(Seurat) library(SeuratObject) library(ggtree) library(EnsDb.Hsapiens.v86) library(patchwork) library(ChIPseeker) library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) Step 2. Load data ''''''''''''''''' We load the demo human FACS-sorted pure blood cell data using:: data("clock_gr_list") data("hematopoiesis.2018") The package provided ``clock_gr_list`` is a list of GRanges, each corresponds to a set of clock-like DML (clockDML) in human. There are three sets of human clockDML that provided in `clock_gr_list`: `Mitosis`: 616 clock-like DML compiled from papers of :cite:p:`Yang2016` and :cite:p:`Youn2018`. `Chronology`: 125625 clock-like DML/DMR computed from genome-wide bisulfite capture sequencing data and a compilation of 450k methyl-array data, removing the ones that overlaps with Mitosis dataset. `solo_WCGW`: 1669720 solitary WCGW motif in partial methylated domain, from :cite:p:`Zhou2018`. This is just for comparison purpose and in general not used in age estimation. Step 3. Initialize peak sets '''''''''''''''''''''''''''' Peak sets are the annotation of ATAC peaks in your input dataset. Either a dataframe with columns chr,start,end or a GRanges object is fine. Here we provide you with an example GRanges ``inputpeak`` from the data ``hematopoiesis.2018``. Initaliztion of peak sets is necessary for following steps:: initiated_peaks <- Init_Peakset(inputpeak) Step 4. Initialize read count matrix '''''''''''''''''''''''''''''''''''' We initiate a count matrix for EpiTrace by:: initiated_matrix <- Init_Matrix(cellname = colnames(inputmatrix),peakname = initiated_peaks$peakId, matrix = inputmatrix) Step 5. Calculate EpiTrace Age '''''''''''''''''''''''''''''' The non-iterative algorithm first needs to build an object:: epitrace_obj <- EpiTrace_prepare_object(initiated_peaks,inputmatrix,cell_metadata$celltype,ref_genome = 'hg19',non_standard_clock = F,clock_gr_list = clock_gr_list) epitrace_obj <- RunEpiTraceAge(epitrace_obj) It emits:: ref clock list is set to be standard (Homo sapiens, hg19) Input peakset is set to be hg19 Joining with `by = join_by(Clock_panel)` add MitosisClock good quality cells 120 and peaks 259 add ChronologyClock good quality cells 120 and peaks 24446 add AllClock good quality cells 120 and peaks 24600 Performing TF-IDF normalization Running SVD Scaling cell embeddings Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation' This message will be shown once per session 19:18:27 UMAP embedding parameters a = 0.9922 b = 1.112 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 19:18:27 Read 120 rows and found 49 numeric columns 19:18:27 Using Annoy for neighbor search, n_neighbors = 30 Found more than one class "dist" in cache; using the first, from namespace 'spam' Also defined by ‘BiocGenerics’ 19:18:27 Building Annoy index with metric = cosine, n_trees = 50 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| 19:18:27 Writing NN index file to temp file /var/folders/0t/xpysj0d91jxcjqvbkhnlg88w0000gn/T//Rtmp1Xns5f/file1136be2dda88 19:18:27 Searching Annoy index using 1 thread, search_k = 3000 19:18:27 Annoy recall = 100% 19:18:27 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30 19:18:28 Initializing from normalized Laplacian + noise (using RSpectra) 19:18:28 Commencing optimization for 500 epochs, with 4368 positive edges Using method 'umap' 0% 10 20 30 40 50 60 70 80 90 100% [----|----|----|----|----|----|----|----|----|----| **************************************************| 19:18:29 Optimization finished Computing nearest neighbor graph Computing SNN ... Estimating Age by EpiTraceAge... Estimating Age by EpiTraceAge... Estimating Age by EpiTraceAge... Joining with `by = join_by(cell)` Joining with `by = join_by(cell)` Joining with `by = join_by(cell)` Alternatively, you can use the iterative algorithm. It defaults to use the reference clock-like loci combining both ``Mitosis`` and ``Chronology`` datasets. A one-shot function to calculate EpiTrace age using standard (human) ClockDML as reference:: epitrace_obj_age_estimated_iterative <- EpiTraceAge_Convergence(initiated_peaks,inputmatrix,ref_genome = 'hg19',non_standard_clock = F,parallel = F,iterative_time = 10,Z_cutoff = 3,qualnum = 1) epitrace_obj_age_estimated_iterative@meta.data$celltype <- left_join(epitrace_obj_age_estimated_iterative@meta.data,cell_metadata %>% dplyr::select(cell,celltype)) %>% dplyr::select(celltype) %>% unlist() It emits:: Preparing obj... ref clock list is not standard. Please make sure the input data, peak set and clock set are in similar reference genome. Input peakset is set to be hg19 Joining with `by = join_by(Clock_panel)` add iterative good quality cells 120 and peaks 24600 Finished prepare obj Estimating age for initialization... Finished age estimation Iterating 1 Performing Corr Finished Corr Calculate iterative age Single thread run Update dataframe Joining with `by = join_by(cell)` mean_error = 0 Step 6. Compute the phylogenetic tree based on EpiTrace ''''''''''''''''''''''''''''''''''''''''''''''''''''''' We pass the prepared epitrace object to estimate per-cell-cluster phylogenetic tree using RunEpiTracePhylogeny command. This generates a list which contains the assay (id of assay), tree (phylogenetic tree of the clusters), tree_plot (a ggtree object). If you would like to use other clustering of cells (say, single cell) you have to change the Idents of the object. Currently we tend to suggest use well annotated single cell clusters (or in this demo case, known FACS-sorted cell types).:: Idents(epitrace_obj) <- epitrace_obj$celltype phylotree_res_myeloid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','GMP','Monocyte'))) chronology_tree_myeloid <- phylotree_res_myeloid[['AllClock']][[2]] chronology_tree_myeloid <- ape::root(chronology_tree_myeloid,outgroup='HSC') plot(chronology_tree_myeloid) You can change the subset of cells to each lineage, for example:: phylotree_res_lymphoid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','LMPP','CLP','NK','B','CD4T','CD8T'))) chronology_tree_lymphoid <- phylotree_res_lymphoid[['AllClock']][[2]] chronology_tree_lymphoid <- ape::root(chronology_tree_lymphoid,outgroup='HSC') plot(chronology_tree_lymphoid) phylotree_res_erythroid <- RunEpiTracePhylogeny(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','MEP','Ery'))) chronology_tree_erythroid <- phylotree_res_erythroid[['AllClock']][[2]] chronology_tree_erythroid <- ape::root(chronology_tree_erythroid,outgroup='HSC') plot(chronology_tree_erythroid) Expected messages would be (as with Seurat V5. Other versions might change):: As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis. This message is displayed once per session. Performing TF-IDF normalization Running SVD Warning in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) : You're computing too large a percentage of total singular values, use a standard svd instead. Warning in irlba(A = t(x = object), nv = n, work = irlba.work, tol = tol) : did not converge--results might be invalid!; try increasing work or maxit Scaling cell embeddings Performing TF-IDF normalization ....... Warning message: The `slot` argument of `AverageExpression()` is deprecated as of Seurat 5.0.0. ℹ Please use the `layer` argument instead. ℹ The deprecated feature was likely used in the Seurat package. Please report the issue at . This warning is displayed once every 8 hours. Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. Example phylogenetic trees plotted would be like: .. image:: _static/Phylogeny_BULK_ATAC.svg :width: 600px :align: center Step 7. Compute association between regulatory region accessibility and cell age. ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' We pass the age-estimated object to AssociationOfPeaksToAge function to compute association result. Note that association is most meaningful for a sub-population or a lineage:: associated_res_myeloid <- AssociationOfPeaksToAge(subset(epitrace_obj,celltype %in% c('HSC','MPP','CMP','GMP','Monocyte')),epitrace_age_name = "EpiTraceAge_AllClock") txdb = TxDb.Hsapiens.UCSC.hg19.knownGene associated_res_myeloid <- separate(associated_res_myeloid,col='locus',into=c('chr','start','end'),remove=F,convert=T) associated_res_myeloid_gr <- makeGRangesFromDataFrame(associated_res_myeloid) findOverlaps(associated_res_myeloid_gr,plyranges::reduce_ranges(c(clock_gr_list[[1]],clock_gr_list[[2]])))@from %>% unique -> peaks_overlap_with_clock associated_res_myeloid_gr$locus_type <- 'Others' associated_res_myeloid_gr$locus_type[peaks_overlap_with_clock] <- 'Clock-like DML' annotatePeak(associated_res_myeloid_gr, tssRegion=c(-2000, 500), TxDb=txdb,addFlankGeneInfo=F, flankDistance=50000,annoDb = "org.Hs.eg.db") -> associated_res_myeloid_gr_anno as.data.frame(associated_res_myeloid_gr_anno@anno) -> associated_res_myeloid_gr_anno_df cbind(associated_res_myeloid, associated_res_myeloid_gr_anno_df %>% dplyr::select(SYMBOL,distanceToTSS,annotation,locus_type) ) -> associated_res_myeloid associated_res_myeloid$promoter_scaled_cor <- NA associated_res_myeloid$promoter_scaled_cor[grepl('Promoter', associated_res_myeloid$annotation)] <- scale(associated_res_myeloid $correlation_of_EpiTraceAge[grepl('Promoter', associated_res_myeloid$annotation)]) associated_res_myeloid <- arrange(associated_res_myeloid,scaled_correlation_of_EpiTraceAge) (associated_res_myeloid) %>% na.omit() %>% tail(5) You can now check which peak is associated with replicative age shift in this lineage by looking at the dataframe ``associated_res_myeloid``.